1LZ4 Block Format Description 2============================ 3Last revised: 2019-03-30. 4Author : Yann Collet 5 6 7This specification is intended for developers 8willing to produce LZ4-compatible compressed data blocks 9using any programming language. 10 11LZ4 is an LZ77-type compressor with a fixed, byte-oriented encoding. 12There is no entropy encoder back-end nor framing layer. 13The latter is assumed to be handled by other parts of the system 14(see [LZ4 Frame format]). 15This design is assumed to favor simplicity and speed. 16It helps later on for optimizations, compactness, and features. 17 18This document describes only the block format, 19not how the compressor nor decompressor actually work. 20The correctness of the decompressor should not depend 21on implementation details of the compressor, and vice versa. 22 23[LZ4 Frame format]: lz4_Frame_format.md 24 25 26 27Compressed block format 28----------------------- 29An LZ4 compressed block is composed of sequences. 30A sequence is a suite of literals (not-compressed bytes), 31followed by a match copy. 32 33Each sequence starts with a `token`. 34The `token` is a one byte value, separated into two 4-bits fields. 35Therefore each field ranges from 0 to 15. 36 37 38The first field uses the 4 high-bits of the token. 39It provides the length of literals to follow. 40 41If the field value is 0, then there is no literal. 42If it is 15, then we need to add some more bytes to indicate the full length. 43Each additional byte then represent a value from 0 to 255, 44which is added to the previous value to produce a total length. 45When the byte value is 255, another byte is output. 46There can be any number of bytes following `token`. There is no "size limit". 47(Side note : this is why a not-compressible input block is expanded by 0.4%). 48 49Example 1 : A literal length of 48 will be represented as : 50 51 - 15 : value for the 4-bits High field 52 - 33 : (=48-15) remaining length to reach 48 53 54Example 2 : A literal length of 280 will be represented as : 55 56 - 15 : value for the 4-bits High field 57 - 255 : following byte is maxed, since 280-15 >= 255 58 - 10 : (=280 - 15 - 255) ) remaining length to reach 280 59 60Example 3 : A literal length of 15 will be represented as : 61 62 - 15 : value for the 4-bits High field 63 - 0 : (=15-15) yes, the zero must be output 64 65Following `token` and optional length bytes, are the literals themselves. 66They are exactly as numerous as previously decoded (length of literals). 67It's possible that there are zero literal. 68 69 70Following the literals is the match copy operation. 71 72It starts by the `offset`. 73This is a 2 bytes value, in little endian format 74(the 1st byte is the "low" byte, the 2nd one is the "high" byte). 75 76The `offset` represents the position of the match to be copied from. 771 means "current position - 1 byte". 78The maximum `offset` value is 65535, 65536 cannot be coded. 79Note that 0 is an invalid value, not used. 80 81Then we need to extract the `matchlength`. 82For this, we use the second token field, the low 4-bits. 83Value, obviously, ranges from 0 to 15. 84However here, 0 means that the copy operation will be minimal. 85The minimum length of a match, called `minmatch`, is 4. 86As a consequence, a 0 value means 4 bytes, and a value of 15 means 19+ bytes. 87Similar to literal length, on reaching the highest possible value (15), 88we output additional bytes, one at a time, with values ranging from 0 to 255. 89They are added to total to provide the final match length. 90A 255 value means there is another byte to read and add. 91There is no limit to the number of optional bytes that can be output this way. 92(This points towards a maximum achievable compression ratio of about 250). 93 94Decoding the `matchlength` reaches the end of current sequence. 95Next byte will be the start of another sequence. 96But before moving to next sequence, 97it's time to use the decoded match position and length. 98The decoder copies `matchlength` bytes from match position to current position. 99 100In some cases, `matchlength` is larger than `offset`. 101Therefore, `match_pos + matchlength > current_pos`, 102which means that later bytes to copy are not yet decoded. 103This is called an "overlap match", and must be handled with special care. 104A common case is an offset of 1, 105meaning the last byte is repeated `matchlength` times. 106 107 108End of block restrictions 109----------------------- 110There are specific rules required to terminate a block. 111 1121. The last sequence contains only literals. 113 The block ends right after them. 1142. The last 5 bytes of input are always literals. 115 Therefore, the last sequence contains at least 5 bytes. 116 - Special : if input is smaller than 5 bytes, 117 there is only one sequence, it contains the whole input as literals. 118 Empty input can be represented with a zero byte, 119 interpreted as a final token without literal and without a match. 1203. The last match must start at least 12 bytes before the end of block. 121 The last match is part of the penultimate sequence. 122 It is followed by the last sequence, which contains only literals. 123 - Note that, as a consequence, 124 an independent block < 13 bytes cannot be compressed, 125 because the match must copy "something", 126 so it needs at least one prior byte. 127 - When a block can reference data from another block, 128 it can start immediately with a match and no literal, 129 so a block of 12 bytes can be compressed. 130 131When a block does not respect these end conditions, 132a conformant decoder is allowed to reject the block as incorrect. 133 134These rules are in place to ensure that a conformant decoder 135can be designed for speed, issuing speculatively instructions, 136while never reading nor writing beyond provided I/O buffers. 137 138 139Additional notes 140----------------------- 141If the decoder will decompress data from an external source, 142it is recommended to ensure that the decoder will not be vulnerable to 143buffer overflow manipulations. 144Always ensure that read and write operations 145remain within the limits of provided buffers. 146Test the decoder with fuzzers 147to ensure it's resilient to improbable combinations. 148 149The format makes no assumption nor limits to the way the compressor 150searches and selects matches within the source data block. 151Multiple techniques can be considered, 152featuring distinct time / performance trade offs. 153As long as the format is respected, 154the result will be compatible and decodable by any compliant decoder. 155An upper compression limit can be reached, 156using a technique called "full optimal parsing", at high cpu cost. 157