1LZ4 Frame Format Description 2============================ 3 4### Notices 5 6Copyright (c) 2013-2020 Yann Collet 7 8Permission is granted to copy and distribute this document 9for any purpose and without charge, 10including translations into other languages 11and incorporation into compilations, 12provided that the copyright notice and this notice are preserved, 13and that any substantive changes or deletions from the original 14are clearly marked. 15Distribution of this document is unlimited. 16 17### Version 18 191.6.2 (12/08/2020) 20 21 22Introduction 23------------ 24 25The purpose of this document is to define a lossless compressed data format, 26that is independent of CPU type, operating system, 27file system and character set, suitable for 28File compression, Pipe and streaming compression 29using the [LZ4 algorithm](http://www.lz4.org). 30 31The data can be produced or consumed, 32even for an arbitrarily long sequentially presented input data stream, 33using only an a priori bounded amount of intermediate storage, 34and hence can be used in data communications. 35The format uses the LZ4 compression method, 36and optional [xxHash-32 checksum method](https://github.com/Cyan4973/xxHash), 37for detection of data corruption. 38 39The data format defined by this specification 40does not attempt to allow random access to compressed data. 41 42This specification is intended for use by implementers of software 43to compress data into LZ4 format and/or decompress data from LZ4 format. 44The text of the specification assumes a basic background in programming 45at the level of bits and other primitive data representations. 46 47Unless otherwise indicated below, 48a compliant compressor must produce data sets 49that conform to the specifications presented here. 50It doesn’t need to support all options though. 51 52A compliant decompressor must be able to decompress 53at least one working set of parameters 54that conforms to the specifications presented here. 55It may also ignore checksums. 56Whenever it does not support a specific parameter within the compressed stream, 57it must produce a non-ambiguous error code 58and associated error message explaining which parameter is unsupported. 59 60 61General Structure of LZ4 Frame format 62------------------------------------- 63 64| MagicNb | F. Descriptor | Block | (...) | EndMark | C. Checksum | 65|:-------:|:-------------:| ----- | ----- | ------- | ----------- | 66| 4 bytes | 3-15 bytes | | | 4 bytes | 0-4 bytes | 67 68__Magic Number__ 69 704 Bytes, Little endian format. 71Value : 0x184D2204 72 73__Frame Descriptor__ 74 753 to 15 Bytes, to be detailed in its own paragraph, 76as it is the most important part of the spec. 77 78The combined _Magic_Number_ and _Frame_Descriptor_ fields are sometimes 79called ___LZ4 Frame Header___. Its size varies between 7 and 19 bytes. 80 81__Data Blocks__ 82 83To be detailed in its own paragraph. 84That’s where compressed data is stored. 85 86__EndMark__ 87 88The flow of blocks ends when the last data block is followed by 89the 32-bit value `0x00000000`. 90 91__Content Checksum__ 92 93_Content_Checksum_ verify that the full content has been decoded correctly. 94The content checksum is the result of [xxHash-32 algorithm] 95digesting the original (decoded) data as input, and a seed of zero. 96Content checksum is only present when its associated flag 97is set in the frame descriptor. 98Content Checksum validates the result, 99that all blocks were fully transmitted in the correct order and without error, 100and also that the encoding/decoding process itself generated no distortion. 101Its usage is recommended. 102 103The combined _EndMark_ and _Content_Checksum_ fields might sometimes be 104referred to as ___LZ4 Frame Footer___. Its size varies between 4 and 8 bytes. 105 106__Frame Concatenation__ 107 108In some circumstances, it may be preferable to append multiple frames, 109for example in order to add new data to an existing compressed file 110without re-framing it. 111 112In such case, each frame has its own set of descriptor flags. 113Each frame is considered independent. 114The only relation between frames is their sequential order. 115 116The ability to decode multiple concatenated frames 117within a single stream or file 118is left outside of this specification. 119As an example, the reference lz4 command line utility behavior is 120to decode all concatenated frames in their sequential order. 121 122 123Frame Descriptor 124---------------- 125 126| FLG | BD | (Content Size) | (Dictionary ID) | HC | 127| ------- | ------- |:--------------:|:---------------:| ------- | 128| 1 byte | 1 byte | 0 - 8 bytes | 0 - 4 bytes | 1 byte | 129 130The descriptor uses a minimum of 3 bytes, 131and up to 15 bytes depending on optional parameters. 132 133__FLG byte__ 134 135| BitNb | 7-6 | 5 | 4 | 3 | 2 | 1 | 0 | 136| ------- |-------|-------|----------|------|----------|----------|------| 137|FieldName|Version|B.Indep|B.Checksum|C.Size|C.Checksum|*Reserved*|DictID| 138 139 140__BD byte__ 141 142| BitNb | 7 | 6-5-4 | 3-2-1-0 | 143| ------- | -------- | ------------- | -------- | 144|FieldName|*Reserved*| Block MaxSize |*Reserved*| 145 146In the tables, bit 7 is highest bit, while bit 0 is lowest. 147 148__Version Number__ 149 1502-bits field, must be set to `01`. 151Any other value cannot be decoded by this version of the specification. 152Other version numbers will use different flag layouts. 153 154__Block Independence flag__ 155 156If this flag is set to “1”, blocks are independent. 157If this flag is set to “0”, each block depends on previous ones 158(up to LZ4 window size, which is 64 KB). 159In such case, it’s necessary to decode all blocks in sequence. 160 161Block dependency improves compression ratio, especially for small blocks. 162On the other hand, it makes random access or multi-threaded decoding impossible. 163 164__Block checksum flag__ 165 166If this flag is set, each data block will be followed by a 4-bytes checksum, 167calculated by using the xxHash-32 algorithm on the raw (compressed) data block. 168The intention is to detect data corruption (storage or transmission errors) 169immediately, before decoding. 170Block checksum usage is optional. 171 172__Content Size flag__ 173 174If this flag is set, the uncompressed size of data included within the frame 175will be present as an 8 bytes unsigned little endian value, after the flags. 176Content Size usage is optional. 177 178__Content checksum flag__ 179 180If this flag is set, a 32-bits content checksum will be appended 181after the EndMark. 182 183__Dictionary ID flag__ 184 185If this flag is set, a 4-bytes Dict-ID field will be present, 186after the descriptor flags and the Content Size. 187 188__Block Maximum Size__ 189 190This information is useful to help the decoder allocate memory. 191Size here refers to the original (uncompressed) data size. 192Block Maximum Size is one value among the following table : 193 194| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 195| --- | --- | --- | --- | ----- | ------ | ---- | ---- | 196| N/A | N/A | N/A | N/A | 64 KB | 256 KB | 1 MB | 4 MB | 197 198The decoder may refuse to allocate block sizes above any system-specific size. 199Unused values may be used in a future revision of the spec. 200A decoder conformant with the current version of the spec 201is only able to decode block sizes defined in this spec. 202 203__Reserved bits__ 204 205Value of reserved bits **must** be 0 (zero). 206Reserved bit might be used in a future version of the specification, 207typically enabling new optional features. 208When this happens, a decoder respecting the current specification version 209shall not be able to decode such a frame. 210 211__Content Size__ 212 213This is the original (uncompressed) size. 214This information is optional, and only present if the associated flag is set. 215Content size is provided using unsigned 8 Bytes, for a maximum of 16 Exabytes. 216Format is Little endian. 217This value is informational, typically for display or memory allocation. 218It can be skipped by a decoder, or used to validate content correctness. 219 220__Dictionary ID__ 221 222Dict-ID is only present if the associated flag is set. 223It's an unsigned 32-bits value, stored using little-endian convention. 224A dictionary is useful to compress short input sequences. 225The compressor can take advantage of the dictionary context 226to encode the input in a more compact manner. 227It works as a kind of “known prefix” which is used by 228both the compressor and the decompressor to “warm-up” reference tables. 229 230The decompressor can use Dict-ID identifier to determine 231which dictionary must be used to correctly decode data. 232The compressor and the decompressor must use exactly the same dictionary. 233It's presumed that the 32-bits dictID uniquely identifies a dictionary. 234 235Within a single frame, a single dictionary can be defined. 236When the frame descriptor defines independent blocks, 237each block will be initialized with the same dictionary. 238If the frame descriptor defines linked blocks, 239the dictionary will only be used once, at the beginning of the frame. 240 241__Header Checksum__ 242 243One-byte checksum of combined descriptor fields, including optional ones. 244The value is the second byte of `xxh32()` : ` (xxh32()>>8) & 0xFF ` 245using zero as a seed, and the full Frame Descriptor as an input 246(including optional fields when they are present). 247A wrong checksum indicates an error in the descriptor. 248Header checksum is informational and can be skipped. 249 250 251Data Blocks 252----------- 253 254| Block Size | data | (Block Checksum) | 255|:----------:| ------ |:----------------:| 256| 4 bytes | | 0 - 4 bytes | 257 258 259__Block Size__ 260 261This field uses 4-bytes, format is little-endian. 262 263If the highest bit is set (`1`), the block is uncompressed. 264 265If the highest bit is not set (`0`), the block is LZ4-compressed, 266using the [LZ4 block format specification](https://github.com/lz4/lz4/blob/dev/doc/lz4_Block_format.md). 267 268All other bits give the size, in bytes, of the data section. 269The size does not include the block checksum if present. 270 271_Block_Size_ shall never be larger than _Block_Maximum_Size_. 272Such an outcome could potentially happen for non-compressible sources. 273In such a case, such data block must be passed using uncompressed format. 274 275A value of `0x00000000` is invalid, and signifies an _EndMark_ instead. 276Note that this is different from a value of `0x80000000` (highest bit set), 277which is an uncompressed block of size 0 (empty), 278which is valid, and therefore doesn't end a frame. 279Note that, if _Block_checksum_ is enabled, 280even an empty block must be followed by a 32-bit block checksum. 281 282__Data__ 283 284Where the actual data to decode stands. 285It might be compressed or not, depending on previous field indications. 286 287When compressed, the data must respect the [LZ4 block format specification](https://github.com/lz4/lz4/blob/dev/doc/lz4_Block_format.md). 288 289Note that a block is not necessarily full. 290Uncompressed size of data can be any size __up to__ _Block_Maximum_Size_, 291so it may contain less data than the maximum block size. 292 293__Block checksum__ 294 295Only present if the associated flag is set. 296This is a 4-bytes checksum value, in little endian format, 297calculated by using the [xxHash-32 algorithm] on the __raw__ (undecoded) data block, 298and a seed of zero. 299The intention is to detect data corruption (storage or transmission errors) 300before decoding. 301 302_Block_checksum_ can be cumulative with _Content_checksum_. 303 304[xxHash-32 algorithm]: https://github.com/Cyan4973/xxHash/blob/release/doc/xxhash_spec.md 305 306 307Skippable Frames 308---------------- 309 310| Magic Number | Frame Size | User Data | 311|:------------:|:----------:| --------- | 312| 4 bytes | 4 bytes | | 313 314Skippable frames allow the integration of user-defined data 315into a flow of concatenated frames. 316Its design is pretty straightforward, 317with the sole objective to allow the decoder to quickly skip 318over user-defined data and continue decoding. 319 320For the purpose of facilitating identification, 321it is discouraged to start a flow of concatenated frames with a skippable frame. 322If there is a need to start such a flow with some user data 323encapsulated into a skippable frame, 324it’s recommended to start with a zero-byte LZ4 frame 325followed by a skippable frame. 326This will make it easier for file type identifiers. 327 328 329__Magic Number__ 330 3314 Bytes, Little endian format. 332Value : 0x184D2A5X, which means any value from 0x184D2A50 to 0x184D2A5F. 333All 16 values are valid to identify a skippable frame. 334 335__Frame Size__ 336 337This is the size, in bytes, of the following User Data 338(without including the magic number nor the size field itself). 3394 Bytes, Little endian format, unsigned 32-bits. 340This means User Data can’t be bigger than (2^32-1) Bytes. 341 342__User Data__ 343 344User Data can be anything. Data will just be skipped by the decoder. 345 346 347Legacy frame 348------------ 349 350The Legacy frame format was defined into the initial versions of “LZ4Demo”. 351Newer compressors should not use this format anymore, as it is too restrictive. 352 353Main characteristics of the legacy format : 354 355- Fixed block size : 8 MB. 356- All blocks must be completely filled, except the last one. 357- All blocks are always compressed, even when compression is detrimental. 358- The last block is detected either because 359 it is followed by the “EOF” (End of File) mark, 360 or because it is followed by a known Frame Magic Number. 361- No checksum 362- Convention is Little endian 363 364| MagicNb | B.CSize | CData | B.CSize | CData | (...) | EndMark | 365| ------- | ------- | ----- | ------- | ----- | ------- | ------- | 366| 4 bytes | 4 bytes | CSize | 4 bytes | CSize | x times | EOF | 367 368 369__Magic Number__ 370 3714 Bytes, Little endian format. 372Value : 0x184C2102 373 374__Block Compressed Size__ 375 376This is the size, in bytes, of the following compressed data block. 3774 Bytes, Little endian format. 378 379__Data__ 380 381Where the actual compressed data stands. 382Data is always compressed, even when compression is detrimental. 383 384__EndMark__ 385 386End of legacy frame is implicit only. 387It must be followed by a standard EOF (End Of File) signal, 388wether it is a file or a stream. 389 390Alternatively, if the frame is followed by a valid Frame Magic Number, 391it is considered completed. 392This policy makes it possible to concatenate legacy frames. 393 394Any other value will be interpreted as a block size, 395and trigger an error if it does not fit within acceptable range. 396 397 398Version changes 399--------------- 400 4011.6.2 : clarifies specification of _EndMark_ 402 4031.6.1 : introduced terms "LZ4 Frame Header" and "LZ4 Frame Footer" 404 4051.6.0 : restored Dictionary ID field in Frame header 406 4071.5.1 : changed document format to MarkDown 408 4091.5 : removed Dictionary ID from specification 410 4111.4.1 : changed wording from “stream” to “frame” 412 4131.4 : added skippable streams, re-added stream checksum 414 4151.3 : modified header checksum 416 4171.2 : reduced choice of “block size”, to postpone decision on “dynamic size of BlockSize Field”. 418 4191.1 : optional fields are now part of the descriptor 420 4211.0 : changed “block size” specification, adding a compressed/uncompressed flag 422 4230.9 : reduced scale of “block maximum size” table 424 4250.8 : removed : high compression flag 426 4270.7 : removed : stream checksum 428 4290.6 : settled : stream size uses 8 bytes, endian convention is little endian 430 4310.5: added copyright notice 432 4330.4 : changed format to Google Doc compatible OpenDocument 434