1.. SPDX-License-Identifier: GPL-2.0 2 3Journal (jbd2) 4-------------- 5 6Introduced in ext3, the ext4 filesystem employs a journal to protect the 7filesystem against corruption in the case of a system crash. A small 8continuous region of disk (default 128MiB) is reserved inside the 9filesystem as a place to land “important” data writes on-disk as quickly 10as possible. Once the important data transaction is fully written to the 11disk and flushed from the disk write cache, a record of the data being 12committed is also written to the journal. At some later point in time, 13the journal code writes the transactions to their final locations on 14disk (this could involve a lot of seeking or a lot of small 15read-write-erases) before erasing the commit record. Should the system 16crash during the second slow write, the journal can be replayed all the 17way to the latest commit record, guaranteeing the atomicity of whatever 18gets written through the journal to the disk. The effect of this is to 19guarantee that the filesystem does not become stuck midway through a 20metadata update. 21 22For performance reasons, ext4 by default only writes filesystem metadata 23through the journal. This means that file data blocks are /not/ 24guaranteed to be in any consistent state after a crash. If this default 25guarantee level (``data=ordered``) is not satisfactory, there is a mount 26option to control journal behavior. If ``data=journal``, all data and 27metadata are written to disk through the journal. This is slower but 28safest. If ``data=writeback``, dirty data blocks are not flushed to the 29disk before the metadata are written to disk through the journal. 30 31The journal inode is typically inode 8. The first 68 bytes of the 32journal inode are replicated in the ext4 superblock. The journal itself 33is normal (but hidden) file within the filesystem. The file usually 34consumes an entire block group, though mke2fs tries to put it in the 35middle of the disk. 36 37All fields in jbd2 are written to disk in big-endian order. This is the 38opposite of ext4. 39 40NOTE: Both ext4 and ocfs2 use jbd2. 41 42The maximum size of a journal embedded in an ext4 filesystem is 2^32 43blocks. jbd2 itself does not seem to care. 44 45Layout 46~~~~~~ 47 48Generally speaking, the journal has this format: 49 50.. list-table:: 51 :widths: 1 1 78 52 :header-rows: 1 53 54 * - Superblock 55 - descriptor\_block (data\_blocks or revocation\_block) [more data or 56 revocations] commmit\_block 57 - [more transactions...] 58 * - 59 - One transaction 60 - 61 62Notice that a transaction begins with either a descriptor and some data, 63or a block revocation list. A finished transaction always ends with a 64commit. If there is no commit record (or the checksums don't match), the 65transaction will be discarded during replay. 66 67External Journal 68~~~~~~~~~~~~~~~~ 69 70Optionally, an ext4 filesystem can be created with an external journal 71device (as opposed to an internal journal, which uses a reserved inode). 72In this case, on the filesystem device, ``s_journal_inum`` should be 73zero and ``s_journal_uuid`` should be set. On the journal device there 74will be an ext4 super block in the usual place, with a matching UUID. 75The journal superblock will be in the next full block after the 76superblock. 77 78.. list-table:: 79 :widths: 1 1 1 1 76 80 :header-rows: 1 81 82 * - 1024 bytes of padding 83 - ext4 Superblock 84 - Journal Superblock 85 - descriptor\_block (data\_blocks or revocation\_block) [more data or 86 revocations] commmit\_block 87 - [more transactions...] 88 * - 89 - 90 - 91 - One transaction 92 - 93 94Block Header 95~~~~~~~~~~~~ 96 97Every block in the journal starts with a common 12-byte header 98``struct journal_header_s``: 99 100.. list-table:: 101 :widths: 1 1 1 77 102 :header-rows: 1 103 104 * - Offset 105 - Type 106 - Name 107 - Description 108 * - 0x0 109 - \_\_be32 110 - h\_magic 111 - jbd2 magic number, 0xC03B3998. 112 * - 0x4 113 - \_\_be32 114 - h\_blocktype 115 - Description of what this block contains. See the jbd2_blocktype_ table 116 below. 117 * - 0x8 118 - \_\_be32 119 - h\_sequence 120 - The transaction ID that goes with this block. 121 122.. _jbd2_blocktype: 123 124The journal block type can be any one of: 125 126.. list-table:: 127 :widths: 1 79 128 :header-rows: 1 129 130 * - Value 131 - Description 132 * - 1 133 - Descriptor. This block precedes a series of data blocks that were 134 written through the journal during a transaction. 135 * - 2 136 - Block commit record. This block signifies the completion of a 137 transaction. 138 * - 3 139 - Journal superblock, v1. 140 * - 4 141 - Journal superblock, v2. 142 * - 5 143 - Block revocation records. This speeds up recovery by enabling the 144 journal to skip writing blocks that were subsequently rewritten. 145 146Super Block 147~~~~~~~~~~~ 148 149The super block for the journal is much simpler as compared to ext4's. 150The key data kept within are size of the journal, and where to find the 151start of the log of transactions. 152 153The journal superblock is recorded as ``struct journal_superblock_s``, 154which is 1024 bytes long: 155 156.. list-table:: 157 :widths: 1 1 1 77 158 :header-rows: 1 159 160 * - Offset 161 - Type 162 - Name 163 - Description 164 * - 165 - 166 - 167 - Static information describing the journal. 168 * - 0x0 169 - journal\_header\_t (12 bytes) 170 - s\_header 171 - Common header identifying this as a superblock. 172 * - 0xC 173 - \_\_be32 174 - s\_blocksize 175 - Journal device block size. 176 * - 0x10 177 - \_\_be32 178 - s\_maxlen 179 - Total number of blocks in this journal. 180 * - 0x14 181 - \_\_be32 182 - s\_first 183 - First block of log information. 184 * - 185 - 186 - 187 - Dynamic information describing the current state of the log. 188 * - 0x18 189 - \_\_be32 190 - s\_sequence 191 - First commit ID expected in log. 192 * - 0x1C 193 - \_\_be32 194 - s\_start 195 - Block number of the start of log. Contrary to the comments, this field 196 being zero does not imply that the journal is clean! 197 * - 0x20 198 - \_\_be32 199 - s\_errno 200 - Error value, as set by jbd2\_journal\_abort(). 201 * - 202 - 203 - 204 - The remaining fields are only valid in a v2 superblock. 205 * - 0x24 206 - \_\_be32 207 - s\_feature\_compat; 208 - Compatible feature set. See the table jbd2_compat_ below. 209 * - 0x28 210 - \_\_be32 211 - s\_feature\_incompat 212 - Incompatible feature set. See the table jbd2_incompat_ below. 213 * - 0x2C 214 - \_\_be32 215 - s\_feature\_ro\_compat 216 - Read-only compatible feature set. There aren't any of these currently. 217 * - 0x30 218 - \_\_u8 219 - s\_uuid[16] 220 - 128-bit uuid for journal. This is compared against the copy in the ext4 221 super block at mount time. 222 * - 0x40 223 - \_\_be32 224 - s\_nr\_users 225 - Number of file systems sharing this journal. 226 * - 0x44 227 - \_\_be32 228 - s\_dynsuper 229 - Location of dynamic super block copy. (Not used?) 230 * - 0x48 231 - \_\_be32 232 - s\_max\_transaction 233 - Limit of journal blocks per transaction. (Not used?) 234 * - 0x4C 235 - \_\_be32 236 - s\_max\_trans\_data 237 - Limit of data blocks per transaction. (Not used?) 238 * - 0x50 239 - \_\_u8 240 - s\_checksum\_type 241 - Checksum algorithm used for the journal. See jbd2_checksum_type_ for 242 more info. 243 * - 0x51 244 - \_\_u8[3] 245 - s\_padding2 246 - 247 * - 0x54 248 - \_\_u32 249 - s\_padding[42] 250 - 251 * - 0xFC 252 - \_\_be32 253 - s\_checksum 254 - Checksum of the entire superblock, with this field set to zero. 255 * - 0x100 256 - \_\_u8 257 - s\_users[16\*48] 258 - ids of all file systems sharing the log. e2fsprogs/Linux don't allow 259 shared external journals, but I imagine Lustre (or ocfs2?), which use 260 the jbd2 code, might. 261 262.. _jbd2_compat: 263 264The journal compat features are any combination of the following: 265 266.. list-table:: 267 :widths: 1 79 268 :header-rows: 1 269 270 * - Value 271 - Description 272 * - 0x1 273 - Journal maintains checksums on the data blocks. 274 (JBD2\_FEATURE\_COMPAT\_CHECKSUM) 275 276.. _jbd2_incompat: 277 278The journal incompat features are any combination of the following: 279 280.. list-table:: 281 :widths: 1 79 282 :header-rows: 1 283 284 * - Value 285 - Description 286 * - 0x1 287 - Journal has block revocation records. (JBD2\_FEATURE\_INCOMPAT\_REVOKE) 288 * - 0x2 289 - Journal can deal with 64-bit block numbers. 290 (JBD2\_FEATURE\_INCOMPAT\_64BIT) 291 * - 0x4 292 - Journal commits asynchronously. (JBD2\_FEATURE\_INCOMPAT\_ASYNC\_COMMIT) 293 * - 0x8 294 - This journal uses v2 of the checksum on-disk format. Each journal 295 metadata block gets its own checksum, and the block tags in the 296 descriptor table contain checksums for each of the data blocks in the 297 journal. (JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2) 298 * - 0x10 299 - This journal uses v3 of the checksum on-disk format. This is the same as 300 v2, but the journal block tag size is fixed regardless of the size of 301 block numbers. (JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3) 302 303.. _jbd2_checksum_type: 304 305Journal checksum type codes are one of the following. crc32 or crc32c are the 306most likely choices. 307 308.. list-table:: 309 :widths: 1 79 310 :header-rows: 1 311 312 * - Value 313 - Description 314 * - 1 315 - CRC32 316 * - 2 317 - MD5 318 * - 3 319 - SHA1 320 * - 4 321 - CRC32C 322 323Descriptor Block 324~~~~~~~~~~~~~~~~ 325 326The descriptor block contains an array of journal block tags that 327describe the final locations of the data blocks that follow in the 328journal. Descriptor blocks are open-coded instead of being completely 329described by a data structure, but here is the block structure anyway. 330Descriptor blocks consume at least 36 bytes, but use a full block: 331 332.. list-table:: 333 :widths: 1 1 1 77 334 :header-rows: 1 335 336 * - Offset 337 - Type 338 - Name 339 - Descriptor 340 * - 0x0 341 - journal\_header\_t 342 - (open coded) 343 - Common block header. 344 * - 0xC 345 - struct journal\_block\_tag\_s 346 - open coded array[] 347 - Enough tags either to fill up the block or to describe all the data 348 blocks that follow this descriptor block. 349 350Journal block tags have any of the following formats, depending on which 351journal feature and block tag flags are set. 352 353If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 is set, the journal block tag is 354defined as ``struct journal_block_tag3_s``, which looks like the 355following. The size is 16 or 32 bytes. 356 357.. list-table:: 358 :widths: 1 1 1 77 359 :header-rows: 1 360 361 * - Offset 362 - Type 363 - Name 364 - Descriptor 365 * - 0x0 366 - \_\_be32 367 - t\_blocknr 368 - Lower 32-bits of the location of where the corresponding data block 369 should end up on disk. 370 * - 0x4 371 - \_\_be32 372 - t\_flags 373 - Flags that go with the descriptor. See the table jbd2_tag_flags_ for 374 more info. 375 * - 0x8 376 - \_\_be32 377 - t\_blocknr\_high 378 - Upper 32-bits of the location of where the corresponding data block 379 should end up on disk. This is zero if JBD2\_FEATURE\_INCOMPAT\_64BIT is 380 not enabled. 381 * - 0xC 382 - \_\_be32 383 - t\_checksum 384 - Checksum of the journal UUID, the sequence number, and the data block. 385 * - 386 - 387 - 388 - This field appears to be open coded. It always comes at the end of the 389 tag, after t_checksum. This field is not present if the "same UUID" flag 390 is set. 391 * - 0x8 or 0xC 392 - char 393 - uuid[16] 394 - A UUID to go with this tag. This field appears to be copied from the 395 ``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that 396 field. 397 398.. _jbd2_tag_flags: 399 400The journal tag flags are any combination of the following: 401 402.. list-table:: 403 :widths: 1 79 404 :header-rows: 1 405 406 * - Value 407 - Description 408 * - 0x1 409 - On-disk block is escaped. The first four bytes of the data block just 410 happened to match the jbd2 magic number. 411 * - 0x2 412 - This block has the same UUID as previous, therefore the UUID field is 413 omitted. 414 * - 0x4 415 - The data block was deleted by the transaction. (Not used?) 416 * - 0x8 417 - This is the last tag in this descriptor block. 418 419If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 is NOT set, the journal block tag 420is defined as ``struct journal_block_tag_s``, which looks like the 421following. The size is 8, 12, 24, or 28 bytes: 422 423.. list-table:: 424 :widths: 1 1 1 77 425 :header-rows: 1 426 427 * - Offset 428 - Type 429 - Name 430 - Descriptor 431 * - 0x0 432 - \_\_be32 433 - t\_blocknr 434 - Lower 32-bits of the location of where the corresponding data block 435 should end up on disk. 436 * - 0x4 437 - \_\_be16 438 - t\_checksum 439 - Checksum of the journal UUID, the sequence number, and the data block. 440 Note that only the lower 16 bits are stored. 441 * - 0x6 442 - \_\_be16 443 - t\_flags 444 - Flags that go with the descriptor. See the table jbd2_tag_flags_ for 445 more info. 446 * - 447 - 448 - 449 - This next field is only present if the super block indicates support for 450 64-bit block numbers. 451 * - 0x8 452 - \_\_be32 453 - t\_blocknr\_high 454 - Upper 32-bits of the location of where the corresponding data block 455 should end up on disk. 456 * - 457 - 458 - 459 - This field appears to be open coded. It always comes at the end of the 460 tag, after t_flags or t_blocknr_high. This field is not present if the 461 "same UUID" flag is set. 462 * - 0x8 or 0xC 463 - char 464 - uuid[16] 465 - A UUID to go with this tag. This field appears to be copied from the 466 ``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that 467 field. 468 469If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2 or 470JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 are set, the end of the block is a 471``struct jbd2_journal_block_tail``, which looks like this: 472 473.. list-table:: 474 :widths: 1 1 1 77 475 :header-rows: 1 476 477 * - Offset 478 - Type 479 - Name 480 - Descriptor 481 * - 0x0 482 - \_\_be32 483 - t\_checksum 484 - Checksum of the journal UUID + the descriptor block, with this field set 485 to zero. 486 487Data Block 488~~~~~~~~~~ 489 490In general, the data blocks being written to disk through the journal 491are written verbatim into the journal file after the descriptor block. 492However, if the first four bytes of the block match the jbd2 magic 493number then those four bytes are replaced with zeroes and the “escaped” 494flag is set in the descriptor block tag. 495 496Revocation Block 497~~~~~~~~~~~~~~~~ 498 499A revocation block is used to prevent replay of a block in an earlier 500transaction. This is used to mark blocks that were journalled at one 501time but are no longer journalled. Typically this happens if a metadata 502block is freed and re-allocated as a file data block; in this case, a 503journal replay after the file block was written to disk will cause 504corruption. 505 506**NOTE**: This mechanism is NOT used to express “this journal block is 507superseded by this other journal block”, as the author (djwong) 508mistakenly thought. Any block being added to a transaction will cause 509the removal of all existing revocation records for that block. 510 511Revocation blocks are described in 512``struct jbd2_journal_revoke_header_s``, are at least 16 bytes in 513length, but use a full block: 514 515.. list-table:: 516 :widths: 1 1 1 77 517 :header-rows: 1 518 519 * - Offset 520 - Type 521 - Name 522 - Description 523 * - 0x0 524 - journal\_header\_t 525 - r\_header 526 - Common block header. 527 * - 0xC 528 - \_\_be32 529 - r\_count 530 - Number of bytes used in this block. 531 * - 0x10 532 - \_\_be32 or \_\_be64 533 - blocks[0] 534 - Blocks to revoke. 535 536After r\_count is a linear array of block numbers that are effectively 537revoked by this transaction. The size of each block number is 8 bytes if 538the superblock advertises 64-bit block number support, or 4 bytes 539otherwise. 540 541If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2 or 542JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 are set, the end of the revocation 543block is a ``struct jbd2_journal_revoke_tail``, which has this format: 544 545.. list-table:: 546 :widths: 1 1 1 77 547 :header-rows: 1 548 549 * - Offset 550 - Type 551 - Name 552 - Description 553 * - 0x0 554 - \_\_be32 555 - r\_checksum 556 - Checksum of the journal UUID + revocation block 557 558Commit Block 559~~~~~~~~~~~~ 560 561The commit block is a sentry that indicates that a transaction has been 562completely written to the journal. Once this commit block reaches the 563journal, the data stored with this transaction can be written to their 564final locations on disk. 565 566The commit block is described by ``struct commit_header``, which is 32 567bytes long (but uses a full block): 568 569.. list-table:: 570 :widths: 1 1 1 77 571 :header-rows: 1 572 573 * - Offset 574 - Type 575 - Name 576 - Descriptor 577 * - 0x0 578 - journal\_header\_s 579 - (open coded) 580 - Common block header. 581 * - 0xC 582 - unsigned char 583 - h\_chksum\_type 584 - The type of checksum to use to verify the integrity of the data blocks 585 in the transaction. See jbd2_checksum_type_ for more info. 586 * - 0xD 587 - unsigned char 588 - h\_chksum\_size 589 - The number of bytes used by the checksum. Most likely 4. 590 * - 0xE 591 - unsigned char 592 - h\_padding[2] 593 - 594 * - 0x10 595 - \_\_be32 596 - h\_chksum[JBD2\_CHECKSUM\_BYTES] 597 - 32 bytes of space to store checksums. If 598 JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2 or JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 599 are set, the first ``__be32`` is the checksum of the journal UUID and 600 the entire commit block, with this field zeroed. If 601 JBD2\_FEATURE\_COMPAT\_CHECKSUM is set, the first ``__be32`` is the 602 crc32 of all the blocks already written to the transaction. 603 * - 0x30 604 - \_\_be64 605 - h\_commit\_sec 606 - The time that the transaction was committed, in seconds since the epoch. 607 * - 0x38 608 - \_\_be32 609 - h\_commit\_nsec 610 - Nanoseconds component of the above timestamp. 611 612