1.. SPDX-License-Identifier: GPL-2.0 2 3Index Nodes 4----------- 5 6In a regular UNIX filesystem, the inode stores all the metadata 7pertaining to the file (time stamps, block maps, extended attributes, 8etc), not the directory entry. To find the information associated with a 9file, one must traverse the directory files to find the directory entry 10associated with a file, then load the inode to find the metadata for 11that file. ext4 appears to cheat (for performance reasons) a little bit 12by storing a copy of the file type (normally stored in the inode) in the 13directory entry. (Compare all this to FAT, which stores all the file 14information directly in the directory entry, but does not support hard 15links and is in general more seek-happy than ext4 due to its simpler 16block allocator and extensive use of linked lists.) 17 18The inode table is a linear array of ``struct ext4_inode``. The table is 19sized to have enough blocks to store at least 20``sb.s_inode_size * sb.s_inodes_per_group`` bytes. The number of the 21block group containing an inode can be calculated as 22``(inode_number - 1) / sb.s_inodes_per_group``, and the offset into the 23group's table is ``(inode_number - 1) % sb.s_inodes_per_group``. There 24is no inode 0. 25 26The inode checksum is calculated against the FS UUID, the inode number, 27and the inode structure itself. 28 29The inode table entry is laid out in ``struct ext4_inode``. 30 31.. list-table:: 32 :widths: 1 1 1 77 33 :header-rows: 1 34 35 * - Offset 36 - Size 37 - Name 38 - Description 39 * - 0x0 40 - \_\_le16 41 - i\_mode 42 - File mode. See the table i_mode_ below. 43 * - 0x2 44 - \_\_le16 45 - i\_uid 46 - Lower 16-bits of Owner UID. 47 * - 0x4 48 - \_\_le32 49 - i\_size\_lo 50 - Lower 32-bits of size in bytes. 51 * - 0x8 52 - \_\_le32 53 - i\_atime 54 - Last access time, in seconds since the epoch. However, if the EA\_INODE 55 inode flag is set, this inode stores an extended attribute value and 56 this field contains the checksum of the value. 57 * - 0xC 58 - \_\_le32 59 - i\_ctime 60 - Last inode change time, in seconds since the epoch. However, if the 61 EA\_INODE inode flag is set, this inode stores an extended attribute 62 value and this field contains the lower 32 bits of the attribute value's 63 reference count. 64 * - 0x10 65 - \_\_le32 66 - i\_mtime 67 - Last data modification time, in seconds since the epoch. However, if the 68 EA\_INODE inode flag is set, this inode stores an extended attribute 69 value and this field contains the number of the inode that owns the 70 extended attribute. 71 * - 0x14 72 - \_\_le32 73 - i\_dtime 74 - Deletion Time, in seconds since the epoch. 75 * - 0x18 76 - \_\_le16 77 - i\_gid 78 - Lower 16-bits of GID. 79 * - 0x1A 80 - \_\_le16 81 - i\_links\_count 82 - Hard link count. Normally, ext4 does not permit an inode to have more 83 than 65,000 hard links. This applies to files as well as directories, 84 which means that there cannot be more than 64,998 subdirectories in a 85 directory (each subdirectory's '..' entry counts as a hard link, as does 86 the '.' entry in the directory itself). With the DIR\_NLINK feature 87 enabled, ext4 supports more than 64,998 subdirectories by setting this 88 field to 1 to indicate that the number of hard links is not known. 89 * - 0x1C 90 - \_\_le32 91 - i\_blocks\_lo 92 - Lower 32-bits of “block” count. If the huge\_file feature flag is not 93 set on the filesystem, the file consumes ``i_blocks_lo`` 512-byte blocks 94 on disk. If huge\_file is set and EXT4\_HUGE\_FILE\_FL is NOT set in 95 ``inode.i_flags``, then the file consumes ``i_blocks_lo + (i_blocks_hi 96 << 32)`` 512-byte blocks on disk. If huge\_file is set and 97 EXT4\_HUGE\_FILE\_FL IS set in ``inode.i_flags``, then this file 98 consumes (``i_blocks_lo + i_blocks_hi`` << 32) filesystem blocks on 99 disk. 100 * - 0x20 101 - \_\_le32 102 - i\_flags 103 - Inode flags. See the table i_flags_ below. 104 * - 0x24 105 - 4 bytes 106 - i\_osd1 107 - See the table i_osd1_ for more details. 108 * - 0x28 109 - 60 bytes 110 - i\_block[EXT4\_N\_BLOCKS=15] 111 - Block map or extent tree. See the section “The Contents of inode.i\_block”. 112 * - 0x64 113 - \_\_le32 114 - i\_generation 115 - File version (for NFS). 116 * - 0x68 117 - \_\_le32 118 - i\_file\_acl\_lo 119 - Lower 32-bits of extended attribute block. ACLs are of course one of 120 many possible extended attributes; I think the name of this field is a 121 result of the first use of extended attributes being for ACLs. 122 * - 0x6C 123 - \_\_le32 124 - i\_size\_high / i\_dir\_acl 125 - Upper 32-bits of file/directory size. In ext2/3 this field was named 126 i\_dir\_acl, though it was usually set to zero and never used. 127 * - 0x70 128 - \_\_le32 129 - i\_obso\_faddr 130 - (Obsolete) fragment address. 131 * - 0x74 132 - 12 bytes 133 - i\_osd2 134 - See the table i_osd2_ for more details. 135 * - 0x80 136 - \_\_le16 137 - i\_extra\_isize 138 - Size of this inode - 128. Alternately, the size of the extended inode 139 fields beyond the original ext2 inode, including this field. 140 * - 0x82 141 - \_\_le16 142 - i\_checksum\_hi 143 - Upper 16-bits of the inode checksum. 144 * - 0x84 145 - \_\_le32 146 - i\_ctime\_extra 147 - Extra change time bits. This provides sub-second precision. See Inode 148 Timestamps section. 149 * - 0x88 150 - \_\_le32 151 - i\_mtime\_extra 152 - Extra modification time bits. This provides sub-second precision. 153 * - 0x8C 154 - \_\_le32 155 - i\_atime\_extra 156 - Extra access time bits. This provides sub-second precision. 157 * - 0x90 158 - \_\_le32 159 - i\_crtime 160 - File creation time, in seconds since the epoch. 161 * - 0x94 162 - \_\_le32 163 - i\_crtime\_extra 164 - Extra file creation time bits. This provides sub-second precision. 165 * - 0x98 166 - \_\_le32 167 - i\_version\_hi 168 - Upper 32-bits for version number. 169 * - 0x9C 170 - \_\_le32 171 - i\_projid 172 - Project ID. 173 174.. _i_mode: 175 176The ``i_mode`` value is a combination of the following flags: 177 178.. list-table:: 179 :widths: 1 79 180 :header-rows: 1 181 182 * - Value 183 - Description 184 * - 0x1 185 - S\_IXOTH (Others may execute) 186 * - 0x2 187 - S\_IWOTH (Others may write) 188 * - 0x4 189 - S\_IROTH (Others may read) 190 * - 0x8 191 - S\_IXGRP (Group members may execute) 192 * - 0x10 193 - S\_IWGRP (Group members may write) 194 * - 0x20 195 - S\_IRGRP (Group members may read) 196 * - 0x40 197 - S\_IXUSR (Owner may execute) 198 * - 0x80 199 - S\_IWUSR (Owner may write) 200 * - 0x100 201 - S\_IRUSR (Owner may read) 202 * - 0x200 203 - S\_ISVTX (Sticky bit) 204 * - 0x400 205 - S\_ISGID (Set GID) 206 * - 0x800 207 - S\_ISUID (Set UID) 208 * - 209 - These are mutually-exclusive file types: 210 * - 0x1000 211 - S\_IFIFO (FIFO) 212 * - 0x2000 213 - S\_IFCHR (Character device) 214 * - 0x4000 215 - S\_IFDIR (Directory) 216 * - 0x6000 217 - S\_IFBLK (Block device) 218 * - 0x8000 219 - S\_IFREG (Regular file) 220 * - 0xA000 221 - S\_IFLNK (Symbolic link) 222 * - 0xC000 223 - S\_IFSOCK (Socket) 224 225.. _i_flags: 226 227The ``i_flags`` field is a combination of these values: 228 229.. list-table:: 230 :widths: 1 79 231 :header-rows: 1 232 233 * - Value 234 - Description 235 * - 0x1 236 - This file requires secure deletion (EXT4\_SECRM\_FL). (not implemented) 237 * - 0x2 238 - This file should be preserved, should undeletion be desired 239 (EXT4\_UNRM\_FL). (not implemented) 240 * - 0x4 241 - File is compressed (EXT4\_COMPR\_FL). (not really implemented) 242 * - 0x8 243 - All writes to the file must be synchronous (EXT4\_SYNC\_FL). 244 * - 0x10 245 - File is immutable (EXT4\_IMMUTABLE\_FL). 246 * - 0x20 247 - File can only be appended (EXT4\_APPEND\_FL). 248 * - 0x40 249 - The dump(1) utility should not dump this file (EXT4\_NODUMP\_FL). 250 * - 0x80 251 - Do not update access time (EXT4\_NOATIME\_FL). 252 * - 0x100 253 - Dirty compressed file (EXT4\_DIRTY\_FL). (not used) 254 * - 0x200 255 - File has one or more compressed clusters (EXT4\_COMPRBLK\_FL). (not used) 256 * - 0x400 257 - Do not compress file (EXT4\_NOCOMPR\_FL). (not used) 258 * - 0x800 259 - Encrypted inode (EXT4\_ENCRYPT\_FL). This bit value previously was 260 EXT4\_ECOMPR\_FL (compression error), which was never used. 261 * - 0x1000 262 - Directory has hashed indexes (EXT4\_INDEX\_FL). 263 * - 0x2000 264 - AFS magic directory (EXT4\_IMAGIC\_FL). 265 * - 0x4000 266 - File data must always be written through the journal 267 (EXT4\_JOURNAL\_DATA\_FL). 268 * - 0x8000 269 - File tail should not be merged (EXT4\_NOTAIL\_FL). (not used by ext4) 270 * - 0x10000 271 - All directory entry data should be written synchronously (see 272 ``dirsync``) (EXT4\_DIRSYNC\_FL). 273 * - 0x20000 274 - Top of directory hierarchy (EXT4\_TOPDIR\_FL). 275 * - 0x40000 276 - This is a huge file (EXT4\_HUGE\_FILE\_FL). 277 * - 0x80000 278 - Inode uses extents (EXT4\_EXTENTS\_FL). 279 * - 0x200000 280 - Inode stores a large extended attribute value in its data blocks 281 (EXT4\_EA\_INODE\_FL). 282 * - 0x400000 283 - This file has blocks allocated past EOF (EXT4\_EOFBLOCKS\_FL). 284 (deprecated) 285 * - 0x01000000 286 - Inode is a snapshot (``EXT4_SNAPFILE_FL``). (not in mainline) 287 * - 0x04000000 288 - Snapshot is being deleted (``EXT4_SNAPFILE_DELETED_FL``). (not in 289 mainline) 290 * - 0x08000000 291 - Snapshot shrink has completed (``EXT4_SNAPFILE_SHRUNK_FL``). (not in 292 mainline) 293 * - 0x10000000 294 - Inode has inline data (EXT4\_INLINE\_DATA\_FL). 295 * - 0x20000000 296 - Create children with the same project ID (EXT4\_PROJINHERIT\_FL). 297 * - 0x80000000 298 - Reserved for ext4 library (EXT4\_RESERVED\_FL). 299 * - 300 - Aggregate flags: 301 * - 0x4BDFFF 302 - User-visible flags. 303 * - 0x4B80FF 304 - User-modifiable flags. Note that while EXT4\_JOURNAL\_DATA\_FL and 305 EXT4\_EXTENTS\_FL can be set with setattr, they are not in the kernel's 306 EXT4\_FL\_USER\_MODIFIABLE mask, since it needs to handle the setting of 307 these flags in a special manner and they are masked out of the set of 308 flags that are saved directly to i\_flags. 309 310.. _i_osd1: 311 312The ``osd1`` field has multiple meanings depending on the creator: 313 314Linux: 315 316.. list-table:: 317 :widths: 1 1 1 77 318 :header-rows: 1 319 320 * - Offset 321 - Size 322 - Name 323 - Description 324 * - 0x0 325 - \_\_le32 326 - l\_i\_version 327 - Inode version. However, if the EA\_INODE inode flag is set, this inode 328 stores an extended attribute value and this field contains the upper 32 329 bits of the attribute value's reference count. 330 331Hurd: 332 333.. list-table:: 334 :widths: 1 1 1 77 335 :header-rows: 1 336 337 * - Offset 338 - Size 339 - Name 340 - Description 341 * - 0x0 342 - \_\_le32 343 - h\_i\_translator 344 - ?? 345 346Masix: 347 348.. list-table:: 349 :widths: 1 1 1 77 350 :header-rows: 1 351 352 * - Offset 353 - Size 354 - Name 355 - Description 356 * - 0x0 357 - \_\_le32 358 - m\_i\_reserved 359 - ?? 360 361.. _i_osd2: 362 363The ``osd2`` field has multiple meanings depending on the filesystem creator: 364 365Linux: 366 367.. list-table:: 368 :widths: 1 1 1 77 369 :header-rows: 1 370 371 * - Offset 372 - Size 373 - Name 374 - Description 375 * - 0x0 376 - \_\_le16 377 - l\_i\_blocks\_high 378 - Upper 16-bits of the block count. Please see the note attached to 379 i\_blocks\_lo. 380 * - 0x2 381 - \_\_le16 382 - l\_i\_file\_acl\_high 383 - Upper 16-bits of the extended attribute block (historically, the file 384 ACL location). See the Extended Attributes section below. 385 * - 0x4 386 - \_\_le16 387 - l\_i\_uid\_high 388 - Upper 16-bits of the Owner UID. 389 * - 0x6 390 - \_\_le16 391 - l\_i\_gid\_high 392 - Upper 16-bits of the GID. 393 * - 0x8 394 - \_\_le16 395 - l\_i\_checksum\_lo 396 - Lower 16-bits of the inode checksum. 397 * - 0xA 398 - \_\_le16 399 - l\_i\_reserved 400 - Unused. 401 402Hurd: 403 404.. list-table:: 405 :widths: 1 1 1 77 406 :header-rows: 1 407 408 * - Offset 409 - Size 410 - Name 411 - Description 412 * - 0x0 413 - \_\_le16 414 - h\_i\_reserved1 415 - ?? 416 * - 0x2 417 - \_\_u16 418 - h\_i\_mode\_high 419 - Upper 16-bits of the file mode. 420 * - 0x4 421 - \_\_le16 422 - h\_i\_uid\_high 423 - Upper 16-bits of the Owner UID. 424 * - 0x6 425 - \_\_le16 426 - h\_i\_gid\_high 427 - Upper 16-bits of the GID. 428 * - 0x8 429 - \_\_u32 430 - h\_i\_author 431 - Author code? 432 433Masix: 434 435.. list-table:: 436 :widths: 1 1 1 77 437 :header-rows: 1 438 439 * - Offset 440 - Size 441 - Name 442 - Description 443 * - 0x0 444 - \_\_le16 445 - h\_i\_reserved1 446 - ?? 447 * - 0x2 448 - \_\_u16 449 - m\_i\_file\_acl\_high 450 - Upper 16-bits of the extended attribute block (historically, the file 451 ACL location). 452 * - 0x4 453 - \_\_u32 454 - m\_i\_reserved2[2] 455 - ?? 456 457Inode Size 458~~~~~~~~~~ 459 460In ext2 and ext3, the inode structure size was fixed at 128 bytes 461(``EXT2_GOOD_OLD_INODE_SIZE``) and each inode had a disk record size of 462128 bytes. Starting with ext4, it is possible to allocate a larger 463on-disk inode at format time for all inodes in the filesystem to provide 464space beyond the end of the original ext2 inode. The on-disk inode 465record size is recorded in the superblock as ``s_inode_size``. The 466number of bytes actually used by struct ext4\_inode beyond the original 467128-byte ext2 inode is recorded in the ``i_extra_isize`` field for each 468inode, which allows struct ext4\_inode to grow for a new kernel without 469having to upgrade all of the on-disk inodes. Access to fields beyond 470EXT2\_GOOD\_OLD\_INODE\_SIZE should be verified to be within 471``i_extra_isize``. By default, ext4 inode records are 256 bytes, and (as 472of October 2013) the inode structure is 156 bytes 473(``i_extra_isize = 28``). The extra space between the end of the inode 474structure and the end of the inode record can be used to store extended 475attributes. Each inode record can be as large as the filesystem block 476size, though this is not terribly efficient. 477 478Finding an Inode 479~~~~~~~~~~~~~~~~ 480 481Each block group contains ``sb->s_inodes_per_group`` inodes. Because 482inode 0 is defined not to exist, this formula can be used to find the 483block group that an inode lives in: 484``bg = (inode_num - 1) / sb->s_inodes_per_group``. The particular inode 485can be found within the block group's inode table at 486``index = (inode_num - 1) % sb->s_inodes_per_group``. To get the byte 487address within the inode table, use 488``offset = index * sb->s_inode_size``. 489 490Inode Timestamps 491~~~~~~~~~~~~~~~~ 492 493Four timestamps are recorded in the lower 128 bytes of the inode 494structure -- inode change time (ctime), access time (atime), data 495modification time (mtime), and deletion time (dtime). The four fields 496are 32-bit signed integers that represent seconds since the Unix epoch 497(1970-01-01 00:00:00 GMT), which means that the fields will overflow in 498January 2038. For inodes that are not linked from any directory but are 499still open (orphan inodes), the dtime field is overloaded for use with 500the orphan list. The superblock field ``s_last_orphan`` points to the 501first inode in the orphan list; dtime is then the number of the next 502orphaned inode, or zero if there are no more orphans. 503 504If the inode structure size ``sb->s_inode_size`` is larger than 128 505bytes and the ``i_inode_extra`` field is large enough to encompass the 506respective ``i_[cma]time_extra`` field, the ctime, atime, and mtime 507inode fields are widened to 64 bits. Within this “extra” 32-bit field, 508the lower two bits are used to extend the 32-bit seconds field to be 34 509bit wide; the upper 30 bits are used to provide nanosecond timestamp 510accuracy. Therefore, timestamps should not overflow until May 2446. 511dtime was not widened. There is also a fifth timestamp to record inode 512creation time (crtime); this field is 64-bits wide and decoded in the 513same manner as 64-bit [cma]time. Neither crtime nor dtime are accessible 514through the regular stat() interface, though debugfs will report them. 515 516We use the 32-bit signed time value plus (2^32 \* (extra epoch bits)). 517In other words: 518 519.. list-table:: 520 :widths: 20 20 20 20 20 521 :header-rows: 1 522 523 * - Extra epoch bits 524 - MSB of 32-bit time 525 - Adjustment for signed 32-bit to 64-bit tv\_sec 526 - Decoded 64-bit tv\_sec 527 - valid time range 528 * - 0 0 529 - 1 530 - 0 531 - ``-0x80000000 - -0x00000001`` 532 - 1901-12-13 to 1969-12-31 533 * - 0 0 534 - 0 535 - 0 536 - ``0x000000000 - 0x07fffffff`` 537 - 1970-01-01 to 2038-01-19 538 * - 0 1 539 - 1 540 - 0x100000000 541 - ``0x080000000 - 0x0ffffffff`` 542 - 2038-01-19 to 2106-02-07 543 * - 0 1 544 - 0 545 - 0x100000000 546 - ``0x100000000 - 0x17fffffff`` 547 - 2106-02-07 to 2174-02-25 548 * - 1 0 549 - 1 550 - 0x200000000 551 - ``0x180000000 - 0x1ffffffff`` 552 - 2174-02-25 to 2242-03-16 553 * - 1 0 554 - 0 555 - 0x200000000 556 - ``0x200000000 - 0x27fffffff`` 557 - 2242-03-16 to 2310-04-04 558 * - 1 1 559 - 1 560 - 0x300000000 561 - ``0x280000000 - 0x2ffffffff`` 562 - 2310-04-04 to 2378-04-22 563 * - 1 1 564 - 0 565 - 0x300000000 566 - ``0x300000000 - 0x37fffffff`` 567 - 2378-04-22 to 2446-05-10 568 569This is a somewhat odd encoding since there are effectively seven times 570as many positive values as negative values. There have also been 571long-standing bugs decoding and encoding dates beyond 2038, which don't 572seem to be fixed as of kernel 3.12 and e2fsprogs 1.42.8. 64-bit kernels 573incorrectly use the extra epoch bits 1,1 for dates between 1901 and 5741970. At some point the kernel will be fixed and e2fsck will fix this 575situation, assuming that it is run before 2310. 576