Lines Matching +full:differential +full:- +full:pair

1 .. SPDX-License-Identifier: GPL-2.0
8 Heading 3 uses "----"
25 - To help kernel distributors understand exactly what the XFS online fsck
28 - To help people reading the code to familiarize themselves with the relevant
31 - To help developers maintaining the system by capturing the reasons
59 - Provide a hierarchy of names through which application programs can associate
62 - Virtualize physical storage media across those names, and
64 - Retrieve the named data blobs at any time.
66 - Examine resource usage.
79 cross-references different types of metadata records with each other to look
83 As a word of caution -- the primary goal of most Linux fsck tools is to restore
92 it is now possible to regenerate data structures when non-catastrophic errors
95 +--------------------------------------------------------------------------+
97 +--------------------------------------------------------------------------+
103 +--------------------------------------------------------------------------+
106 -----------------------
109 …nges <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-symlink>`…
110 …ps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-servi…
111 …ges <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-dirs>`_.
116 --------------
132 It uses extent-based in-memory data structures to reduce memory consumption,
141 -----------------
175 This new third program has three components: an in-kernel facility to check
176 metadata, an in-kernel facility to repair metadata, and a userspace driver
183 +--------------------------------------------------------------------------+
185 +--------------------------------------------------------------------------+
193 +--------------------------------------------------------------------------+
212 autonomous self-healing of XFS maximizes service availability.
226 +------------------------------------------------------------------+
228 +------------------------------------------------------------------+
231 +------------------------------------------------------------------+
238 -----
260 --------------
311 The ability to use hardware-assisted data file integrity checking is new
315 7. Re-check the summary counters and presents the caller with a summary of
322 -------------------------
324 The kernel scrub code uses a three-step strategy for checking and repairing
349 --------------------------
362 - Free space and reference count information
364 - Inode records and indexes
366 - Storage mapping information for file data
368 - Directories
370 - Extended attributes
372 - Symbolic links
374 - Quota limits
383 errors and cross-references healthy records against other metadata to look for
407 This mechanism is described in section 2.1 ("Off-Line Algorithm") of
408 V. Srinivasan and M. J. Carey, `"Performance of On-Line Index Construction
410 *Extending Database Technology*, pp. 293-309, 1992.
413 in-memory array prior to formatting the new ondisk structure, which is very
414 similar to the list-based algorithm discussed in section 2.3 ("List-Based
429 - Reverse mapping information
431 - Directory parent pointers
444 Instead, repair functions set up an in-memory staging structure to store
455 Once the scan is done, the owning object is re-locked, the live data is used to
478 2.4 of Srinivasan above, and sections 2 ("NSF: Inded Build Without Side-File")
526 - Summary counts of free space and inodes
528 - File link counts from directories
530 - Quota resource usage counts
547 <http://www.odbms.org/wp-content/uploads/2014/06/Increment-locks.pdf>`_, 2011.
549 Since quotas are non-negative integer counts of resource usage, online
562 ---------------
569 - **Decreased performance**: Adding metadata indices to the filesystem
576 - **Incorrect repairs**: As with all software, there might be defects in the
583 The xfsprogs build system has a configure option (``--enable-scrub=no``) that
587 - **Inability to repair**: Sometimes, a filesystem is too badly damaged to be
597 - **Misbehavior**: Online fsck requires many privileges -- raw IO to block
607 - **Fuzz Kiddiez**: There are many people now who seem to think that running
609 spraying exploit code onto the public mailing list for instant zero-day
616 Automated testing should front-load some of the risk while the feature is
637 of every aspect of a fsck tool until the introduction of low-cost virtual
638 machines with high-IOPS storage.
640 fsck project involves differential analysis against the existing fsck tools and
645 -------------------------------
657 `fstests <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/>`_, for
660 would run both the ``xfs_check`` and ``xfs_repair -n`` commands on the test and
665 ``xfs_scrub -n`` between each test to ensure that the new checking code
679 ---------------------------------------
693 in-kernel validation functions and the ability of the offline fsck tool to
711 -----------------------------------------
745 2. Offline checking (``xfs_repair -n``)
747 4. Online checking (``xfs_scrub -n``)
762 These tests have been very valuable for ``xfs_scrub`` in the same ways -- they
768 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzzer-improvements…
770 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzz-baseline>`_,
772 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=more-fuzz-testing>`…
775 --------------
790 * Race ``fsstress`` and ``xfs_scrub -n`` to ensure that checking the whole
792 * Race ``fsstress`` and ``xfs_scrub`` in force-rebuild mode to ensure that
793 force-repairing the whole filesystem doesn't cause problems.
794 * Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while
796 * Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while
797 remounting the filesystem read-only and read-write.
805 …git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=race-scrub-and-mount-state-
806 and the `evolution of existing per-function stress testing
807 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=refactor-scrub-stre…
819 ------------------
827 Both tools share a ``-n`` option to perform a read-only scan, and a ``-v``
830 A new feature of ``xfs_scrub`` is the ``-x`` option, which employs the error
843 ------------------
849 possible, the lowest CPU and IO priority, and in a CPU-constrained single
867 * ``xfs_scrub_all.cron`` on non-systemd systems
879 This was performed via ``systemd-analyze security``, after which privileges
891 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-se…
894 ----------------
901 download this information into a human-readable format.
917 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=corruption-health-repo…
920 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=indirect-health-report…
934 ------------------------
965 Documentation/filesystems/xfs-self-describing-metadata.rst
968 ---------------
977 increase internal redundancy -- either storing nearly identical copies of
978 metadata, or more space-efficient encoding techniques.
987 Like any system that improves redundancy, the reverse-mapping feature increases
993 defeats device-level deduplication because the filesystem requires real
996 +--------------------------------------------------------------------------+
998 +--------------------------------------------------------------------------+
1003 | copy-writes, which age the filesystem prematurely. |
1011 +--------------------------------------------------------------------------+
1015 .. code-block:: c
1031 Finally, the flags field provides extra information about the space usage --
1063 Instead, scrub relies on rigorous cross-referencing during the primary space
1066 3. Consistency scans must use non-blocking lock acquisition primitives if the
1080 Checking and Cross-Referencing
1081 ------------------------------
1091 - Is a part of this structure obviously corrupt (``XFS_SCRUB_OFLAG_CORRUPT``) ?
1092 - Is this structure inconsistent with the rest of the system
1094 - Is there so much damage around the filesystem that cross-referencing is not
1096 - Can the structure be optimized to improve performance or reduce the size of
1098 - Does the structure contain data that is not inconsistent but deserves review
1111 - Does the block belong to this filesystem?
1113 - Does the block belong to the structure that asked for the read?
1117 - Is the type of data stored in the block within a reasonable range of what
1120 - Does the physical location of the block match the location it was read from?
1122 - Does the block checksum match the data?
1124 The scope of the protections here are very limited -- verifiers can only
1134 userspace as corruption; during a cross-reference, they are reported as a
1135 failure to cross-reference once the full examination is complete.
1144 These checks are split between the buffer verifiers, the in-filesystem users of
1150 - Does the type of data stored in the block match what scrub is expecting?
1152 - Does the block belong to the owning structure that asked for the read?
1154 - If the block contains records, do the records fit within the block?
1156 - If the block tracks internal free space information, is it consistent with
1159 - Are the records contained inside the block free of obvious corruptions?
1161 Record checks in this category are more rigorous and more time-intensive.
1174 Validation of Userspace-Controlled Record Attributes
1182 - Superblock fields controlled by mount options
1183 - Filesystem labels
1184 - File timestamps
1185 - File permissions
1186 - File size
1187 - File flags
1188 - Names present in directory entries, extended attribute keys, and filesystem
1190 - Extended attribute key namespaces
1191 - Extended attribute values
1192 - File data block contents
1193 - Quota limits
1194 - Quota timer expiration (if resource usage exceeds the soft limit)
1196 Cross-Referencing Space Metadata
1200 cross-referencing records between metadata structures.
1204 The exact set of cross-referencing is highly dependent on the context of the
1216 Btree blocks undergo the following checks before cross-referencing:
1218 - Does the type of data stored in the block match what scrub is expecting?
1220 - Does the block belong to the owning structure that asked for the read?
1222 - Do the records fit within the block?
1224 - Are the records contained inside the block free of obvious corruptions?
1226 - Are the name hashes in the correct order?
1228 - Do node pointers within the btree point to valid block addresses for the type
1231 - Do child pointers point towards the leaves?
1233 - Do sibling pointers point across the same level?
1235 - For each node block record, does the record key accurate reflect the contents
1238 Space allocation records are cross-referenced as follows:
1240 1. Any space mentioned by any metadata structure are cross-referenced as
1243 - Does the reverse mapping index list only the appropriate owner as the
1246 - Are none of the blocks claimed as free space?
1248 - If these aren't file data blocks, are none of the blocks claimed as space
1251 2. Btree blocks are cross-referenced as follows:
1253 - Everything in class 1 above.
1255 - If there's a parent node block, do the keys listed for this block match the
1258 - Do the sibling pointers point to valid blocks? Of the same level?
1260 - Do the child pointers point to valid blocks? Of the next level down?
1262 3. Free space btree records are cross-referenced as follows:
1264 - Everything in class 1 and 2 above.
1266 - Does the reverse mapping index list no owners of this space?
1268 - Is this space not claimed by the inode index for inodes?
1270 - Is it not mentioned by the reference count index?
1272 - Is there a matching record in the other free space btree?
1274 4. Inode btree records are cross-referenced as follows:
1276 - Everything in class 1 and 2 above.
1278 - Is there a matching record in free inode btree?
1280 - Do cleared bits in the holemask correspond with inode clusters?
1282 - Do set bits in the freemask correspond with inode records with zero link
1285 5. Inode records are cross-referenced as follows:
1287 - Everything in class 1.
1289 - Do all the fields that summarize information about the file forks actually
1292 - Does each inode with zero link count correspond to a record in the free
1295 6. File fork space mapping records are cross-referenced as follows:
1297 - Everything in class 1 and 2 above.
1299 - Is this space not mentioned by the inode btrees?
1301 - If this is a CoW fork mapping, does it correspond to a CoW entry in the
1304 7. Reference count records are cross-referenced as follows:
1306 - Everything in class 1 and 2 above.
1308 - Within the space subkeyspace of the rmap btree (that is to say, all
1315 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-refcount-
1317 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-inobt-gap…
1319 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-rmapbt-ga…
1322 …tps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-mergeable-r…
1325 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-strengthen-rmap-
1331 Extended attributes implement a key-value store that enable fragments of data
1335 Most typically these fragments are metadata about the file -- origins, security
1336 contexts, user-supplied labels, indexing information, etc.
1356 Scrub must read each block mapped by the attr fork and ignore the non-leaf
1374 Checking and Cross-Referencing Directories
1380 255-byte sequence (name) to an inumber.
1385 Each non-directory file may have multiple directories point to it.
1390 Each data block contains variable-sized records associating a user-provided
1393 exists as post-EOF extents) is populated with a block containing free space
1439 maps user-provided names to improve lookup times by avoiding linear scans.
1440 Internally, it maps a 32-bit hash of the name to a block offset within the
1444 fixed-size metadata records -- each dabtree block contains a magic number, a
1446 The format of leaf and node records are the same -- each entry points to the
1448 leaf blocks, and dabtree leaf records pointing to non-dabtree blocks elsewhere
1451 Checking and cross-referencing the dabtree is very similar to what is done for
1454 - Does the type of data stored in the block match what scrub is expecting?
1456 - Does the block belong to the owning structure that asked for the read?
1458 - Do the records fit within the block?
1460 - Are the records contained inside the block free of obvious corruptions?
1462 - Are the name hashes in the correct order?
1464 - Do node pointers within the dabtree point to valid fork offsets for dabtree
1467 - Do leaf pointers within the dabtree point to valid fork offsets for directory
1470 - Do child pointers point towards the leaves?
1472 - Do sibling pointers point across the same level?
1474 - For each dabtree node record, does the record key accurate reflect the
1477 - For each dabtree leaf record, does the record key accurate reflect the
1480 Cross-Referencing Summary Counters
1490 Cross-referencing these values against the filesystem metadata should be a
1498 Post-Repair Reverification
1510 ------------------------------------
1512 Complex operations can make modifications to multiple per-AG data structures
1570 It then attaches to the in-memory transaction an action item to schedule
1598 Happily, log recovery corrects this inconsistency for us -- when recovery finds
1608 In other words, all per-AG metadata updates for an unmapped block must be
1654 For copy-on-write updates this is even worse, because this must be done once to
1665 However, online fsck changes the rules -- remember that although physical
1666 updates to per-AG structures are coordinated by locking the buffers for AG
1672 For example, if a thread performing a copy-on-write has completed a reverse
1717 holding an AG header lock, but per-AG work items cannot be marked done without
1733 calls ``->finish_item`` to complete it.
1735 4. The ``->finish_item`` implementation logs some changes and calls
1761 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-drain-intents>`_.
1778 to find that no further action is necessary is expensive -- on the author's
1779 computer, this have an overhead of 40-50ns per access.
1804 - The hooked part of XFS should declare a static-scoped static key that
1809 - When deciding to invoke code that's only used by scrub, the regular
1811 scrub-only hook code if the static key is not enabled.
1813 - The regular filesystem should export helper functions that call
1819 - Scrub functions wanting to turn on scrub-only XFS functionality should call
1832 return -EDEADLOCK, which should result in the scrub being restarted with the
1839 Documentation/staging/static-keys.rst.
1844 ----------------------
1860 difficult, especially on 32-bit systems.
1873 Fortunately, the Linux kernel already has a facility for byte-addressable and
1875 In-kernel graphics drivers (most notably i915) take advantage of tmpfs files
1880 +--------------------------------------------------------------------------+
1882 +--------------------------------------------------------------------------+
1887 | The second edition solved the half-rebuilt structure problem by storing |
1892 +--------------------------------------------------------------------------+
1899 1. Arrays of fixed-sized records (space management btrees, directory and
1902 2. Sparse arrays of fixed-sized records (quotas and link counts)
1922 XFS is very record-based, which suggests that the ability to load and store
1924 To support these cases, a pair of ``xfile_obj_load`` and ``xfile_obj_store``
1953 :ref:`sorting<xfarray_sort>` algorithms and the :ref:`in-memory
1966 xfile writers call the ``->write_begin`` and ``->write_end`` functions of the
1984 Arrays of Fixed-Sized Records
1988 counts, file fork space, and reverse mappings) consists of a set of fixed-size
1990 Directories have a set of fixed-size dirent records that point to the names,
1991 and extended attributes have a set of fixed-size attribute keys that point to
2001 The ``xfarray`` abstraction presents a linear array for fixed-size records atop
2002 the byte-accessible xfile.
2019 ``xfarray_store`` functions, which wrap the similarly-named xfile functions to
2047 `big in-memory array
2048 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=big-array>`_.
2057 .. code-block:: c
2075 .. code-block:: c
2138 memory buffer, run the kernel's in-memory heapsort on the buffer, and choose
2141 low-effort robust (resistant) location in large samples`, in *Contributions to
2145 The partitioning of quicksort is fairly textbook -- rearrange the record
2184 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_ serie…
2188 In-Memory B+Trees
2200 Another option is to skip the side-log and commit live updates from the
2218 The XFS buffer cache specializes in abstracting IO to block-oriented address
2225 `in-memory btree
2226 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=in-memory-btrees>`_
2235 per-AG structure.
2240 With this adaptation in place, users of the xfile-backed buffer cache use
2241 exactly the same APIs as users of the disk-backed buffer cache.
2244 updates to an in-memory btree.
2250 Space management for an xfile is very simple -- each btree block is one memory
2252 These blocks use the same header format as an on-disk btree, but the in-memory
2254 corruption-prone than regular DRAM.
2294 and to update the in-memory btree.
2309 structure, the ephemeral nature of the in-memory btree block storage presents
2342 ------------------------------
2353 rebuilding a btree index from a collection of records -- bulk btree loading.
2354 This was implemented rather inefficiently code-wise, since ``xfs_repair``
2355 had separate copy-pasted implementations for each btree type.
2376 maxrecs = (block_size - header_size) / record_size
2402 maxrecs = (block_size - header_size) / (key_size + ptr_size)
2421 - For AG-rooted btrees, this level is the root level, so the height of the new
2425 - For inode-rooted btrees where the records in the top level do not fit in the
2430 - For inode-rooted btrees where the records in the top level can be stored in
2434 This only becomes relevant when non-bmap btrees gain the ability to root in
2447 its in-memory ``struct xfs_extent_free_item`` object to the space reservation.
2452 extent, it updates the in-memory reservation to reflect the claimed space.
2472 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-bitmap-rework>`_
2475 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-l…
2481 This part is pretty simple -- the btree builder (``xfs_btree_bulkload``) claims
2627 This xfarray is walked twice during the btree creation step -- once to populate
2629 free inode btree with records for chunks that have free non-sparse inodes.
2636 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
2650 In other words, the record emission stimulus is level-triggered::
2661 Extents being used to stage copy-on-write operations should be the only records
2663 Single-owner file blocks aren't recorded in either the free space or the
2697 - Until the reverse mapping btree runs out of records:
2699 - Retrieve the next record from the btree and put it in a bag.
2701 - Collect all records with the same starting block from the btree and put
2704 - While the bag isn't empty:
2706 - Among the mappings in the bag, compute the lowest block number where the
2712 - Remove all mappings from the bag that end at this position.
2714 - Collect all reverse mappings that start at this position from the btree
2717 - If the size of the bag changed and is greater than one, create a new
2721 The bag-like structure in this case is a type 2 xfarray as discussed in the
2729 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
2770 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-file-mappings>`_
2776 ---------------------------
2811 Repairs for file-based metadata such as extended attributes, directories,
2823 - If zero, the block has a single owner and can be freed.
2825 - If not, the block is part of a crosslinked structure and must not be
2855 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-l…
2926 information changes the number of free space records, repair must re-estimate
2955 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
2984 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
2998 other owner, to avoid re-adding crosslinked blocks to the AGFL.
3002 5. The next operation to fix the freelist will right-size the list.
3007 --------------------
3010 ("dinodes") and an in-memory ("cached") representation.
3013 badly damaged that the filesystem cannot load the in-memory representation.
3015 specialized resource acquisition functions that return either the in-memory
3020 is necessary to get the in-core structure loaded.
3025 Once the in-memory representation is loaded, repair can lock the inode and can
3027 Most inode attributes are easy to check and constrain, or are user-controlled
3036 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-inodes>`_
3040 --------------------
3043 an in-memory representation, and hence are subject to the same cache coherency
3048 whatever is necessary to get the in-core structure loaded.
3049 Once the in-memory representation is loaded, the only attributes needing
3057 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quota>`_
3063 --------------------------------
3073 Writer threads reserve the worst-case quantities of resources from the
3086 The high-performance nature of the summary counters makes it difficult for
3095 For repairs, the in-memory counters must be stabilized while walking the
3109 - The final freeze state is set one higher than ``SB_FREEZE_COMPLETE`` to
3113 - It does not quiesce the log.
3118 +--------------------------------------------------------------------------+
3120 +--------------------------------------------------------------------------+
3127 | - Other programs can unfreeze the filesystem without our knowledge. |
3130 | - Adding an extra lock to prevent others from thawing the filesystem |
3131 | required the addition of a ``->freeze_super`` function to wrap |
3143 | - The log need not be quiesced to check the summary counters, but a VFS |
3147 | - Quiescing the log means that XFS flushes the (possibly incorrect) |
3150 | - A bug in the VFS meant that freeze could complete even when |
3153 +--------------------------------------------------------------------------+
3157 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-fscounters>`_
3161 ---------------------
3174 - How does scrub manage the scan while it is collecting data?
3176 - How does the scan keep abreast of changes being made to the system by other
3186 (*itable*) of fixed-size records (*inodes*) describing a file's attributes and
3191 Wales, November 1977), pp. 18-2; and later by D. Ritchie and K. Thompson,
3193 <https://archive.org/details/bstj57-6-1905/page/n8/mode/1up>`_, from *The UNIX
3194 Time-Sharing System*, (The Bell System Technical Journal, July 1978), pp.
3195 1913-4.
3199 They form a continuous keyspace that can be expressed as a 64-bit integer,
3213 Advancing the scan cursor is a multi-step process encapsulated in
3221 2. Use the per-AG inode btree to look up the next inumber after the one that
3289 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_
3293 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>`_
3312 - The VFS may decide to kick off writeback as part of a ``DONTCACHE`` inode
3315 - Speculative preallocations need to be unreserved.
3317 - An unlinked file may have lost its last reference, in which case the entire
3322 Inactivation has two parts -- the VFS part, which initiates writeback on all
3323 dirty file pages, and the XFS part, which cleans up XFS-specific information
3360 then decide to cross-reference the object with an object that is acquired
3376 save time if another process re-opens the file before the system runs out
3378 Filesystem callers can short-circuit the LRU process by setting a ``DONTCACHE``
3393 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iget-fixes>`_ and
3395 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-dir-iget-fixes>`…
3403 in a well-known order: parent → child when updating the directory tree, and
3421 Solving both of these problems is straightforward -- any time online fsck
3455 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_
3484 overhead for single-threaded applications.
3490 - A ``struct xfs_hooks`` object must be embedded in a convenient place such as
3491 a well-known incore filesystem object.
3493 - Each hook must define an action code and a structure containing more context
3496 - Hook providers should provide appropriate wrapper functions and structs
3500 - A callsite in the regular filesystem code must be chosen to call
3510 - The online fsck function should define a structure to hold scan data, a lock
3515 - The online fsck code must contain a C function to catch the hook action code
3520 - Prior to unlocking inodes to start the scan, online fsck must call
3524 - Online fsck must call ``xfs_hooks_del`` to disable the hook once the scan is
3568 - Prior to invoking the notifier call chain, the filesystem function being
3572 - The scanning function and the scrub hook function must coordinate access to
3575 - Scrub hook function must not add the live update information to the scan
3580 - Scrub hook functions must not change the caller's state, including the
3585 - The hook function can abort the inode scan to avoid breaking the other rules.
3589 - ``xchk_iscan_start`` starts a scan
3591 - ``xchk_iscan_iter`` grabs a reference to the next inode in the scan or
3594 - ``xchk_iscan_want_live_update`` to decide if an inode has already been
3597 in-memory scan information.
3599 - ``xchk_iscan_mark_visited`` to mark an inode as having been visited in the
3602 - ``xchk_iscan_teardown`` to finish the scan
3606 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_
3698 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>`_
3708 filesystem, and per-file link count records are stored in a sparse ``xfarray``
3739 Non-directories never have children of any kind.
3756 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-nlinks>`_
3766 and use an :ref:`in-memory array <xfarray>` to store the gathered observations.
3768 repair code -- code and data are entirely contained within the scrub module,
3771 A secondary advantage of this repair approach is atomicity -- once the kernel
3783 <liveupdate>`, and an :ref:`in-memory rmap btree <xfbtree>` to complete the
3802 a. Create a btree cursor for the in-memory btree.
3804 b. Use the rmap code to add the record to the in-memory btree.
3813 a. Create a btree cursor for the in-memory btree.
3815 b. Replay the operation into the in-memory btree.
3840 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rmap-btree>`_
3844 --------------------------------------------
3849 File forks map 64-bit logical file fork space extents to physical storage space
3850 extents, similar to how a memory management unit maps 64-bit virtual addresses
3852 Therefore, file-based tree structures (such as directories and extended
3854 to other blocks mapped within that same address space, and file-based linear
3860 Therefore, online repair of file-based metadata createas a temporary file in
3878 There is a downside to the reaping process -- if the system crashes during the
3890 +--------------------------------------------------------------------------+
3892 +--------------------------------------------------------------------------+
3907 | - Array structures are linearly addressed, and the regular filesystem |
3911 | - Extended attributes are allowed to use the entire attr fork offset |
3914 | - Even if repair could build an alternate copy of a data structure in a |
3920 | - A crash after construction of the secondary tree but before the range |
3924 | - Reaping blocks after a repair is not a simple operation, and |
3928 | - Directory entry blocks and quota records record the file fork offset |
3935 | - Each block in a directory or extended attributes btree index contains |
3944 +--------------------------------------------------------------------------+
3951 This allocates an inode, marks the in-core inode private, and attaches it to
3959 The usage patterns of these two locks are the same as for any other XFS file --
3982 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-tempfiles>`_
3986 ----------------------
3996 a. When the reverse-mapping btree is enabled, the swap code must keep the
4001 b. Reverse-mapping is critical for the operation of online fsck, so the old
4007 For this use case, an incomplete exchange will not result in a user-visible
4012 For directory and xattr repairs, the user-visible contents might be the
4015 e. Old blocks in the file may be cross-linked with another structure and must
4016 not reappear if the system goes down mid-repair.
4022 the reverse-mapping extent swap code.
4026 The new ``XFS_SB_FEAT_INCOMPAT_LOG_ATOMIC_SWAP`` log-incompatible feature flag
4032 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-file-updates>`_
4035 +--------------------------------------------------------------------------+
4036 | **Sidebar: Using Log-Incompatible Feature Flags** |
4037 +--------------------------------------------------------------------------+
4075 | Log-assisted extended attribute updates and atomic extent swaps both use |
4078 +--------------------------------------------------------------------------+
4093 .. code-block:: c
4166 This quantity is ``(map1.br_startoff + map1.br_blockcount -
4179 The operation manager completes the deferred work in steps 3b-3e before
4182 4. Perform any post-processing.
4204 - Data device blocks needed to handle the repeated updates to the fork
4206 - Change in data and realtime block counts for both files.
4207 - Increase in quota usage for both files, if the two files do not share the
4209 - The number of extent mappings that will be added to each file.
4210 - Whether or not there are partially written realtime extents.
4229 - If both forks are in local format and the fork areas are large enough, the
4235 - If both forks map blocks, then the regular atomic extent swap is used.
4237 - Otherwise, only one fork is in local format.
4256 reaping <reaping>` mechanism that is done post-repair.
4260 However, this iunlink processing omits the cross-link detection of online
4302 partitioned into ``log2(total rt extents)`` sections containing enough 32-bit
4305 and can satisfy a power-of-two allocation request.
4328 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rtsummary>`_
4334 In XFS, extended attributes are implemented as a namespaced name-value store.
4340 Attribute leaf blocks contain variable-sized records that associate
4341 user-provided names with the user-provided values.
4374 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_
4378 ------------------
4452 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_
4480 +--------------------------------------------------------------------------+
4482 +--------------------------------------------------------------------------+
4521 +--------------------------------------------------------------------------+
4568 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-online-dir-repai…
4605 <https://www.spinics.net/lists/linux-xfs/msg69397.html>`_.
4641 5. Copy all non-parent pointer extended attributes to the temporary file.
4651 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-online-parent-re…
4668 per-AG in-memory slab.
4672 a. Sort the per-AG tuples in order of child_ag_inum, parent_inum, and
4678 Record the names in a per-file xfblob, and store ``(parent_inum,
4679 parent_gen, dirent_pos)`` tuples in a per-file slab.
4681 2. Sort the per-file tuples in order of parent_inum, and dirent_pos.
4684 per-AG tuple slab.
4685 This should be trivial since the per-AG tuples are in child inumber
4688 4. Position a second slab cursor at the start of the per-file tuple slab.
4693 a. Tuples in the per-AG list but not the per-file list are missing and
4696 b. Tuples in the per-file list but not the per-AG list are dangling
4706 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=pptrs-repair>`_
4710 challenging because it currently uses a single-pass scan of the filesystem
4712 This scan would have to be converted into a multi-pass scan:
4729 the dirents and add them to the now-empty directories.
4736 -------------
4741 downwards either to more subdirectories or to non-directory files.
4791 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-orphanage>`_
4804 -----------------
4846 - Phase 1 checks that the provided path maps to an XFS filesystem and detect
4849 - Phase 2 scrubs groups (g) and (h) in parallel using a threaded workqueue.
4851 - Phase 3 scans inodes in parallel.
4854 - Phase 4 repairs everything in groups (i) through (d) so that phases 5 and 6
4857 - Phase 5 starts by checking groups (b) and (c) in parallel before moving on
4860 - Phase 6 depends on groups (i) through (b) to find file data blocks to verify,
4863 - Phase 7 checks group (a), having validated everything else.
4869 --------------------
4872 Given that XFS targets installations with large high-performance storage,
4909 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-performance-t…
4912 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-iscan-rebalan…
4918 ------------------
4938 filesystem object, it became much more memory-efficient to track all eligible
4940 Each repair item represents a single lockable object -- AGs, metadata files,
4997 …://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-better-repair-warn…
5000 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-data-d…
5003 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-object-tracki…
5006 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-schedu…
5010 -----------------------------------------------
5020 - Slashes and null bytes are not allowed in directory entries.
5022 - Null bytes are not allowed in userspace-visible extended attributes.
5024 - Null bytes are not allowed in the filesystem label.
5029 presented together -- all the names in a directory, or all the attributes of a
5033 modern-day Linux systems is that programs work with Unicode character code
5035 These programs typically encode those code points in UTF-8 when interfacing
5036 with the C library because the kernel expects null-terminated names.
5038 UTF-8 encoded Unicode data.
5046 The standard also permits characters to be constructed in multiple ways --
5056 For example, the character "Right-to-Left Override" U+202E can trick some
5073 When ``xfs_scrub`` detects UTF-8 encoding in use on a system, it uses the
5076 `libicu <https://github.com/unicode-org/icu>`_
5079 Names are also checked for control characters, non-rendering characters, and
5085 ---------------------------------------
5099 If the verification read fails, ``xfs_scrub`` retries with single-block reads
5105 construct file paths from inode numbers for user-friendly reporting.
5120 ----------------
5137 When XFS v5 came along with self-describing metadata, this old mechanism grew
5144 This mechanism is identical to steps 2-3 from the procedure above except for
5156 - **Atomic commit of file writes**: A userspace process opens a file that it
5167 - **Transactional file updates**: The same mechanism as above, but the caller
5178 - **Emulation of atomic block device writes**: Export a block device with a
5188 ----------------
5210 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=vectorized-scrub>`_
5213 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=vectorized-scrub>`_
5217 ------------------------------------
5232 ------------------------
5249 uses the new fallocate map-freespace call to map any free space in that region
5263 contents; any changes will be written somewhere else via copy-on-write.
5302 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=defrag-freespace>`_
5305 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=defrag-freespace>`_
5309 ---------------------