DESIGN.md - OpenGrok cross reference for /littlefs-3.7.0-3.6.0/DESIGN.md

Lines Matching refs:a
15 in the context of microcontrollers. The question was: How would you build a
40 reactive, with no concept of a shutdown routine. This presents a big challenge
42 leave a device unrecoverable.
47    If a power loss corrupts any persistent data structures, this can cause the
49    recover from a power loss during any write operation.
51 1. **Wear leveling** - Writing to flash is destructive. If a filesystem
54    used to store frequently updated metadata and cause a device's early death.
62    were possible. For RAM we have a stronger requirement, all RAM usage is
64    size or number of files. This creates a unique challenge as even presumably
72 power-loss resilience and wear leveling, we can narrow these down to a handful
77    Here storage is divided into blocks, with each file being stored in a
79    power-loss resilient, so updating a file is a simple as rewriting the blocks
109 2. In a completely different direction, we have logging filesystems, such as
110    [JFFS], [YAFFS], and [SPIFFS], storage location is not bound to a piece of
111    data, instead the entire storage is used for a circular log which is
113    changes, while reading requires traversing the log to reconstruct a file.
115    at a tradeoff of RAM.
126    Logging filesystem are beautifully elegant. With a checksum, we can easily
133    see a pure logging filesystem that does not have one of these two costs:
138    SPIFFS is a very interesting case here, as it uses the fact that repeated
139    programs to NOR flash is both atomic and masking. This is a very neat
142 3. Perhaps the most common type of filesystem, a journaling filesystem is the
143    offspring that happens when you mate a block based filesystem with a logging
144    filesystem. [ext4] and [NTFS] are good examples. Here, we take a normal
145    block based filesystem and add a bounded log where we note every change
174    as fast as a block based filesystem (though updating the journal does have
175    a small cost), and atomic updates to the journal allow the filesystem to
176    recover in the event of a power loss.
178    Unfortunately, journaling filesystems have a couple of problems. They are
180    parallel, which comes with a code size cost. They also offer no protection
187    a copy with the changes and replacing any references to the old block with
189    reach the root of our filesystem, which is often stored in a very small log.
217    storing data changes directly in a log. They even disassociate the storage
221    problems. Because updates to a COW filesystem don't stop until they've
222    reached the root, an update can cascade into a larger set of writes than
238 Consider logging. It has either a _O(n&sup2;)_ runtime or _O(n)_ RAM cost. We
244 In the case of COW data structures, we can try twisting the definition a bit.
245 Let's say that our COW structure doesn't copy after a single write, but instead
249 each level we divide the propagation of wear by _n_. With a sufficiently
250 large _n_ (&gt; branching factor) wear propagation is no longer a problem.
258 on the filesystem. At the super-block level, littlefs is a CObW tree of blocks
293 storage, in the worst case a small log costs 4x the size of the original data.
303 Why two blocks? Well, logs work by appending entries to a circular buffer
305 incrementally program new data onto erased blocks, but we need to erase a full
306 block at a time. This means that in order for our circular buffer to work, we
311 during writes, using a data structure to track these blocks is complicated.
312 The simple solution here is to store a two block addresses for every metadata
317 In order to determine which metadata block is the most recent, we store a
320 revision count also gives us a rough idea of how many erases have occurred on
351 So how do we atomically update our metadata pairs? Atomicity (a type of
353 Error detection can be provided with a checksum, and in littlefs's case we
354 use a 32-bit [CRC][wikipedia-crc]. Maintaining redundancy, on the other hand,
381    Note that littlefs doesn't maintain a checksum for each entry. Many logging
382    filesystems do this, but it limits what you can update in a single atomic
383    operation. What we can do instead is group multiple entries into a commit
384    that shares a single checksum. This lets us update multiple unrelated pieces
411    simple. We want to avoid RAM consumption, so we use a sort of brute force
412    solution where for each entry we check to see if a newer entry has been
443    more space is available, but because we have small logs, overflowing a log
447    containing half of the entries, connected by a tail pointer. Instead of
449    associated with larger logs, we form a linked list of small bounded logs.
450    This is a tradeoff as this approach does use more storage space, but at the
506 Looking at the problem generically, consider a log with ![n] bytes for each
521 Substituting these in for ![d] and ![s] gives us a nice formula for the cost of
526 Assuming 100 byte entries in a 4 KiB log, we can graph this using the entry
527 size to find a multiplicative cost:
542 If we look at metadata pairs and linked-lists of metadata pairs at a high
544 each containing _m_ metadata entries, the _lookup_ cost for a specific
545 entry has a worst case runtime complexity of _O(nm)_. For _updating_ a specific
553 that they can only use a quarter of their original storage. Metadata pairs
554 provide a mechanism for performing atomic updates, but we need a separate
559 Metadata pairs provide efficient atomic updates but unfortunately have a large
565 data structures, are a category of data structures where the underlying
567 elements containing a copy of the updated data and replacing any references
568 with references to the new elements. Generally, the performance of a COW data
574 with a constant amount of RAM. Notably this rules out
581 So, what can we do? First let's consider storing files in a simple COW
582 linked-list. Appending a block, which is the basis for writing files, means we
583 have to update the last block to point to our new block. This requires a COW
596 To avoid a full copy during appends, we can store the data backwards. Appending
598 updated. If we update a block in the middle, we still need to copy the
612 However, a backwards linked-list does have a rather glaring problem. Iterating
613 over a file _in order_ has a runtime cost of _O(n&sup2;)_. A quadratic runtime
614 just to read a file! That's awful.
616 Fortunately we can do better. Instead of a singly linked list, littlefs
617 uses a multilayered linked-list often called a
623 is divisible by 2&zwj;_&#739;_, that block contains a pointer to block
629 lets us calculate the power-of-two factors efficiently. For a give block _n_,
642 efficiently than in a singly linked list.
644 Consider a path from data block 5 to data block 1. You can see how data block 3
665 the search space for the block in half, giving us a runtime of _O(log n)_.
671 So now we have a [COW] data structure that is cheap to append with a runtime
672 of _O(1)_, and can be read with a worst case runtime of _O(n log n)_. Given
673 that this runtime is also divided by the amount of data we can store in a
678 This is a new data structure, so we still have several questions. What is the
679 storage overhead? Can the number of pointers exceed the size of a block? How do
680 we store a CTZ skip-list in our metadata pairs?
685 the previous. As we approach infinity, the storage overhead forms a geometric
692 also solve for the maximum number of pointers we would ever need to store in a
694 following equation. Note that both a smaller block size (![B][bigB]) and larger
705 littlefs uses a 32-bit word width, so our blocks can only overflow with
711 This last question is how do we store CTZ skip-lists? We need a pointer to the
713 offset in the head block. But it's worth noting that each size maps to a unique
714 index + offset pair. So in theory we can store only a single pointer and size.
716 However, calculating the index + offset pair from the size is a bit
717 complicated. We can start with a summation that loops through all of the blocks
725 runtime of reading a file up to _O(n&sup2; log n)_. Fortunately, that summation
728 However, despite the integration of a bitwise operation, we can actually reduce
729 this equation to a _O(1)_ form.  While browsing the amazing resource that is
733 surprise, these both result from simple equations, leading us to a rather
747 handle deviations from this average. Of course, just to make sure I wrote a
750 Now we can substitute into our original equation to find a more efficient
765 equation to find the offset. We run into a bit of a problem with integer
766 overflow, but we can avoid this by rearranging the equation a bit:
770 Our solution requires quite a bit of math, but computers are very good at math.
771 Now we can find both our block index and offset from a size in _O(1)_, letting
772 us store CTZ skip-lists with only a pointer and size.
774 CTZ skip-lists give us a COW data structure that is easily traversable in
776 these operations work in a bounded amount of RAM and require only two words of
847 In filesystem design, block allocation is often a second-class citizen, but in
848 a COW filesystem its role becomes much more important as it is needed for
857 littlefs takes a cautious approach. Instead of trusting a free list on disk,
858 littlefs relies on the fact that the filesystem on disk is a mirror image of
859 the free blocks on the disk. The block allocator operates much like a garbage
860 collector in a scripting language, scanning for unused blocks on demand.
886 While this approach may sound complicated, the decision to not maintain a free
888 languages, there are only a handful of data structures we need to traverse.
890 is simply a noop. This "drop it on the floor" strategy greatly reduces the
901 Looking at existing designs, some larger filesystems that use a similar "drop
902 it on the floor" strategy store a bitmap of the entire storage in [RAM]. This
905 to modify the idea into a workable solution.
918 The block allocator in littlefs is a compromise between a disk-sized bitmap and
919 a brute force traversal. Instead of a bitmap the size of storage, we keep track
920 of a small, fixed-size bitmap called the lookahead buffer. During block
926 Here's what it might look like to allocate 4 blocks on a decently busy
927 filesystem with a 32 bit lookahead and a total of 128 blocks (512 KiB
952 This lookahead approach has a runtime complexity of _O(n&sup2;)_ to completely
961 The block allocator has a secondary role: wear leveling.
965 wear on a single block in the storage.
979 write to a block, we can read the data back and verify that it was written
981 RAM, a write error has occurred and we most likely have a bad block.
983 Once we detect a bad block, we need to recover from it. In the case of write
984 errors, we have a copy of the corrupted data in RAM, so all we need to do is
985 evict the bad block, allocate a new, hopefully good block, and repeat the write
988 The actual act of evicting the bad block and replacing it with a new block is
990 property of CObW data structures is that any block can be replaced during a
991 COW operation. The bounded-writes part is normally triggered by a counter, but
992 nothing prevents us from triggering a COW operation as soon as we find a bad
1173 cycle we'll eventually find a new block where a write succeeds. If we don't,
1182 Read errors, on the other hand, are quite a bit more complicated. We don't have
1183 a copy of the data lingering around in RAM, so we need a way to reconstruct the
1187 ECC is an extension to the idea of a checksum. Where a checksum such as CRC can
1189 correct some amount of errors. However, there is a limit to how many errors ECC
1204 at the block device level to modestly extend the life of a device. littlefs
1205 respects any errors reported by the block device, allowing a block device to
1213 One way to do this is to detect when the number of errors in a block exceeds
1215 time, and treat the error as a write error, evicting the block before fatal
1216 read errors have a chance to develop.
1233 As a tradeoff for code size and complexity, littlefs (currently) only provides
1234 dynamic wear leveling. This is a best effort solution. Wear is not distributed
1236 life of a device.
1238 On top of this, littlefs uses a statistical wear leveling algorithm. What this
1239 means is that we don’t actively track wear, instead we rely on a uniform
1240 distribution of wear across storage to approximate a dynamic wear leveling
1241 algorithm. Despite the long name, this is actually a simplification of dynamic
1245 creates a uniform distribution in two parts. The easy part is when the device
1249 Instead, we start the allocator as a random offset every time we mount the
1251 pattern is also a uniform distribution.
1255 Initially, this approach to wear leveling looks like it creates a difficult
1256 dependency on a power-independent random number generator, which must return
1257 different random numbers on each boot. However, the filesystem is in a
1258 relatively unique situation in that it is sitting on top of a large of amount
1299 Together, bad block detection and dynamic wear leveling provide a best effort
1300 solution for avoiding the early death of a filesystem due to wear. Importantly,
1301 littlefs's wear leveling algorithm provides a key feature: You can increase the
1302 life of a device simply by increasing the size of storage. And if more
1303 aggressive wear leveling is desired, you can always combine littlefs with a
1304 [flash translation layer (FTL)][wikipedia-ftl] to get a small power resilient
1310 our filesystem as a whole.
1316 a skip-list stored in a metadata pair that acts as an inode for the file.
1337 Consider a small 4-byte file. With a two block metadata-pair and one block for
1338 the CTZ skip-list, we find ourselves using a full 3 blocks. On most NOR flash
1378 metadata pair, we can store multiple files in a single metadata pair. One way
1379 to do this is to directly associate a directory with a metadata pair (or a
1426 have a ~4x storage cost, so if our file is smaller than 1/4 the block size,
1430 We call this an inline file, and it allows a directory to store many small
1431 files quite efficiently. Our previous 4 byte file now only takes up a
1470 Once the file exceeds 1/4 the block size, we switch to a CTZ skip-list. This
1479 a strict binding of directories and metadata pairs, but there are a few
1482 On their own, each directory is a linked-list of metadata pairs. This lets us
1485 directory pointers in our metadata pairs, which gives us a directory tree, much
1512 The main complication is, once again, traversal with a constant amount of
1513 [RAM]. The directory tree is a tree, and the unfortunate fact is you can't
1514 traverse a tree with constant RAM.
1518 thread a linked-list through our tree, explicitly enabling cheap traversal
1548 set off a bunch of red flags.
1550 To work around this, our threaded linked-list has a bit of leeway. Instead of
1552 contain metadata pairs that have no parent because of a power loss. These are
1556 that maintain a filesystem tree threaded with a linked-list for traversal.
1558 Adding a directory to our tree:
1635 Removing a directory:
1689 blocks in a metadata pair when the block goes bad or exceeds its allocated
1690 erases. If we lose power while evicting a metadata block we may end up with
1691 a situation where the filesystem references the replacement block while the
1692 threaded linked-list still contains the evicted block. We call this a
1761 insert replacement in threaded linked-list, creating a half-orphan
1807 Finding orphans and half-orphans is expensive, requiring a _O(n&sup2;)_
1809 is a power resilient filesystem that works with only a bounded amount of RAM.
1811 boot, and a read-only littlefs can ignore the threaded linked-list entirely.
1813 If we only had some sort of global state, then we could also store a flag and
1821 How do you atomically move a file between two directories?
1825 through a minimum of two distinct states to complete a move.
1827 To make matters worse, file moves are a common form of synchronization for
1828 filesystems. As a filesystem designed for power-loss, it's important we get
1838 - Some filesystems propagate COW operations up the tree until a common parent
1842 - In a previous version of littlefs we tried to solve this problem by going
1846   a unique identifier for each file.
1848 In the end, solving the move problem required creating a new mechanism for
1850 introduction of a mechanism called "global state".
1854 Global state is a small set of state that can be updated from _any_ metadata
1856 entries in one commit gives us a powerful tool for crafting complex atomic
1861 Global state exists as a set of deltas that are distributed across the metadata
1876 To update the global state from a metadata pair, we take the global state we
1907 To make this efficient, we always keep a copy of the global state in RAM. We
1911 You may have noticed that global state is very expensive. We keep a copy in
1912 RAM and a delta in an unbounded number of metadata pairs. Even if we reset
1915 state bounded and extremely small. But, even with a strict budget, global
1996 describing an ongoing move, we know we lost power during a move and the file
2104 Global state gives us a powerful tool we can use to solve the move problem.
2106 of states and using the same number of commits as a naive move. Additionally,
2107 global state gives us a bit of persistent state we can use for some other