1## The design of littlefs
2
3A little fail-safe filesystem designed for microcontrollers.
4
5```
6   | | |     .---._____
7  .-----.   |          |
8--|o    |---| littlefs |
9--|     |---|          |
10  '-----'   '----------'
11   | | |
12```
13
14littlefs was originally built as an experiment to learn about filesystem design
15in the context of microcontrollers. The question was: How would you build a
16filesystem that is resilient to power-loss and flash wear without using
17unbounded memory?
18
19This document covers the high-level design of littlefs, how it is different
20than other filesystems, and the design decisions that got us here. For the
21low-level details covering every bit on disk, check out [SPEC.md](SPEC.md).
22
23## The problem
24
25The embedded systems littlefs targets are usually 32-bit microcontrollers with
26around 32 KiB of RAM and 512 KiB of ROM. These are often paired with SPI NOR
27flash chips with about 4 MiB of flash storage. These devices are too small for
28Linux and most existing filesystems, requiring code written specifically with
29size in mind.
30
31Flash itself is an interesting piece of technology with its own quirks and
32nuance. Unlike other forms of storage, writing to flash requires two
33operations: erasing and programming. Programming (setting bits to 0) is
34relatively cheap and can be very granular. Erasing however (setting bits to 1),
35requires an expensive and destructive operation which gives flash its name.
36[Wikipedia][wikipedia-flash] has more information on how exactly flash works.
37
38To make the situation more annoying, it's very common for these embedded
39systems to lose power at any time. Usually, microcontroller code is simple and
40reactive, with no concept of a shutdown routine. This presents a big challenge
41for persistent storage, where an unlucky power loss can corrupt the storage and
42leave a device unrecoverable.
43
44This leaves us with three major requirements for an embedded filesystem.
45
461. **Power-loss resilience** - On these systems, power can be lost at any time.
47   If a power loss corrupts any persistent data structures, this can cause the
48   device to become unrecoverable. An embedded filesystem must be designed to
49   recover from a power loss during any write operation.
50
511. **Wear leveling** - Writing to flash is destructive. If a filesystem
52   repeatedly writes to the same block, eventually that block will wear out.
53   Filesystems that don't take wear into account can easily burn through blocks
54   used to store frequently updated metadata and cause a device's early death.
55
561. **Bounded RAM/ROM** - If the above requirements weren't enough, these
57   systems also have very limited amounts of memory. This prevents many
58   existing filesystem designs, which can lean on relatively large amounts of
59   RAM to temporarily store filesystem metadata.
60
61   For ROM, this means we need to keep our design simple and reuse code paths
62   were possible. For RAM we have a stronger requirement, all RAM usage is
63   bounded. This means RAM usage does not grow as the filesystem changes in
64   size or number of files. This creates a unique challenge as even presumably
65   simple operations, such as traversing the filesystem, become surprisingly
66   difficult.
67
68## Existing designs?
69
70So, what's already out there? There are, of course, many different filesystems,
71however they often share and borrow feature from each other. If we look at
72power-loss resilience and wear leveling, we can narrow these down to a handful
73of designs.
74
751. First we have the non-resilient, block based filesystems, such as [FAT] and
76   [ext2]. These are the earliest filesystem designs and often the most simple.
77   Here storage is divided into blocks, with each file being stored in a
78   collection of blocks. Without modifications, these filesystems are not
79   power-loss resilient, so updating a file is a simple as rewriting the blocks
80   in place.
81
82   ```
83               .--------.
84               |  root  |
85               |        |
86               |        |
87               '--------'
88               .-'    '-.
89              v          v
90         .--------.  .--------.
91         |   A    |  |   B    |
92         |        |  |        |
93         |        |  |        |
94         '--------'  '--------'
95         .-'         .-'    '-.
96        v           v          v
97   .--------.  .--------.  .--------.
98   |   C    |  |   D    |  |   E    |
99   |        |  |        |  |        |
100   |        |  |        |  |        |
101   '--------'  '--------'  '--------'
102   ```
103
104   Because of their simplicity, these filesystems are usually both the fastest
105   and smallest. However the lack of power resilience is not great, and the
106   binding relationship of storage location and data removes the filesystem's
107   ability to manage wear.
108
1092. In a completely different direction, we have logging filesystems, such as
110   [JFFS], [YAFFS], and [SPIFFS], storage location is not bound to a piece of
111   data, instead the entire storage is used for a circular log which is
112   appended with every change made to the filesystem. Writing appends new
113   changes, while reading requires traversing the log to reconstruct a file.
114   Some logging filesystems cache files to avoid the read cost, but this comes
115   at a tradeoff of RAM.
116
117   ```
118                                                             v
119   .--------.--------.--------.--------.--------.--------.--------.--------.
120   |        C        | new B  | new A  |                 |   A    |   B    |
121   |                 |        |        |->               |        |        |
122   |                 |        |        |                 |        |        |
123   '--------'--------'--------'--------'--------'--------'--------'--------'
124   ```
125
126   Logging filesystem are beautifully elegant. With a checksum, we can easily
127   detect power-loss and fall back to the previous state by ignoring failed
128   appends. And if that wasn't good enough, their cyclic nature means that
129   logging filesystems distribute wear across storage perfectly.
130
131   The main downside is performance. If we look at garbage collection, the
132   process of cleaning up outdated data from the end of the log, I've yet to
133   see a pure logging filesystem that does not have one of these two costs:
134
135   1. _O(n²)_ runtime
136   2. _O(n)_ RAM
137
138   SPIFFS is a very interesting case here, as it uses the fact that repeated
139   programs to NOR flash is both atomic and masking. This is a very neat
140   solution, however it limits the type of storage you can support.
141
1423. Perhaps the most common type of filesystem, a journaling filesystem is the
143   offspring that happens when you mate a block based filesystem with a logging
144   filesystem. [ext4] and [NTFS] are good examples. Here, we take a normal
145   block based filesystem and add a bounded log where we note every change
146   before it occurs.
147
148   ```
149                             journal
150                            .--------.--------.
151               .--------.   | C'| D'|     | E'|
152               |  root  |-->|   |   |->   |   |
153               |        |   |   |   |     |   |
154               |        |   '--------'--------'
155               '--------'
156               .-'    '-.
157              v          v
158         .--------.  .--------.
159         |   A    |  |   B    |
160         |        |  |        |
161         |        |  |        |
162         '--------'  '--------'
163         .-'         .-'    '-.
164        v           v          v
165   .--------.  .--------.  .--------.
166   |   C    |  |   D    |  |   E    |
167   |        |  |        |  |        |
168   |        |  |        |  |        |
169   '--------'  '--------'  '--------'
170   ```
171
172
173   This sort of filesystem takes the best from both worlds. Performance can be
174   as fast as a block based filesystem (though updating the journal does have
175   a small cost), and atomic updates to the journal allow the filesystem to
176   recover in the event of a power loss.
177
178   Unfortunately, journaling filesystems have a couple of problems. They are
179   fairly complex, since there are effectively two filesystems running in
180   parallel, which comes with a code size cost. They also offer no protection
181   against wear because of the strong relationship between storage location
182   and data.
183
1844. Last but not least we have copy-on-write (COW) filesystems, such as
185   [btrfs] and [ZFS]. These are very similar to other block based filesystems,
186   but instead of updating block inplace, all updates are performed by creating
187   a copy with the changes and replacing any references to the old block with
188   our new block. This recursively pushes all of our problems upwards until we
189   reach the root of our filesystem, which is often stored in a very small log.
190
191   ```
192               .--------.                  .--------.
193               |  root  |       write      |new root|
194               |        |        ==>       |        |
195               |        |                  |        |
196               '--------'                  '--------'
197               .-'    '-.                    |    '-.
198              |  .-------|------------------'        v
199              v v        v                       .--------.
200         .--------.  .--------.                  | new B  |
201         |   A    |  |   B    |                  |        |
202         |        |  |        |                  |        |
203         |        |  |        |                  '--------'
204         '--------'  '--------'                  .-'    |
205         .-'         .-'    '-.    .------------|------'
206        |           |          |  |             v
207        v           v          v  v        .--------.
208   .--------.  .--------.  .--------.      | new D  |
209   |   C    |  |   D    |  |   E    |      |        |
210   |        |  |        |  |        |      |        |
211   |        |  |        |  |        |      '--------'
212   '--------'  '--------'  '--------'
213   ```
214
215   COW filesystems are interesting. They offer very similar performance to
216   block based filesystems while managing to pull off atomic updates without
217   storing data changes directly in a log. They even disassociate the storage
218   location of data, which creates an opportunity for wear leveling.
219
220   Well, almost. The unbounded upwards movement of updates causes some
221   problems. Because updates to a COW filesystem don't stop until they've
222   reached the root, an update can cascade into a larger set of writes than
223   would be needed for the original data. On top of this, the upward motion
224   focuses these writes into the block, which can wear out much earlier than
225   the rest of the filesystem.
226
227## littlefs
228
229So what does littlefs do?
230
231If we look at existing filesystems, there are two interesting design patterns
232that stand out, but each have their own set of problems. Logging, which
233provides independent atomicity, has poor runtime performance. And COW data
234structures, which perform well, push the atomicity problem upwards.
235
236Can we work around these limitations?
237
238Consider logging. It has either a _O(n²)_ runtime or _O(n)_ RAM cost. We
239can't avoid these costs, _but_ if we put an upper bound on the size we can at
240least prevent the theoretical cost from becoming problem. This relies on the
241super secret computer science hack where you can pretend any algorithmic
242complexity is _O(1)_ by bounding the input.
243
244In the case of COW data structures, we can try twisting the definition a bit.
245Let's say that our COW structure doesn't copy after a single write, but instead
246copies after _n_ writes. This doesn't change most COW properties (assuming you
247can write atomically!), but what it does do is prevent the upward motion of
248wear. This sort of copy-on-bounded-writes (CObW) still focuses wear, but at
249each level we divide the propagation of wear by _n_. With a sufficiently
250large _n_ (> branching factor) wear propagation is no longer a problem.
251
252See where this is going? Separate, logging and COW are imperfect solutions and
253have weaknesses that limit their usefulness. But if we merge the two they can
254mutually solve each other's limitations.
255
256This is the idea behind littlefs. At the sub-block level, littlefs is built
257out of small, two block logs that provide atomic updates to metadata anywhere
258on the filesystem. At the super-block level, littlefs is a CObW tree of blocks
259that can be evicted on demand.
260
261```
262                    root
263                   .--------.--------.
264                   | A'| B'|         |
265                   |   |   |->       |
266                   |   |   |         |
267                   '--------'--------'
268                .----'   '--------------.
269       A       v                 B       v
270      .--------.--------.       .--------.--------.
271      | C'| D'|         |       | E'|new|         |
272      |   |   |->       |       |   | E'|->       |
273      |   |   |         |       |   |   |         |
274      '--------'--------'       '--------'--------'
275      .-'   '--.                  |   '------------------.
276     v          v              .-'                        v
277.--------.  .--------.        v                       .--------.
278|   C    |  |   D    |   .--------.       write       | new E  |
279|        |  |        |   |   E    |        ==>        |        |
280|        |  |        |   |        |                   |        |
281'--------'  '--------'   |        |                   '--------'
282                         '--------'                   .-'    |
283                         .-'    '-.    .-------------|------'
284                        v          v  v              v
285                   .--------.  .--------.       .--------.
286                   |   F    |  |   G    |       | new F  |
287                   |        |  |        |       |        |
288                   |        |  |        |       |        |
289                   '--------'  '--------'       '--------'
290```
291
292There are still some minor issues. Small logs can be expensive in terms of
293storage, in the worst case a small log costs 4x the size of the original data.
294CObW structures require an efficient block allocator since allocation occurs
295every _n_ writes. And there is still the challenge of keeping the RAM usage
296constant.
297
298## Metadata pairs
299
300Metadata pairs are the backbone of littlefs. These are small, two block logs
301that allow atomic updates anywhere in the filesystem.
302
303Why two blocks? Well, logs work by appending entries to a circular buffer
304stored on disk. But remember that flash has limited write granularity. We can
305incrementally program new data onto erased blocks, but we need to erase a full
306block at a time. This means that in order for our circular buffer to work, we
307need more than one block.
308
309We could make our logs larger than two blocks, but the next challenge is how
310do we store references to these logs? Because the blocks themselves are erased
311during writes, using a data structure to track these blocks is complicated.
312The simple solution here is to store a two block addresses for every metadata
313pair. This has the added advantage that we can change out blocks in the
314metadata pair independently, and we don't reduce our block granularity for
315other operations.
316
317In order to determine which metadata block is the most recent, we store a
318revision count that we compare using [sequence arithmetic][wikipedia-sna]
319(very handy for avoiding problems with integer overflow). Conveniently, this
320revision count also gives us a rough idea of how many erases have occurred on
321the block.
322
323```
324metadata pair pointer: {block 0, block 1}
325                           |        '--------------------.
326                            '-.                           |
327disk                           v                          v
328.--------.--------.--------.--------.--------.--------.--------.--------.
329|                 |        |metadata|                 |metadata|        |
330|                 |        |block 0 |                 |block 1 |        |
331|                 |        |        |                 |        |        |
332'--------'--------'--------'--------'--------'--------'--------'--------'
333                               '--.                  .----'
334                                   v                v
335             metadata pair .----------------.----------------.
336                           |   revision 11  |   revision 12  |
337             block 1 is    |----------------|----------------|
338             most recent   |       A        |       A''      |
339                           |----------------|----------------|
340                           |    checksum    |    checksum    |
341                           |----------------|----------------|
342                           |       B        |       A'''     | <- most recent A
343                           |----------------|----------------|
344                           |       A''      |    checksum    |
345                           |----------------|----------------|
346                           |    checksum    |       |        |
347                           |----------------|       v        |
348                           '----------------'----------------'
349```
350
351So how do we atomically update our metadata pairs? Atomicity (a type of
352power-loss resilience) requires two parts: redundancy and error detection.
353Error detection can be provided with a checksum, and in littlefs's case we
354use a 32-bit [CRC][wikipedia-crc]. Maintaining redundancy, on the other hand,
355requires multiple stages.
356
3571. If our block is not full and the program size is small enough to let us
358   append more entries, we can simply append the entries to the log. Because
359   we don't overwrite the original entries (remember rewriting flash requires
360   an erase), we still have the original entries if we lose power during the
361   append.
362
363   ```
364                                    commit A
365   .----------------.----------------.    .----------------.----------------.
366   |   revision 1   |   revision 0   | => |   revision 1   |   revision 0   |
367   |----------------|----------------|    |----------------|----------------|
368   |       |        |                |    |       A        |                |
369   |       v        |                |    |----------------|                |
370   |                |                |    |    checksum    |                |
371   |                |                |    |----------------|                |
372   |                |                |    |       |        |                |
373   |                |                |    |       v        |                |
374   |                |                |    |                |                |
375   |                |                |    |                |                |
376   |                |                |    |                |                |
377   |                |                |    |                |                |
378   '----------------'----------------'    '----------------'----------------'
379   ```
380
381   Note that littlefs doesn't maintain a checksum for each entry. Many logging
382   filesystems do this, but it limits what you can update in a single atomic
383   operation. What we can do instead is group multiple entries into a commit
384   that shares a single checksum. This lets us update multiple unrelated pieces
385   of metadata as long as they reside on the same metadata pair.
386
387   ```
388                                 commit B and A'
389   .----------------.----------------.    .----------------.----------------.
390   |   revision 1   |   revision 0   | => |   revision 1   |   revision 0   |
391   |----------------|----------------|    |----------------|----------------|
392   |       A        |                |    |       A        |                |
393   |----------------|                |    |----------------|                |
394   |    checksum    |                |    |    checksum    |                |
395   |----------------|                |    |----------------|                |
396   |       |        |                |    |       B        |                |
397   |       v        |                |    |----------------|                |
398   |                |                |    |       A'       |                |
399   |                |                |    |----------------|                |
400   |                |                |    |    checksum    |                |
401   |                |                |    |----------------|                |
402   '----------------'----------------'    '----------------'----------------'
403   ```
404
4052. If our block _is_ full of entries, we need to somehow remove outdated
406   entries to make space for new ones. This process is called garbage
407   collection, but because littlefs has multiple garbage collectors, we
408   also call this specific case compaction.
409
410   Compared to other filesystems, littlefs's garbage collector is relatively
411   simple. We want to avoid RAM consumption, so we use a sort of brute force
412   solution where for each entry we check to see if a newer entry has been
413   written. If the entry is the most recent we append it to our new block. This
414   is where having two blocks becomes important, if we lose power we still have
415   everything in our original block.
416
417   During this compaction step we also erase the metadata block and increment
418   the revision count. Because we can commit multiple entries at once, we can
419   write all of these changes to the second block without worrying about power
420   loss. It's only when the commit's checksum is written that the compacted
421   entries and revision count become committed and readable.
422
423   ```
424                           commit B', need to compact
425   .----------------.----------------.    .----------------.----------------.
426   |   revision 1   |   revision 0   | => |   revision 1   |   revision 2   |
427   |----------------|----------------|    |----------------|----------------|
428   |       A        |                |    |       A        |       A'       |
429   |----------------|                |    |----------------|----------------|
430   |    checksum    |                |    |    checksum    |       B'       |
431   |----------------|                |    |----------------|----------------|
432   |       B        |                |    |       B        |    checksum    |
433   |----------------|                |    |----------------|----------------|
434   |       A'       |                |    |       A'       |       |        |
435   |----------------|                |    |----------------|       v        |
436   |    checksum    |                |    |    checksum    |                |
437   |----------------|                |    |----------------|                |
438   '----------------'----------------'    '----------------'----------------'
439   ```
440
4413. If our block is full of entries _and_ we can't find any garbage, then what?
442   At this point, most logging filesystems would return an error indicating no
443   more space is available, but because we have small logs, overflowing a log
444   isn't really an error condition.
445
446   Instead, we split our original metadata pair into two metadata pairs, each
447   containing half of the entries, connected by a tail pointer. Instead of
448   increasing the size of the log and dealing with the scalability issues
449   associated with larger logs, we form a linked list of small bounded logs.
450   This is a tradeoff as this approach does use more storage space, but at the
451   benefit of improved scalability.
452
453   Despite writing to two metadata pairs, we can still maintain power
454   resilience during this split step by first preparing the new metadata pair,
455   and then inserting the tail pointer during the commit to the original
456   metadata pair.
457
458   ```
459                         commit C and D, need to split
460   .----------------.----------------.    .----------------.----------------.
461   |   revision 1   |   revision 2   | => |   revision 3   |   revision 2   |
462   |----------------|----------------|    |----------------|----------------|
463   |       A        |       A'       |    |       A'       |       A'       |
464   |----------------|----------------|    |----------------|----------------|
465   |    checksum    |       B'       |    |       B'       |       B'       |
466   |----------------|----------------|    |----------------|----------------|
467   |       B        |    checksum    |    |      tail    ---------------------.
468   |----------------|----------------|    |----------------|----------------| |
469   |       A'       |       |        |    |    checksum    |                | |
470   |----------------|       v        |    |----------------|                | |
471   |    checksum    |                |    |       |        |                | |
472   |----------------|                |    |       v        |                | |
473   '----------------'----------------'    '----------------'----------------' |
474                                                   .----------------.---------'
475                                                  v                v
476                                          .----------------.----------------.
477                                          |   revision 1   |   revision 0   |
478                                          |----------------|----------------|
479                                          |       C        |                |
480                                          |----------------|                |
481                                          |       D        |                |
482                                          |----------------|                |
483                                          |    checksum    |                |
484                                          |----------------|                |
485                                          |       |        |                |
486                                          |       v        |                |
487                                          |                |                |
488                                          |                |                |
489                                          '----------------'----------------'
490   ```
491
492There is another complexity the crops up when dealing with small logs. The
493amortized runtime cost of garbage collection is not only dependent on its
494one time cost (_O(n&sup2;)_ for littlefs), but also depends on how often
495garbage collection occurs.
496
497Consider two extremes:
498
4991. Log is empty, garbage collection occurs once every _n_ updates
5002. Log is full, garbage collection occurs **every** update
501
502Clearly we need to be more aggressive than waiting for our metadata pair to
503be full. As the metadata pair approaches fullness the frequency of compactions
504grows very rapidly.
505
506Looking at the problem generically, consider a log with ![n] bytes for each
507entry, ![d] dynamic entries (entries that are outdated during garbage
508collection), and ![s] static entries (entries that need to be copied during
509garbage collection). If we look at the amortized runtime complexity of updating
510this log we get this formula:
511
512![cost = n + n (s / d+1)][metadata-formula1]
513
514If we let ![r] be the ratio of static space to the size of our log in bytes, we
515find an alternative representation of the number of static and dynamic entries:
516
517![s = r (size/n)][metadata-formula2]
518
519![d = (1 - r) (size/n)][metadata-formula3]
520
521Substituting these in for ![d] and ![s] gives us a nice formula for the cost of
522updating an entry given how full the log is:
523
524![cost = n + n (r (size/n) / ((1-r) (size/n) + 1))][metadata-formula4]
525
526Assuming 100 byte entries in a 4 KiB log, we can graph this using the entry
527size to find a multiplicative cost:
528
529![Metadata pair update cost graph][metadata-cost-graph]
530
531So at 50% usage, we're seeing an average of 2x cost per update, and at 75%
532usage, we're already at an average of 4x cost per update.
533
534To avoid this exponential growth, instead of waiting for our metadata pair
535to be full, we split the metadata pair once we exceed 50% capacity. We do this
536lazily, waiting until we need to compact before checking if we fit in our 50%
537limit. This limits the overhead of garbage collection to 2x the runtime cost,
538giving us an amortized runtime complexity of _O(1)_.
539
540---
541
542If we look at metadata pairs and linked-lists of metadata pairs at a high
543level, they have fairly nice runtime costs. Assuming _n_ metadata pairs,
544each containing _m_ metadata entries, the _lookup_ cost for a specific
545entry has a worst case runtime complexity of _O(nm)_. For _updating_ a specific
546entry, the worst case complexity is _O(nm&sup2;)_, with an amortized complexity
547of only _O(nm)_.
548
549However, splitting at 50% capacity does mean that in the best case our
550metadata pairs will only be 1/2 full. If we include the overhead of the second
551block in our metadata pair, each metadata entry has an effective storage cost
552of 4x the original size. I imagine users would not be happy if they found
553that they can only use a quarter of their original storage. Metadata pairs
554provide a mechanism for performing atomic updates, but we need a separate
555mechanism for storing the bulk of our data.
556
557## CTZ skip-lists
558
559Metadata pairs provide efficient atomic updates but unfortunately have a large
560storage cost. But we can work around this storage cost by only using the
561metadata pairs to store references to more dense, copy-on-write (COW) data
562structures.
563
564[Copy-on-write data structures][wikipedia-cow], also called purely functional
565data structures, are a category of data structures where the underlying
566elements are immutable.  Making changes to the data requires creating new
567elements containing a copy of the updated data and replacing any references
568with references to the new elements. Generally, the performance of a COW data
569structure depends on how many old elements can be reused after replacing parts
570of the data.
571
572littlefs has several requirements of its COW structures. They need to be
573efficient to read and write, but most frustrating, they need to be traversable
574with a constant amount of RAM. Notably this rules out
575[B-trees][wikipedia-B-tree], which can not be traversed with constant RAM, and
576[B+-trees][wikipedia-B+-tree], which are not possible to update with COW
577operations.
578
579---
580
581So, what can we do? First let's consider storing files in a simple COW
582linked-list. Appending a block, which is the basis for writing files, means we
583have to update the last block to point to our new block. This requires a COW
584operation, which means we need to update the second-to-last block, and then the
585third-to-last, and so on until we've copied out the entire file.
586
587```
588A linked-list
589.--------.  .--------.  .--------.  .--------.  .--------.  .--------.
590| data 0 |->| data 1 |->| data 2 |->| data 4 |->| data 5 |->| data 6 |
591|        |  |        |  |        |  |        |  |        |  |        |
592|        |  |        |  |        |  |        |  |        |  |        |
593'--------'  '--------'  '--------'  '--------'  '--------'  '--------'
594```
595
596To avoid a full copy during appends, we can store the data backwards. Appending
597blocks just requires adding the new block and no other blocks need to be
598updated. If we update a block in the middle, we still need to copy the
599following blocks, but can reuse any blocks before it. Since most file writes
600are linear, this design gambles that appends are the most common type of data
601update.
602
603```
604A backwards linked-list
605.--------.  .--------.  .--------.  .--------.  .--------.  .--------.
606| data 0 |<-| data 1 |<-| data 2 |<-| data 4 |<-| data 5 |<-| data 6 |
607|        |  |        |  |        |  |        |  |        |  |        |
608|        |  |        |  |        |  |        |  |        |  |        |
609'--------'  '--------'  '--------'  '--------'  '--------'  '--------'
610```
611
612However, a backwards linked-list does have a rather glaring problem. Iterating
613over a file _in order_ has a runtime cost of _O(n&sup2;)_. A quadratic runtime
614just to read a file! That's awful.
615
616Fortunately we can do better. Instead of a singly linked list, littlefs
617uses a multilayered linked-list often called a
618[skip-list][wikipedia-skip-list]. However, unlike the most common type of
619skip-list, littlefs's skip-lists are strictly deterministic built around some
620interesting properties of the count-trailing-zeros (CTZ) instruction.
621
622The rules CTZ skip-lists follow are that for every _n_&zwj;th block where _n_
623is divisible by 2&zwj;_&#739;_, that block contains a pointer to block
624_n_-2&zwj;_&#739;_. This means that each block contains anywhere from 1 to
625log&#8322;_n_ pointers that skip to different preceding elements of the
626skip-list.
627
628The name comes from heavy use of the [CTZ instruction][wikipedia-ctz], which
629lets us calculate the power-of-two factors efficiently. For a give block _n_,
630that block contains ctz(_n_)+1 pointers.
631
632```
633A backwards CTZ skip-list
634.--------.  .--------.  .--------.  .--------.  .--------.  .--------.
635| data 0 |<-| data 1 |<-| data 2 |<-| data 3 |<-| data 4 |<-| data 5 |
636|        |<-|        |--|        |<-|        |--|        |  |        |
637|        |<-|        |--|        |--|        |--|        |  |        |
638'--------'  '--------'  '--------'  '--------'  '--------'  '--------'
639```
640
641The additional pointers let us navigate the data-structure on disk much more
642efficiently than in a singly linked list.
643
644Consider a path from data block 5 to data block 1. You can see how data block 3
645was completely skipped:
646```
647.--------.  .--------.  .--------.  .--------.  .--------.  .--------.
648| data 0 |  | data 1 |<-| data 2 |  | data 3 |  | data 4 |<-| data 5 |
649|        |  |        |  |        |<-|        |--|        |  |        |
650|        |  |        |  |        |  |        |  |        |  |        |
651'--------'  '--------'  '--------'  '--------'  '--------'  '--------'
652```
653
654The path to data block 0 is even faster, requiring only two jumps:
655```
656.--------.  .--------.  .--------.  .--------.  .--------.  .--------.
657| data 0 |  | data 1 |  | data 2 |  | data 3 |  | data 4 |<-| data 5 |
658|        |  |        |  |        |  |        |  |        |  |        |
659|        |<-|        |--|        |--|        |--|        |  |        |
660'--------'  '--------'  '--------'  '--------'  '--------'  '--------'
661```
662
663We can find the runtime complexity by looking at the path to any block from
664the block containing the most pointers. Every step along the path divides
665the search space for the block in half, giving us a runtime of _O(log n)_.
666To get _to_ the block with the most pointers, we can perform the same steps
667backwards, which puts the runtime at _O(2 log n)_ = _O(log n)_. An interesting
668note is that this optimal path occurs naturally if we greedily choose the
669pointer that covers the most distance without passing our target.
670
671So now we have a [COW] data structure that is cheap to append with a runtime
672of _O(1)_, and can be read with a worst case runtime of _O(n log n)_. Given
673that this runtime is also divided by the amount of data we can store in a
674block, this cost is fairly reasonable.
675
676---
677
678This is a new data structure, so we still have several questions. What is the
679storage overhead? Can the number of pointers exceed the size of a block? How do
680we store a CTZ skip-list in our metadata pairs?
681
682To find the storage overhead, we can look at the data structure as multiple
683linked-lists. Each linked-list skips twice as many blocks as the previous,
684or from another perspective, each linked-list uses half as much storage as
685the previous. As we approach infinity, the storage overhead forms a geometric
686series. Solving this tells us that on average our storage overhead is only
6872 pointers per block.
688
689![lim,n->inf((1/n)sum,i,0->n(ctz(i)+1)) = sum,i,0->inf(1/2^i) = 2][ctz-formula1]
690
691Because our file size is limited the word width we use to store sizes, we can
692also solve for the maximum number of pointers we would ever need to store in a
693block. If we set the overhead of pointers equal to the block size, we get the
694following equation. Note that both a smaller block size (![B][bigB]) and larger
695word width (![w]) result in more storage overhead.
696
697![B = (w/8)ceil(log2(2^w / (B-2w/8)))][ctz-formula2]
698
699Solving the equation for ![B][bigB] gives us the minimum block size for some
700common word widths:
701
7021. 32-bit CTZ skip-list => minimum block size of 104 bytes
7032. 64-bit CTZ skip-list => minimum block size of 448 bytes
704
705littlefs uses a 32-bit word width, so our blocks can only overflow with
706pointers if they are smaller than 104 bytes. This is an easy requirement, as
707in practice, most block sizes start at 512 bytes. As long as our block size
708is larger than 104 bytes, we can avoid the extra logic needed to handle
709pointer overflow.
710
711This last question is how do we store CTZ skip-lists? We need a pointer to the
712head block, the size of the skip-list, the index of the head block, and our
713offset in the head block. But it's worth noting that each size maps to a unique
714index + offset pair. So in theory we can store only a single pointer and size.
715
716However, calculating the index + offset pair from the size is a bit
717complicated. We can start with a summation that loops through all of the blocks
718up until our given size. Let ![B][bigB] be the block size in bytes, ![w] be the
719word width in bits, ![n] be the index of the block in the skip-list, and
720![N][bigN] be the file size in bytes:
721
722![N = sum,i,0->n(B-(w/8)(ctz(i)+1))][ctz-formula3]
723
724This works quite well, but requires _O(n)_ to compute, which brings the full
725runtime of reading a file up to _O(n&sup2; log n)_. Fortunately, that summation
726doesn't need to touch the disk, so the practical impact is minimal.
727
728However, despite the integration of a bitwise operation, we can actually reduce
729this equation to a _O(1)_ form.  While browsing the amazing resource that is
730the [On-Line Encyclopedia of Integer Sequences (OEIS)][oeis], I managed to find
731[A001511], which matches the iteration of the CTZ instruction,
732and [A005187], which matches its partial summation. Much to my
733surprise, these both result from simple equations, leading us to a rather
734unintuitive property that ties together two seemingly unrelated bitwise
735instructions:
736
737![sum,i,0->n(ctz(i)+1) = 2n-popcount(n)][ctz-formula4]
738
739where:
740
7411. ctz(![x]) = the number of trailing bits that are 0 in ![x]
7422. popcount(![x]) = the number of bits that are 1 in ![x]
743
744Initial tests of this surprising property seem to hold. As ![n] approaches
745infinity, we end up with an average overhead of 2 pointers, which matches our
746assumption from earlier. During iteration, the popcount function seems to
747handle deviations from this average. Of course, just to make sure I wrote a
748quick script that verified this property for all 32-bit integers.
749
750Now we can substitute into our original equation to find a more efficient
751equation for file size:
752
753![N = Bn - (w/8)(2n-popcount(n))][ctz-formula5]
754
755Unfortunately, the popcount function is non-injective, so we can't solve this
756equation for our index. But what we can do is solve for an ![n'] index that
757is greater than ![n] with error bounded by the range of the popcount function.
758We can repeatedly substitute ![n'] into the original equation until the error
759is smaller than our integer resolution. As it turns out, we only need to
760perform this substitution once, which gives us this formula for our index:
761
762![n = floor((N-(w/8)popcount(N/(B-2w/8))) / (B-2w/8))][ctz-formula6]
763
764Now that we have our index ![n], we can just plug it back into the above
765equation to find the offset. We run into a bit of a problem with integer
766overflow, but we can avoid this by rearranging the equation a bit:
767
768![off = N - (B-2w/8)n - (w/8)popcount(n)][ctz-formula7]
769
770Our solution requires quite a bit of math, but computers are very good at math.
771Now we can find both our block index and offset from a size in _O(1)_, letting
772us store CTZ skip-lists with only a pointer and size.
773
774CTZ skip-lists give us a COW data structure that is easily traversable in
775_O(n)_, can be appended in _O(1)_, and can be read in _O(n log n)_. All of
776these operations work in a bounded amount of RAM and require only two words of
777storage overhead per block. In combination with metadata pairs, CTZ skip-lists
778provide power resilience and compact storage of data.
779
780```
781                                    .--------.
782                                   .|metadata|
783                                   ||        |
784                                   ||        |
785                                   |'--------'
786                                   '----|---'
787                                        v
788.--------.  .--------.  .--------.  .--------.
789| data 0 |<-| data 1 |<-| data 2 |<-| data 3 |
790|        |<-|        |--|        |  |        |
791|        |  |        |  |        |  |        |
792'--------'  '--------'  '--------'  '--------'
793
794write data to disk, create copies
795=>
796                                    .--------.
797                                   .|metadata|
798                                   ||        |
799                                   ||        |
800                                   |'--------'
801                                   '----|---'
802                                        v
803.--------.  .--------.  .--------.  .--------.
804| data 0 |<-| data 1 |<-| data 2 |<-| data 3 |
805|        |<-|        |--|        |  |        |
806|        |  |        |  |        |  |        |
807'--------'  '--------'  '--------'  '--------'
808     ^ ^           ^
809     | |           |    .--------.  .--------.  .--------.  .--------.
810     | |           '----| new    |<-| new    |<-| new    |<-| new    |
811     | '----------------| data 2 |<-| data 3 |--| data 4 |  | data 5 |
812     '------------------|        |--|        |--|        |  |        |
813                        '--------'  '--------'  '--------'  '--------'
814
815commit to metadata pair
816=>
817                                                            .--------.
818                                                           .|new     |
819                                                           ||metadata|
820                                                           ||        |
821                                                           |'--------'
822                                                           '----|---'
823                                                                |
824.--------.  .--------.  .--------.  .--------.                  |
825| data 0 |<-| data 1 |<-| data 2 |<-| data 3 |                  |
826|        |<-|        |--|        |  |        |                  |
827|        |  |        |  |        |  |        |                  |
828'--------'  '--------'  '--------'  '--------'                  |
829     ^ ^           ^                                            v
830     | |           |    .--------.  .--------.  .--------.  .--------.
831     | |           '----| new    |<-| new    |<-| new    |<-| new    |
832     | '----------------| data 2 |<-| data 3 |--| data 4 |  | data 5 |
833     '------------------|        |--|        |--|        |  |        |
834                        '--------'  '--------'  '--------'  '--------'
835```
836
837## The block allocator
838
839So we now have the framework for an atomic, wear leveling filesystem. Small two
840block metadata pairs provide atomic updates, while CTZ skip-lists provide
841compact storage of data in COW blocks.
842
843But now we need to look at the [elephant] in the room. Where do all these
844blocks come from?
845
846Deciding which block to use next is the responsibility of the block allocator.
847In filesystem design, block allocation is often a second-class citizen, but in
848a COW filesystem its role becomes much more important as it is needed for
849nearly every write to the filesystem.
850
851Normally, block allocation involves some sort of free list or bitmap stored on
852the filesystem that is updated with free blocks. However, with power
853resilience, keeping these structures consistent becomes difficult. It doesn't
854help that any mistake in updating these structures can result in lost blocks
855that are impossible to recover.
856
857littlefs takes a cautious approach. Instead of trusting a free list on disk,
858littlefs relies on the fact that the filesystem on disk is a mirror image of
859the free blocks on the disk. The block allocator operates much like a garbage
860collector in a scripting language, scanning for unused blocks on demand.
861
862```
863          .----.
864          |root|
865          |    |
866          '----'
867   v-------'  '-------v
868.----.    .    .    .----.
869| A  |    .    .    | B  |
870|    |    .    .    |    |
871'----'    .    .    '----'
872.    .    .    .  v--'  '------------v---------v
873.    .    .    .----.    .         .----.    .----.
874.    .    .    | C  |    .         | D  |    | E  |
875.    .    .    |    |    .         |    |    |    |
876.    .    .    '----'    .         '----'    '----'
877.    .    .    .    .    .         .    .    .    .
878.----.----.----.----.----.----.----.----.----.----.----.----.
879| A  |    |root| C  | B  |         | D  |    | E  |         |
880|    |    |    |    |    |         |    |    |    |         |
881'----'----'----'----'----'----'----'----'----'----'----'----'
882        ^                   ^    ^                   ^    ^
883         '-------------------'----'-------------------'----'-- free blocks
884```
885
886While this approach may sound complicated, the decision to not maintain a free
887list greatly simplifies the overall design of littlefs. Unlike programming
888languages, there are only a handful of data structures we need to traverse.
889And block deallocation, which occurs nearly as often as block allocation,
890is simply a noop. This "drop it on the floor" strategy greatly reduces the
891complexity of managing on disk data structures, especially when handling
892high-risk error conditions.
893
894---
895
896Our block allocator needs to find free blocks efficiently. You could traverse
897through every block on storage and check each one against our filesystem tree;
898however, the runtime would be abhorrent. We need to somehow collect multiple
899blocks per traversal.
900
901Looking at existing designs, some larger filesystems that use a similar "drop
902it on the floor" strategy store a bitmap of the entire storage in [RAM]. This
903works well because bitmaps are surprisingly compact. We can't use the same
904strategy here, as it violates our constant RAM requirement, but we may be able
905to modify the idea into a workable solution.
906
907```
908.----.----.----.----.----.----.----.----.----.----.----.----.
909| A  |    |root| C  | B  |         | D  |    | E  |         |
910|    |    |    |    |    |         |    |    |    |         |
911'----'----'----'----'----'----'----'----'----'----'----'----'
912  1    0    1    1    1    0    0    1    0    1    0    0
913 \---------------------------+----------------------------/
914                             v
915               bitmap: 0xb94 (0b101110010100)
916```
917
918The block allocator in littlefs is a compromise between a disk-sized bitmap and
919a brute force traversal. Instead of a bitmap the size of storage, we keep track
920of a small, fixed-size bitmap called the lookahead buffer. During block
921allocation, we take blocks from the lookahead buffer. If the lookahead buffer
922is empty, we scan the filesystem for more free blocks, populating our lookahead
923buffer. In each scan we use an increasing offset, circling the storage as
924blocks are allocated.
925
926Here's what it might look like to allocate 4 blocks on a decently busy
927filesystem with a 32 bit lookahead and a total of 128 blocks (512 KiB
928of storage if blocks are 4 KiB):
929```
930boot...         lookahead:
931                fs blocks: fffff9fffffffffeffffffffffff0000
932scanning...     lookahead: fffff9ff
933                fs blocks: fffff9fffffffffeffffffffffff0000
934alloc = 21      lookahead: fffffdff
935                fs blocks: fffffdfffffffffeffffffffffff0000
936alloc = 22      lookahead: ffffffff
937                fs blocks: fffffffffffffffeffffffffffff0000
938scanning...     lookahead:         fffffffe
939                fs blocks: fffffffffffffffeffffffffffff0000
940alloc = 63      lookahead:         ffffffff
941                fs blocks: ffffffffffffffffffffffffffff0000
942scanning...     lookahead:         ffffffff
943                fs blocks: ffffffffffffffffffffffffffff0000
944scanning...     lookahead:                 ffffffff
945                fs blocks: ffffffffffffffffffffffffffff0000
946scanning...     lookahead:                         ffff0000
947                fs blocks: ffffffffffffffffffffffffffff0000
948alloc = 112     lookahead:                         ffff8000
949                fs blocks: ffffffffffffffffffffffffffff8000
950```
951
952This lookahead approach has a runtime complexity of _O(n&sup2;)_ to completely
953scan storage; however, bitmaps are surprisingly compact, and in practice only
954one or two passes are usually needed to find free blocks. Additionally, the
955performance of the allocator can be optimized by adjusting the block size or
956size of the lookahead buffer, trading either write granularity or RAM for
957allocator performance.
958
959## Wear leveling
960
961The block allocator has a secondary role: wear leveling.
962
963Wear leveling is the process of distributing wear across all blocks in the
964storage to prevent the filesystem from experiencing an early death due to
965wear on a single block in the storage.
966
967littlefs has two methods of protecting against wear:
9681. Detection and recovery from bad blocks
9692. Evenly distributing wear across dynamic blocks
970
971---
972
973Recovery from bad blocks doesn't actually have anything to do with the block
974allocator itself. Instead, it relies on the ability of the filesystem to detect
975and evict bad blocks when they occur.
976
977In littlefs, it is fairly straightforward to detect bad blocks at write time.
978All writes must be sourced by some form of data in RAM, so immediately after we
979write to a block, we can read the data back and verify that it was written
980correctly. If we find that the data on disk does not match the copy we have in
981RAM, a write error has occurred and we most likely have a bad block.
982
983Once we detect a bad block, we need to recover from it. In the case of write
984errors, we have a copy of the corrupted data in RAM, so all we need to do is
985evict the bad block, allocate a new, hopefully good block, and repeat the write
986that previously failed.
987
988The actual act of evicting the bad block and replacing it with a new block is
989left up to the filesystem's copy-on-bounded-writes (CObW) data structures. One
990property of CObW data structures is that any block can be replaced during a
991COW operation. The bounded-writes part is normally triggered by a counter, but
992nothing prevents us from triggering a COW operation as soon as we find a bad
993block.
994
995```
996     .----.
997     |root|
998     |    |
999     '----'
1000   v--'  '----------------------v
1001.----.                        .----.
1002| A  |                        | B  |
1003|    |                        |    |
1004'----'                        '----'
1005.    .                      v---'  .
1006.    .                   .----.    .
1007.    .                   | C  |    .
1008.    .                   |    |    .
1009.    .                   '----'    .
1010.    .                   .    .    .
1011.----.----.----.----.----.----.----.----.----.----.
1012| A  |root|              | C  | B  |              |
1013|    |    |              |    |    |              |
1014'----'----'----'----'----'----'----'----'----'----'
1015
1016update C
1017=>
1018     .----.
1019     |root|
1020     |    |
1021     '----'
1022   v--'  '----------------------v
1023.----.                        .----.
1024| A  |                        | B  |
1025|    |                        |    |
1026'----'                        '----'
1027.    .                      v---'  .
1028.    .                   .----.    .
1029.    .                   |bad |    .
1030.    .                   |blck|    .
1031.    .                   '----'    .
1032.    .                   .    .    .
1033.----.----.----.----.----.----.----.----.----.----.
1034| A  |root|              |bad | B  |              |
1035|    |    |              |blck|    |              |
1036'----'----'----'----'----'----'----'----'----'----'
1037
1038oh no! bad block! relocate C
1039=>
1040     .----.
1041     |root|
1042     |    |
1043     '----'
1044   v--'  '----------------------v
1045.----.                        .----.
1046| A  |                        | B  |
1047|    |                        |    |
1048'----'                        '----'
1049.    .                      v---'  .
1050.    .                   .----.    .
1051.    .                   |bad |    .
1052.    .                   |blck|    .
1053.    .                   '----'    .
1054.    .                   .    .    .
1055.----.----.----.----.----.----.----.----.----.----.
1056| A  |root|              |bad | B  |bad |         |
1057|    |    |              |blck|    |blck|         |
1058'----'----'----'----'----'----'----'----'----'----'
1059                            --------->
1060oh no! bad block! relocate C
1061=>
1062     .----.
1063     |root|
1064     |    |
1065     '----'
1066   v--'  '----------------------v
1067.----.                        .----.
1068| A  |                        | B  |
1069|    |                        |    |
1070'----'                        '----'
1071.    .                      v---'  .
1072.    .                   .----.    .    .----.
1073.    .                   |bad |    .    | C' |
1074.    .                   |blck|    .    |    |
1075.    .                   '----'    .    '----'
1076.    .                   .    .    .    .    .
1077.----.----.----.----.----.----.----.----.----.----.
1078| A  |root|              |bad | B  |bad | C' |    |
1079|    |    |              |blck|    |blck|    |    |
1080'----'----'----'----'----'----'----'----'----'----'
1081                            -------------->
1082successfully relocated C, update B
1083=>
1084     .----.
1085     |root|
1086     |    |
1087     '----'
1088   v--'  '----------------------v
1089.----.                        .----.
1090| A  |                        |bad |
1091|    |                        |blck|
1092'----'                        '----'
1093.    .                      v---'  .
1094.    .                   .----.    .    .----.
1095.    .                   |bad |    .    | C' |
1096.    .                   |blck|    .    |    |
1097.    .                   '----'    .    '----'
1098.    .                   .    .    .    .    .
1099.----.----.----.----.----.----.----.----.----.----.
1100| A  |root|              |bad |bad |bad | C' |    |
1101|    |    |              |blck|blck|blck|    |    |
1102'----'----'----'----'----'----'----'----'----'----'
1103
1104oh no! bad block! relocate B
1105=>
1106     .----.
1107     |root|
1108     |    |
1109     '----'
1110   v--'  '----------------------v
1111.----.                        .----.         .----.
1112| A  |                        |bad |         |bad |
1113|    |                        |blck|         |blck|
1114'----'                        '----'         '----'
1115.    .                      v---'  .         .    .
1116.    .                   .----.    .    .----.    .
1117.    .                   |bad |    .    | C' |    .
1118.    .                   |blck|    .    |    |    .
1119.    .                   '----'    .    '----'    .
1120.    .                   .    .    .    .    .    .
1121.----.----.----.----.----.----.----.----.----.----.
1122| A  |root|              |bad |bad |bad | C' |bad |
1123|    |    |              |blck|blck|blck|    |blck|
1124'----'----'----'----'----'----'----'----'----'----'
1125                                 -------------->
1126oh no! bad block! relocate B
1127=>
1128     .----.
1129     |root|
1130     |    |
1131     '----'
1132   v--'  '----------------------v
1133.----.    .----.              .----.
1134| A  |    | B' |              |bad |
1135|    |    |    |              |blck|
1136'----'    '----'              '----'
1137.    .    .  | .            .---'  .
1138.    .    .  '--------------v-------------v
1139.    .    .    .         .----.    .    .----.
1140.    .    .    .         |bad |    .    | C' |
1141.    .    .    .         |blck|    .    |    |
1142.    .    .    .         '----'    .    '----'
1143.    .    .    .         .    .    .    .    .
1144.----.----.----.----.----.----.----.----.----.----.
1145| A  |root| B' |         |bad |bad |bad | C' |bad |
1146|    |    |    |         |blck|blck|blck|    |blck|
1147'----'----'----'----'----'----'----'----'----'----'
1148------------>                    ------------------
1149successfully relocated B, update root
1150=>
1151     .----.
1152     |root|
1153     |    |
1154     '----'
1155   v--'  '--v
1156.----.    .----.
1157| A  |    | B' |
1158|    |    |    |
1159'----'    '----'
1160.    .    .   '---------------------------v
1161.    .    .    .                        .----.
1162.    .    .    .                        | C' |
1163.    .    .    .                        |    |
1164.    .    .    .                        '----'
1165.    .    .    .                        .    .
1166.----.----.----.----.----.----.----.----.----.----.
1167| A  |root| B' |         |bad |bad |bad | C' |bad |
1168|    |    |    |         |blck|blck|blck|    |blck|
1169'----'----'----'----'----'----'----'----'----'----'
1170```
1171
1172We may find that the new block is also bad, but hopefully after repeating this
1173cycle we'll eventually find a new block where a write succeeds. If we don't,
1174that means that all blocks in our storage are bad, and we've reached the end of
1175our device's usable life. At this point, littlefs will return an "out of space"
1176error. This is technically true, as there are no more good blocks, but as an
1177added benefit it also matches the error condition expected by users of
1178dynamically sized data.
1179
1180---
1181
1182Read errors, on the other hand, are quite a bit more complicated. We don't have
1183a copy of the data lingering around in RAM, so we need a way to reconstruct the
1184original data even after it has been corrupted. One such mechanism for this is
1185[error-correction-codes (ECC)][wikipedia-ecc].
1186
1187ECC is an extension to the idea of a checksum. Where a checksum such as CRC can
1188detect that an error has occurred in the data, ECC can detect and actually
1189correct some amount of errors. However, there is a limit to how many errors ECC
1190can detect: the [Hamming bound][wikipedia-hamming-bound]. As the number of
1191errors approaches the Hamming bound, we may still be able to detect errors, but
1192can no longer fix the data. If we've reached this point the block is
1193unrecoverable.
1194
1195littlefs by itself does **not** provide ECC. The block nature and relatively
1196large footprint of ECC does not work well with the dynamically sized data of
1197filesystems, correcting errors without RAM is complicated, and ECC fits better
1198with the geometry of block devices. In fact, several NOR flash chips have extra
1199storage intended for ECC, and many NAND chips can even calculate ECC on the
1200chip itself.
1201
1202In littlefs, ECC is entirely optional. Read errors can instead be prevented
1203proactively by wear leveling. But it's important to note that ECC can be used
1204at the block device level to modestly extend the life of a device. littlefs
1205respects any errors reported by the block device, allowing a block device to
1206provide additional aggressive error detection.
1207
1208---
1209
1210To avoid read errors, we need to be proactive, as opposed to reactive as we
1211were with write errors.
1212
1213One way to do this is to detect when the number of errors in a block exceeds
1214some threshold, but is still recoverable. With ECC we can do this at write
1215time, and treat the error as a write error, evicting the block before fatal
1216read errors have a chance to develop.
1217
1218A different, more generic strategy, is to proactively distribute wear across
1219all blocks in the storage, with the hope that no single block fails before the
1220rest of storage is approaching the end of its usable life. This is called
1221wear leveling.
1222
1223Generally, wear leveling algorithms fall into one of two categories:
1224
12251. [Dynamic wear leveling][wikipedia-dynamic-wear-leveling], where we
1226   distribute wear over "dynamic" blocks. The can be accomplished by
1227   only considering unused blocks.
1228
12292. [Static wear leveling][wikipedia-static-wear-leveling], where we
1230   distribute wear over both "dynamic" and "static" blocks. To make this work,
1231   we need to consider all blocks, including blocks that already contain data.
1232
1233As a tradeoff for code size and complexity, littlefs (currently) only provides
1234dynamic wear leveling. This is a best effort solution. Wear is not distributed
1235perfectly, but it is distributed among the free blocks and greatly extends the
1236life of a device.
1237
1238On top of this, littlefs uses a statistical wear leveling algorithm. What this
1239means is that we don’t actively track wear, instead we rely on a uniform
1240distribution of wear across storage to approximate a dynamic wear leveling
1241algorithm. Despite the long name, this is actually a simplification of dynamic
1242wear leveling.
1243
1244The uniform distribution of wear is left up to the block allocator, which
1245creates a uniform distribution in two parts. The easy part is when the device
1246is powered, in which case we allocate the blocks linearly, circling the device.
1247The harder part is what to do when the device loses power. We can't just
1248restart the allocator at the beginning of storage, as this would bias the wear.
1249Instead, we start the allocator as a random offset every time we mount the
1250filesystem. As long as this random offset is uniform, the combined allocation
1251pattern is also a uniform distribution.
1252
1253![Cumulative wear distribution graph][wear-distribution-graph]
1254
1255Initially, this approach to wear leveling looks like it creates a difficult
1256dependency on a power-independent random number generator, which must return
1257different random numbers on each boot. However, the filesystem is in a
1258relatively unique situation in that it is sitting on top of a large of amount
1259of entropy that persists across power loss.
1260
1261We can actually use the data on disk to directly drive our random number
1262generator. In practice, this is implemented by xoring the checksums of each
1263metadata pair, which is already calculated to fetch and mount the filesystem.
1264
1265```
1266            .--------. \                         probably random
1267           .|metadata| |                                ^
1268           ||        | +-> crc ----------------------> xor
1269           ||        | |                                ^
1270           |'--------' /                                |
1271           '---|--|-'                                   |
1272            .-'    '-------------------------.          |
1273           |                                  |         |
1274           |        .--------------> xor ------------> xor
1275           |        |                 ^       |         ^
1276           v       crc               crc      v        crc
1277      .--------. \  ^   .--------. \  ^   .--------. \  ^
1278     .|metadata|-|--|-->|metadata| |  |  .|metadata| |  |
1279     ||        | +--'  ||        | +--'  ||        | +--'
1280     ||        | |     ||        | |     ||        | |
1281     |'--------' /     |'--------' /     |'--------' /
1282     '---|--|-'        '----|---'        '---|--|-'
1283      .-'    '-.            |             .-'    '-.
1284     v          v           v            v          v
1285.--------.  .--------.  .--------.  .--------.  .--------.
1286|  data  |  |  data  |  |  data  |  |  data  |  |  data  |
1287|        |  |        |  |        |  |        |  |        |
1288|        |  |        |  |        |  |        |  |        |
1289'--------'  '--------'  '--------'  '--------'  '--------'
1290```
1291
1292Note that this random number generator is not perfect. It only returns unique
1293random numbers when the filesystem is modified. This is exactly what we want
1294for distributing wear in the allocator, but means this random number generator
1295is not useful for general use.
1296
1297---
1298
1299Together, bad block detection and dynamic wear leveling provide a best effort
1300solution for avoiding the early death of a filesystem due to wear. Importantly,
1301littlefs's wear leveling algorithm provides a key feature: You can increase the
1302life of a device simply by increasing the size of storage. And if more
1303aggressive wear leveling is desired, you can always combine littlefs with a
1304[flash translation layer (FTL)][wikipedia-ftl] to get a small power resilient
1305filesystem with static wear leveling.
1306
1307## Files
1308
1309Now that we have our building blocks out of the way, we can start looking at
1310our filesystem as a whole.
1311
1312The first step: How do we actually store our files?
1313
1314We've determined that CTZ skip-lists are pretty good at storing data compactly,
1315so following the precedent found in other filesystems we could give each file
1316a skip-list stored in a metadata pair that acts as an inode for the file.
1317
1318
1319```
1320                                    .--------.
1321                                   .|metadata|
1322                                   ||        |
1323                                   ||        |
1324                                   |'--------'
1325                                   '----|---'
1326                                        v
1327.--------.  .--------.  .--------.  .--------.
1328| data 0 |<-| data 1 |<-| data 2 |<-| data 3 |
1329|        |<-|        |--|        |  |        |
1330|        |  |        |  |        |  |        |
1331'--------'  '--------'  '--------'  '--------'
1332```
1333
1334However, this doesn't work well when files are small, which is common for
1335embedded systems. Compared to PCs, _all_ data in an embedded system is small.
1336
1337Consider a small 4-byte file. With a two block metadata-pair and one block for
1338the CTZ skip-list, we find ourselves using a full 3 blocks. On most NOR flash
1339with 4 KiB blocks, this is 12 KiB of overhead. A ridiculous 3072x increase.
1340
1341```
1342file stored as inode, 4 bytes costs ~12 KiB
1343
1344 .----------------.                  \
1345.|    revision    |                  |
1346||----------------|    \             |
1347||    skiplist   ---.  +- metadata   |
1348||----------------| |  /  4x8 bytes  |
1349||    checksum    | |     32 bytes   |
1350||----------------| |                |
1351||       |        | |                +- metadata pair
1352||       v        | |                |  2x4 KiB
1353||                | |                |  8 KiB
1354||                | |                |
1355||                | |                |
1356||                | |                |
1357|'----------------' |                |
1358'----------------'  |                /
1359          .--------'
1360         v
1361 .----------------.    \             \
1362 |      data      |    +- data       |
1363 |----------------|    /  4 bytes    |
1364 |                |                  |
1365 |                |                  |
1366 |                |                  |
1367 |                |                  +- data block
1368 |                |                  |  4 KiB
1369 |                |                  |
1370 |                |                  |
1371 |                |                  |
1372 |                |                  |
1373 |                |                  |
1374 '----------------'                  /
1375```
1376
1377We can make several improvements. First, instead of giving each file its own
1378metadata pair, we can store multiple files in a single metadata pair. One way
1379to do this is to directly associate a directory with a metadata pair (or a
1380linked list of metadata pairs). This makes it easy for multiple files to share
1381the directory's metadata pair for logging and reduces the collective storage
1382overhead.
1383
1384The strict binding of metadata pairs and directories also gives users
1385direct control over storage utilization depending on how they organize their
1386directories.
1387
1388```
1389multiple files stored in metadata pair, 4 bytes costs ~4 KiB
1390
1391       .----------------.
1392      .|    revision    |
1393      ||----------------|
1394      ||    A name      |
1395      ||   A skiplist  -----.
1396      ||----------------|   |  \
1397      ||    B name      |   |  +- metadata
1398      ||   B skiplist  ---. |  |  4x8 bytes
1399      ||----------------| | |  /  32 bytes
1400      ||    checksum    | | |
1401      ||----------------| | |
1402      ||       |        | | |
1403      ||       v        | | |
1404      |'----------------' | |
1405      '----------------'  | |
1406         .----------------' |
1407        v                   v
1408.----------------.  .----------------.  \           \
1409|     A data     |  |     B data     |  +- data     |
1410|                |  |----------------|  /  4 bytes  |
1411|                |  |                |              |
1412|                |  |                |              |
1413|                |  |                |              |
1414|                |  |                |              + data block
1415|                |  |                |              | 4 KiB
1416|                |  |                |              |
1417|----------------|  |                |              |
1418|                |  |                |              |
1419|                |  |                |              |
1420|                |  |                |              |
1421'----------------'  '----------------'              /
1422```
1423
1424The second improvement we can make is noticing that for very small files, our
1425attempts to use CTZ skip-lists for compact storage backfires. Metadata pairs
1426have a ~4x storage cost, so if our file is smaller than 1/4 the block size,
1427there's actually no benefit in storing our file outside of our metadata pair.
1428
1429In this case, we can store the file directly in our directory's metadata pair.
1430We call this an inline file, and it allows a directory to store many small
1431files quite efficiently. Our previous 4 byte file now only takes up a
1432theoretical 16 bytes on disk.
1433
1434```
1435inline files stored in metadata pair, 4 bytes costs ~16 bytes
1436
1437 .----------------.
1438.|    revision    |
1439||----------------|
1440||    A name      |
1441||   A skiplist  ---.
1442||----------------| |  \
1443||    B name      | |  +- data
1444||    B data      | |  |  4x4 bytes
1445||----------------| |  /  16 bytes
1446||    checksum    | |
1447||----------------| |
1448||       |        | |
1449||       v        | |
1450|'----------------' |
1451'----------------'  |
1452          .---------'
1453         v
1454 .----------------.
1455 |     A data     |
1456 |                |
1457 |                |
1458 |                |
1459 |                |
1460 |                |
1461 |                |
1462 |                |
1463 |----------------|
1464 |                |
1465 |                |
1466 |                |
1467 '----------------'
1468```
1469
1470Once the file exceeds 1/4 the block size, we switch to a CTZ skip-list. This
1471means that our files never use more than 4x storage overhead, decreasing as
1472the file grows in size.
1473
1474![File storage cost graph][file-cost-graph]
1475
1476## Directories
1477
1478Now we just need directories to store our files. As mentioned above we want
1479a strict binding of directories and metadata pairs, but there are a few
1480complications we need to sort out.
1481
1482On their own, each directory is a linked-list of metadata pairs. This lets us
1483store an unlimited number of files in each directory, and we don't need to
1484worry about the runtime complexity of unbounded logs. We can store other
1485directory pointers in our metadata pairs, which gives us a directory tree, much
1486like what you find on other filesystems.
1487
1488```
1489            .--------.
1490           .| root   |
1491           ||        |
1492           ||        |
1493           |'--------'
1494           '---|--|-'
1495            .-'    '-------------------------.
1496           v                                  v
1497      .--------.        .--------.        .--------.
1498     .| dir A  |------->| dir A  |       .| dir B  |
1499     ||        |       ||        |       ||        |
1500     ||        |       ||        |       ||        |
1501     |'--------'       |'--------'       |'--------'
1502     '---|--|-'        '----|---'        '---|--|-'
1503      .-'    '-.            |             .-'    '-.
1504     v          v           v            v          v
1505.--------.  .--------.  .--------.  .--------.  .--------.
1506| file C |  | file D |  | file E |  | file F |  | file G |
1507|        |  |        |  |        |  |        |  |        |
1508|        |  |        |  |        |  |        |  |        |
1509'--------'  '--------'  '--------'  '--------'  '--------'
1510```
1511
1512The main complication is, once again, traversal with a constant amount of
1513[RAM]. The directory tree is a tree, and the unfortunate fact is you can't
1514traverse a tree with constant RAM.
1515
1516Fortunately, the elements of our tree are metadata pairs, so unlike CTZ
1517skip-lists, we're not limited to strict COW operations. One thing we can do is
1518thread a linked-list through our tree, explicitly enabling cheap traversal
1519over the entire filesystem.
1520
1521```
1522            .--------.
1523           .| root   |-.
1524           ||        | |
1525   .-------||        |-'
1526   |       |'--------'
1527   |       '---|--|-'
1528   |        .-'    '-------------------------.
1529   |       v                                  v
1530   |  .--------.        .--------.        .--------.
1531   '->| dir A  |------->| dir A  |------->| dir B  |
1532     ||        |       ||        |       ||        |
1533     ||        |       ||        |       ||        |
1534     |'--------'       |'--------'       |'--------'
1535     '---|--|-'        '----|---'        '---|--|-'
1536      .-'    '-.            |             .-'    '-.
1537     v          v           v            v          v
1538.--------.  .--------.  .--------.  .--------.  .--------.
1539| file C |  | file D |  | file E |  | file F |  | file G |
1540|        |  |        |  |        |  |        |  |        |
1541|        |  |        |  |        |  |        |  |        |
1542'--------'  '--------'  '--------'  '--------'  '--------'
1543```
1544
1545Unfortunately, not sticking to pure COW operations creates some problems. Now,
1546whenever we want to manipulate the directory tree, multiple pointers need to be
1547updated. If you're familiar with designing atomic data structures this should
1548set off a bunch of red flags.
1549
1550To work around this, our threaded linked-list has a bit of leeway. Instead of
1551only containing metadata pairs found in our filesystem, it is allowed to
1552contain metadata pairs that have no parent because of a power loss. These are
1553called orphaned metadata pairs.
1554
1555With the possibility of orphans, we can build power loss resilient operations
1556that maintain a filesystem tree threaded with a linked-list for traversal.
1557
1558Adding a directory to our tree:
1559
1560```
1561         .--------.
1562        .| root   |-.
1563        ||        | |
1564.-------||        |-'
1565|       |'--------'
1566|       '---|--|-'
1567|        .-'    '-.
1568|       v          v
1569|  .--------.  .--------.
1570'->| dir A  |->| dir C  |
1571  ||        | ||        |
1572  ||        | ||        |
1573  |'--------' |'--------'
1574  '--------'  '--------'
1575
1576allocate dir B
1577=>
1578         .--------.
1579        .| root   |-.
1580        ||        | |
1581.-------||        |-'
1582|       |'--------'
1583|       '---|--|-'
1584|        .-'    '-.
1585|       v          v
1586|  .--------.    .--------.
1587'->| dir A  |--->| dir C  |
1588  ||        | .->|        |
1589  ||        | | ||        |
1590  |'--------' | |'--------'
1591  '--------'  | '--------'
1592              |
1593   .--------. |
1594  .| dir B  |-'
1595  ||        |
1596  ||        |
1597  |'--------'
1598  '--------'
1599
1600insert dir B into threaded linked-list, creating an orphan
1601=>
1602         .--------.
1603        .| root   |-.
1604        ||        | |
1605.-------||        |-'
1606|       |'--------'
1607|       '---|--|-'
1608|        .-'    '-------------.
1609|       v                      v
1610|  .--------.  .--------.  .--------.
1611'->| dir A  |->| dir B  |->| dir C  |
1612  ||        | || orphan!| ||        |
1613  ||        | ||        | ||        |
1614  |'--------' |'--------' |'--------'
1615  '--------'  '--------'  '--------'
1616
1617add dir B to parent directory
1618=>
1619               .--------.
1620              .| root   |-.
1621              ||        | |
1622.-------------||        |-'
1623|             |'--------'
1624|             '--|-|-|-'
1625|        .------'  |  '-------.
1626|       v          v           v
1627|  .--------.  .--------.  .--------.
1628'->| dir A  |->| dir B  |->| dir C  |
1629  ||        | ||        | ||        |
1630  ||        | ||        | ||        |
1631  |'--------' |'--------' |'--------'
1632  '--------'  '--------'  '--------'
1633```
1634
1635Removing a directory:
1636
1637```
1638               .--------.
1639              .| root   |-.
1640              ||        | |
1641.-------------||        |-'
1642|             |'--------'
1643|             '--|-|-|-'
1644|        .------'  |  '-------.
1645|       v          v           v
1646|  .--------.  .--------.  .--------.
1647'->| dir A  |->| dir B  |->| dir C  |
1648  ||        | ||        | ||        |
1649  ||        | ||        | ||        |
1650  |'--------' |'--------' |'--------'
1651  '--------'  '--------'  '--------'
1652
1653remove dir B from parent directory, creating an orphan
1654=>
1655         .--------.
1656        .| root   |-.
1657        ||        | |
1658.-------||        |-'
1659|       |'--------'
1660|       '---|--|-'
1661|        .-'    '-------------.
1662|       v                      v
1663|  .--------.  .--------.  .--------.
1664'->| dir A  |->| dir B  |->| dir C  |
1665  ||        | || orphan!| ||        |
1666  ||        | ||        | ||        |
1667  |'--------' |'--------' |'--------'
1668  '--------'  '--------'  '--------'
1669
1670remove dir B from threaded linked-list, returning dir B to free blocks
1671=>
1672         .--------.
1673        .| root   |-.
1674        ||        | |
1675.-------||        |-'
1676|       |'--------'
1677|       '---|--|-'
1678|        .-'    '-.
1679|       v          v
1680|  .--------.  .--------.
1681'->| dir A  |->| dir C  |
1682  ||        | ||        |
1683  ||        | ||        |
1684  |'--------' |'--------'
1685  '--------'  '--------'
1686```
1687
1688In addition to normal directory tree operations, we can use orphans to evict
1689blocks in a metadata pair when the block goes bad or exceeds its allocated
1690erases. If we lose power while evicting a metadata block we may end up with
1691a situation where the filesystem references the replacement block while the
1692threaded linked-list still contains the evicted block. We call this a
1693half-orphan.
1694
1695```
1696               .--------.
1697              .| root   |-.
1698              ||        | |
1699.-------------||        |-'
1700|             |'--------'
1701|             '--|-|-|-'
1702|        .------'  |  '-------.
1703|       v          v           v
1704|  .--------.  .--------.  .--------.
1705'->| dir A  |->| dir B  |->| dir C  |
1706  ||        | ||        | ||        |
1707  ||        | ||        | ||        |
1708  |'--------' |'--------' |'--------'
1709  '--------'  '--------'  '--------'
1710
1711try to write to dir B
1712=>
1713                  .--------.
1714                 .| root   |-.
1715                 ||        | |
1716.----------------||        |-'
1717|                |'--------'
1718|                '-|-||-|-'
1719|        .--------'  ||  '-----.
1720|       v            |v         v
1721|  .--------.     .--------.  .--------.
1722'->| dir A  |---->| dir B  |->| dir C  |
1723  ||        |-.   |        | ||        |
1724  ||        | |   |        | ||        |
1725  |'--------' |   '--------' |'--------'
1726  '--------'  |      v       '--------'
1727              |  .--------.
1728              '->| dir B  |
1729                 | bad    |
1730                 | block! |
1731                 '--------'
1732
1733oh no! bad block detected, allocate replacement
1734=>
1735                  .--------.
1736                 .| root   |-.
1737                 ||        | |
1738.----------------||        |-'
1739|                |'--------'
1740|                '-|-||-|-'
1741|        .--------'  ||  '-------.
1742|       v            |v           v
1743|  .--------.     .--------.    .--------.
1744'->| dir A  |---->| dir B  |--->| dir C  |
1745  ||        |-.   |        | .->|        |
1746  ||        | |   |        | | ||        |
1747  |'--------' |   '--------' | |'--------'
1748  '--------'  |      v       | '--------'
1749              |  .--------.  |
1750              '->| dir B  |  |
1751                 | bad    |  |
1752                 | block! |  |
1753                 '--------'  |
1754                             |
1755                 .--------.  |
1756                 | dir B  |--'
1757                 |        |
1758                 |        |
1759                 '--------'
1760
1761insert replacement in threaded linked-list, creating a half-orphan
1762=>
1763                  .--------.
1764                 .| root   |-.
1765                 ||        | |
1766.----------------||        |-'
1767|                |'--------'
1768|                '-|-||-|-'
1769|        .--------'  ||  '-------.
1770|       v            |v           v
1771|  .--------.     .--------.    .--------.
1772'->| dir A  |---->| dir B  |--->| dir C  |
1773  ||        |-.   |        | .->|        |
1774  ||        | |   |        | | ||        |
1775  |'--------' |   '--------' | |'--------'
1776  '--------'  |      v       | '--------'
1777              |  .--------.  |
1778              |  | dir B  |  |
1779              |  | bad    |  |
1780              |  | block! |  |
1781              |  '--------'  |
1782              |              |
1783              |  .--------.  |
1784              '->| dir B  |--'
1785                 | half   |
1786                 | orphan!|
1787                 '--------'
1788
1789fix reference in parent directory
1790=>
1791               .--------.
1792              .| root   |-.
1793              ||        | |
1794.-------------||        |-'
1795|             |'--------'
1796|             '--|-|-|-'
1797|        .------'  |  '-------.
1798|       v          v           v
1799|  .--------.  .--------.  .--------.
1800'->| dir A  |->| dir B  |->| dir C  |
1801  ||        | ||        | ||        |
1802  ||        | ||        | ||        |
1803  |'--------' |'--------' |'--------'
1804  '--------'  '--------'  '--------'
1805```
1806
1807Finding orphans and half-orphans is expensive, requiring a _O(n&sup2;)_
1808comparison of every metadata pair with every directory entry. But the tradeoff
1809is a power resilient filesystem that works with only a bounded amount of RAM.
1810Fortunately, we only need to check for orphans on the first allocation after
1811boot, and a read-only littlefs can ignore the threaded linked-list entirely.
1812
1813If we only had some sort of global state, then we could also store a flag and
1814avoid searching for orphans unless we knew we were specifically interrupted
1815while manipulating the directory tree (foreshadowing!).
1816
1817## The move problem
1818
1819We have one last challenge: the move problem. Phrasing the problem is simple:
1820
1821How do you atomically move a file between two directories?
1822
1823In littlefs we can atomically commit to directories, but we can't create
1824an atomic commit that spans multiple directories. The filesystem must go
1825through a minimum of two distinct states to complete a move.
1826
1827To make matters worse, file moves are a common form of synchronization for
1828filesystems. As a filesystem designed for power-loss, it's important we get
1829atomic moves right.
1830
1831So what can we do?
1832
1833- We definitely can't just let power-loss result in duplicated or lost files.
1834  This could easily break users' code and would only reveal itself in extreme
1835  cases. We were only able to be lazy about the threaded linked-list because
1836  it isn't user facing and we can handle the corner cases internally.
1837
1838- Some filesystems propagate COW operations up the tree until a common parent
1839  is found. Unfortunately this interacts poorly with our threaded tree and
1840  brings back the issue of upward propagation of wear.
1841
1842- In a previous version of littlefs we tried to solve this problem by going
1843  back and forth between the source and destination, marking and unmarking the
1844  file as moving in order to make the move atomic from the user perspective.
1845  This worked, but not well. Finding failed moves was expensive and required
1846  a unique identifier for each file.
1847
1848In the end, solving the move problem required creating a new mechanism for
1849sharing knowledge between multiple metadata pairs. In littlefs this led to the
1850introduction of a mechanism called "global state".
1851
1852---
1853
1854Global state is a small set of state that can be updated from _any_ metadata
1855pair. Combining global state with metadata pairs' ability to update multiple
1856entries in one commit gives us a powerful tool for crafting complex atomic
1857operations.
1858
1859How does global state work?
1860
1861Global state exists as a set of deltas that are distributed across the metadata
1862pairs in the filesystem. The actual global state can be built out of these
1863deltas by xoring together all of the deltas in the filesystem.
1864
1865```
1866 .--------.  .--------.  .--------.  .--------.  .--------.
1867.|        |->| gdelta |->|        |->| gdelta |->| gdelta |
1868||        | || 0x23   | ||        | || 0xff   | || 0xce   |
1869||        | ||        | ||        | ||        | ||        |
1870|'--------' |'--------' |'--------' |'--------' |'--------'
1871'--------'  '----|---'  '--------'  '----|---'  '----|---'
1872                 v                       v           v
1873       0x00 --> xor ------------------> xor ------> xor --> gstate 0x12
1874```
1875
1876To update the global state from a metadata pair, we take the global state we
1877know and xor it with both our changes and any existing delta in the metadata
1878pair. Committing this new delta to the metadata pair commits the changes to
1879the filesystem's global state.
1880
1881```
1882 .--------.  .--------.  .--------.  .--------.  .--------.
1883.|        |->| gdelta |->|        |->| gdelta |->| gdelta |
1884||        | || 0x23   | ||        | || 0xff   | || 0xce   |
1885||        | ||        | ||        | ||        | ||        |
1886|'--------' |'--------' |'--------' |'--------' |'--------'
1887'--------'  '----|---'  '--------'  '--|---|-'  '----|---'
1888                 v                     v   |         v
1889       0x00 --> xor ----------------> xor -|------> xor --> gstate = 0x12
1890                                           |                          |
1891                                           |                          |
1892change gstate to 0xab --> xor <------------|--------------------------'
1893=>                         |               v
1894                           '------------> xor
1895                                           |
1896                                           v
1897 .--------.  .--------.  .--------.  .--------.  .--------.
1898.|        |->| gdelta |->|        |->| gdelta |->| gdelta |
1899||        | || 0x23   | ||        | || 0x46   | || 0xce   |
1900||        | ||        | ||        | ||        | ||        |
1901|'--------' |'--------' |'--------' |'--------' |'--------'
1902'--------'  '----|---'  '--------'  '----|---'  '----|---'
1903                 v                       v           v
1904       0x00 --> xor ------------------> xor ------> xor --> gstate = 0xab
1905```
1906
1907To make this efficient, we always keep a copy of the global state in RAM. We
1908only need to iterate over our metadata pairs and build the global state when
1909the filesystem is mounted.
1910
1911You may have noticed that global state is very expensive. We keep a copy in
1912RAM and a delta in an unbounded number of metadata pairs. Even if we reset
1913the global state to its initial value, we can't easily clean up the deltas on
1914disk. For this reason, it's very important that we keep the size of global
1915state bounded and extremely small. But, even with a strict budget, global
1916state is incredibly valuable.
1917
1918---
1919
1920Now we can solve the move problem. We can create global state describing our
1921move atomically with the creation of the new file, and we can clear this move
1922state atomically with the removal of the old file.
1923
1924```
1925               .--------.    gstate = no move
1926              .| root   |-.
1927              ||        | |
1928.-------------||        |-'
1929|             |'--------'
1930|             '--|-|-|-'
1931|        .------'  |  '-------.
1932|       v          v           v
1933|  .--------.  .--------.  .--------.
1934'->| dir A  |->| dir B  |->| dir C  |
1935  ||        | ||        | ||        |
1936  ||        | ||        | ||        |
1937  |'--------' |'--------' |'--------'
1938  '----|---'  '--------'  '--------'
1939       v
1940   .--------.
1941   | file D |
1942   |        |
1943   |        |
1944   '--------'
1945
1946begin move, add reference in dir C, change gstate to have move
1947=>
1948               .--------.    gstate = moving file D in dir A (m1)
1949              .| root   |-.
1950              ||        | |
1951.-------------||        |-'
1952|             |'--------'
1953|             '--|-|-|-'
1954|        .------'  |  '-------.
1955|       v          v           v
1956|  .--------.  .--------.  .--------.
1957'->| dir A  |->| dir B  |->| dir C  |
1958  ||        | ||        | || gdelta |
1959  ||        | ||        | || =m1    |
1960  |'--------' |'--------' |'--------'
1961  '----|---'  '--------'  '----|---'
1962       |     .----------------'
1963       v    v
1964     .--------.
1965     | file D |
1966     |        |
1967     |        |
1968     '--------'
1969
1970complete move, remove reference in dir A, change gstate to no move
1971=>
1972               .--------.    gstate = no move (m1^~m1)
1973              .| root   |-.
1974              ||        | |
1975.-------------||        |-'
1976|             |'--------'
1977|             '--|-|-|-'
1978|        .------'  |  '-------.
1979|       v          v           v
1980|  .--------.  .--------.  .--------.
1981'->| dir A  |->| dir B  |->| dir C  |
1982  || gdelta | ||        | || gdelta |
1983  || =~m1   | ||        | || =m1    |
1984  |'--------' |'--------' |'--------'
1985  '--------'  '--------'  '----|---'
1986                               v
1987                           .--------.
1988                           | file D |
1989                           |        |
1990                           |        |
1991                           '--------'
1992```
1993
1994
1995If, after building our global state during mount, we find information
1996describing an ongoing move, we know we lost power during a move and the file
1997is duplicated in both the source and destination directories. If this happens,
1998we can resolve the move using the information in the global state to remove
1999one of the files.
2000
2001```
2002                 .--------.    gstate = moving file D in dir A (m1)
2003                .| root   |-.             ^
2004                ||        |------------> xor
2005.---------------||        |-'             ^
2006|               |'--------'               |
2007|               '--|-|-|-'                |
2008|        .--------'  |  '---------.       |
2009|       |            |             |      |
2010|       |     .----------> xor --------> xor
2011|       v     |      v      ^      v      ^
2012|  .--------. |  .--------. |  .--------. |
2013'->| dir A  |-|->| dir B  |-|->| dir C  | |
2014  ||        |-' ||        |-' || gdelta |-'
2015  ||        |   ||        |   || =m1    |
2016  |'--------'   |'--------'   |'--------'
2017  '----|---'    '--------'    '----|---'
2018       |     .---------------------'
2019       v    v
2020     .--------.
2021     | file D |
2022     |        |
2023     |        |
2024     '--------'
2025```
2026
2027We can also move directories the same way we move files. There is the threaded
2028linked-list to consider, but leaving the threaded linked-list unchanged works
2029fine as the order doesn't really matter.
2030
2031```
2032               .--------.    gstate = no move (m1^~m1)
2033              .| root   |-.
2034              ||        | |
2035.-------------||        |-'
2036|             |'--------'
2037|             '--|-|-|-'
2038|        .------'  |  '-------.
2039|       v          v           v
2040|  .--------.  .--------.  .--------.
2041'->| dir A  |->| dir B  |->| dir C  |
2042  || gdelta | ||        | || gdelta |
2043  || =~m1   | ||        | || =m1    |
2044  |'--------' |'--------' |'--------'
2045  '--------'  '--------'  '----|---'
2046                               v
2047                           .--------.
2048                           | file D |
2049                           |        |
2050                           |        |
2051                           '--------'
2052
2053begin move, add reference in dir C, change gstate to have move
2054=>
2055                .--------.    gstate = moving dir B in root (m1^~m1^m2)
2056               .| root   |-.
2057               ||        | |
2058.--------------||        |-'
2059|              |'--------'
2060|              '--|-|-|-'
2061|        .-------'  |  '----------.
2062|       v           |              v
2063|  .--------.       |          .--------.
2064'->| dir A  |-.     |       .->| dir C  |
2065  || gdelta | |     |       | || gdelta |
2066  || =~m1   | |     |       | || =m1^m2 |
2067  |'--------' |     |       | |'--------'
2068  '--------'  |     |       | '---|--|-'
2069              |     |    .-------'   |
2070              |     v   v   |        v
2071              |  .--------. |    .--------.
2072              '->| dir B  |-'    | file D |
2073                ||        |      |        |
2074                ||        |      |        |
2075                |'--------'      '--------'
2076                '--------'
2077
2078complete move, remove reference in root, change gstate to no move
2079=>
2080             .--------.    gstate = no move (m1^~m1^m2^~m2)
2081            .| root   |-.
2082            || gdelta | |
2083.-----------|| =~m2   |-'
2084|           |'--------'
2085|           '---|--|-'
2086|        .-----'    '-----.
2087|       v                  v
2088|  .--------.          .--------.
2089'->| dir A  |-.     .->| dir C  |
2090  || gdelta | |     | || gdelta |
2091  || =~m1   | |     '-|| =m1^m2 |-------.
2092  |'--------' |       |'--------'       |
2093  '--------'  |       '---|--|-'        |
2094              |        .-'    '-.       |
2095              |       v          v      |
2096              |  .--------.  .--------. |
2097              '->| dir B  |--| file D |-'
2098                ||        |  |        |
2099                ||        |  |        |
2100                |'--------'  '--------'
2101                '--------'
2102```
2103
2104Global state gives us a powerful tool we can use to solve the move problem.
2105And the result is surprisingly performant, only needing the minimum number
2106of states and using the same number of commits as a naive move. Additionally,
2107global state gives us a bit of persistent state we can use for some other
2108small improvements.
2109
2110## Conclusion
2111
2112And that's littlefs, thanks for reading!
2113
2114
2115[wikipedia-flash]: https://en.wikipedia.org/wiki/Flash_memory
2116[wikipedia-sna]: https://en.wikipedia.org/wiki/Serial_number_arithmetic
2117[wikipedia-crc]: https://en.wikipedia.org/wiki/Cyclic_redundancy_check
2118[wikipedia-cow]: https://en.wikipedia.org/wiki/Copy-on-write
2119[wikipedia-B-tree]: https://en.wikipedia.org/wiki/B-tree
2120[wikipedia-B+-tree]: https://en.wikipedia.org/wiki/B%2B_tree
2121[wikipedia-skip-list]: https://en.wikipedia.org/wiki/Skip_list
2122[wikipedia-ctz]: https://en.wikipedia.org/wiki/Count_trailing_zeros
2123[wikipedia-ecc]: https://en.wikipedia.org/wiki/Error_correction_code
2124[wikipedia-hamming-bound]: https://en.wikipedia.org/wiki/Hamming_bound
2125[wikipedia-dynamic-wear-leveling]: https://en.wikipedia.org/wiki/Wear_leveling#Dynamic_wear_leveling
2126[wikipedia-static-wear-leveling]: https://en.wikipedia.org/wiki/Wear_leveling#Static_wear_leveling
2127[wikipedia-ftl]: https://en.wikipedia.org/wiki/Flash_translation_layer
2128
2129[oeis]: https://oeis.org
2130[A001511]: https://oeis.org/A001511
2131[A005187]: https://oeis.org/A005187
2132
2133[fat]: https://en.wikipedia.org/wiki/Design_of_the_FAT_file_system
2134[ext2]: http://e2fsprogs.sourceforge.net/ext2intro.html
2135[jffs]: https://www.sourceware.org/jffs2/jffs2-html
2136[yaffs]: https://yaffs.net/documents/how-yaffs-works
2137[spiffs]: https://github.com/pellepl/spiffs/blob/master/docs/TECH_SPEC
2138[ext4]: https://ext4.wiki.kernel.org/index.php/Ext4_Design
2139[ntfs]: https://en.wikipedia.org/wiki/NTFS
2140[btrfs]: https://btrfs.wiki.kernel.org/index.php/Btrfs_design
2141[zfs]: https://en.wikipedia.org/wiki/ZFS
2142
2143[cow]: https://upload.wikimedia.org/wikipedia/commons/0/0c/Cow_female_black_white.jpg
2144[elephant]: https://upload.wikimedia.org/wikipedia/commons/3/37/African_Bush_Elephant.jpg
2145[ram]: https://upload.wikimedia.org/wikipedia/commons/9/97/New_Mexico_Bighorn_Sheep.JPG
2146
2147[metadata-formula1]: https://latex.codecogs.com/svg.latex?cost%20%3D%20n%20&plus;%20n%20%5Cfrac%7Bs%7D%7Bd&plus;1%7D
2148[metadata-formula2]: https://latex.codecogs.com/svg.latex?s%20%3D%20r%20%5Cfrac%7Bsize%7D%7Bn%7D
2149[metadata-formula3]: https://latex.codecogs.com/svg.latex?d%20%3D%20%281-r%29%20%5Cfrac%7Bsize%7D%7Bn%7D
2150[metadata-formula4]: https://latex.codecogs.com/svg.latex?cost%20%3D%20n%20&plus;%20n%20%5Cfrac%7Br%5Cfrac%7Bsize%7D%7Bn%7D%7D%7B%281-r%29%5Cfrac%7Bsize%7D%7Bn%7D&plus;1%7D
2151
2152[ctz-formula1]: https://latex.codecogs.com/svg.latex?%5Clim_%7Bn%5Cto%5Cinfty%7D%5Cfrac%7B1%7D%7Bn%7D%5Csum_%7Bi%3D0%7D%5E%7Bn%7D%5Cleft%28%5Ctext%7Bctz%7D%28i%29&plus;1%5Cright%29%20%3D%20%5Csum_%7Bi%3D0%7D%5Cfrac%7B1%7D%7B2%5Ei%7D%20%3D%202
2153[ctz-formula2]: https://latex.codecogs.com/svg.latex?B%20%3D%20%5Cfrac%7Bw%7D%7B8%7D%5Cleft%5Clceil%5Clog_2%5Cleft%28%5Cfrac%7B2%5Ew%7D%7BB-2%5Cfrac%7Bw%7D%7B8%7D%7D%5Cright%29%5Cright%5Crceil
2154[ctz-formula3]: https://latex.codecogs.com/svg.latex?N%20%3D%20%5Csum_i%5En%5Cleft%5BB-%5Cfrac%7Bw%7D%7B8%7D%5Cleft%28%5Ctext%7Bctz%7D%28i%29&plus;1%5Cright%29%5Cright%5D
2155[ctz-formula4]: https://latex.codecogs.com/svg.latex?%5Csum_i%5En%5Cleft%28%5Ctext%7Bctz%7D%28i%29&plus;1%5Cright%29%20%3D%202n-%5Ctext%7Bpopcount%7D%28n%29
2156[ctz-formula5]: https://latex.codecogs.com/svg.latex?N%20%3D%20Bn%20-%20%5Cfrac%7Bw%7D%7B8%7D%5Cleft%282n-%5Ctext%7Bpopcount%7D%28n%29%5Cright%29
2157[ctz-formula6]: https://latex.codecogs.com/svg.latex?n%20%3D%20%5Cleft%5Clfloor%5Cfrac%7BN-%5Cfrac%7Bw%7D%7B8%7D%5Cleft%28%5Ctext%7Bpopcount%7D%5Cleft%28%5Cfrac%7BN%7D%7BB-2%5Cfrac%7Bw%7D%7B8%7D%7D-1%5Cright%29&plus;2%5Cright%29%7D%7BB-2%5Cfrac%7Bw%7D%7B8%7D%7D%5Cright%5Crfloor
2158[ctz-formula7]: https://latex.codecogs.com/svg.latex?%5Cmathit%7Boff%7D%20%3D%20N%20-%20%5Cleft%28B-2%5Cfrac%7Bw%7D%7B8%7D%5Cright%29n%20-%20%5Cfrac%7Bw%7D%7B8%7D%5Ctext%7Bpopcount%7D%28n%29
2159
2160[bigB]: https://latex.codecogs.com/svg.latex?B
2161[d]: https://latex.codecogs.com/svg.latex?d
2162[m]: https://latex.codecogs.com/svg.latex?m
2163[bigN]: https://latex.codecogs.com/svg.latex?N
2164[n]: https://latex.codecogs.com/svg.latex?n
2165[n']: https://latex.codecogs.com/svg.latex?n%27
2166[r]: https://latex.codecogs.com/svg.latex?r
2167[s]: https://latex.codecogs.com/svg.latex?s
2168[w]: https://latex.codecogs.com/svg.latex?w
2169[x]: https://latex.codecogs.com/svg.latex?x
2170
2171[metadata-cost-graph]: https://raw.githubusercontent.com/geky/littlefs/gh-images/metadata-cost.svg?sanitize=true
2172[wear-distribution-graph]: https://raw.githubusercontent.com/geky/littlefs/gh-images/wear-distribution.svg?sanitize=true
2173[file-cost-graph]: https://raw.githubusercontent.com/geky/littlefs/gh-images/file-cost.svg?sanitize=true
2174