xfs-delayed-logging-design.txt - OpenGrok cross reference for /Linux-v5.4/Documentation/filesystems/xfs-delayed-logging-design.txt

Lines Matching full:that
12 required for objects that are frequently logged. Some parts of inodes are more
17 The reason that this is such a concern is that XFS allows multiple separate
21 "re-logging". Conceptually, this is quite simple - all it requires is that any
23 changes in the new transaction that is written to the log.
25 That is, if we have a sequence of changes A through to F, and the object was
43 that an object being relogged does not prevent the tail of the log from ever
56 progresses, ensuring that current operation never gets blocked by itself if the
59 Hence it can be seen that the relogging operation is fundamental to the correct
63 the log over and over again. Worse is the fact that objects tend to get
67 Another feature of the XFS transaction subsystem is that most transactions are
68 asynchronous. That is, they don't commit to disk until either a log buffer is
70 forces the log buffers holding the transactions to disk. This means that XFS is
80 that can be made to the filesystem at any point in time - if all the log
91 relogging technique XFS uses is that we can be relogging changed objects
93 return to the previous relogging example, it is entirely possible that
96 That is, a single log buffer may contain multiple copies of the same object,
99 necessary copy in the log buffer, and three stale copies that are simply
102 buffers. It is clear that reducing the number of stale objects written to the
122 One of the key changes that delayed logging makes to the operation of the
123 journalling subsystem is that it disassociates the amount of outstanding
130 It should be noted that this does not change the guarantee that log recovery
131 will result in a consistent filesystem. What it does mean is that as far as the
133 that simply did not occur as a result of the crash. This makes it even more
134 important that applications that care about their data use fsync() where they
137 It should be noted that delayed logging is not an innovative new concept that
141 no time is spent in this document trying to convince the reader that the
162 existing log item dirty region tracking) is that when it comes to writing the
163 changes to the log buffers, we need to ensure that the object we are formatting
168 This introduces lots of scope for deadlocks with transactions that are already
173 to be an unsolvable deadlock condition, and it was solving this problem that
178 vector array that points to the changed regions in the item. The log write code
186 the changes in a format that is compatible with the log buffer writing code.
187 that does not require us to lock the item to access. This formatting and
189 resulting in a vector that is transactionally consistent and can be accessed
224 relogged we can replace the current memory buffer with a new memory buffer that
234 region state that needs to be placed into the headers during the log write.
238 self-describing object that can be passed to the log buffer write code to be
240 Hence we avoid needing a new on-disk format to handle items that have been
246 Now that we can record transactional changes in memory in a form that allows
248 them so that they can be written to the log at some later point in time.  The
250 to be the object that is used to track committed objects as it will always
253 The log item is already used to track the log items that have been written to
258 that is in the AIL can be relogged, which causes the object to be pinned again
259 and then moved forward in the AIL when the log buffer IO completes for that
262 Essentially, this shows that an item that is in the AIL can still be modified
265 can we store state in any field that is protected by the AIL lock. Hence the
270 called the Committed Item List (CIL).  The list tracks log items that have been
275 ones that are most recently modified. Ordering of the CIL is not necessary for
284 We need to write these items in the order that they exist in the CIL, and they
294 transaction, nor does the log replay code. The only fundamental limit is that
296 reason for this limit is that to find the head and tail of the log, there must
298 transaction is larger than half the log, then there is the possibility that a
308 bigger with a lot more items in it. The worst case effect of this is that we
318 per-checkpoint context that travels through the log write process through to
321 Hence a checkpoint has a context that tracks the state of the current
323 at the same time a checkpoint transaction is started. That is, when we remove
326 context and attach that to the CIL for aggregation of new transactions.
333 requires that we strictly order the commit records in the log so that
336 To ensure that we can be writing an item into a checkpoint transaction at
339 to store the list of log vectors that need to be written into the transaction.
341 detached from the log items. That is, when the CIL is flushed the memory
343 checkpoint context so that the log item can be released. In diagrammatic form,
394 attached to the log buffer that the commit record was written to along with a
395 completion callback. Log IO completion will call that callback, which can then
402 it. The fact that we walk the log items (in the CIL) just to chain the log
403 vectors and break the link between the log item and the log vector means that
410 compare" situation that can be done after a working and reviewed implementation
415 One of the key aspects of the XFS transaction subsystem is that it tags
418 future operations that cannot be completed until that transaction is fully
419 committed to the log. In the rare case that a dependent operation occurs (e.g.
433 atomically, it is simple to ensure that each new context has a monotonically
440 operations that track transactions that have not yet completed know what
442 result, the code that forces the log to a specific LSN now needs to ensure that
445 To ensure that we can do this, we need to track all the checkpoint contexts
446 that are currently committing to the log. When we flush a checkpoint, the
450 we can also wait on the log buffer that contains the commit record, thereby
453 It should be noted that the synchronous forces may need to be extended with
459 The main concern with log forces is to ensure that all the previous checkpoints
461 need to check that all the prior contexts in the committing list are also
463 synchronisation in the log force code so that we don't need to wait anywhere
466 The only remaining complexity is that a log force now also has to handle the
467 case where the forcing sequence number is the same as the current context. That
473 force the log at the LSN of that transaction) and so the higher level code
495 there are lots of transactions that only contain an inode core and an inode log
496 format structure. That is, two vectors totaling roughly 150 bytes. If we modify
502 space.  From this, it should be obvious that a static log space reservation is
527 The problem with this is that it can lead to deadlocks as we may need to commit
530 space available in the log if we are to use static reservations, and that is
538 the difference in space required is removed from the transaction that causes
554 a CIL push triggered by a log force, only that there is no waiting for the
562 manner that is done for the existing logging method. A discussion point is
572 that items get pinned once for every transaction that is committed to the log
573 buffers. Hence items that are relogged in the log buffers will have a pin count
583 That is, we now have a many-to-one relationship between transaction commit and
584 log item completion. The result of this is that pinning and unpinning of the
598 for the pin count means that the pinning of an item must take place under the
605 lock to guarantee that we pin the items correctly.
609 A fundamental requirement for the CIL is that accesses through transaction
616 for concurrency from the ground up. It is obvious that there are serialisation
624 that we have a many-to-one interaction here. That is, the only restriction on
625 the number of concurrent transactions that can be trying to commit at once is
628 128MB log, which means that it is generally one per CPU in a machine.
632 while we are holding out a CIL flush, so at the moment that means it is held
640 want every other CPU in the machine spinning on the CIL lock. Given that
647 It should also be noted that CIL flushing is also a relatively rare operation
658 possible that this lock will become a contention point, but given the short
659 hold time once per transaction I think that contention is unlikely.
662 that is run as part of the checkpoint commit and log force sequencing. The code
663 path that triggers a CIL flush (i.e. whatever triggers the log force) will enter
780 From this, it can be seen that the only life cycle differences between the two
785 behaviour, allocation or freeing that don't already exist.