diff options
| author | Linus Torvalds <torvalds@linux-foundation.org> | 2022-08-04 20:19:16 -0700 |
|---|---|---|
| committer | Linus Torvalds <torvalds@linux-foundation.org> | 2022-08-04 20:19:16 -0700 |
| commit | b2a88c212e652e94f1e4b635910972ac57ba4e97 (patch) | |
| tree | f575188c9788c091d896218946261f75be7b8eb8 | |
| parent | 9daee913dc8d15eb65e0ff560803ab1c28bb480b (diff) | |
| parent | 5e9466a5d0604e20082d828008047b3165592caf (diff) | |
| download | linux-b2a88c212e652e94f1e4b635910972ac57ba4e97.tar.gz linux-b2a88c212e652e94f1e4b635910972ac57ba4e97.tar.bz2 linux-b2a88c212e652e94f1e4b635910972ac57ba4e97.zip | |
Merge tag 'xfs-5.20-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs updates from Darrick Wong:
"The biggest changes for this release are the log scalability
improvements, lockless lookups for the buffer cache, and making the
attr fork a permanent part of the incore inode in preparation for
directory parent pointers.
There's also a bunch of bug fixes that have accumulated since -rc5. I
might send you a second pull request with some more bug fixes that I'm
still working on.
Once the merge window ends, I will hand maintainership back to Dave
Chinner until the 6.1-rc1 release so that I can conduct the design
review for the online fsck feature, and try to get it merged.
Summary:
- Improve scalability of the XFS log by removing spinlocks and global
synchronization points.
- Add security labels to whiteout inodes to match the other
filesystems.
- Clean up per-ag pointer passing to simplify call sites.
- Reduce verifier overhead by precalculating more AG geometry.
- Implement fast-path lockless lookups in the buffer cache to reduce
spinlock hammering.
- Make attr forks a permanent part of the inode structure to fix a
UAF bug and because most files these days tend to have security
labels and soon will have parent pointers too.
- Clean up XFS_IFORK_Q usage and give it a better name.
- Fix more UAF bugs in the xattr code.
- SOB my tags.
- Fix some typos in the timestamp range documentation.
- Fix a few more memory leaks.
- Code cleanups and typo fixes.
- Fix an unlocked inode fork pointer access in getbmap"
* tag 'xfs-5.20-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (61 commits)
xfs: delete extra space and tab in blank line
xfs: fix NULL pointer dereference in xfs_getbmap()
xfs: Fix typo 'the the' in comment
xfs: Fix comment typo
xfs: don't leak memory when attr fork loading fails
xfs: fix for variable set but not used warning
xfs: xfs_buf cache destroy isn't RCU safe
xfs: delete unnecessary NULL checks
xfs: fix comment for start time value of inode with bigtime enabled
xfs: fix use-after-free in xattr node block inactivation
xfs: lockless buffer lookup
xfs: remove a superflous hash lookup when inserting new buffers
xfs: reduce the number of atomic when locking a buffer after lookup
xfs: merge xfs_buf_find() and xfs_buf_get_map()
xfs: break up xfs_buf_find() into individual pieces
xfs: add in-memory iunlink log item
xfs: add log item precommit operation
xfs: combine iunlink inode update functions
xfs: clean up xfs_iunlink_update_inode()
xfs: double link the unlinked inode list
...
86 files changed, 2306 insertions, 1696 deletions
diff --git a/Documentation/filesystems/xfs-delayed-logging-design.rst b/Documentation/filesystems/xfs-delayed-logging-design.rst index 464405d2801e..4ef419f54663 100644 --- a/Documentation/filesystems/xfs-delayed-logging-design.rst +++ b/Documentation/filesystems/xfs-delayed-logging-design.rst @@ -1,29 +1,314 @@ .. SPDX-License-Identifier: GPL-2.0 -========================== -XFS Delayed Logging Design -========================== - -Introduction to Re-logging in XFS -================================= - -XFS logging is a combination of logical and physical logging. Some objects, -such as inodes and dquots, are logged in logical format where the details -logged are made up of the changes to in-core structures rather than on-disk -structures. Other objects - typically buffers - have their physical changes -logged. The reason for these differences is to reduce the amount of log space -required for objects that are frequently logged. Some parts of inodes are more -frequently logged than others, and inodes are typically more frequently logged -than any other object (except maybe the superblock buffer) so keeping the -amount of metadata logged low is of prime importance. - -The reason that this is such a concern is that XFS allows multiple separate -modifications to a single object to be carried in the log at any given time. -This allows the log to avoid needing to flush each change to disk before -recording a new change to the object. XFS does this via a method called -"re-logging". Conceptually, this is quite simple - all it requires is that any -new change to the object is recorded with a *new copy* of all the existing -changes in the new transaction that is written to the log. +================== +XFS Logging Design +================== + +Preamble +======== + +This document describes the design and algorithms that the XFS journalling +subsystem is based on. This document describes the design and algorithms that +the XFS journalling subsystem is based on so that readers may familiarize +themselves with the general concepts of how transaction processing in XFS works. + +We begin with an overview of transactions in XFS, followed by describing how +transaction reservations are structured and accounted, and then move into how we +guarantee forwards progress for long running transactions with finite initial +reservations bounds. At this point we need to explain how relogging works. With +the basic concepts covered, the design of the delayed logging mechanism is +documented. + + +Introduction +============ + +XFS uses Write Ahead Logging for ensuring changes to the filesystem metadata +are atomic and recoverable. For reasons of space and time efficiency, the +logging mechanisms are varied and complex, combining intents, logical and +physical logging mechanisms to provide the necessary recovery guarantees the +filesystem requires. + +Some objects, such as inodes and dquots, are logged in logical format where the +details logged are made up of the changes to in-core structures rather than +on-disk structures. Other objects - typically buffers - have their physical +changes logged. Long running atomic modifications have individual changes +chained together by intents, ensuring that journal recovery can restart and +finish an operation that was only partially done when the system stopped +functioning. + +The reason for these differences is to keep the amount of log space and CPU time +required to process objects being modified as small as possible and hence the +logging overhead as low as possible. Some items are very frequently modified, +and some parts of objects are more frequently modified than others, so keeping +the overhead of metadata logging low is of prime importance. + +The method used to log an item or chain modifications together isn't +particularly important in the scope of this document. It suffices to know that +the method used for logging a particular object or chaining modifications +together are different and are dependent on the object and/or modification being +performed. The logging subsystem only cares that certain specific rules are +followed to guarantee forwards progress and prevent deadlocks. + + +Transactions in XFS +=================== + +XFS has two types of high level transactions, defined by the type of log space +reservation they take. These are known as "one shot" and "permanent" +transactions. Permanent transaction reservations can take reservations that span +commit boundaries, whilst "one shot" transactions are for a single atomic +modification. + +The type and size of reservation must be matched to the modification taking +place. This means that permanent transactions can be used for one-shot +modifications, but one-shot reservations cannot be used for permanent +transactions. + +In the code, a one-shot transaction pattern looks somewhat like this:: + + tp = xfs_trans_alloc(<reservation>) + <lock items> + <join item to transaction> + <do modification> + xfs_trans_commit(tp); + +As items are modified in the transaction, the dirty regions in those items are +tracked via the transaction handle. Once the transaction is committed, all +resources joined to it are released, along with the remaining unused reservation +space that was taken at the transaction allocation time. + +In contrast, a permanent transaction is made up of multiple linked individual +transactions, and the pattern looks like this:: + + tp = xfs_trans_alloc(<reservation>) + xfs_ilock(ip, XFS_ILOCK_EXCL) + + loop { + xfs_trans_ijoin(tp, 0); + <do modification> + xfs_trans_log_inode(tp, ip); + xfs_trans_roll(&tp); + } + + xfs_trans_commit(tp); + xfs_iunlock(ip, XFS_ILOCK_EXCL); + +While this might look similar to a one-shot transaction, there is an important +difference: xfs_trans_roll() performs a specific operation that links two +transactions together:: + + ntp = xfs_trans_dup(tp); + xfs_trans_commit(tp); + xfs_log_reserve(ntp); + +This results in a series of "rolling transactions" where the inode is locked +across the entire chain of transactions. Hence while this series of rolling +transactions is running, nothing else can read from or write to the inode and +this provides a mechanism for complex changes to appear atomic from an external +observer's point of view. + +It is important to note that a series of rolling transactions in a permanent +transaction does not form an atomic change in the journal. While each +individual modification is atomic, the chain is *not atomic*. If we crash half +way through, then recovery will only replay up to the last transactional +modification the loop made that was committed to the journal. + +This affects long running permanent transactions in that it is not possible to +predict how much of a long running operation will actually be recovered because +there is no guarantee of how much of the operation reached stale storage. Hence +if a long running operation requires multiple transactions to fully complete, +the high level operation must use intents and deferred operations to guarantee +recovery can complete the operation once the first transactions is persisted in +the on-disk journal. + + +Transactions are Asynchronous +============================= + +In XFS, all high level transactions are asynchronous by default. This means that +xfs_trans_commit() does not guarantee that the modification has been committed +to stable storage when it returns. Hence when a system crashes, not all the +completed transactions will be replayed during recovery. + +However, the logging subsystem does provide global ordering guarantees, such +that if a specific change is seen after recovery, all metadata modifications +that were committed prior to that change will also be seen. + +For single shot operations that need to reach stable storage immediately, or +ensuring that a long running permanent transaction is fully committed once it is +complete, we can explicitly tag a transaction as synchronous. This will trigger +a "log force" to flush the outstanding committed transactions to stable storage +in the journal and wait for that to complete. + +Synchronous transactions are rarely used, however, because they limit logging +throughput to the IO latency limitations of the underlying storage. Instead, we +tend to use log forces to ensure modifications are on stable storage only when +a user operation requires a synchronisation point to occur (e.g. fsync). + + +Transaction Reservations +======================== + +It has been mentioned a number of times now that the logging subsystem needs to +provide a forwards progress guarantee so that no modification ever stalls +because it can't be written to the journal due to a lack of space in the +journal. This is achieved by the transaction reservations that are made when +a transaction is first allocated. For permanent transactions, these reservations +are maintained as part of the transaction rolling mechanism. + +A transaction reservation provides a guarantee that there is physical log space +available to write the modification into the journal before we start making +modifications to objects and items. As such, the reservation needs to be large +enough to take into account the amount of metadata that the change might need to +log in the worst case. This means that if we are modifying a btree in the +transaction, we have to reserve enough space to record a full leaf-to-root split +of the btree. As such, the reservations are quite complex because we have to +take into account all the hidden changes that might occur. + +For example, a user data extent allocation involves allocating an extent from +free space, which modifies the free space trees. That's two btrees. Inserting +the extent into the inode's extent map might require a split of the extent map +btree, which requires another allocation that can modify the free space trees +again. Then we might have to update reverse mappings, which modifies yet +another btree which might require more space. And so on. Hence the amount of +metadata that a "simple" operation can modify can be quite large. + +This "worst case" calculation provides us with the static "unit reservation" +for the transaction that is calculated at mount time. We must guarantee that the +log has this much space available before the transaction is allowed to proceed +so that when we come to write the dirty metadata into the log we don't run out +of log space half way through the write. + +For one-shot transactions, a single unit space reservation is all that is +required for the transaction to proceed. For permanent transactions, however, we +also have a "log count" that affects the size of the reservation that is to be +made. + +While a permanent transaction can get by with a single unit of space +reservation, it is somewhat inefficient to do this as it requires the +transaction rolling mechanism to re-reserve space on every transaction roll. We +know from the implementation of the permanent transactions how many transaction +rolls are likely for the common modifications that need to be made. + +For example, and inode allocation is typically two transactions - one to +physically allocate a free inode chunk on disk, and another to allocate an inode +from an inode chunk that has free inodes in it. Hence for an inode allocation +transaction, we might set the reservation log count to a value of 2 to indicate +that the common/fast path transaction will commit two linked transactions in a +chain. Each time a permanent transaction rolls, it consumes an entire unit +reservation. + +Hence when the permanent transaction is first allocated, the log space +reservation is increases from a single unit reservation to multiple unit +reservations. That multiple is defined by the reservation log count, and this +means we can roll the transaction multiple times before we have to re-reserve +log space when we roll the transaction. This ensures that the common +modifications we make only need to reserve log space once. + +If the log count for a permanent transaction reaches zero, then it needs to +re-reserve physical space in the log. This is somewhat complex, and requires +an understanding of how the log accounts for space that has been reserved. + + +Log Space Accounting +==================== + +The position in the log is typically referred to as a Log Sequence Number (LSN). +The log is circular, so the positions in the log are defined by the combination +of a cycle number - the number of times the log has been overwritten - and the +offset into the log. A LSN carries the cycle in the upper 32 bits and the +offset in the lower 32 bits. The offset is in units of "basic blocks" (512 +bytes). Hence we can do realtively simple LSN based math to keep track of +available space in the log. + +Log space accounting is done via a pair of constructs called "grant heads". The +position of the grant heads is an absolute value, so the amount of space +available in the log is defined by the distance between the position of the +grant head and the current log tail. That is, how much space can be +reserved/consumed before the grant heads would fully wrap the log and overtake +the tail position. + +The first grant head is the "reserve" head. This tracks the byte count of the +reservations currently held by active transactions. It is a purely in-memory +accounting of the space reservation and, as such, actually tracks byte offsets +into the log rather than basic blocks. Hence it technically isn't using LSNs to +represent the log position, but it is still treated like a split {cycle,offset} +tuple for the purposes of tracking reservation space. + +The reserve grant head is used to accurately account for exact transaction +reservations amounts and the exact byte count that modifications actually make +and need to write into the log. The reserve head is used to prevent new +transactions from taking new reservations when the head reaches the current +tail. It will block new reservations in a FIFO queue and as the log tail moves +forward it will wake them in order once sufficient |
