summaryrefslogtreecommitdiff
path: root/fs/xfs/xfs_log_priv.h
AgeCommit message (Collapse)AuthorFilesLines
2022-07-07xfs: xlog_sync() manually adjusts grant head spaceDave Chinner1-1/+2
When xlog_sync() rounds off the tail the iclog that is being flushed, it manually subtracts that space from the grant heads. This space is actually reserved by the transaction ticket that covers the xlog_sync() call from xlog_write(), but we don't plumb the ticket down far enough for it to account for the space consumed in the current log ticket. The grant heads are hot, so we really should be accounting this to the ticket is we can, rather than adding thousands of extra grant head updates every CIL commit. Interestingly, this actually indicates a potential log space overrun can occur when we force the log. By the time that xfs_log_force() pushes out an active iclog and consumes the roundoff space, the reservation for that roundoff space has been returned to the grant heads and is no longer covered by a reservation. In theory the roundoff added to log force on an already full log could push the write head past the tail. In practice, the CIL commit that writes to the log and needs the iclog pushed will have reserved space for roundoff, so when it releases the ticket there will still be physical space for the roundoff to be committed to the log, even though it is no longer reserved. This roundoff won't be enough space to allow a transaction to be woken if the log is full, so overruns should not actually occur in practice. That said, it indicates that we should not release the CIL context log ticket until after we've released the commit iclog. It also means that xlog_sync() still needs the direct grant head manipulation if we don't provide it with a ticket. Log forces are rare when we are in fast paths running 1.5 million transactions/s that make the grant heads hot, so let's optimise the hot case and pass CIL log tickets down to the xlog_sync() code. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07xfs: convert log vector chain to use list headsDave Chinner1-2/+2
Because the next change is going to require sorting log vectors, and that requires arbitrary rearrangement of the list which cannot be done easily with a single linked list. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07xfs: convert CIL to unordered per cpu listsDave Chinner1-2/+1
So that we can remove the cil_lock which is a global serialisation point. We've already got ordering sorted, so all we need to do is treat the CIL list like the busy extent list and reconstruct it before the push starts. This is what we're trying to avoid: - 75.35% 1.83% [kernel] [k] xfs_log_commit_cil - 46.35% xfs_log_commit_cil - 41.54% _raw_spin_lock - 67.30% do_raw_spin_lock 66.96% __pv_queued_spin_lock_slowpath Which happens on a 32p system when running a 32-way 'rm -rf' workload. After this patch: - 20.90% 3.23% [kernel] [k] xfs_log_commit_cil - 17.67% xfs_log_commit_cil - 6.51% xfs_log_ticket_ungrant 1.40% xfs_log_space_wake 2.32% memcpy_erms - 2.18% xfs_buf_item_committing - 2.12% xfs_buf_item_release - 1.03% xfs_buf_unlock 0.96% up 0.72% xfs_buf_rele 1.33% xfs_inode_item_format 1.19% down_read 0.91% up_read 0.76% xfs_buf_item_format - 0.68% kmem_alloc_large - 0.67% kmem_alloc 0.64% __kmalloc 0.50% xfs_buf_item_size It kinda looks like the workload is running out of log space all the time. But all the spinlock contention is gone and the transaction commit rate has gone from 800k/s to 1.3M/s so the amount of real work being done has gone up a *lot*. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07xfs: Add order IDs to log items in CILDave Chinner1-0/+1
Before we split the ordered CIL up into per cpu lists, we need a mechanism to track the order of the items in the CIL. We need to do this because there are rules around the order in which related items must physically appear in the log even inside a single checkpoint transaction. An example of this is intents - an intent must appear in the log before it's intent done record so that log recovery can cancel the intent correctly. If we have these two records misordered in the CIL, then they will not be recovered correctly by journal replay. We also will not be able to move items to the tail of the CIL list when they are relogged, hence the log items will need some mechanism to allow the correct log item order to be recreated before we write log items to the hournal. Hence we need to have a mechanism for recording global order of transactions in the log items so that we can recover that order from un-ordered per-cpu lists. Do this with a simple monotonic increasing commit counter in the CIL context. Each log item in the transaction gets stamped with the current commit order ID before it is added to the CIL. If the item is already in the CIL, leave it where it is instead of moving it to the tail of the list and instead sort the list before we start the push work. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07xfs: track CIL ticket reservation in percpu structureDave Chinner1-0/+1
To get it out from under the cil spinlock. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07xfs: implement percpu cil space used calculationDave Chinner1-1/+3
Now that we have the CIL percpu structures in place, implement the space used counter as a per-cpu counter. We have to be really careful now about ensuring that the checks and updates run without arbitrary delays, which means they need to run with pre-emption disabled. We do this by careful placement of the get_cpu_ptr/put_cpu_ptr calls to access the per-cpu structures for that CPU. We need to be able to reliably detect that the CIL has reached the hard limit threshold so we can take extra reservations for the iclog headers when the space used overruns the original reservation. hence we factor out xlog_cil_over_hard_limit() from xlog_cil_push_background(). The global CIL space used is an atomic variable that is backed by per-cpu aggregation to minimise the number of atomic updates we do to the global state in the fast path. While we are under the soft limit, we aggregate only when the per-cpu aggregation is over the proportion of the soft limit assigned to that CPU. This means that all CPUs can use all but one byte of their aggregation threshold and we will not go over the soft limit. Hence once we detect that we've gone over both a per-cpu aggregation threshold and the soft limit, we know that we have only exceeded the soft limit by one per-cpu aggregation threshold. Even if all CPUs hit this at the same time, we can't be over the hard limit, so we can run an aggregation back into the atomic counter at this point and still be under the hard limit. At this point, we will be over the soft limit and hence we'll aggregate into the global atomic used space directly rather than the per-cpu counters, hence providing accurate detection of hard limit excursion for accounting and reservation purposes. Hence we get the best of both worlds - lockless, scalable per-cpu fast path plus accurate, atomic detection of hard limit excursion. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-02xfs: introduce per-cpu CIL tracking structureDave Chinner1-0/+18
The CIL push lock is highly contended on larger machines, becoming a hard bottleneck that about 700,000 transaction commits/s on >16p machines. To address this, start moving the CIL tracking infrastructure to utilise per-CPU structures. We need to track the space used, the amount of log reservation space reserved to write the CIL, the log items in the CIL and the busy extents that need to be completed by the CIL commit. This requires a couple of per-cpu counters, an unordered per-cpu list and a globally ordered per-cpu list. Create a per-cpu structure to hold these and all the management interfaces needed, as well as the hooks to handle hotplug CPUs. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-02xfs: rework per-iclog header CIL reservationDave Chinner1-9/+11
For every iclog that a CIL push will use up, we need to ensure we have space reserved for the iclog header in each iclog. It is extremely difficult to do this accurately with a per-cpu counter without expensive summing of the counter in every commit. However, we know what the maximum CIL size is going to be because of the hard space limit we have, and hence we know exactly how many iclogs we are going to need to write out the CIL. We are constrained by the requirement that small transactions only have reservation space for a single iclog header built into them. At commit time we don't know how much of the current transaction reservation is made up of iclog header reservations as calculated by xfs_log_calc_unit_res() when the ticket was reserved. As larger reservations have multiple header spaces reserved, we can steal more than one iclog header reservation at a time, but we only steal the exact number needed for the given log vector size delta. As a result, we don't know exactly when we are going to steal iclog header reservations, nor do we know exactly how many we are going to need for a given CIL. To make things simple, start by calculating the worst case number of iclog headers a full CIL push will require. Record this into an atomic variable in the CIL. Then add a byte counter to the log ticket that records exactly how much iclog header space has been reserved in this ticket by xfs_log_calc_unit_res(). This tells us exactly how much space we can steal from the ticket at transaction commit time. Now, at transaction commit time, we can check if the CIL has a full iclog header reservation and, if not, steal the entire reservation the current ticket holds for iclog headers. This minimises the number of times we need to do atomic operations in the fast path, but still guarantees we get all the reservations we need. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-02xfs: use the CIL space used counter for emptiness checksDave Chinner1-0/+4
In the next patches we are going to make the CIL list itself per-cpu, and so we cannot use list_empty() to check is the list is empty. Replace the list_empty() checks with a flag in the CIL to indicate we have committed at least one transaction to the CIL and hence the CIL is not empty. We need this flag to be an atomic so that we can clear it without holding any locks in the commit fast path, but we also need to be careful to avoid atomic operations in the fast path. Hence we use the fact that test_bit() is not an atomic op to first check if the flag is set and then run the atomic test_and_clear_bit() operation to clear it and steal the initial unit reservation for the CIL context checkpoint. When we are switching to a new context in a push, we place the setting of the XLOG_CIL_EMPTY flag under the xc_push_lock. THis allows all the other places that need to check whether the CIL is empty to use test_bit() and still be serialised correctly with the CIL context swaps that set the bit. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-05-27xfs: refactor buffer cancellation table allocationDarrick J. Wong1-3/+0
Move the code that allocates and frees the buffer cancellation tables used by log recovery into the file that actually uses the tables. This is a precursor to some cleanups and a memory leak fix. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-05-12xfs: can't use kmem_zalloc() for attribute buffersDave Chinner1-0/+34
Because heap allocation of 64kB buffers will fail: .... XFS: fs_mark(8414) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40) XFS: fs_mark(8417) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40) XFS: fs_mark(8409) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40) XFS: fs_mark(8428) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40) XFS: fs_mark(8430) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40) XFS: fs_mark(8437) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40) XFS: fs_mark(8433) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40) XFS: fs_mark(8406) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40) XFS: fs_mark(8412) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40) XFS: fs_mark(8432) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40) XFS: fs_mark(8424) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40) .... I'd use kvmalloc() instead, but.... - 48.19% xfs_attr_create_intent - 46.89% xfs_attri_init - kvmalloc_node - 46.04% __kmalloc_node - kmalloc_large_node - 45.99% __alloc_pages - 39.39% __alloc_pages_slowpath.constprop.0 - 38.89% __alloc_pages_direct_compact - 38.71% try_to_compact_pages - compact_zone_order - compact_zone - 21.09% isolate_migratepages_block 10.31% PageHuge 5.82% set_pfnblock_flags_mask 0.86% get_pfnblock_flags_mask - 4.48% __reset_isolation_suitable 4.44% __reset_isolation_pfn - 3.56% __pageblock_pfn_to_page 1.33% pfn_to_online_page 2.83% get_pfnblock_flags_mask - 0.87% migrate_pages 0.86% compaction_alloc 0.84% find_suitable_fallback - 6.60% get_page_from_freelist 4.99% clear_page_erms - 1.19% _raw_spin_lock_irqsave - do_raw_spin_lock __pv_queued_spin_lock_slowpath - 0.86% __vmalloc_node_range 0.65% __alloc_pages_bulk .... this is just yet another reminder of how much kvmalloc() sucks. So lift xlog_cil_kvmalloc(), rename it to xlog_kvmalloc() and use that instead.... We also clean up the attribute name and value lengths as they no longer need to be rounded out to sizes compatible with log vectors. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-04-21Merge branch 'guilt/xlog-write-rework' into xfs-5.19-for-nextDave Chinner1-43/+4
2022-04-21xfs: convert log ticket and iclog flags to unsigned.Dave Chinner1-4/+4
5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned fields to be unsigned. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-04-21xfs: CIL context doesn't need to count iovecsDave Chinner1-1/+0
Now that we account for log opheaders in the log item formatting code, we don't actually use the aggregated count of log iovecs in the CIL for anything. Remove it and the tracking code that calculates it. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-04-21xfs: xlog_write() doesn't need optype anymoreDave Chinner1-1/+1
So remove it from the interface and callers. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-04-21xfs: remove xlog_verify_dest_ptrChristoph Hellwig1-4/+0
Just check that the offset in xlog_write_vec is smaller than the iclog size and remove the expensive cycling through all iclogs. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-04-21xfs: introduce xlog_write_partial()Dave Chinner1-8/+0
Re-implement writing of a log vector that does not fit into the current iclog. The iclog will already be in XLOG_STATE_WANT_SYNC because xlog_get_iclog_space() will have reserved all the remaining iclog space for us, hence we can simply iterate over the iovecs in the log vector getting more iclog space until the entire log vector is written. Handling this partial write case separately means we do need to pass unnecessary state around for the common, fast path case when the log vector fits entirely within the current iclog. It isolates the complexity and allows us to modify and improve the partial write case without impacting the simple fast path. This change includes several improvements incorporated from patches written by Christoph Hellwig. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-04-21xfs: change the type of ic_datapChristoph Hellwig1-1/+1
Turn ic_datap from a char into a void pointer given that it points to arbitrary data. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> [dgc: also remove (char *) cast in xlog_alloc_log()] Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-04-21xfs: pass lv chain length into xlog_write()Dave Chinner1-1/+1
The caller of xlog_write() usually has a close accounting of the aggregated vector length contained in the log vector chain passed to xlog_write(). There is no need to iterate the chain to calculate he length of the data in xlog_write_calculate_len() if the caller is already iterating that chain to build it. Passing in the vector length avoids doing an extra chain iteration, which can be a significant amount of work given that large CIL commits can have hundreds of thousands of vectors attached to the chain. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-04-21xfs: log ticket region debug is largely uselessDave Chinner1-20/+0
xlog_tic_add_region() is used to trace the regions being added to a log ticket to provide information in the situation where a ticket reservation overrun occurs. The information gathered is stored int the ticket, and dumped if xlog_print_tic_res() is called. For a front end struct xfs_trans overrun, the ticket only contains reservation tracking information - the ticket is never handed to the log so has no regions attached to it. The overrun debug information in this case comes from xlog_print_trans(), which walks the items attached to the transaction and dumps their attached formatted log vectors directly. It also dumps the ticket state, but that only contains reservation accounting and nothing else. Hence xlog_print_tic_res() never dumps region or overrun information from this path. xlog_tic_add_region() is actually called from xlog_write(), which means it is being used to track the regions seen in a CIL checkpoint log vector chain. In looking at CIL behaviour recently, I've seen 32MB checkpoints regularly exceed 250,000 regions in the LV chain. The log ticket debug code can track *15* regions. IOWs, if there is a ticket overrun in the CIL code, the ticket region tracking code is going to be completely useless for determining what went wrong. The only thing it can tell us is how much of an overrun occurred, and we really don't need extra debug information in the log ticket to tell us that. Indeed, the main place we call xlog_tic_add_region() is also adding up the number of regions and the space used so that xlog_write() knows how much will be written to the log. This is exactly the same information that log ticket is storing once we take away the useless region tracking array. Hence xlog_tic_add_region() is not useful, but can be called 250,000 times a CIL push... Just strip all that debug "information" out of the of the log ticket and only have it report reservation space information when an overrun occurs. This also reduces the size of a log ticket down by about 150 bytes... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-04-21xfs: log tickets don't need log client idDave Chinner1-8/+2
We currently set the log ticket client ID when we reserve a transaction. This client ID is only ever written to the log by a CIL checkpoint or unmount records, and so anything using a high level transaction allocated through xfs_trans_alloc() does not need a log ticket client ID to be set. For the CIL checkpoint, the client ID written to the journal is always XFS_TRANSACTION, and for the unmount record it is always XFS_LOG, and nothing else writes to the log. All of these operations tell xlog_write() exactly what they need to write to the log (the optype) and build their own opheaders for start, commit and unmount records. Hence we no longer need to set the client id in either the log ticket or the xfs_trans. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-03-29xfs: drop async cache flushes from CIL commits.Dave Chinner1-2/+1
Jan Kara reported a performance regression in dbench that he bisected down to commit bad77c375e8d ("xfs: CIL checkpoint flushes caches unconditionally"). Whilst developing the journal flush/fua optimisations this cache was part of, it appeared to made a significant difference to performance. However, now that this patchset has settled and all the correctness issues fixed, there does not appear to be any significant performance benefit to asynchronous cache flushes. In fact, the opposite is true on some storage types and workloads, where additional cache flushes that can occur from fsync heavy workloads have measurable and significant impact on overall throughput. Local dbench testing shows little difference on dbench runs with sync vs async cache flushes on either fast or slow SSD storage, and no difference in streaming concurrent async transaction workloads like fs-mark. Fast NVME storage. From `dbench -t 30`, CIL scale: clients async sync BW Latency BW Latency 1 935.18 0.855 915.64 0.903 8 2404.51 6.873 2341.77 6.511 16 3003.42 6.460 2931.57 6.529 32 3697.23 7.939 3596.28 7.894 128 7237.43 15.495 7217.74 11.588 512 5079.24 90.587 5167.08 95.822 fsmark, 32 threads, create w/ 64 byte xattr w/32k logbsize create chown unlink async 1m41s 1m16s 2m03s sync 1m40s 1m19s 1m54s Slower SATA SSD storage: From `dbench -t 30`, CIL scale: clients async sync BW Latency BW Latency 1 78.59 15.792 83.78 10.729 8 367.88 92.067 404.63 59.943 16 564.51 72.524 602.71 76.089 32 831.66 105.984 870.26 110.482 128 1659.76 102.969 1624.73 91.356 512 2135.91 223.054 2603.07 161.160 fsmark, 16 threads, create w/32k logbsize create unlink async 5m06s 4m15s sync 5m00s 4m22s And on Jan's test machine: 5.18-rc8-vanilla 5.18-rc8-patched Amean 1 71.22 ( 0.00%) 64.94 * 8.81%* Amean 2 93.03 ( 0.00%) 84.80 * 8.85%* Amean 4 150.54 ( 0.00%) 137.51 * 8.66%* Amean 8 252.53 ( 0.00%) 242.24 * 4.08%* Amean 16 454.13 ( 0.00%) 439.08 * 3.31%* Amean 32 835.24 ( 0.00%) 829.74 * 0.66%* Amean 64 1740.59 ( 0.00%) 1686.73 * 3.09%* Performance and cache flush behaviour is restored to pre-regression levels. As such, we can now consider the async cache flush mechanism an unnecessary exercise in premature optimisation and hence we can now remove it and the infrastructure it requires completely. Fixes: bad77c375e8d ("xfs: CIL checkpoint flushes caches unconditionally") Reported-and-tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-03-29xfs: xfs_do_force_shutdown needs to block racing shutdownsDave Chinner1-0/+11
When we call xfs_forced_shutdown(), the caller often expects the filesystem to be completely shut down when it returns. However, if we have racing xfs_forced_shutdown() calls, the first caller sets the mount shutdown flag then goes to shutdown the log. The second caller sees the mount shutdown flag and returns immediately - it does not wait for the log to be shut down. Unfortunately, xfs_forced_shutdown() is used in some places that expect it to completely shut down the filesystem before it returns (e.g. xfs_trans_log_inode()). As such, returning before the log has been shut down leaves us in a place where the transaction failed to complete correctly but we still call xfs_trans_commit(). This situation arises because xfs_trans_log_inode() does not return an error and instead calls xfs_force_shutdown() to ensure that the transaction being committed is aborted. Unfortunately, we have a race condition where xfs_trans_commit() needs to check xlog_is_shutdown() because it can't abort log items before the log is shut down, but it needs to use xfs_is_shutdown() because xfs_forced_shutdown() does not block waiting for the log to shut down. To fix this conundrum, first we make all calls to xfs_forced_shutdown() block until the log is also shut down. This means we can then safely use xfs_forced_shutdown() as a mechanism that ensures the currently running transaction will be aborted by xfs_trans_commit() regardless of the shutdown check it uses. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-10-22xfs: rename _zone variables to _cacheDarrick J. Wong1-1/+1
Now that we've gotten rid of the kmem_zone_t typedef, rename the variables to _cache since that's what they are. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
2021-10-22xfs: remove kmem_zone typedefDarrick J. Wong1-1/+1
Remove these typedefs by referencing kmem_cache directly. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
2021-08-16xfs: move the CIL workqueue to the CILDave Chinner1-0/+1
We only use the CIL workqueue in the CIL, so it makes no sense to hang it off the xfs_mount and have to walk multiple pointers back up to the mount when we have the CIL structures right there. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-08-16xfs: CIL work is serialised, not pipelinedDave Chinner1-1/+1
Because we use a single work structure attached to the CIL rather than the CIL context, we can only queue a single work item at a time. This results in the CIL being single threaded and limits performance when it becomes CPU bound. The design of the CIL is that it is pipelined and multiple commits can be running concurrently, but the way the work is currently implemented means that it is not pipelining as it was intended. The critical work to switch the CIL context can take a few milliseconds to run, but the rest of the CIL context flush can take hundreds of milliseconds to complete. The context switching is the serialisation point of the CIL, once the context has been switched the rest of the context push can run asynchrnously with all other context pushes. Hence we can move the work to the CIL context so that we can run multiple CIL pushes at the same time and spread the majority of the work out over multiple CPUs. We can keep the per-cpu CIL commit state on the CIL rather than the context, because the context is pinned to the CIL until the switch is done and we aggregate and drain the per-cpu state held on the CIL during the context switch. However, because we no longer serialise the CIL work, we can have effectively unlimited CIL pushes in progress. We don't want to do this - not only does it create contention on the iclogs and the state machine locks, we can run the log right out of space with outstanding pushes. Instead, limit the work concurrency to 4 concurrent works being processed at a time. This is enough concurrency to remove the CIL from being a CPU bound bottleneck but not enough to create new contention points or unbound concurrency issues. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-08-16xfs: AIL needs asynchronous CIL forcingDave Chinner1-0/+4
The AIL pushing is stalling on log forces when it comes across pinned items. This is happening on removal workloads where the AIL is dominated by stale items that are removed from AIL when the checkpoint that marks the items stale is committed to the journal. This results is relatively few items in the AIL, but those that are are often pinned as directories items are being removed from are still being logged. As a result, many push cycles through the CIL will first issue a blocking log force to unpin the items. This can take some time to complete, with tracing regularly showing push delays of half a second and sometimes up into the range of several seconds. Sequences like this aren't uncommon: .... 399.829437: xfsaild: last lsn 0x11002dd000 count 101 stuck 101 flushing 0 tout 20 <wanted 20ms, got 270ms delay> 400.099622: xfsaild: target 0x11002f3600, prev 0x11002f3600, last lsn 0x0 400.099623: xfsaild: first lsn 0x11002f3600 400.099679: xfsaild: last lsn 0x1100305000 count 16 stuck 11 flushing 0 tout 50 <wanted 50ms, got 500ms delay> 400.589348: xfsaild: target 0x110032e600, prev 0x11002f3600, last lsn 0x0 400.589349: xfsaild: first lsn 0x1100305000 400.589595: xfsaild: last lsn 0x110032e600 count 156 stuck 101 flushing 30 tout 50 <wanted 50ms, got 460ms delay> 400.950341: xfsaild: target 0x1100353000, prev 0x110032e600, last lsn 0x0 400.950343: xfsaild: first lsn 0x1100317c00 400.950436: xfsaild: last lsn 0x110033d200 count 105 stuck 101 flushing 0 tout 20 <wanted 20ms, got 200ms delay> 401.142333: xfsaild: target 0x1100361600, prev 0x1100353000, last lsn 0x0 401.142334: xfsaild: first lsn 0x110032e600 401.142535: xfsaild: last lsn 0x1100353000 count 122 stuck 101 flushing 8 tout 10 <wanted 10ms, got 10ms delay> 401.154323: xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x1100353000 401.154328: xfsaild: first lsn 0x1100353000 401.154389: xfsaild: last lsn 0x1100353000 count 101 stuck 101 flushing 0 tout 20 <wanted 20ms, got 300ms delay> 401.451525: xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0 401.451526: xfsaild: first lsn 0x1100353000 401.451804: xfsaild: last lsn 0x1100377200 count 170 stuck 22 flushing 122 tout 50 <wanted 50ms, got 500ms delay> 401.933581: xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0 .... In each of these cases, every AIL pass saw 101 log items stuck on the AIL (pinned) with very few other items being found. Each pass, a log force was issued, and delay between last/first is the sleep time + the sync log force time. Some of these 101 items pinned the tail of the log. The tail of the log does slowly creep forward (first lsn), but the problem is that the log is actually out of reservation space because it's been running so many transactions that stale items that never reach the AIL but consume log space. Hence we have a largely empty AIL, with long term pins on items that pin the tail of the log that don't get pushed frequently enough to keep log space available. The problem is the hundreds of milliseconds that we block in the log force pushing the CIL out to disk. The AIL should not be stalled like this - it needs to run and flush items that are at the tail of the log with minimal latency. What we really need to do is trigger a log flush, but then not wait for it at all - we've already done our waiting for stuff to complete when we backed off prior to the log force being issued. Even if we remove the XFS_LOG_SYNC from the xfs_log_force() call, we still do a blocking flush of the CIL and that is what is causing the issue. Hence we need a new interface for the CIL to trigger an immediate background push of the CIL to get it moving faster but not to wait on that to occur. While the CIL is pushing, the AIL can also be pushing. We already have an internal interface to do this - xlog_cil_push_now() - but we need a wrapper for it to be used externally. xlog_cil_force_seq() can easily be extended to do what we need as it already implements the synchronous CIL push via xlog_cil_push_now(). Add the necessary flags and "push current sequence" semantics to xlog_cil_force_seq() and convert the AIL pushing to use it. One of the complexities here is that the CIL push does not guarantee that the commit record for the CIL checkpoint is written to disk. The current log force ensures this by submitting the current ACTIVE iclog that the commit record was written to. We need the CIL to actually write this commit record to disk for an async push to ensure that the checkpoint actually makes it to disk and unpins the pinned items in the checkpoint on completion. Hence we need to pass down to the CIL push that we are doing an async flush so that it can switch out the commit_iclog if necessary to get written to disk when the commit iclog is finally released. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-08-16xfs: order CIL checkpoint start recordsDave Chinner1-0/+1
Because log recovery depends on strictly ordered start records as well as strictly ordered commit records. This is a zero day bug in the way XFS writes pipelined transactions to the journal which is exposed by fixing the zero day bug that prevents the CIL from pipelining checkpoints. This re-introduces explicit concurrent commits back into the on-disk journal and hence out of order start records. The XFS journal commit code has never ordered start records and we have relied on strict commit record ordering for correct recovery ordering of concurrently written transactions. Unfortunately, root cause analysis uncovered the fact that log recovery uses the LSN of the start record for transaction commit processing. Hence, whilst the commits are processed in strict order by recovery, the LSNs associated with the commits can be out of order and so recovery may stamp incorrect LSNs into objects and/or misorder intents in the AIL for later processing. This can result in log recovery failures and/or on disk corruption, sometimes silent. Because this is a long standing log recovery issue, we can't just fix log recovery and call it good. This still leaves older kernels susceptible to recovery failures and corruption when replaying a log from a kernel that pipelines checkpoints. There is also the issue that in-memory ordering for AIL pushing and data integrity operations are based on checkpoint start LSNs, and if the start LSN is incorrect in the journal, it is also incorrect in memory. Hence there's really only one choice for fixing this zero-day bug: we need to strictly order checkpoint start records in ascending sequence order in the log, the same way we already strictly order commit records. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-08-16xfs: attach iclog callbacks in xlog_cil_set_ctx_write_state()Dave Chinner1-1/+2
Now that we have a mechanism to guarantee that the callbacks attached to an iclog are owned by the context that attaches them until they drop their reference to the iclog via xlog_state_release_iclog(), we can attach callbacks to the iclog at any time we have an active reference to the iclog. xlog_state_get_iclog_space() always guarantees that the commit record will fit in the iclog it returns, so we can move this IO callback setting to xlog_cil_set_ctx_write_state(), record the commit iclog in the context and remove the need for the commit iclog to be returned by xlog_write() altogether. This, in turn, allows us to move the wakeup for ordered commit record writes up into xlog_cil_set_ctx_write_state(), too, because we have been guaranteed that this commit record will be physically located in the iclog before any waiting commit record at a higher sequence number will be granted iclog space. This further cleans up the post commit record write processing in the CIL push code, especially as xlog_state_release_iclog() will now clean up the context when shutdown errors occur. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-08-16xfs: pass a CIL context to xlog_write()Dave Chinner1-2/+5
Pass the CIL context to xlog_write() rather than a pointer to a LSN variable. Only the CIL checkpoint calls to xlog_write() need to know about the start LSN of the writes, so rework xlog_write to directly write the LSNs into the CIL context structure. This removes the commit_lsn variable from xlog_cil_push_work(), so now we only have to issue the commit record ordering wakeup from there. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-08-16xfs: move xlog_commit_record to xfs_log_cil.cDave Chinner1-2/+0
It is only used by the CIL checkpoints, and is the counterpart to start record formatting and writing that is already local to xfs_log_cil.c. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-08-16xfs: convert log flags to an operational state fieldDave Chinner1-11/+23
log->l_flags doesn't actually contain "flags" as such, it contains operational state information that can change at runtime. For the shutdown state, this at least should be an atomic bit because it is read without holding locks in many places and so using atomic bitops for the state field modifications makes sense. This allows us to use things like test_and_set_bit() on state changes (e.g. setting XLOG_TAIL_WARN) to avoid races in setting the state when we aren't holding locks. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-08-16xfs: XLOG_STATE_IOERROR must dieDave Chinner1-3/+1
We don't need an iclog state field to tell us the log has been shut down. We can just check the xlog_is_shutdown() instead. The avoids the need to have shutdown overwrite the current iclog state while being active used by the log code and so having to ensure that every iclog state check handles XLOG_STATE_IOERROR appropriately. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-08-16xfs: convert XLOG_FORCED_SHUTDOWN() to xlog_is_shutdown()Dave Chinner1-2/+5
Make it less shouty and a static inline before adding more calls through the log code. Also convert internal log code that uses XFS_FORCED_SHUTDOWN(mount) to use xlog_is_shutdown(log) as well. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-08-09xfs: clear log incompat feature bits when the log is idleDarrick J. Wong1-0/+3
When there are no ongoing transactions and the log contents have been checkpointed back into the filesystem, the log performs 'covering', which is to say that it log a dummy transaction to record the fact that the tail has caught up with the head. This is a good time to clear log incompat feature flags, because they are flags that are temporarily set to limit the range of kernels that can replay a dirty log. Since it's possible that some other higher level thread is about to start logging items protected by a log incompat flag, we create a rwsem so that upper level threads can coordinate this with the log. It would probably be more performant to use a percpu rwsem, but the ability to /try/ taking the write lock during covering is critical, and percpu rwsems do not provide that. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
2021-07-29xfs: need to see iclog flags in tracingDave Chinner1-3/+10
Because I cannot tell if the NEED_FLUSH flag is being set correctly by the log force and CIL push machinery without it. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-07-29xfs: fix ordering violation between cache flushes and tail updatesDave Chinner1-1/+2
There is a race between the new CIL async data device metadata IO completion cache flush and the log tail in the iclog the flush covers being updated. This can be seen by repeating generic/482 in a loop and eventually log recovery fails with a failures such as this: XFS (dm-3): Starting recovery (logdev: internal) XFS (dm-3): bad inode magic/vsn daddr 228352 #0 (magic=0) XFS (dm-3): Metadata corruption detected at xfs_inode_buf_verify+0x180/0x190, xfs_inode block 0x37c00 xfs_inode_buf_verify XFS (dm-3): Unmount and run xfs_repair XFS (dm-3): First 128 bytes of corrupted metadata buffer: 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ XFS (dm-3): metadata I/O error in "xlog_recover_items_pass2+0x55/0xc0" at daddr 0x37c00 len 32 error 117 Analysis of the logwrite replay shows that there were no writes to the data device between the FUA @ write 124 and the FUA at write @ 125, but log recovery @ 125 failed. The difference was the one log write @ 125 moved the tail of the log forwards from (1,8) to (1,32) and so the inode create intent in (1,8) was not replayed and so the inode cluster was zero on disk when replay of the first inode item in (1,32) was attempted. What this meant was that the journal write that occurred at @ 125 did not ensure that metadata completed before the iclog was written was correctly on stable storage. The tail of the log moved forward, so IO must have been completed between the two iclog writes. This means that there is a race condition between the unconditional async cache flush in the CIL push work and the tail LSN that is written to the iclog. This happens like so: CIL push work AIL push work ------------- ------------- Add to committing list start async data dev cache flush ..... <flush completes> <all writes to old tail lsn are stable> xlog_write .... push inode create buffer <start IO> ..... xlog_write(commit record) .... <IO completes> log tail moves xlog_assign_tail_lsn() start_lsn == commit_lsn <no iclog preflush!> xlog_state_release_iclog __xlog_state_release_iclog() <writes *new* tail_lsn into iclog> xlog_sync() .... submit_bio() <tail in