summaryrefslogtreecommitdiff
path: root/fs/btrfs/extent-tree.c
AgeCommit message (Collapse)AuthorFilesLines
2022-12-05btrfs: split the bio submission path into a separate fileChristoph Hellwig1-0/+1
The code used by btrfs_submit_bio only interacts with the rest of volumes.c through __btrfs_map_block (which itself is a more generic version of two exported helpers) and does not really have anything to do with volumes.c. Create a new bio.c file and a bio.h header going along with it for the btrfs_bio-based storage layer, which will grow even more going forward. Also update the file with my copyright notice given that a large part of the moved code was written or rewritten by me. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: use btrfs_dev_name() helper to handle missing devices betterQu Wenruo1-1/+1
[BUG] If dev-replace failed to re-construct its data/metadata, the kernel message would be incorrect for the missing device: BTRFS info (device dm-1): dev_replace from <missing disk> (devid 2) to /dev/mapper/test-scratch2 started BTRFS error (device dm-1): failed to rebuild valid logical 38862848 for dev (efault) Note the above "dev (efault)" of the second line. While the first line is properly reporting "<missing disk>". [CAUSE] Although dev-replace is using btrfs_dev_name(), the heavy lifting work is still done by scrub (scrub is reused by both dev-replace and regular scrub). Unfortunately scrub code never uses btrfs_dev_name() helper, as it's only declared locally inside dev-replace.c. [FIX] Fix the output by: - Move the btrfs_dev_name() helper to volumes.h - Use btrfs_dev_name() to replace open-coded rcu_str_deref() calls Only zoned code is not touched, as I'm not familiar with degraded zoned code. - Constify return value and parameter Now the output looks pretty sane: BTRFS info (device dm-1): dev_replace from <missing disk> (devid 2) to /dev/mapper/test-scratch2 started BTRFS error (device dm-1): failed to rebuild valid logical 38862848 for dev <missing disk> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: concentrate all tree block parentness check parameters into one structureQu Wenruo1-4/+8
There are several different tree block parentness check parameters used across several helpers: - level Mandatory - transid Under most cases it's mandatory, but there are several backref cases which skips this check. - owner_root - first_key Utilized by most top-down tree search routine. Otherwise can be skipped. Those four members are not always mandatory checks, and some of them are the same u64, which means if some arguments got swapped compiler will not catch it. Furthermore if we're going to further expand the parentness check, we need to modify quite some helpers just to add one more parameter. This patch will concentrate all these members into a structure called btrfs_tree_parent_check, and pass that structure for the following helpers: - btrfs_read_extent_buffer() - read_tree_block() Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move orphan prototypes into orphan.hJosef Bacik1-0/+1
Move these out of ctree.h into orphan.h to cut down on code in ctree.h. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move file-item prototypes into their own headerJosef Bacik1-0/+1
Move these prototypes out of ctree.h and into file-item.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move root tree prototypes to their own headerJosef Bacik1-0/+1
Move all the root-tree.c prototypes to root-tree.h, and then update all the necessary files to include the new header. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move extent-tree helpers into their own header fileJosef Bacik1-0/+1
Move all the extent tree related prototypes to extent-tree.h out of ctree.h, and then go include it everywhere needed so everything compiles. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move btrfs_account_ro_block_groups_free_space into space-info.cJosef Bacik1-34/+0
This was prototyped in ctree.h and the code existed in extent-tree.c, but it's space-info related so move it into space-info.c. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move accessor helpers into accessors.hJosef Bacik1-0/+1
This is a large patch, but because they're all macros it's impossible to split up. Simply copy all of the item accessors in ctree.h and paste them in accessors.h, and then update any files to include the header so everything compiles. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ reformat comments, style fixups ] Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move fs wide helpers out of ctree.hJosef Bacik1-0/+1
We have several fs wide related helpers in ctree.h. The bulk of these are the incompat flag test helpers, but there are things such as btrfs_fs_closing() and the read only helpers that also aren't directly related to the ctree code. Move these into a fs.h header, which will serve as the location for file system wide related helpers. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-10-24btrfs: fix tree mod log mishandling of reallocated nodesJosef Bacik1-12/+13
We have been seeing the following panic in production kernel BUG at fs/btrfs/tree-mod-log.c:677! invalid opcode: 0000 [#1] SMP RIP: 0010:tree_mod_log_rewind+0x1b4/0x200 RSP: 0000:ffffc9002c02f890 EFLAGS: 00010293 RAX: 0000000000000003 RBX: ffff8882b448c700 RCX: 0000000000000000 RDX: 0000000000008000 RSI: 00000000000000a7 RDI: ffff88877d831c00 RBP: 0000000000000002 R08: 000000000000009f R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000100c40 R12: 0000000000000001 R13: ffff8886c26d6a00 R14: ffff88829f5424f8 R15: ffff88877d831a00 FS: 00007fee1d80c780(0000) GS:ffff8890400c0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fee1963a020 CR3: 0000000434f33002 CR4: 00000000007706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: btrfs_get_old_root+0x12b/0x420 btrfs_search_old_slot+0x64/0x2f0 ? tree_mod_log_oldest_root+0x3d/0xf0 resolve_indirect_ref+0xfd/0x660 ? ulist_alloc+0x31/0x60 ? kmem_cache_alloc_trace+0x114/0x2c0 find_parent_nodes+0x97a/0x17e0 ? ulist_alloc+0x30/0x60 btrfs_find_all_roots_safe+0x97/0x150 iterate_extent_inodes+0x154/0x370 ? btrfs_search_path_in_tree+0x240/0x240 iterate_inodes_from_logical+0x98/0xd0 ? btrfs_search_path_in_tree+0x240/0x240 btrfs_ioctl_logical_to_ino+0xd9/0x180 btrfs_ioctl+0xe2/0x2ec0 ? __mod_memcg_lruvec_state+0x3d/0x280 ? do_sys_openat2+0x6d/0x140 ? kretprobe_dispatcher+0x47/0x70 ? kretprobe_rethook_handler+0x38/0x50 ? rethook_trampoline_handler+0x82/0x140 ? arch_rethook_trampoline_callback+0x3b/0x50 ? kmem_cache_free+0xfb/0x270 ? do_sys_openat2+0xd5/0x140 __x64_sys_ioctl+0x71/0xb0 do_syscall_64+0x2d/0x40 Which is this code in tree_mod_log_rewind() switch (tm->op) { case BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING: BUG_ON(tm->slot < n); This occurs because we replay the nodes in order that they happened, and when we do a REPLACE we will log a REMOVE_WHILE_FREEING for every slot, starting at 0. 'n' here is the number of items in this block, which in this case was 1, but we had 2 REMOVE_WHILE_FREEING operations. The actual root cause of this was that we were replaying operations for a block that shouldn't have been replayed. Consider the following sequence of events 1. We have an already modified root, and we do a btrfs_get_tree_mod_seq(). 2. We begin removing items from this root, triggering KEY_REPLACE for it's child slots. 3. We remove one of the 2 children this root node points to, thus triggering the root node promotion of the remaining child, and freeing this node. 4. We modify a new root, and re-allocate the above node to the root node of this other root. The tree mod log looks something like this logical 0 op KEY_REPLACE (slot 1) seq 2 logical 0 op KEY_REMOVE (slot 1) seq 3 logical 0 op KEY_REMOVE_WHILE_FREEING (slot 0) seq 4 logical 4096 op LOG_ROOT_REPLACE (old logical 0) seq 5 logical 8192 op KEY_REMOVE_WHILE_FREEING (slot 1) seq 6 logical 8192 op KEY_REMOVE_WHILE_FREEING (slot 0) seq 7 logical 0 op LOG_ROOT_REPLACE (old logical 8192) seq 8 >From here the bug is triggered by the following steps 1. Call btrfs_get_old_root() on the new_root. 2. We call tree_mod_log_oldest_root(btrfs_root_node(new_root)), which is currently logical 0. 3. tree_mod_log_oldest_root() calls tree_mod_log_search_oldest(), which gives us the KEY_REPLACE seq 2, and since that's not a LOG_ROOT_REPLACE we incorrectly believe that we don't have an old root, because we expect that the most recent change should be a LOG_ROOT_REPLACE. 4. Back in tree_mod_log_oldest_root() we don't have a LOG_ROOT_REPLACE, so we don't set old_root, we simply use our existing extent buffer. 5. Since we're using our existing extent buffer (logical 0) we call tree_mod_log_search(0) in order to get the newest change to start the rewind from, which ends up being the LOG_ROOT_REPLACE at seq 8. 6. Again since we didn't find an old_root we simply clone logical 0 at it's current state. 7. We call tree_mod_log_rewind() with the cloned extent buffer. 8. Set n = btrfs_header_nritems(logical 0), which would be whatever the original nritems was when we COWed the original root, say for this example it's 2. 9. We start from the newest operation and work our way forward, so we see LOG_ROOT_REPLACE which we ignore. 10. Next we see KEY_REMOVE_WHILE_FREEING for slot 0, which triggers the BUG_ON(tm->slot < n), because it expects if we've done this we have a completely empty extent buffer to replay completely. The correct thing would be to find the first LOG_ROOT_REPLACE, and then get the old_root set to logical 8192. In fact making that change fixes this particular problem. However consider the much more complicated case. We have a child node in this tree and the above situation. In the above case we freed one of the child blocks at the seq 3 operation. If this block was also re-allocated and got new tree mod log operations we would have a different problem. btrfs_search_old_slot(orig root) would get down to the logical 0 root that still pointed at that node. However in btrfs_search_old_slot() we call tree_mod_log_rewind(buf) directly. This is not context aware enough to know which operations we should be replaying. If the block was re-allocated multiple times we may only want to replay a range of operations, and determining what that range is isn't possible to determine. We could maybe solve this by keeping track of which root the node belonged to at every tree mod log operation, and then passing this around to make sure we're only replaying operations that relate to the root we're trying to rewind. However there's a simpler way to solve this problem, simply disallow reallocations if we have currently running tree mod log users. We already do this for leaf's, so we're simply expanding this to nodes as well. This is a relatively uncommon occurrence, and the problem is complicated enough I'm worried that we will still have corner cases in the reallocation case. So fix this in the most straightforward way possible. Fixes: bd989ba359f2 ("Btrfs: add tree modification log functions") CC: stable@vger.kernel.org # 3.3+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29btrfs: set generation before calling btrfs_clean_tree_block in ↵Tetsuo Handa1-0/+3
btrfs_init_new_buffer syzbot is reporting uninit-value in btrfs_clean_tree_block() [1], for commit bc877d285ca3dba2 ("btrfs: Deduplicate extent_buffer init code") missed that btrfs_set_header_generation() in btrfs_init_new_buffer() must not be moved to after clean_tree_block() because clean_tree_block() is calling btrfs_header_generation() since commit 55c69072d6bd5be1 ("Btrfs: Fix extent_buffer usage when nodesize != leafsize"). Since memzero_extent_buffer() will reset "struct btrfs_header" part, we can't move btrfs_set_header_generation() to before memzero_extent_buffer(). Just re-add btrfs_set_header_generation() before btrfs_clean_tree_block(). Link: https://syzkaller.appspot.com/bug?extid=fba8e2116a12609b6c59 [1] Reported-by: syzbot <syzbot+fba8e2116a12609b6c59@syzkaller.appspotmail.com> Fixes: bc877d285ca3dba2 ("btrfs: Deduplicate extent_buffer init code") CC: stable@vger.kernel.org # 4.19+ Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29btrfs: make can_nocow_extent nowait compatibleJosef Bacik1-0/+6
If we have NOWAIT specified on our IOCB and we're writing into a PREALLOC or NOCOW extent then we need to be able to tell can_nocow_extent that we don't want to wait on any locks or metadata IO. Fix can_nocow_extent to allow for NOWAIT. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: speedup checking for extent sharedness during fiemapFilipe Manana1-1/+9
One of the most expensive tasks performed during fiemap is to check if an extent is shared. This task has two major steps: 1) Check if the data extent is shared. This implies checking the extent item in the extent tree, checking delayed references, etc. If we find the data extent is directly shared, we terminate immediately; 2) If the data extent is not directly shared (its extent item has a refcount of 1), then it may be shared if we have snapshots that share subtrees of the inode's subvolume b+tree. So we check if the leaf containing the file extent item is shared, then its parent node, then the parent node of the parent node, etc, until we reach the root node or we find one of them is shared - in which case we stop immediately. During fiemap we process the extents of a file from left to right, from file offset 0 to EOF. This means that we iterate b+tree leaves from left to right, and has the implication that we keep repeating that second step above several times for the same b+tree path of the inode's subvolume b+tree. For example, if we have two file extent items in leaf X, and the path to leaf X is A -> B -> C -> X, then when we try to determine if the data extent referenced by the first extent item is shared, we check if the data extent is shared - if it's not, then we check if leaf X is shared, if not, then we check if node C is shared, if not, then check if node B is shared, if not than check if node A is shared. When we move to the next file extent item, after determining the data extent is not shared, we repeat the checks for X, C, B and A - doing all the expensive searches in the extent tree, delayed refs, etc. If we have thousands of tile extents, then we keep repeating the sharedness checks for the same paths over and over. On a file that has no shared extents or only a small portion, it's easy to see that this scales terribly with the number of extents in the file and the sizes of the extent and subvolume b+trees. This change eliminates the repeated sharedness check on extent buffers by caching the results of the last path used. The results can be used as long as no snapshots were created since they were cached (for not shared extent buffers) or no roots were dropped since they were cached (for shared extent buffers). This greatly reduces the time spent by fiemap for files with thousands of extents and/or large extent and subvolume b+trees. Example performance test: $ cat fiemap-perf-test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi mkfs.btrfs -f $DEV mount -o compress=lzo $DEV $MNT # 40G gives 327680 128K file extents (due to compression). xfs_io -f -c "pwrite -S 0xab -b 1M 0 40G" $MNT/foobar umount $MNT mount -o compress=lzo $DEV $MNT start=$(date +%s%N) filefrag $MNT/foobar end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "fiemap took $dur milliseconds (metadata not cached)" start=$(date +%s%N) filefrag $MNT/foobar end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "fiemap took $dur milliseconds (metadata cached)" umount $MNT Before this patch: $ ./fiemap-perf-test.sh (...) /mnt/sdi/foobar: 327680 extents found fiemap took 3597 milliseconds (metadata not cached) /mnt/sdi/foobar: 327680 extents found fiemap took 2107 milliseconds (metadata cached) After this patch: $ ./fiemap-perf-test.sh (...) /mnt/sdi/foobar: 327680 extents found fiemap took 1646 milliseconds (metadata not cached) /mnt/sdi/foobar: 327680 extents found fiemap took 698 milliseconds (metadata cached) That's about 2.2x faster when no metadata is cached, and about 3x faster when all metadata is cached. On a real filesystem with many other files, data, directories, etc, the b+trees will be 2 or 3 levels higher, therefore this optimization will have a higher impact. Several reports of a slow fiemap show up often, the two Link tags below refer to two recent reports of such slowness. This patch, together with the next ones in the series, is meant to address that. Link: https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/ Link: https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: get rid of block group caching progress logicOmar Sandoval1-7/+2
struct btrfs_caching_ctl::progress and struct btrfs_block_group::last_byte_to_unpin were previously needed to ensure that unpin_extent_range() didn't return a range to the free space cache before the caching thread had a chance to cache that range. However, the commit "btrfs: fix space cache corruption and potential double allocations" made it so that we always synchronously cache the block group at the time that we pin the extent, so this machinery is no longer necessary. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: convert block group bit field to use bit helpersJosef Bacik1-2/+3
We use a bit field in the btrfs_block_group for different flags, however this is awkward because we have to hold the block_group->lock for any modification of any of these fields, and makes the code clunky for a few of these flags. Convert these to a properly flags setup so we can utilize the bit helpers. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-23btrfs: fix space cache corruption and potential double allocationsOmar Sandoval1-24/+6
When testing space_cache v2 on a large set of machines, we encountered a few symptoms: 1. "unable to add free space :-17" (EEXIST) errors. 2. Missing free space info items, sometimes caught with a "missing free space info for X" error. 3. Double-accounted space: ranges that were allocated in the extent tree and also marked as free in the free space tree, ranges that were marked as allocated twice in the extent tree, or ranges that were marked as free twice in the free space tree. If the latter made it onto disk, the next reboot would hit the BUG_ON() in add_new_free_space(). 4. On some hosts with no on-disk corruption or error messages, the in-memory space cache (dumped with drgn) disagreed with the free space tree. All of these symptoms have the same underlying cause: a race between caching the free space for a block group and returning free space to the in-memory space cache for pinned extents causes us to double-add a free range to the space cache. This race exists when free space is cached from the free space tree (space_cache=v2) or the extent tree (nospace_cache, or space_cache=v1 if the cache needs to be regenerated). struct btrfs_block_group::last_byte_to_unpin and struct btrfs_block_group::progress are supposed to protect against this race, but commit d0c2f4fa555e ("btrfs: make concurrent fsyncs wait less when waiting for a transaction commit") subtly broke this by allowing multiple transactions to be unpinning extents at the same time. Specifically, the race is as follows: 1. An extent is deleted from an uncached block group in transaction A. 2. btrfs_commit_transaction() is called for transaction A. 3. btrfs_run_delayed_refs() -> __btrfs_free_extent() runs the delayed ref for the deleted extent. 4. __btrfs_free_extent() -> do_free_extent_accounting() -> add_to_free_space_tree() adds the deleted extent back to the free space tree. 5. do_free_extent_accounting() -> btrfs_update_block_group() -> btrfs_cache_block_group() queues up the block group to get cached. block_group->progress is set to block_group->start. 6. btrfs_commit_transaction() for transaction A calls switch_commit_roots(). It sets block_group->last_byte_to_unpin to block_group->progress, which is block_group->start because the block group hasn't been cached yet. 7. The caching thread gets to our block group. Since the commit roots were already switched, load_free_space_tree() sees the deleted extent as free and adds it to the space cache. It finishes caching and sets block_group->progress to U64_MAX. 8. btrfs_commit_transaction() advances transaction A to TRANS_STATE_SUPER_COMMITTED. 9. fsync calls btrfs_commit_transaction() for transaction B. Since transaction A is already in TRANS_STATE_SUPER_COMMITTED and the commit is for fsync, it advances. 10. btrfs_commit_transaction() for transaction B calls switch_commit_roots(). This time, the block group has already been cached, so it sets block_group->last_byte_to_unpin to U64_MAX. 11. btrfs_commit_transaction() for transaction A calls btrfs_finish_extent_commit(), which calls unpin_extent_range() for the deleted extent. It sees last_byte_to_unpin set to U64_MAX (by transaction B!), so it adds the deleted extent to the space cache again! This explains all of our symptoms above: * If the sequence of events is exactly as described above, when the free space is re-added in step 11, it will fail with EEXIST. * If another thread reallocates the deleted extent in between steps 7 and 11, then step 11 will silently re-add that space to the space cache as free even though it is actually allocated. Then, if that space is allocated *again*, the free space tree will be corrupted (namely, the wrong item will be deleted). * If we don't catch this free space tree corruption, it will continue to get worse as extents are deleted and reallocated. The v1 space_cache is synchronously loaded when an extent is deleted (btrfs_update_block_group() with alloc=0 calls btrfs_cache_block_group() with load_cache_only=1), so it is not normally affected by this bug. However, as noted above, if we fail to load the space cache, we will fall back to caching from the extent tree and may hit this bug. The easiest fix for this race is to also make caching from the free space tree or extent tree synchronous. Josef tested this and found no performance regressions. A few extra changes fall out of this change. Namely, this fix does the following, with step 2 being the crucial fix: 1. Factor btrfs_caching_ctl_wait_done() out of btrfs_wait_block_group_cache_done() to allow waiting on a caching_ctl that we already hold a reference to. 2. Change the call in btrfs_cache_block_group() of btrfs_wait_space_cache_v1_finished() to btrfs_caching_ctl_wait_done(), which makes us wait regardless of the space_cache option. 3. Delete the now unused btrfs_wait_space_cache_v1_finished() and space_cache_v1_done(). 4. Change btrfs_cache_block_group()'s `int load_cache_only` parameter to `bool wait` to more accurately describe its new meaning. 5. Change a few callers which had a separate call to btrfs_wait_block_group_cache_done() to use wait = true instead. 6. Make btrfs_wait_block_group_cache_done() static now that it's not used outside of block-group.c anymore. Fixes: d0c2f4fa555e ("btrfs: make concurrent fsyncs wait less when waiting for a transaction commit") CC: stable@vger.kernel.org # 5.12+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-17btrfs: fix lockdep splat with reloc root extent buffersJosef Bacik1-1/+17
We have been hitting the following lockdep splat with btrfs/187 recently WARNING: possible circular locking dependency detected 5.19.0-rc8+ #775 Not tainted ------------------------------------------------------ btrfs/752500 is trying to acquire lock: ffff97e1875a97b8 (btrfs-treloc-02#2){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110 but task is already holding lock: ffff97e1875a9278 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (btrfs-tree-01/1){+.+.}-{3:3}: down_write_nested+0x41/0x80 __btrfs_tree_lock+0x24/0x110 btrfs_init_new_buffer+0x7d/0x2c0 btrfs_alloc_tree_block+0x120/0x3b0 __btrfs_cow_block+0x136/0x600 btrfs_cow_block+0x10b/0x230 btrfs_search_slot+0x53b/0xb70 btrfs_lookup_inode+0x2a/0xa0 __btrfs_update_delayed_inode+0x5f/0x280 btrfs_async_run_delayed_root+0x24c/0x290 btrfs_work_helper+0xf2/0x3e0 process_one_work+0x271/0x590 worker_thread+0x52/0x3b0 kthread+0xf0/0x120 ret_from_fork+0x1f/0x30 -> #1 (btrfs-tree-01){++++}-{3:3}: down_write_nested+0x41/0x80 __btrfs_tree_lock+0x24/0x110 btrfs_search_slot+0x3c3/0xb70 do_relocation+0x10c/0x6b0 relocate_tree_blocks+0x317/0x6d0 relocate_block_group+0x1f1/0x560 btrfs_relocate_block_group+0x23e/0x400 btrfs_relocate_chunk+0x4c/0x140 btrfs_balance+0x755/0xe40 btrfs_ioctl+0x1ea2/0x2c90 __x64_sys_ioctl+0x88/0xc0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd -> #0 (btrfs-treloc-02#2){+.+.}-{3:3}: __lock_acquire+0x1122/0x1e10 lock_acquire+0xc2/0x2d0 down_write_nested+0x41/0x80 __btrfs_tree_lock+0x24/0x110 btrfs_lock_root_node+0x31/0x50 btrfs_search_slot+0x1cb/0xb70 replace_path+0x541/0x9f0 merge_reloc_root+0x1d6/0x610 merge_reloc_roots+0xe2/0x260 relocate_block_group+0x2c8/0x560 btrfs_relocate_block_group+0x23e/0x400 btrfs_relocate_chunk+0x4c/0x140 btrfs_balance+0x755/0xe40 btrfs_ioctl+0x1ea2/0x2c90 __x64_sys_ioctl+0x88/0xc0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd other info that might help us debug this: Chain exists of: btrfs-treloc-02#2 --> btrfs-tree-01 --> btrfs-tree-01/1 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(btrfs-tree-01/1); lock(btrfs-tree-01); lock(btrfs-tree-01/1); lock(btrfs-treloc-02#2); *** DEADLOCK *** 7 locks held by btrfs/752500: #0: ffff97e292fdf460 (sb_writers#12){.+.+}-{0:0}, at: btrfs_ioctl+0x208/0x2c90 #1: ffff97e284c02050 (&fs_info->reclaim_bgs_lock){+.+.}-{3:3}, at: btrfs_balance+0x55f/0xe40 #2: ffff97e284c00878 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_relocate_block_group+0x236/0x400 #3: ffff97e292fdf650 (sb_internal#2){.+.+}-{0:0}, at: merge_reloc_root+0xef/0x610 #4: ffff97e284c02378 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transaction+0x1a8/0x5a0 #5: ffff97e284c023a0 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transaction+0x1a8/0x5a0 #6: ffff97e1875a9278 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110 stack backtrace: CPU: 1 PID: 752500 Comm: btrfs Not tainted 5.19.0-rc8+ #775 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 Call Trace: dump_stack_lvl+0x56/0x73 check_noncircular+0xd6/0x100 ? lock_is_held_type+0xe2/0x140 __lock_acquire+0x1122/0x1e10 lock_acquire+0xc2/0x2d0 ? __btrfs_tree_lock+0x24/0x110 down_write_nested+0x41/0x80 ? __btrfs_tree_lock+0x24/0x110 __btrfs_tree_lock+0x24/0x110 btrfs_lock_root_node+0x31/0x50 btrfs_search_slot+0x1cb/0xb70 ? lock_release+0x137/0x2d0 ? _raw_spin_unlock+0x29/0x50 ? release_extent_buffer+0x128/0x180 replace_path+0x541/0x9f0 merge_reloc_root+0x1d6/0x610 merge_reloc_roots+0xe2/0x260 relocate_block_group+0x2c8/0x560 btrfs_relocate_block_group+0x23e/0x400 btrfs_relocate_chunk+0x4c/0x140 btrfs_balance+0x755/0xe40 btrfs_ioctl+0x1ea2/0x2c90 ? lock_is_held_type+0xe2/0x140 ? lock_is_held_type+0xe2/0x140 ? __x64_sys_ioctl+0x88/0xc0 __x64_sys_ioctl+0x88/0xc0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd This isn't necessarily new, it's just tricky to hit in practice. There are two competing things going on here. With relocation we create a snapshot of every fs tree with a reloc tree. Any extent buffers that get initialized here are initialized with the reloc root lockdep key. However since it is a snapshot, any blocks that are currently in cache that originally belonged to the fs tree will have the normal tree lockdep key set. This creates the lock dependency of reloc tree -> normal tree for the extent buffer locking during the first phase of the relocation as we walk down the reloc root to relocate blocks. However this is problematic because the final phase of the relocation is merging the reloc root into the original fs root. This involves searching down to any keys that exist in the original fs root and then swapping the relocated block and the original fs root block. We have to search down to the fs root first, and then go search the reloc root for the block we need to replace. This creates the dependency of normal tree -> reloc tree which is why lockdep complains. Additionally even if we were to fix this particular mismatch with a different nesting for the merge case, we're still slotting in a block that has a owner of the reloc root objectid into a normal tree, so that block will have its lockdep key set to the tree reloc root, and create a lockdep splat later on when we wander into that block from the fs root. Unfortunately the only solution here is to make sure we do not set the lockdep key to the reloc tree lockdep key normally, and then reset any blocks we wander into from the reloc root when we're doing the merged. This solves the problem of having mixed tree reloc keys intermixed with normal tree keys, and then allows us to make sure in the merge case we maintain the lock order of normal tree -> reloc tree We handle this by setting a bit on the reloc root when we do the search for the block we want to relocate, and any block we search into or COW at that point gets set to the reloc tree key. This works correctly because we only ever COW down to the parent node, so we aren't resetting the key for the block we're linking into the fs root. With this patch we no longer have the lockdep splat in btrfs/187. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: zoned: write out partially allocated regionNaohiro Aota1-0/+10
cow_file_range() works in an all-or-nothing way: if it fails to allocate an extent for a part of the given region, it gives up all the region including the successfully allocated parts. On cow_file_range(), run_delalloc_zoned() writes data for the region only when it successfully allocate all the region. This all-or-nothing allocation and write-out are problematic when available space in all the block groups are get tight with the active zone restriction. btrfs_reserve_extent() try hard to utilize the left space in the active block groups and gives up finally and fails with -ENOSPC. However, if we send IOs for the successfully allocated region, we can finish a zone and can continue on the rest of the allocation on a newly allocated block group. This patch implements the partial write-out for run_delalloc_zoned(). With this patch applied, cow_file_range() returns -EAGAIN to tell the caller to do something to progress the further allocation, and tells the successfully allocated region with done_offset. Furthermore, the zoned extent allocator returns -EAGAIN to tell cow_file_range() going back to the caller side. Actually, we still need to wait for an IO to complete to continue the allocation. The next patch implements that part. CC: stable@vger.kernel.org # 5.16+ Fixes: afba2bc036b0 ("btrfs: zoned: implement active zone tracking") Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: zoned: finish least available block group on data bg allocationNaohiro Aota1-10/+40
When we run out of active zones and no sufficient space is left in any block groups, we need to finish one block group to make room to activate a new block group. However, we cannot do this for metadata block groups because we can cause a deadlock by waiting for a running transaction commit. So, do that only for a data block group. Furthermore, the block group to be finished has two requirements. First, the block group must not have reserved bytes left. Having reserved bytes means we have an allocated region but did not yet send bios for it. If that region is allocated by the thread calling btrfs_zone_finish(), it results in a deadlock. Second, the block group to be finished must not be a SYSTEM block group. Finishing a SYSTEM block group easily breaks further chunk allocation by nullifying the SYSTEM free space. In a certain case, we cannot find any zone finish candidate or btrfs_zone_finish() may fail. In that case, we fall back to split the allocation bytes and fill the last spaces left in the block groups. CC: stable@vger.kernel.org # 5.16+ Fixes: afba2bc036b0 ("btrfs: zoned: implement active zone tracking") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: let can_allocate_chunk return errorNaohiro Aota1-7/+8
For the later patch, convert the return type from bool to int and return errors. No functional changes. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: use named constant for reserved device spaceQu Wenruo1-3/+3
There's a reserved space on each device of size 1MiB that can be used by bootloaders or to avoid accidental overwrite. Use a symbolic constant with the explaining comment instead of hard coding the value and multiple comments. Note: since btrfs-progs v4.1, mkfs.btrfs will reserve the first 1MiB for the primary super block (at offset 64KiB), until then the range could have been used by mistake. Kernel has been always respecting the 1MiB range for writes. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: split discard handling out of btrfs_map_blockChristoph Hellwig1-44/+28
Mapping block for discard doesn't really share any code with the regular block mapping case. Split it out into an entirely separate helper that just returns an array of btrfs_discard_stripe structures and the number of stripes. This removes the need for the length field in the btrfs_io_context structure, so remove tht. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-16Merge tag 'for-5.19-rc7-tag' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs reverts from David Sterba: "Due to a recent report [1] we need to revert the radix tree to xarray conversion patches. There's a problem with sleeping under spinlock, when xa_insert could allocate memory under pressure. We use GFP_NOFS so this is a real problem that we unfortunately did not discover during review. I'm sorry to do such change at rc6 time but the revert is IMO the safer option, there are patches to use mutex instead of the spin locks but that would need more testing. The revert branch has been tested on a few setups, all seem ok. The conversion to xarray will be revisited in the future" Link: https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/ [1] * tag 'for-5.19-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Revert "btrfs: turn delayed_nodes_tree into an XArray" Revert "btrfs: turn name_cache radix tree into XArray in send_ctx" Revert "btrfs: turn fs_info member buffer_radix into XArray" Revert "btrfs: turn fs_roots_radix in btrfs_fs_info into an XArray"
2022-07-15Revert "btrfs: turn fs_roots_radix in btrfs_fs_info into an XArray"David Sterba1-1/+1
This reverts commit 48b36a602a335c184505346b5b37077840660634. Revert the xarray conversion, there's a problem with potential sleep-inside-spinlock [1] when calling xa_insert that triggers GFP_NOFS allocation. The radix tree used the preloading mechanism to avoid sleeping but this is not available in xarray. Conversion from spin lock to mutex is possible but at time of rc6 is riskier than a clean revert. [1] https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/ Reported-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-06-26Merge tag 'for-5.19-rc3-tag' of ↵Linus Torvalds1-2/+18
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - zoned relocation fixes: - fix critical section end for extent writeback, this could lead to out of order write - prevent writing to previous data relocation block group if space gets low - reflink fixes: - fix race between reflinking and ordered extent completion - proper error handling when block reserve migration fails - add missing inode iversion/mtime/ctime updates on each iteration when replacing extents - fix deadlock when running fsync/fiemap/commit at the same time - fix false-positive KCSAN report regarding pid tracking for read locks and data race - minor documentation update and link to new site * tag 'for-5.19-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Documentation: update btrfs list of features and link to readthedocs.io btrfs: fix deadlock with fsync+fiemap+transaction commit btrfs: don't set lock_owner when locking extent buffer for reading btrfs: zoned: fix critical section of relocation inode writeback btrfs: zoned: prevent allocation from previous data relocation BG btrfs: do not BUG_ON() on failure to migrate space when replacing extents btrfs: add missing inode updates on each iteration when replacing extents btrfs: fix race between reflinking and ordered extent completion
2022-06-21btrfs: zoned: prevent allocation from previous data relocation BGNaohiro Aota1-2/+18
After commit 5f0addf7b890 ("btrfs: zoned: use dedicated lock for data relocation"), we observe IO errors on e.g, btrfs/232 like below. [09.0][T4038707] WARNING: CPU: 3 PID: 4038707 at fs/btrfs/extent-tree.c:2381 btrfs_cross_ref_exist+0xfc/0x120 [btrfs] <snip> [09.9][T4038707] Call Trace: [09.5][T4038707] <TASK> [09.3][T4038707] run_delalloc_nocow+0x7f1/0x11a0 [btrfs] [09.6][T4038707] ? test_range_bit+0x174/0x320 [btrfs] [09.2][T4038707] ? fallback_to_cow+0x980/0x980 [btrfs] [09.3][T4038707] ? find_lock_delalloc_range+0x33e/0x3e0 [btrfs] [09.5][T4038707] btrfs_run_delalloc_range+0x445/0x1320 [btrfs] [09.2][T4038707] ? test_range_bit+0x320/0x320 [btrfs] [09.4][T4038707] ? lock_downgrade+0x6a0/0x6a0 [09.2][T4038707] ? orc_find.part.0+0x1ed/0x300 [09.5][T4038707] ? __module_address.part.0+0x25/0x300 [09.0][T4038707] writepage_delalloc+0x159/0x310 [btrfs] <snip> [09.4][ C3] sd 10:0:1:0: [sde] tag#2620 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s [09.5][ C3] sd 10:0:1:0: [sde] tag#2620 Sense Key : Illegal Request [current] [09.9][ C3] sd 10:0:1:0: [sde] tag#2620 Add. Sense: Unaligned write command [09.5][ C3] sd 10:0:1:0: [sde] tag#2620 CDB: Write(16) 8a 00 00 00 00 00 02 f3 63 87 00 00 00 2c 00 00 [09.4][ C3] critical target error, dev sde, sector 396041272 op 0x1:(WRITE) flags 0x800 phys_seg 3 prio class 0 [09.9][ C3] BTRFS error (device dm-1): bdev /dev/mapper/dml_102_2 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0 The IO errors occur when we allocate a regular extent in previous data relocation block group. On zoned btrfs, we use a dedicated block group to relocate a data extent. Thus, we allocate relocating data extents (pre-alloc) only from the dedicated block group and vice versa. Once the free space in the dedicated block group gets tight, a relocating extent may not fit into the block group. In that case, we need to switch the dedicated block group to the next one. Then, the previous one is now freed up for allocating a regular extent. The BG is already not enough to allocate the relocating extent, but there is still room to allocate a smaller extent. Now the problem happens. By allocating a regular extent while nocow IOs for the relocation is still on-going, we will issue WRITE IOs (for relocation) and ZONE APPEND IOs (for the regular writes) at the same time. That mixed IOs confuses the write pointer and arises the unaligned write errors. This commit introduces a new bit 'zoned_data_reloc_ongoing' to the btrfs_block_group. We set this bit before releasing the dedicated block group, and no extent are allocated from a block group having this bit set. This bit is similar to setting block_group->ro, but is different from it by allowing nocow writes to start. Once all the nocow IO for relocation is done (hooked from btrfs_finish_ordered_io), we reset the bit to release the block group for further allocation. Fixes: c2707a255623 ("btrfs: zoned: add a dedicated data relocation block group") CC: stable@vger.kernel.org # 5.16+ Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-24Merge tag 'for-5.19-tag' of ↵Linus Torvalds1-33/+28
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "Features: - subpage: - support for PAGE_SIZE > 4K (previously only 64K) - make it work with raid56 - repair super block num_devices automatically if it does not match the number of device items - defrag can convert inline extents to regular extents, up to now inline files were skipped but the setting of mount option max_inline could affect the decision logic - zoned: - minimal accepted zone size is explicitly set to 4MiB - make zone reclaim less aggressive and don't reclaim if there are enough free zones - add per-profile sysfs tunable of the reclaim threshold - allow automatic block group reclaim for non-zoned filesystems, with sysfs tunables - tree-checker: new check, compare extent buffer owner against owner rootid Performance: - avoid blocking on space reservation when doing nowait direct io writes (+7% throughput for reads and writes) - NOCOW write throughput improvement due to refined locking (+3%) - send: reduce pressure to page cache by dropping extent pages right after they're processed Core: - convert all radix trees to xarray - add iterators for b-tree node items - support printk message index - user bulk page allocation for extent buffers - switch to bio_alloc API, use on-stack bios where convenient, other bio cleanups - use rw lock for block groups to favor concurrent reads - simplify workques, don't allocate high priority threads for all normal queues as we need only one - refactor scrub, process chunks based on their constraints and similarity - allocate direct io structures on stack and pass around only pointers, avoids allocation and reduces potential error handling Fixes: - fix count of reserved transaction items for various inode operations - fix deadlock between concurrent dio writes when low on free data space - fix a few cases when zones need to be finished VFS, iomap: - add helper to check if sb write has started (usable for assertions) - new helper iomap_dio_alloc_bio, export iomap_dio_bio_end_io" * tag 'for-5.19-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (173 commits) btrfs: zoned: introduce a minimal zone size 4M and reject mount btrfs: allow defrag to convert inline extents to regular extents btrfs: add "0x" prefix for unsupported optional features btrfs: do not account twice for inode ref when reserving metadata units btrfs: zoned: fix comparison of alloc_offset vs meta_write_pointer btrfs: send: avoid trashing the page cache btrfs: send: keep the current inode open while processing it btrfs: allocate the btrfs_dio_private as part of the iomap dio bio btrfs: move struct btrfs_dio_private to inode.c btrfs: remove the disk_bytenr in struct btrfs_dio_private btrfs: allocate dio_data on stack iomap: add per-iomap_iter private data iomap: allow the file system to provide a bio_set for direct I/O btrfs: add a btrfs_dio_rw wrapper btrfs: zoned: zone finish unused block group btrfs: zoned: properly finish block group on metadata write btrfs: zoned: finish block group when there are no more allocatable bytes left btrfs: zoned: consolidate zone finish functions btrfs: zoned: introduce btrfs_zoned_bg_is_full btrfs: improve error reporting in lookup_inline_extent_backref ...
2022-05-16btrfs: zoned: introduce btrfs_zoned_bg_is_fullNaohiro Aota1-2/+1
Introduce a wrapper to check if all the space in a block group is allocated or not. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: improve error reporting in lookup_inline_extent_backrefNikolay Borisov1-1/+7
When iterating the backrefs in an extent item if the ptr to the 'current' backref record goes beyond the extent item a warning is generated and -ENOENT is returned. However what's more appropriate to debug such cases would be to return EUCLEAN and also print identifying information about the performed search as well as the current content of the leaf containing the possibly corrupted extent item. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16