summaryrefslogtreecommitdiff
path: root/fs/btrfs/extent_map.c
AgeCommit message (Collapse)AuthorFilesLines
2024-10-31btrfs: fix extent map merging not happening for adjacent extentsFilipe Manana1-1/+6
If we have 3 or more adjacent extents in a file, that is, consecutive file extent items pointing to adjacent extents, within a contiguous file range and compatible flags, we end up not merging all the extents into a single extent map. For example: $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt/sdc $ xfs_io -f -d -c "pwrite -b 64K 0 64K" \ -c "pwrite -b 64K 64K 64K" \ -c "pwrite -b 64K 128K 64K" \ -c "pwrite -b 64K 192K 64K" \ /mnt/sdc/foo After all the ordered extents complete we unpin the extent maps and try to merge them, but instead of getting a single extent map we get two because: 1) When the first ordered extent completes (file range [0, 64K)) we unpin its extent map and attempt to merge it with the extent map for the range [64K, 128K), but we can't because that extent map is still pinned; 2) When the second ordered extent completes (file range [64K, 128K)), we unpin its extent map and merge it with the previous extent map, for file range [0, 64K), but we can't merge with the next extent map, for the file range [128K, 192K), because this one is still pinned. The merged extent map for the file range [0, 128K) gets the flag EXTENT_MAP_MERGED set; 3) When the third ordered extent completes (file range [128K, 192K)), we unpin its extent map and attempt to merge it with the previous extent map, for file range [0, 128K), but we can't because that extent map has the flag EXTENT_MAP_MERGED set (mergeable_maps() returns false due to different flags) while the extent map for the range [128K, 192K) doesn't have that flag set. We also can't merge it with the next extent map, for file range [192K, 256K), because that one is still pinned. At this moment we have 3 extent maps: One for file range [0, 128K), with the flag EXTENT_MAP_MERGED set. One for file range [128K, 192K). One for file range [192K, 256K) which is still pinned; 4) When the fourth and final extent completes (file range [192K, 256K)), we unpin its extent map and attempt to merge it with the previous extent map, for file range [128K, 192K), which succeeds since none of these extent maps have the EXTENT_MAP_MERGED flag set. So we end up with 2 extent maps: One for file range [0, 128K), with the flag EXTENT_MAP_MERGED set. One for file range [128K, 256K), with the flag EXTENT_MAP_MERGED set. Since after merging extent maps we don't attempt to merge again, that is, merge the resulting extent map with the one that is now preceding it (and the one following it), we end up with those two extent maps, when we could have had a single extent map to represent the whole file. Fix this by making mergeable_maps() ignore the EXTENT_MAP_MERGED flag. While this doesn't present any functional issue, it prevents the merging of extent maps which allows to save memory, and can make defrag not merging extents too (that will be addressed in the next patch). Fixes: 199257a78bb0 ("btrfs: defrag: don't use merged extent map for their generation check") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-22btrfs: fix read corruption due to race with extent map mergingBoris Burkov1-15/+16
In debugging some corrupt squashfs files, we observed symptoms of corrupt page cache pages but correct on-disk contents. Further investigation revealed that the exact symptom was a correct page followed by an incorrect, duplicate, page. This got us thinking about extent maps. commit ac05ca913e9f ("Btrfs: fix race between using extent maps and merging them") enforces a reference count on the primary `em` extent_map being merged, as that one gets modified. However, since, commit 3d2ac9922465 ("btrfs: introduce new members for extent_map") both 'em' and 'merge' get modified, which started modifying 'merge' and thus introduced the same race. We were able to reproduce this by looping the affected squashfs workload in parallel on a bunch of separate btrfs-es while also dropping caches. We are still working on a simple enough reproducer to make into an fstest. The simplest fix is to stop modifying 'merge', which is not essential, as it is dropped immediately after the merge. This behavior is simply a consequence of the order of the two extent maps being important in computing the new values. Modify merge_ondisk_extents to take prev and next by const* and also take a third merged parameter that it puts the results in. Note that this introduces the rather odd behavior of passing 'em' to merge_ondisk_extents as a const * and as a regular ptr. Fixes: 3d2ac9922465 ("btrfs: introduce new members for extent_map") CC: stable@vger.kernel.org # 6.11+ Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: reduce size and overhead of extent_map_block_end()Filipe Manana1-3/+6
At extent_map_block_end() we are calling the inline functions extent_map_block_start() and extent_map_block_len() multiple times, which results in expanding their code multiple times, increasing the compiled code size and repeating the computations those functions do. Improve this by caching their results in local variables. The size of the module before this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1755770 163800 16920 1936490 1d8c6a fs/btrfs/btrfs.ko And after this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1755656 163800 16920 1936376 1d8bf8 fs/btrfs/btrfs.ko Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-13btrfs: only run the extent map shrinker from kswapd tasksFilipe Manana1-16/+6
Currently the extent map shrinker can be run by any task when attempting to allocate memory and there's enough memory pressure to trigger it. To avoid too much latency we stop iterating over extent maps and removing them once the task needs to reschedule. This logic was introduced in commit b3ebb9b7e92a ("btrfs: stop extent map shrinker if reschedule is needed"). While that solved high latency problems for some use cases, it's still not enough because with a too high number of tasks entering the extent map shrinker code, either due to memory allocations or because they are a kswapd task, we end up having a very high level of contention on some spin locks, namely: 1) The fs_info->fs_roots_radix_lock spin lock, which we need to find roots to iterate over their inodes; 2) The spin lock of the xarray used to track open inodes for a root (struct btrfs_root::inodes) - on 6.10 kernels and below, it used to be a red black tree and the spin lock was root->inode_lock; 3) The fs_info->delayed_iput_lock spin lock since the shrinker adds delayed iputs (calls btrfs_add_delayed_iput()). Instead of allowing the extent map shrinker to be run by any task, make it run only by kswapd tasks. This still solves the problem of running into OOM situations due to an unbounded extent map creation, which is simple to trigger by direct IO writes, as described in the changelog of commit 956a17d9d050 ("btrfs: add a shrinker for extent maps"), and by a similar case when doing buffered IO on files with a very large number of holes (keeping the file open and creating many holes, whose extent maps are only released when the file is closed). Reported-by: kzd <kzd@56709.net> Link: https://bugzilla.kernel.org/show_bug.cgi?id=219121 Reported-by: Octavia Togami <octavia.togami@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CAHPNGSSt-a4ZZWrtJdVyYnJFscFjP9S7rMcvEMaNSpR556DdLA@mail.gmail.com/ Fixes: 956a17d9d050 ("btrfs: add a shrinker for extent maps") CC: stable@vger.kernel.org # 6.10+ Tested-by: kzd <kzd@56709.net> Tested-by: Octavia Togami <octavia.togami@gmail.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-30Merge tag 'for-6.11-rc1-tag' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix regression in extent map rework when handling insertion of overlapping compressed extent - fix unexpected file length when appending to a file using direct io and buffer not faulted in - in zoned mode, fix accounting of unusable space when flipping read-only block group back to read-write - fix page locking when COWing an inline range, assertion failure found by syzbot - fix calculation of space info in debugging print - tree-checker, add validation of data reference item - fix a few -Wmaybe-uninitialized build warnings * tag 'for-6.11-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: initialize location to fix -Wmaybe-uninitialized in btrfs_lookup_dentry() btrfs: fix corruption after buffer fault in during direct IO append write btrfs: zoned: fix zone_unusable accounting on making block group read-write again btrfs: do not subtract delalloc from avail bytes btrfs: make cow_file_range_inline() honor locked_page on error btrfs: fix corrupt read due to bad offset of a compressed extent map btrfs: tree-checker: validate dref root and objectid
2024-07-25btrfs: fix corrupt read due to bad offset of a compressed extent mapFilipe Manana1-1/+1
If we attempt to insert a compressed extent map that has a range that overlaps another extent map we have in the inode's extent map tree, we can end up with an incorrect offset after adjusting the new extent map at merge_extent_mapping() because we don't update the extent map's offset. For example consider the following scenario: 1) We have a file extent item for a compressed extent covering the file range [108K, 144K) and currently there's no corresponding extent map in the inode's extent map tree; 2) The inode's size is 141K; 3) We have an encoded write (compressed) into the file range [120K, 128K), which overlaps the existing file extent item. The encoded write creates a matching extent map, adds it to the inode's extent map tree and creates an ordered extent for it. Note that the corresponding file extent item is added to the subvolume tree only when the ordered extent completes (when executing btrfs_finish_one_ordered()); 4) We have a write into the file range [160K, 164K). This writes increases the i_size of the file, and there's a hole between the current i_size (141K) and the start offset of this write, and since the old i_size is in the middle of the block [140K, 144K), we have to write zeroes to the range [141K, 144K) (3072 bytes) and therefore dirty that page. We then call btrfs_set_extent_delalloc() with a start offset of 140K. We then end up at btrfs_find_new_delalloc_bytes() which will call btrfs_get_extent() for the range [140K, 144K); 5) The btrfs_get_extent() doesn't find any extent map in the inode's extent map tree covering the range [140K, 144K), so it searches the subvolume tree for any file extent items covering that range. There it finds the file extent item for the range [108K, 144K), creates a compressed extent map for that range and then calls btrfs_add_extent_mapping() with that extent map and passes the range [140K, 144K) via the "start" and "len" parameters; 6) The call to add_extent_mapping() done by btrfs_add_extent_mapping() fails with -EEXIST because there's an extent map, created at step 2 for the [120K, 128K) range, that covers that overlaps with the range of the given extent map ([108K, 144K)). Then it does a lookup for extent map from step 2 add calls merge_extent_mapping() to adjust the input extent map ([108K, 144K)). That adjust the extent map to a start offset of 128K and a length of 16K (starting just after the extent map from step 2), but it does not update the offset field of the extent map, leaving it with a value of zero instead of updating to a value of 20K (128K - 108K = 20K). As a result any read for the range [128K, 144K) can return incorrect data since we read from a wrong section of the extent (unless both the correct and incorrect ranges happen to have the same data). So fix this by changing merge_extent_mapping() to update the extent map's offset even if it's compressed. Also add a test case to the self tests. This didn't happen before the patchset that does big changes in the extent map structure (which includes the commit in the Fixes tag below) because we kept track of the original start offset in the extent map (member "orig_start") so we could always calculate the correct offset by subtracting that offset from the start offset. A test case for fstests that triggered this problem using send/receive with compressed writes will be added soon. Fixes: 3d2ac9922465 ("btrfs: introduce new members for extent_map") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-17Merge tag 'for-6.11-tag' of ↵Linus Torvalds1-80/+163
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "The highlights are new logic behind background block group reclaim, automatic removal of qgroup after removing a subvolume and new 'rescue=' mount options. The rest is optimizations, cleanups and refactoring. User visible features: - dynamic block group reclaim: - tunable framework to avoid situations where eager data allocations prevent creating new metadata chunks due to lack of unallocated space - reuse sysfs knob bg_reclaim_threshold (otherwise used only in zoned mode) for a fixed value threshold - new on/off sysfs knob "dynamic_reclaim" calculating the value based on heuristics, aiming to keep spare working space for relocating chunks but not to needlessly relocate partially utilized block groups or reclaim newly allocated ones - stats are exported in sysfs per block group type, files "reclaim_*" - this may increase IO load at unexpected times but the corner case of no allocatable block groups is known to be worse - automatically remove qgroup of deleted subvolumes: - adjust qgroup removal conditions, make sure all related subvolume data are already removed, or return EBUSY, also take into account setting of sysfs drop_subtree_threshold - also works in squota mode - mount option updates: new modes of 'rescue=' that allow to mount images (read-only) that could have been partially converted by user space tools - ignoremetacsums - invalid metadata checksums are ignored - ignoresuperflags - super block flags that track conversion in progress (like UUID or checksums) Core: - size of struct btrfs_inode is now below 1024 (on a release config), improved memory packing and other secondary effects - switch tracking of open inodes from rb-tree to xarray, minor performance improvement - reduce number of empty transaction commits when there are no dirty data/metadata - memory allocation optimizations (reduced numbers, reordering out of critical sections) - extent map structure optimizations and refactoring, more sanity checks - more subpage in zoned mode preparations or fixes - general snapshot code cleanups, improvements and documentation - tree-checker updates: more file extent ram_bytes fixes, continued - raid-stripe-tree update (not backward compatible): - remove extent encoding field from the structure, can be inferred from other information - requires btrfs-progs 6.9.1 or newer - cleanups and refactoring - error message updates - error handling improvements - return type and parameter cleanups and improvements" * tag 'for-6.11-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (152 commits) btrfs: fix extent map use-after-free when adding pages to compressed bio btrfs: fix bitmap leak when loading free space cache on duplicate entry btrfs: remove the BUG_ON() inside extent_range_clear_dirty_for_io() btrfs: move extent_range_clear_dirty_for_io() into inode.c btrfs: enhance compression error messages btrfs: fix data race when accessing the last_trans field of a root btrfs: rename the extra_gfp parameter of btrfs_alloc_page_array() btrfs: remove the extra_gfp parameter from btrfs_alloc_folio_array() btrfs: introduce new "rescue=ignoresuperflags" mount option btrfs: introduce new "rescue=ignoremetacsums" mount option btrfs: output the unrecognized super block flags as hex btrfs: remove unused Opt enums btrfs: tree-checker: add extra ram_bytes and disk_num_bytes check btrfs: fix the ram_bytes assignment for truncated ordered extents btrfs: make validate_extent_map() catch ram_bytes mismatch btrfs: ignore incorrect btrfs_file_extent_item::ram_bytes btrfs: cleanup the bytenr usage inside btrfs_extent_item_to_extent_map() btrfs: fix typo in error message in btrfs_validate_super() btrfs: move the direct IO code into its own file btrfs: pass a btrfs_inode to btrfs_set_prop() ...
2024-07-11btrfs: avoid races when tracking progress for extent map shrinkingFilipe Manana1-21/+63
We store the progress (root and inode numbers) of the extent map shrinker in fs_info without any synchronization but we can have multiple tasks calling into the shrinker during memory allocations when there's enough memory pressure for example. This can result in a task A reading fs_info->extent_map_shrinker_last_ino after another task B updates it, and task A reading fs_info->extent_map_shrinker_last_root before task B updates it, making task A see an odd state that isn't necessarily harmful but may make it skip certain inode ranges or do more work than necessary by going over the same inodes again. These unprotected accesses would also trigger warnings from tools like KCSAN. So add a lock to protect access to these progress fields. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: stop extent map shrinker if reschedule is neededFilipe Manana1-8/+31
The extent map shrinker can be called in a variety of contexts where we are under memory pressure, and of them is when a task is trying to allocate memory. For this reason the shrinker is typically called with a value of struct shrink_control::nr_to_scan that is much smaller than what we return in the nr_cached_objects callback of struct super_operations (fs/btrfs/super.c:btrfs_nr_cached_objects()), so that the shrinker does not take a long time and cause high latencies. However we can still take a lot of time in the shrinker even for a limited amount of nr_to_scan: 1) When traversing the red black tree that tracks open inodes in a root, as for example with millions of open inodes we get a deep tree which takes time searching for an inode; 2) Iterating over the extent map tree, which is a red black tree, of an inode when doing the rb_next() calls and when removing an extent map from the tree, since often that requires rebalancing the red black tree; 3) When trying to write lock an inode's extent map tree we may wait for a significant amount of time, because there's either another task about to do IO and searching for an extent map in the tree or inserting an extent map in the tree, and we can have thousands or even millions of extent maps for an inode. Furthermore, there can be concurrent calls to the shrinker so the lock might be busy simply because there is already another task shrinking extent maps for the same inode; 4) We often reschedule if we need to, which further increases latency. So improve on this by stopping the extent map shrinking code whenever we need to reschedule and make it skip an inode if we can't immediately lock its extent map tree. Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Reported-by: Andrea Gelmini <andrea.gelmini@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CABXGCsMmmb36ym8hVNGTiU8yfUS_cGvoUmGCcBrGWq9OxTrs+A@mail.gmail.com/ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: use delayed iput during extent map shrinkingFilipe Manana1-1/+1
When putting an inode during extent map shrinking we're doing a standard iput() but that may take a long time in case the inode is dirty and we are doing the final iput that triggers eviction - the VFS will have to wait for writeback before calling the btrfs evict callback (see fs/inode.c:evict()). This slows down the task running the shrinker which may have been triggered while updating some tree for example, meaning locks are held as well as an open transaction handle. Also if the iput() ends up triggering eviction and the inode has no links anymore, then we trigger item truncation which requires flushing delayed items, space reservation to start a transaction and that may trigger the space reclaim task and wait for it, resulting in deadlocks in case the reclaim task needs for example to commit a transaction and the shrinker is being triggered from a path holding a transaction handle. Syzbot reported such a case with the following stack traces: ====================================================== WARNING: possible circular locking dependency detected 6.10.0-rc2-syzkaller-00010-g2ab795141095 #0 Not tainted ------------------------------------------------------ kswapd0/111 is trying to acquire lock: ffff88801eae4610 (sb_internal#3){.+.+}-{0:0}, at: btrfs_commit_inode_delayed_inode+0x110/0x330 fs/btrfs/delayed-inode.c:1275 but task is already holding lock: ffffffff8dd3a9a0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0xa88/0x1970 mm/vmscan.c:6924 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (fs_reclaim){+.+.}-{0:0}: __fs_reclaim_acquire mm/page_alloc.c:3783 [inline] fs_reclaim_acquire+0x102/0x160 mm/page_alloc.c:3797 might_alloc include/linux/sched/mm.h:334 [inline] slab_pre_alloc_hook mm/slub.c:3890 [inline] slab_alloc_node mm/slub.c:3980 [inline] kmem_cache_alloc_lru_noprof+0x58/0x2f0 mm/slub.c:4019 btrfs_alloc_inode+0x118/0xb20 fs/btrfs/inode.c:8411 alloc_inode+0x5d/0x230 fs/inode.c:261 iget5_locked fs/inode.c:1235 [inline] iget5_locked+0x1c9/0x2c0 fs/inode.c:1228 btrfs_iget_locked fs/btrfs/inode.c:5590 [inline] btrfs_iget_path fs/btrfs/inode.c:5607 [inline] btrfs_iget+0xfb/0x230 fs/btrfs/inode.c:5636 create_reloc_inode+0x403/0x820 fs/btrfs/relocation.c:3911 btrfs_relocate_block_group+0x471/0xe60 fs/btrfs/relocation.c:4114 btrfs_relocate_chunk+0x143/0x450 fs/btrfs/volumes.c:3373 __btrfs_balance fs/btrfs/volumes.c:4157 [inline] btrfs_balance+0x211a/0x3f00 fs/btrfs/volumes.c:4534 btrfs_ioctl_balance fs/btrfs/ioctl.c:3675 [inline] btrfs_ioctl+0x12ed/0x8290 fs/btrfs/ioctl.c:4742 __do_compat_sys_ioctl+0x2c3/0x330 fs/ioctl.c:1007 do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline] __do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386 do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411 entry_SYSENTER_compat_after_hwframe+0x84/0x8e -> #2 (btrfs_trans_num_extwriters){++++}-{0:0}: join_transaction+0x164/0xf40 fs/btrfs/transaction.c:315 start_transaction+0x427/0x1a70 fs/btrfs/transaction.c:700 btrfs_rebuild_free_space_tree+0xaa/0x480 fs/btrfs/free-space-tree.c:1323 btrfs_start_pre_rw_mount+0x218/0xf60 fs/btrfs/disk-io.c:2999 open_ctree+0x41ab/0x52e0 fs/btrfs/disk-io.c:3554 btrfs_fill_super fs/btrfs/super.c:946 [inline] btrfs_get_tree_super fs/btrfs/super.c:1863 [inline] btrfs_get_tree+0x11e9/0x1b90 fs/btrfs/super.c:2089 vfs_get_tree+0x8f/0x380 fs/super.c:1780 fc_mount+0x16/0xc0 fs/namespace.c:1125 btrfs_get_tree_subvol fs/btrfs/super.c:2052 [inline] btrfs_get_tree+0xa53/0x1b90 fs/btrfs/super.c:2090 vfs_get_tree+0x8f/0x380 fs/super.c:1780 do_new_mount fs/namespace.c:3352 [inline] path_mount+0x6e1/0x1f10 fs/namespace.c:3679 do_mount fs/namespace.c:3692 [inline] __do_sys_mount fs/namespace.c:3898 [inline] __se_sys_mount fs/namespace.c:3875 [inline] __ia32_sys_mount+0x295/0x320 fs/namespace.c:3875 do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline] __do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386 do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411 entry_SYSENTER_compat_after_hwframe+0x84/0x8e -> #1 (btrfs_trans_num_writers){++++}-{0:0}: join_transaction+0x148/0xf40 fs/btrfs/transaction.c:314 start_transaction+0x427/0x1a70 fs/btrfs/transaction.c:700 btrfs_rebuild_free_space_tree+0xaa/0x480 fs/btrfs/free-space-tree.c:1323 btrfs_start_pre_rw_mount+0x218/0xf60 fs/btrfs/disk-io.c:2999 open_ctree+0x41ab/0x52e0 fs/btrfs/disk-io.c:3554 btrfs_fill_super fs/btrfs/super.c:946 [inline] btrfs_get_tree_super fs/btrfs/super.c:1863 [inline] btrfs_get_tree+0x11e9/0x1b90 fs/btrfs/super.c:2089 vfs_get_tree+0x8f/0x380 fs/super.c:1780 fc_mount+0x16/0xc0 fs/namespace.c:1125 btrfs_get_tree_subvol fs/btrfs/super.c:2052 [inline] btrfs_get_tree+0xa53/0x1b90 fs/btrfs/super.c:2090 vfs_get_tree+0x8f/0x380 fs/super.c:1780 do_new_mount fs/namespace.c:3352 [inline] path_mount+0x6e1/0x1f10 fs/namespace.c:3679 do_mount fs/namespace.c:3692 [inline] __do_sys_mount fs/namespace.c:3898 [inline] __se_sys_mount fs/namespace.c:3875 [inline] __ia32_sys_mount+0x295/0x320 fs/namespace.c:3875 do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline] __do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386 do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411 entry_SYSENTER_compat_after_hwframe+0x84/0x8e -> #0 (sb_internal#3){.+.+}-{0:0}: check_prev_add kernel/locking/lockdep.c:3134 [inline] check_prevs_add kernel/locking/lockdep.c:3253 [inline] validate_chain kernel/locking/lockdep.c:3869 [inline] __lock_acquire+0x2478/0x3b30 kernel/locking/lockdep.c:5137 lock_acquire kernel/locking/lockdep.c:5754 [inline] lock_acquire+0x1b1/0x560 kernel/locking/lockdep.c:5719 percpu_down_read include/linux/percpu-rwsem.h:51 [inline] __sb_start_write include/linux/fs.h:1655 [inline] sb_start_intwrite include/linux/fs.h:1838 [inline] start_transaction+0xbc1/0x1a70 fs/btrfs/transaction.c:694 btrfs_commit_inode_delayed_inode+0x110/0x330 fs/btrfs/delayed-inode.c:1275 btrfs_evict_inode+0x960/0xe80 fs/btrfs/inode.c:5291 evict+0x2ed/0x6c0 fs/inode.c:667 iput_final fs/inode.c:1741 [inline] iput.part.0+0x5a8/0x7f0 fs/inode.c:1767 iput+0x5c/0x80 fs/inode.c:1757 btrfs_scan_root fs/btrfs/extent_map.c:1118 [inline] btrfs_free_extent_maps+0xbd3/0x1320 fs/btrfs/extent_map.c:1189 super_cache_scan+0x409/0x550 fs/super.c:227 do_shrink_slab+0x44f/0x11c0 mm/shrinker.c:435 shrink_slab+0x18a/0x1310 mm/shrinker.c:662 shrink_one+0x493/0x7c0 mm/vmscan.c:4790 shrink_many mm/vmscan.c:4851 [inline] lru_gen_shrink_node+0x89f/0x1750 mm/vmscan.c:4951 shrink_node mm/vmscan.c:5910 [inline] kswapd_shrink_node mm/vmscan.c:6720 [inline] balance_pgdat+0x1105/0x1970 mm/vmscan.c:6911 kswapd+0x5ea/0xbf0 mm/vmscan.c:7180 kthread+0x2c1/0x3a0 kernel/kthread.c:389 ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 other info that might help us debug this: Chain exists of: sb_internal#3 --> btrfs_trans_num_extwriters --> fs_reclaim Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(fs_reclaim); lock(btrfs_trans_num_extwriters); lock(fs_reclaim); rlock(sb_internal#3); *** DEADLOCK *** 2 locks held by kswapd0/111: #0: ffffffff8dd3a9a0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0xa88/0x1970 mm/vmscan.c:6924 #1: ffff88801eae40e0 (&type->s_umount_key#62){++++}-{3:3}, at: super_trylock_shared fs/super.c:562 [inline] #1: ffff88801eae40e0 (&type->s_umount_key#62){++++}-{3:3}, at: super_cache_scan+0x96/0x550 fs/super.c:196 stack backtrace: CPU: 0 PID: 111 Comm: kswapd0 Not tainted 6.10.0-rc2-syzkaller-00010-g2ab795141095 #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x116/0x1f0 lib/dump_stack.c:114 check_noncircular+0x31a/0x400 kernel/locking/lockdep.c:2187 check_prev_add kernel/locking/lockdep.c:3134 [inline] check_prevs_add kernel/locking/lockdep.c:3253 [inline] validate_chain kernel/locking/lockdep.c:3869 [inline] __lock_acquire+0x2478/0x3b30 kernel/locking/lockdep.c:5137 lock_acquire kernel/locking/lockdep.c:5754 [inline] lock_acquire+0x1b1/0x560 kernel/locking/lockdep.c:5719 percpu_down_read include/linux/percpu-rwsem.h:51 [inline] __sb_start_write include/linux/fs.h:1655 [inline] sb_start_intwrite include/linux/fs.h:1838 [inline] start_transaction+0xbc1/0x1a70 fs/btrfs/transaction.c:694 btrfs_commit_inode_delayed_inode+0x110/0x330 fs/btrfs/delayed-inode.c:1275 btrfs_evict_inode+0x960/0xe80 fs/btrfs/inode.c:5291 evict+0x2ed/0x6c0 fs/inode.c:667 iput_final fs/inode.c:1741 [inline] iput.part.0+0x5a8/0x7f0 fs/inode.c:1767 iput+0x5c/0x80 fs/inode.c:1757 btrfs_scan_root fs/btrfs/extent_map.c:1118 [inline] btrfs_free_extent_maps+0xbd3/0x1320 fs/btrfs/extent_map.c:1189 super_cache_scan+0x409/0x550 fs/super.c:227 do_shrink_slab+0x44f/0x11c0 mm/shrinker.c:435 shrink_slab+0x18a/0x1310 mm/shrinker.c:662 shrink_one+0x493/0x7c0 mm/vmscan.c:4790 shrink_many mm/vmscan.c:4851 [inline] lru_gen_shrink_node+0x89f/0x1750 mm/vmscan.c:4951 shrink_node mm/vmscan.c:5910 [inline] kswapd_shrink_node mm/vmscan.c:6720 [inline] balance_pgdat+0x1105/0x1970 mm/vmscan.c:6911 kswapd+0x5ea/0xbf0 mm/vmscan.c:7180 kthread+0x2c1/0x3a0 kernel/kthread.c:389 ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 </TASK> So fix this by using btrfs_add_delayed_iput() so that the final iput is delegated to the cleaner kthread. Link: https://lore.kernel.org/linux-btrfs/000000000000892280061a344581@google.com/ Reported-by: syzbot+3dad89b3993a4b275e72@syzkaller.appspotmail.com Fixes: 956a17d9d050 ("btrfs: add a shrinker for extent maps") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: make validate_extent_map() catch ram_bytes mismatchQu Wenruo1-0/+5
Previously validate_extent_map() is only to catch bugs related to extent_map member cleanups. But with recent btrfs-check enhancement to catch ram_bytes mismatch with disk_num_bytes, it would be much better to catch such extent maps earlier. So this patch adds extra ram_bytes validation for extent maps. Please note that, older filesystems with such mismatch won't trigger this error: - extent_map::ram_bytes is already fixed Previous patch has already fixed the ram_bytes for affected file extents. So this enhanced sanity check should not affect end users. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: remove extent_map::block_start memberQu Wenruo1-39/+14
The member extent_map::block_start can be calculated from extent_map::disk_bytenr + extent_map::offset for regular extents. And otherwise just extent_map::disk_bytenr. And this is already validated by the validate_extent_map(). Now we can remove the member. However there is a special case in btrfs_create_dio_extent() where we for NOCOW/PREALLOC ordered extents cannot directly use the resulting btrfs_file_extent, as btrfs_split_ordered_extent() cannot handle them yet. So for that call site, we pass file_extent->disk_bytenr + file_extent->num_bytes as disk_bytenr for the ordered extent, and 0 for offset. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: remove extent_map::block_len memberQu Wenruo1-27/+14
The extent_map::block_len is either extent_map::len (non-compressed extent) or extent_map::disk_num_bytes (compressed extent). Since we already have sanity checks to do the cross-checks between the new and old members, we can drop the old extent_map::block_len now. For most call sites, they can manually select extent_map::len or extent_map::disk_num_bytes, since most if not all of them have checked if the extent is compressed. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: remove extent_map::orig_start memberQu Wenruo1-19/+2
Since we have extent_map::offset, the old extent_map::orig_start is just extent_map::start - extent_map::offset for non-hole/inline extents. And since the new extent_map::offset is already verified by validate_extent_map() while the old orig_start is not, let's just remove the old member from all call sites. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: introduce extra sanity checks for extent mapsQu Wenruo1-0/+59
Since extent_map structure has the all the needed members to represent a file extent directly, we can apply all the file extent sanity checks to an extent map. The new sanity checks will cross check both the old members (block_start/block_len/orig_start) and the new members (disk_bytenr/disk_num_bytes/offset). There is a special case for offset/orig_start/start cross check, we only do such sanity check for compressed extent, as only compressed read/encoded write really utilize orig_start. This can be proved by the cleanup patch of orig_start. The checks happens at the following times: - add_extent_mapping() This is for newly added extent map - replace_extent_mapping() This is for btrfs_drop_extent_map_range() and split_extent_map() - try_merge_map() For a lot of call sites we have to properly populate all the members to pass the sanity check, meanwhile the following code needs extra modification: - setup_file_extents() from inode-tests The file extents layout of setup_file_extents() is already too invalid that tree-checker would reject most of them in real world. However there is just a special unaligned regular extent which has mismatched disk_num_bytes (4096) and ram_bytes (4096 - 1). So instead of dropping the whole test case, here we just unify disk_num_bytes and ram_bytes to 4096 - 1. - test_case_7() from extent-map-tests An extent is inserted with 16K length, but on-disk extent size is only 4K. This means it must be a compressed extent, so set the compressed flag for it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: introduce new members for extent_mapQu Wenruo1-4/+75
Introduce two new members for extent_map: - disk_bytenr - offset Both are matching the members with the same name inside btrfs_file_extent_items. For now this patch only touches those members when: - Reading btrfs_file_extent_items from disk - Inserting new holes - Merging two extent maps With the new disk_bytenr and disk_num_bytes, doing merging would be a little more complex, as we have 3 different cases: * Both extent maps are referring to the same data extents |<----- data extent A ----->| |<- em 1 ->|<- em 2 ->| * Both extent maps are referring to different data extents |<-- data extent A -->|<-- data extent B -->| |<- em 1 ->|<- em 2 ->| * One of the extent maps is referring to a merged and larger data extent that covers both extent maps This is not really valid case other than some selftests. So this test case would be removed. A new helper merge_ondisk_extents() is introduced to handle the above valid cases. To properly assign values for those new members, a new btrfs_file_extent parameter is introduced to all the involved call sites. - For NOCOW writes the btrfs_file_extent would be exposed from can_nocow_file_extent(). - For other writes, the members can be easily calculated As most of them have 0 offset and utilizing the whole on-disk data extent. The exception is encoded write, but thankfully that interface provided offset directly and all other needed info. For now, both the old members (block_start/block_len/orig_start) are co-existing with the new members (disk_bytenr/offset), meanwhile all the critical code is still using the old members only. The cleanup will happen later after all the old and new members are properly validated. There would be some re-ordering for the assignment of the extent_map members, now we follow the new ordering: - start and len Or file_pos and num_bytes for other structures. - disk_bytenr and disk_num_bytes - offset and ram_bytes - compression So expect some seemingly unrelated line movement. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: rename extent_map::orig_block_len to disk_num_bytesQu Wenruo1-8/+8
This would make it very obvious that the member just matches btrfs_file_extent_item::disk_num_bytes. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: use a regular rb_root instead of cached rb_root for extent_map_treeFilipe Manana1-22/+26
We are currently using a cached rb_root (struct rb_root_cached) for the rb root of struct extent_map_tree. This doesn't offer much of an advantage here because: 1) It's only advantage over the regular rb_root is that it caches a pointer to the left most node (first node), so a call to rb_first_cached() doesn't have to chase pointers until it reaches the left most node; 2) We only have two scenarios that access left most node with rb_first_cached(): When dropping all extent maps from an inode, during inode eviction; When iterating over extent maps during the extent map shrinker; 3) In both cases we keep removing extent maps, which causes deletion of the left most node so rb_erase_cached() has to call rb_next() to find out what's the next left most node and assign it to struct rb_root_cached::rb_leftmost; 4) We can do that ourselves in those two uses cases and stop using a rb_root_cached rb tree and use instead a regular rb_root rb tree. This reduces the size of struct extent_map_tree by 8 bytes and, since this structure is embedded in struct btrfs_inode, it also reduces the size of that structure by 8 bytes. So on a 64 bits platform the size of btrfs_inode is reduced from 1032 bytes down to 1024 bytes. This means we will be able to have 4 inodes per 4K page instead of 3. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: rename rb_root member of extent_map_tree from map to rootFilipe Manana1-11/+11
Currently we name the rb_root member of struct extent_map_tree as 'map', which is odd and confusing. Since it's a root node, rename it to 'root'. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: add tracepoints for extent map shrinker eventsFilipe Manana1-0/+13
Add some tracepoints for the extent map shrinker to help debug and analyse main events. These have proved useful during development of the shrinker. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: add a shrinker for extent mapsFilipe Manana1-0/+160
Extent maps are used either to represent existing file extent items, or to represent new extents that are going to be written and the respective file extent items are created when the ordered extent completes. We currently don't have any limit for how many extent maps we can have, neither per inode nor globally. Most of the time this not too noticeable because extent maps are removed in the following situations: 1) When evicting an inode; 2) When releasing folios (pages) through the btrfs_release_folio() address space operation callback. However we won't release extent maps in the folio range if the folio is either dirty or under writeback or if the inode's i_size is less than or equals to 16M (see try_release_extent_mapping(). This 16M i_size constraint was added back in 2008 with commit 70dec8079d78 ("Btrfs: extent_io and extent_state optimizations"), but there's no explanation about why we have it or why the 16M value. This means that for buffered IO we can reach an OOM situation due to too many extent maps if either of the following happens: 1) There's a set of tasks constantly doing IO on many files with a size not larger than 16M, specially if they keep the files open for very long periods, therefore preventing inode eviction. This requires a really high number of such files, and having many non mergeable extent maps (due to random 4K writes for example) and a machine with very little memory; 2) There's a set tasks constantly doing random write IO (therefore creating many non mergeable extent maps) on files and keeping them open for long periods of time, so inode eviction doesn't happen and there's always a lot of dirty pages or pages under writeback, preventing btrfs_release_folio() from releasing the respective extent maps. This second case was actually reported in the thread pointed by the Link tag below, and it requires a very large file under heavy IO and a machine with very little amount of RAM, which is probably hard to happen in practice in a real world use case. However when using direct IO this is not so hard to happen, because the page cache is not used, and therefore btrfs_release_folio() is never called. Which means extent maps are dropped only when evicting the inode, and that means that if we have tasks that keep a file descriptor open and keep doing IO on a very large file (or files), we can exhaust memory due to an unbounded amount of extent maps. This is especially easy to happen if we have a huge file with millions of small extents and their extent maps are not mergeable (non contiguous offsets and disk locations). This was reported in that thread with the following fio test: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="" cat <<EOF > /tmp/fio-job.ini [global] name=fio-rand-write filename=$MNT/fio-rand-write rw=randwrite bs=4K direct=1 numjobs=16 fallocate=none time_based runtime=90000 [file1] size=300G ioengine=libaio iodepth=16 EOF umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT Monitoring the btrfs_extent_map slab while running the test with: $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \ /sys/kernel/slab/btrfs_extent_map/total_objects' Shows the number of active and total extent maps skyrocketing to tens of millions, and on systems with a short amount of memory it's easy and quick to get into an OOM situation, as reported in that thread. So to avoid this issue add a shrinker that will remove extents maps, as long as they are not pinned, and takes proper care with any concurrent fsync to avoid missing extents (setting the full sync flag while in the middle of a fast fsync). This shrinker is triggered through the callbacks nr_cached_objects and free_cached_objects of struct super_operations. The shrinker will iterate over all roots and over all inodes of each root, and keeps track of the last scanned root and inode, so that the next time it runs, it starts from that root and from the next inode. This is similar to what xfs does for its inode reclaim (implements those callbacks, and cycles through inodes by starting from where it ended last time). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: add a global per cpu counter to track number of used extent mapsFilipe Manana1-0/+17
Add a per cpu counter that tracks the total number of extent maps that are in extent trees of inodes that belong to fs trees. This is going to be used in an upcoming change that adds a shrinker for extent maps. Only extent maps for fs trees are considered, because for special trees such as the data relocation tree we don't want to evict their extent maps which are critical for the relocation to work, and since those are limited, it's not a concern to have them in memory during the relocation of a block group. Another case are extent maps for free space cache inodes, which must always remain in memory, but those are limited (there's only one per free space cache inode, which means one per block group). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: pass the extent map tree's inode to try_merge_map()Filipe Manana1-7/+6
Extent maps are always associated to an inode's extent map tree, so there's no need to pass the extent map tree explicitly to try_merge_map(). In order to facilitate an upcoming change that adds a shrinker for extent maps, change try_merge_map() to receive the inode instead of its extent map tree. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: pass the extent map tree's inode to setup_extent_mapping()Filipe Manana1-5/+5
Extent maps are always associated to an inode's extent map tree, so there's no need to pass the extent map tree explicitly to setup_extent_mapping(). In order to facilitate an upcoming change that adds a shrinker for extent maps, change setup_extent_mapping() to receive the inode instead of its extent map tree. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: pass the extent map tree's inode to replace_extent_mapping()Filipe Manana1-5/+6
Extent maps are always associated to an inode's extent map tree, so there's no need to pass the extent map tree explicitly to replace_extent_mapping(). In order to facilitate an upcoming change that adds a shrinker for extent maps, change replace_extent_mapping() to receive the inode instead of its extent map tree. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: pass the extent map tree's inode to remove_extent_mapping()Filipe Manana1-9/+13
Extent maps are always associated to an inode's extent map tree, so there's no need to pass the extent map tree explicitly to remove_extent_mapping(). In order to facilitate an upcoming change that adds a shrinker for extent maps, change remove_extent_mapping() to receive the inode instead of its extent map tree. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: pass the extent map tree's inode to clear_em_logging()Filipe Manana1-1/+3
Extent maps are always associated to an inode's extent map tree, so there's no need to pass the extent map tree explicitly to clear_em_logging(). In order t