summaryrefslogtreecommitdiff
path: root/fs/btrfs/ioctl.c
AgeCommit message (Collapse)AuthorFilesLines
2024-05-07btrfs: change root->root_key.objectid to btrfs_root_id()Josef Bacik1-10/+10
A comment from Filipe on one of my previous cleanups brought my attention to a new helper we have for getting the root id of a root, which makes it easier to read in the code. The changes where made with the following Coccinelle semantic patch: // <smpl> @@ expression E,E1; @@ ( E->root_key.objectid = E1 | - E->root_key.objectid + btrfs_root_id(E) ) // </smpl> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor style fixups ] Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: rename err to ret in btrfs_ioctl_snap_destroy()Anand Jain1-33/+33
Unify naming of return value to the preferred way. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-02Merge tag 'for-6.9-rc6-tag' of ↵Linus Torvalds1-3/+30
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - set correct ram_bytes when splitting ordered extent. This can be inconsistent on-disk but harmless as it's not used for calculations and it's only advisory for compression - fix lockdep splat when taking cleaner mutex in qgroups disable ioctl - fix missing mutex unlock on error path when looking up sys chunk for relocation * tag 'for-6.9-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: set correct ram_bytes when splitting ordered extent btrfs: take the cleaner_mutex earlier in qgroup disable btrfs: add missing mutex_unlock in btrfs_relocate_sys_chunks()
2024-04-25btrfs: take the cleaner_mutex earlier in qgroup disableJosef Bacik1-3/+30
One of my CI runs popped the following lockdep splat ====================================================== WARNING: possible circular locking dependency detected 6.9.0-rc4+ #1 Not tainted ------------------------------------------------------ btrfs/471533 is trying to acquire lock: ffff92ba46980850 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_quota_disable+0x54/0x4c0 but task is already holding lock: ffff92ba46980bd0 (&fs_info->subvol_sem){++++}-{3:3}, at: btrfs_ioctl+0x1c8f/0x2600 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (&fs_info->subvol_sem){++++}-{3:3}: down_read+0x42/0x170 btrfs_rename+0x607/0xb00 btrfs_rename2+0x2e/0x70 vfs_rename+0xaf8/0xfc0 do_renameat2+0x586/0x600 __x64_sys_rename+0x43/0x50 do_syscall_64+0x95/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #1 (&sb->s_type->i_mutex_key#16){++++}-{3:3}: down_write+0x3f/0xc0 btrfs_inode_lock+0x40/0x70 prealloc_file_extent_cluster+0x1b0/0x370 relocate_file_extent_cluster+0xb2/0x720 relocate_data_extent+0x107/0x160 relocate_block_group+0x442/0x550 btrfs_relocate_block_group+0x2cb/0x4b0 btrfs_relocate_chunk+0x50/0x1b0 btrfs_balance+0x92f/0x13d0 btrfs_ioctl+0x1abf/0x2600 __x64_sys_ioctl+0x97/0xd0 do_syscall_64+0x95/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #0 (&fs_info->cleaner_mutex){+.+.}-{3:3}: __lock_acquire+0x13e7/0x2180 lock_acquire+0xcb/0x2e0 __mutex_lock+0xbe/0xc00 btrfs_quota_disable+0x54/0x4c0 btrfs_ioctl+0x206b/0x2600 __x64_sys_ioctl+0x97/0xd0 do_syscall_64+0x95/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e other info that might help us debug this: Chain exists of: &fs_info->cleaner_mutex --> &sb->s_type->i_mutex_key#16 --> &fs_info->subvol_sem Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&fs_info->subvol_sem); lock(&sb->s_type->i_mutex_key#16); lock(&fs_info->subvol_sem); lock(&fs_info->cleaner_mutex); *** DEADLOCK *** 2 locks held by btrfs/471533: #0: ffff92ba4319e420 (sb_writers#14){.+.+}-{0:0}, at: btrfs_ioctl+0x3b5/0x2600 #1: ffff92ba46980bd0 (&fs_info->subvol_sem){++++}-{3:3}, at: btrfs_ioctl+0x1c8f/0x2600 stack backtrace: CPU: 1 PID: 471533 Comm: btrfs Kdump: loaded Not tainted 6.9.0-rc4+ #1 Call Trace: <TASK> dump_stack_lvl+0x77/0xb0 check_noncircular+0x148/0x160 ? lock_acquire+0xcb/0x2e0 __lock_acquire+0x13e7/0x2180 lock_acquire+0xcb/0x2e0 ? btrfs_quota_disable+0x54/0x4c0 ? lock_is_held_type+0x9a/0x110 __mutex_lock+0xbe/0xc00 ? btrfs_quota_disable+0x54/0x4c0 ? srso_return_thunk+0x5/0x5f ? lock_acquire+0xcb/0x2e0 ? btrfs_quota_disable+0x54/0x4c0 ? btrfs_quota_disable+0x54/0x4c0 btrfs_quota_disable+0x54/0x4c0 btrfs_ioctl+0x206b/0x2600 ? srso_return_thunk+0x5/0x5f ? __do_sys_statfs+0x61/0x70 __x64_sys_ioctl+0x97/0xd0 do_syscall_64+0x95/0x180 ? srso_return_thunk+0x5/0x5f ? reacquire_held_locks+0xd1/0x1f0 ? do_user_addr_fault+0x307/0x8a0 ? srso_return_thunk+0x5/0x5f ? lock_acquire+0xcb/0x2e0 ? srso_return_thunk+0x5/0x5f ? srso_return_thunk+0x5/0x5f ? find_held_lock+0x2b/0x80 ? srso_return_thunk+0x5/0x5f ? lock_release+0xca/0x2a0 ? srso_return_thunk+0x5/0x5f ? do_user_addr_fault+0x35c/0x8a0 ? srso_return_thunk+0x5/0x5f ? trace_hardirqs_off+0x4b/0xc0 ? srso_return_thunk+0x5/0x5f ? lockdep_hardirqs_on_prepare+0xde/0x190 ? srso_return_thunk+0x5/0x5f This happens because when we call rename we already have the inode mutex held, and then we acquire the subvol_sem if we are a subvolume. This makes the dependency inode lock -> subvol sem When we're running data relocation we will preallocate space for the data relocation inode, and we always run the relocation under the ->cleaner_mutex. This now creates the dependency of cleaner_mutex -> inode lock (from the prealloc) -> subvol_sem Qgroup delete is doing this in the opposite order, it is acquiring the subvol_sem and then it is acquiring the cleaner_mutex, which results in this lockdep splat. This deadlock can't happen in reality, because we won't ever rename the data reloc inode, nor is the data reloc inode a subvolume. However this is fairly easy to fix, simply take the cleaner mutex in the case where we are disabling qgroups before we take the subvol_sem. This resolves the lockdep splat. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-08Merge tag 'for-6.9-rc2-tag' of ↵Linus Torvalds1-9/+28
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Several fixes to qgroups that have been recently identified by test generic/475: - fix prealloc reserve leak in subvolume operations - various other fixes in reservation setup, conversion or cleanup" * tag 'for-6.9-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: always clear PERTRANS metadata during commit btrfs: make btrfs_clear_delalloc_extent() free delalloc reserve btrfs: qgroup: convert PREALLOC to PERTRANS after record_root_in_trans btrfs: record delayed inode root in transaction btrfs: qgroup: fix qgroup prealloc rsv leak in subvolume operations btrfs: qgroup: correctly model root qgroup rsv in convert
2024-04-02btrfs: qgroup: fix qgroup prealloc rsv leak in subvolume operationsBoris Burkov1-9/+28
Create subvolume, create snapshot and delete subvolume all use btrfs_subvolume_reserve_metadata() to reserve metadata for the changes done to the parent subvolume's fs tree, which cannot be mediated in the normal way via start_transaction. When quota groups (squota or qgroups) are enabled, this reserves qgroup metadata of type PREALLOC. Once the operation is associated to a transaction, we convert PREALLOC to PERTRANS, which gets cleared in bulk at the end of the transaction. However, the error paths of these three operations were not implementing this lifecycle correctly. They unconditionally converted the PREALLOC to PERTRANS in a generic cleanup step regardless of errors or whether the operation was fully associated to a transaction or not. This resulted in error paths occasionally converting this rsv to PERTRANS without calling record_root_in_trans successfully, which meant that unless that root got recorded in the transaction by some other thread, the end of the transaction would not free that root's PERTRANS, leaking it. Ultimately, this resulted in hitting a WARN in CONFIG_BTRFS_DEBUG builds at unmount for the leaked reservation. The fix is to ensure that every qgroup PREALLOC reservation observes the following properties: 1. any failure before record_root_in_trans is called successfully results in freeing the PREALLOC reservation. 2. after record_root_in_trans, we convert to PERTRANS, and now the transaction owns freeing the reservation. This patch enforces those properties on the three operations. Without it, generic/269 with squotas enabled at mkfs time would fail in ~5-10 runs on my system. With this patch, it ran successfully 1000 times in a row. Fixes: e85fde5162bf ("btrfs: qgroup: fix qgroup meta rsv leak for subvolume operations") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-12Merge tag 'for-6.9-tag' of ↵Linus Torvalds1-47/+73
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "Mostly stabilization, refactoring and cleanup changes. There rest are minor performance optimizations due to caching or lock contention reduction and a few notable fixes. Performance improvements: - minor speedup in logging when repeatedly allocated structure is preallocated only once, improves latency and decreases lock contention - minor throughput increase (+6%), reduced lock contention after clearing delayed allocation bits, applies to several common workload types - skip full quota rescan if a new relation is added in the same transaction Fixes: - zstd fix for inline compressed file in subpage mode, updated version from the 6.8 time - proper qgroup inheritance ioctl parameter validation - more fiemap followup fixes after reduced locking done in 6.8: - fix race when detecting delalloc ranges Core changes: - more debugging code: - added assertions for a very rare crash in raid56 calculation - tree-checker dumps page state to give more insights into possible reference counting issues - add checksum calculation offloading sysfs knob, for now enabled under DEBUG only to determine a good heuristic for deciding the offload or synchronous, depends on various factors (block group profile, device speed) and is not as clear as initially thought (checksum type) - error handling improvements, added assertions - more page to folio conversion (defrag, truncate), cached size and shift - preparation for more fine grained locking of sectors in subpage mode - cleanups and refactoring: - include cleanups, forward declarations - pointer-to-structure helpers - redundant argument removals - removed unused code - slab cache updates, last use of SLAB_MEM_SPREAD removed" * tag 'for-6.9-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (114 commits) btrfs: reuse cloned extent buffer during fiemap to avoid re-allocations btrfs: fix race when detecting delalloc ranges during fiemap btrfs: fix off-by-one chunk length calculation at contains_pending_extent() btrfs: qgroup: allow quick inherit if snapshot is created and added to the same parent btrfs: qgroup: validate btrfs_qgroup_inherit parameter btrfs: include device major and minor numbers in the device scan notice btrfs: mark btrfs_put_caching_control() static btrfs: remove SLAB_MEM_SPREAD flag use btrfs: qgroup: always free reserved space for extent records btrfs: tree-checker: dump the page status if hit something wrong btrfs: compression: remove dead comments in btrfs_compress_heuristic() btrfs: subpage: make writer lock utilize bitmap btrfs: subpage: make reader lock utilize bitmap btrfs: unexport btrfs_subpage_start_writer() and btrfs_subpage_end_and_test_writer() btrfs: pass a valid extent map cache pointer to __get_extent_map() btrfs: merge btrfs_del_delalloc_inode() helpers btrfs: pass btrfs_device to btrfs_scratch_superblocks() btrfs: handle transaction commit errors in flush_reservations() btrfs: use KMEM_CACHE() to create btrfs_free_space cache btrfs: use KMEM_CACHE() to create delayed ref caches ...
2024-03-11Merge tag 'vfs-6.9.super' of ↵Linus Torvalds1-8/+8
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull block handle updates from Christian Brauner: "Last cycle we changed opening of block devices, and opening a block device would return a bdev_handle. This allowed us to implement support for restricting and forbidding writes to mounted block devices. It was accompanied by converting and adding helpers to operate on bdev_handles instead of plain block devices. That was already a good step forward but ultimately it isn't necessary to have special purpose helpers for opening block devices internally that return a bdev_handle. Fundamentally, opening a block device internally should just be equivalent to opening files. So now all internal opens of block devices return files just as a userspace open would. Instead of introducing a separate indirection into bdev_open_by_*() via struct bdev_handle bdev_file_open_by_*() is made to just return a struct file. Opening and closing a block device just becomes equivalent to opening and closing a file. This all works well because internally we already have a pseudo fs for block devices and so opening block devices is simple. There's a few places where we needed to be careful such as during boot when the kernel is supposed to mount the rootfs directly without init doing it. Here we need to take care to ensure that we flush out any asynchronous file close. That's what we already do for opening, unpacking, and closing the initramfs. So nothing new here. The equivalence of opening and closing block devices to regular files is a win in and of itself. But it also has various other advantages. We can remove struct bdev_handle completely. Various low-level helpers are now private to the block layer. Other helpers were simply removable completely. A follow-up series that is already reviewed build on this and makes it possible to remove bdev->bd_inode and allows various clean ups of the buffer head code as well. All places where we stashed a bdev_handle now just stash a file and use simple accessors to get to the actual block device which was already the case for bdev_handle" * tag 'vfs-6.9.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits) block: remove bdev_handle completely block: don't rely on BLK_OPEN_RESTRICT_WRITES when yielding write access bdev: remove bdev pointer from struct bdev_handle bdev: make struct bdev_handle private to the block layer bdev: make bdev_{release, open_by_dev}() private to block layer bdev: remove bdev_open_by_path() reiserfs: port block device access to file ocfs2: port block device access to file nfs: port block device access to files jfs: port block device access to file f2fs: port block device access to files ext4: port block device access to file erofs: port device access to file btrfs: port device access to file bcachefs: port block device access to file target: port block device access to file s390: port block device access to file nvme: port block device access to file block2mtd: port device access to files bcache: port block device access to files ...
2024-03-05btrfs: qgroup: validate btrfs_qgroup_inherit parameterQu Wenruo1-13/+3
[BUG] Currently btrfs can create subvolume with an invalid qgroup inherit without triggering any error: # mkfs.btrfs -O quota -f $dev # mount $dev $mnt # btrfs subvolume create -i 2/0 $mnt/subv1 # btrfs qgroup show -prce --sync $mnt Qgroupid Referenced Exclusive Path -------- ---------- --------- ---- 0/5 16.00KiB 16.00KiB <toplevel> 0/256 16.00KiB 16.00KiB subv1 [CAUSE] We only do a very basic size check for btrfs_qgroup_inherit structure, but never really verify if the values are correct. Thus in btrfs_qgroup_inherit() function, we have to skip non-existing qgroups, and never return any error. [FIX] Fix the behavior and introduce extra checks: - Introduce early check for btrfs_qgroup_inherit structure Not only the size, but also all the qgroup ids would be verified. And the timing is very early, so we can return error early. This early check is very important for snapshot creation, as snapshot is delayed to transaction commit. - Drop support for btrfs_qgroup_inherit::num_ref_copies and num_excl_copies Those two members are used to specify to copy refr/excl numbers from other qgroups. This would definitely mark qgroup inconsistent, and btrfs-progs has dropped the support for them for a long time. It's time to drop the support for kernel. - Verify the supported btrfs_qgroup_inherit::flags Just in case we want to add extra flags for btrfs_qgroup_inherit. Now above subvolume creation would fail with -ENOENT other than silently ignore the non-existing qgroup. CC: stable@vger.kernel.org # 6.7+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: factor out validation of btrfs_ioctl_vol_args_v2::nameDavid Sterba1-3/+17
The validation of vol args v2 name in snapshot and device remove ioctls is not done properly. A terminating NUL is written to the end of the buffer unconditionally, assuming that this would be the last place in case the buffer is used completely. This does not communicate back the actual error (either an invalid or too long path). Factor out all such cases and use a helper to do the verification, simply look for NUL in the buffer. There's no expected practical change, the size of buffer is 4088, this is enough for most paths or names. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: factor out validation of btrfs_ioctl_vol_args::nameDavid Sterba1-5/+29
The validation of vol args name in several ioctls is not done properly. a terminating NUL is written to the end of the buffer unconditionally, assuming that this would be the last place in case the buffer is used completely. This does not communicate back the actual error (either an invalid or too long path). Factor out all such cases and use a helper to do the verification, simply look for NUL in the buffer. There's no expected practical change, the size of buffer is 4088, this is enough for most paths or names. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: add helper to get fs_info from struct inode pointerDavid Sterba1-20/+20
Add a convenience helper to get a fs_info from a VFS inode pointer instead of open coding the chain or using btrfs_sb() that in some cases does one more pointer hop. This is implemented as a macro (still with type checking) so we don't need full definitions of struct btrfs_inode, btrfs_root or btrfs_fs_info. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: handle directory and dentry mismatch in btrfs_may_delete()David Sterba1-1/+3
The helper btrfs_may_delete() is a copy of generic fs/namei.c:may_delete() to verify various conditions before deletion. There's a BUG_ON added before linux.git started, we can turn it to a proper error handling at least in our local helper. A mistmatch between directory and the deleted dentry is clearly invalid. This won't be probably ever hit due to the way how the parameters are set from the caller btrfs_ioctl_snap_destroy(), using a VFS helper lookup_one(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: remove unused included headersDavid Sterba1-4/+0
With help of neovim, LSP and clangd we can identify header files that are not actually needed to be included in the .c files. This is focused only on removal (with minor fixups), further cleanups are possible but will require doing the header files properly with forward declarations, minimized includes and include-what-you-use care. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: replace sb::s_blocksize by fs_info::sectorsizeDavid Sterba1-1/+1
The block size stored in the super block is used by subsystems outside of btrfs and it's a copy of fs_info::sectorsize. Unify that to always use our sectorsize, with the exception of mount where we first need to use fixed values (4K) until we read the super block and can set the sectorsize. Replace all uses, in most cases it's fewer pointer indirections. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-01Merge tag 'for-6.8-rc6-tag' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix freeing allocated id for anon dev when snapshot creation fails - fiemap fixes: - followup for a recent deadlock fix, ranges that fiemap can access can still race with ordered extent completion - make sure fiemap with SYNC flag does not race with writes * tag 'for-6.8-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix double free of anonymous device after snapshot creation failure btrfs: ensure fiemap doesn't race with writes when FIEMAP_FLAG_SYNC is given btrfs: fix race between ordered extent completion and fiemap
2024-02-29btrfs: fix double free of anonymous device after snapshot creation failureFilipe Manana1-1/+1
When creating a snapshot we may do a double free of an anonymous device in case there's an error committing the transaction. The second free may result in freeing an anonymous device number that was allocated by some other subsystem in the kernel or another btrfs filesystem. The steps that lead to this: 1) At ioctl.c:create_snapshot() we allocate an anonymous device number and assign it to pending_snapshot->anon_dev; 2) Then we call btrfs_commit_transaction() and end up at transaction.c:create_pending_snapshot(); 3) There we call btrfs_get_new_fs_root() and pass it the anonymous device number stored in pending_snapshot->anon_dev; 4) btrfs_get_new_fs_root() frees that anonymous device number because btrfs_lookup_fs_root() returned a root - someone else did a lookup of the new root already, which could some task doing backref walking; 5) After that some error happens in the transaction commit path, and at ioctl.c:create_snapshot() we jump to the 'fail' label, and after that we free again the same anonymous device number, which in the meanwhile may have been reallocated somewhere else, because pending_snapshot->anon_dev still has the same value as in step 1. Recently syzbot ran into this and reported the following trace: ------------[ cut here ]------------ ida_free called for id=51 which is not allocated. WARNING: CPU: 1 PID: 31038 at lib/idr.c:525 ida_free+0x370/0x420 lib/idr.c:525 Modules linked in: CPU: 1 PID: 31038 Comm: syz-executor.2 Not tainted 6.8.0-rc4-syzkaller-00410-gc02197fc9076 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/25/2024 RIP: 0010:ida_free+0x370/0x420 lib/idr.c:525 Code: 10 42 80 3c 28 (...) RSP: 0018:ffffc90015a67300 EFLAGS: 00010246 RAX: be5130472f5dd000 RBX: 0000000000000033 RCX: 0000000000040000 RDX: ffffc90009a7a000 RSI: 000000000003ffff RDI: 0000000000040000 RBP: ffffc90015a673f0 R08: ffffffff81577992 R09: 1ffff92002b4cdb4 R10: dffffc0000000000 R11: fffff52002b4cdb5 R12: 0000000000000246 R13: dffffc0000000000 R14: ffffffff8e256b80 R15: 0000000000000246 FS: 00007fca3f4b46c0(0000) GS:ffff8880b9500000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f167a17b978 CR3: 000000001ed26000 CR4: 0000000000350ef0 Call Trace: <TASK> btrfs_get_root_ref+0xa48/0xaf0 fs/btrfs/disk-io.c:1346 create_pending_snapshot+0xff2/0x2bc0 fs/btrfs/transaction.c:1837 create_pending_snapshots+0x195/0x1d0 fs/btrfs/transaction.c:1931 btrfs_commit_transaction+0xf1c/0x3740 fs/btrfs/transaction.c:2404 create_snapshot+0x507/0x880 fs/btrfs/ioctl.c:848 btrfs_mksubvol+0x5d0/0x750 fs/btrfs/ioctl.c:998 btrfs_mksnapshot+0xb5/0xf0 fs/btrfs/ioctl.c:1044 __btrfs_ioctl_snap_create+0x387/0x4b0 fs/btrfs/ioctl.c:1306 btrfs_ioctl_snap_create_v2+0x1ca/0x400 fs/btrfs/ioctl.c:1393 btrfs_ioctl+0xa74/0xd40 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:871 [inline] __se_sys_ioctl+0xfe/0x170 fs/ioctl.c:857 do_syscall_64+0xfb/0x240 entry_SYSCALL_64_after_hwframe+0x6f/0x77 RIP: 0033:0x7fca3e67dda9 Code: 28 00 00 00 (...) RSP: 002b:00007fca3f4b40c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007fca3e7abf80 RCX: 00007fca3e67dda9 RDX: 00000000200005c0 RSI: 0000000050009417 RDI: 0000000000000003 RBP: 00007fca3e6ca47a R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 000000000000000b R14: 00007fca3e7abf80 R15: 00007fff6bf95658 </TASK> Where we get an explicit message where we attempt to free an anonymous device number that is not currently allocated. It happens in a different code path from the example below, at btrfs_get_root_ref(), so this change may not fix the case triggered by syzbot. To fix at least the code path from the example above, change btrfs_get_root_ref() and its callers to receive a dev_t pointer argument for the anonymous device number, so that in case it frees the number, it also resets it to 0, so that up in the call chain we don't attempt to do the double free. CC: stable@vger.kernel.org # 5.10+ Link: https://lore.kernel.org/linux-btrfs/000000000000f673a1061202f630@google.com/ Fixes: e03ee2fe873e ("btrfs: do not ASSERT() if the newly created subvolume already got read") Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-25btrfs: port device access to fileChristian Brauner1-8/+8
Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-19-adbd023e19cc@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-07Merge tag 'for-6.8-rc3-tag' of ↵Linus Torvalds1-0/+5
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - two fixes preventing deletion and manual creation of subvolume qgroup - unify error code returned for unknown send flags - fix assertion during subvolume creation when anonymous device could be allocated by other thread (e.g. due to backref walk) * tag 'for-6.8-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: do not ASSERT() if the newly created subvolume already got read btrfs: forbid deleting live subvol qgroup btrfs: forbid creating subvol qgroups btrfs: send: return EOPNOTSUPP on unknown flags
2024-01-31btrfs: forbid creating subvol qgroupsBoris Burkov1-0/+5
Creating a qgroup 0/subvolid leads to various races and it isn't helpful, because you can't specify a subvol id when creating a subvol, so you can't be sure it will be the right one. Any requirements on the automatic subvol can be gratified by using a higher level qgroup and the inheritance parameters of subvol creation. Fixes: cecbb533b5fc ("btrfs: record simple quota deltas in delayed refs") CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-01-22Merge tag 'for-6.8-rc1-tag' of ↵Linus Torvalds1-0/+7
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - zoned mode fixes: - fix slowdown when writing large file sequentially by looking up block groups with enough space faster - locking fixes when activating a zone - new mount API fixes: - preserve mount options for a ro/rw mount of the same subvolume - scrub fixes: - fix use-after-free in case the chunk length is not aligned to 64K, this does not happen normally but has been reported on images converted from ext4 - similar alignment check was missing with raid-stripe-tree - subvolume deletion fixes: - prevent calling ioctl on already deleted subvolume - properly track flag tracking a deleted subvolume - in subpage mode, fix decompression of an inline extent (zlib, lzo, zstd) - fix crash when starting writeback on a folio, after integration with recent MM changes this needs to be started conditionally - reject unknown flags in defrag ioctl - error handling, API fixes, minor warning fixes * tag 'for-6.8-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: scrub: limit RST scrub to chunk boundary btrfs: scrub: avoid use-after-free when chunk length is not 64K aligned btrfs: don't unconditionally call folio_start_writeback in subpage btrfs: use the original mount's mount options for the legacy reconfigure btrfs: don't warn if discard range is not aligned to sector btrfs: tree-checker: fix inline ref size in error messages btrfs: zstd: fix and simplify the inline extent decompression btrfs: lzo: fix and simplify the inline extent decompression btrfs: zlib: fix and simplify the inline extent decompression btrfs: defrag: reject unknown flags of btrfs_ioctl_defrag_range_args btrfs: avoid copying BTRFS_ROOT_SUBVOL_DEAD flag to snapshot of subvolume being deleted btrfs: don't abort filesystem when attempting to snapshot deleted subvolume btrfs: zoned: fix lock ordering in btrfs_zone_activate() btrfs: fix unbalanced unlock of mapping_tree_lock btrfs: ref-verify: free ref cache before clearing mount opt btrfs: fix kvcalloc() arguments order in btrfs_ioctl_send() btrfs: zoned: optimize hint byte for zoned allocator btrfs: zoned: factor out prepare_allocation_zoned()
2024-01-12btrfs: defrag: reject unknown flags of btrfs_ioctl_defrag_range_argsQu Wenruo1-0/+4
Add extra sanity check for btrfs_ioctl_defrag_range_args::flags. This is not really to enhance fuzzing tests, but as a preparation for future expansion on btrfs_ioctl_defrag_range_args. In the future we're going to add new members, allowing more fine tuning for btrfs defrag. Without the -ENONOTSUPP error, there would be no way to detect if the kernel supports those new defrag features. CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-01-12btrfs: don't abort filesystem when attempting to snapshot deleted subvolumeOmar Sandoval1-0/+3
If the source file descriptor to the snapshot ioctl refers to a deleted subvolume, we get the following abort: BTRFS: Transaction aborted (error -2) WARNING: CPU: 0 PID: 833 at fs/btrfs/transaction.c:1875 create_pending_snapshot+0x1040/0x1190 [btrfs] Modules linked in: pata_acpi btrfs ata_piix libata scsi_mod virtio_net blake2b_generic xor net_failover virtio_rng failover scsi_common rng_core raid6_pq libcrc32c CPU: 0 PID: 833 Comm: t_snapshot_dele Not tainted 6.7.0-rc6 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-1.fc39 04/01/2014 RIP: 0010:create_pending_snapshot+0x1040/0x1190 [btrfs] RSP: 0018:ffffa09c01337af8 EFLAGS: 00010282 RAX: 0000000000000000 RBX: ffff9982053e7c78 RCX: 0000000000000027 RDX: ffff99827dc20848 RSI: 0000000000000001 RDI: ffff99827dc20840 RBP: ffffa09c01337c00 R08: 0000000000000000 R09: ffffa09c01337998 R10: 0000000000000003 R11: ffffffffb96da248 R12: fffffffffffffffe R13: ffff99820535bb28 R14: ffff99820b7bd000 R15: ffff99820381ea80 FS: 00007fe20aadabc0(0000) GS:ffff99827dc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000559a120b502f CR3: 00000000055b6000 CR4: 00000000000006f0 Call Trace: <TASK> ? create_pending_snapshot+0x1040/0x1190 [btrfs] ? __warn+0x81/0x130 ? create_pending_snapshot+0x1040/0x1190 [btrfs] ? report_bug+0x171/0x1a0 ? handle_bug+0x3a/0x70 ? exc_invalid_op+0x17/0x70 ? asm_exc_invalid_op+0x1a/0x20 ? create_pending_snapshot+0x1040/0x1190 [btrfs] ? create_pending_snapshot+0x1040/0x1190 [btrfs] create_pending_snapshots+0x92/0xc0 [btrfs] btrfs_commit_transaction+0x66b/0xf40 [btrfs] btrfs_mksubvol+0x301/0x4d0 [btrfs] btrfs_mksnapshot+0x80/0xb0 [btrfs] __btrfs_ioctl_snap_create+0x1c2/0x1d0 [btrfs] btrfs_ioctl_snap_create_v2+0xc4/0x150 [btrfs] btrfs_ioctl+0x8a6/0x2650 [btrfs] ? kmem_cache_free+0x22/0x340 ? do_sys_openat2+0x97/0xe0 __x64_sys_ioctl+0x97/0xd0 do_syscall_64+0x46/0xf0 entry_SYSCALL_64_after_hwframe+0x6e/0x76 RIP: 0033:0x7fe20abe83af RSP: 002b:00007ffe6eff1360 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fe20abe83af RDX: 00007ffe6eff23c0 RSI: 0000000050009417 RDI: 0000000000000003 RBP: 0000000000000003 R08: 0000000000000000 R09: 00007fe20ad16cd0 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007ffe6eff13c0 R14: 00007fe20ad45000 R15: 0000559a120b6d58 </TASK> ---[ end trace 0000000000000000 ]--- BTRFS: error (device vdc: state A) in create_pending_snapshot:1875: errno=-2 No such entry BTRFS info (device vdc: state EA): forced readonly BTRFS warning (device vdc: state EA): Skipping commit of aborted transaction. BTRFS: error (device vdc: state EA) in cleanup_transaction:2055: errno=-2 No such entry This happens because create_pending_snapshot() initializes the new root item as a copy of the source root item. This includes the refs field, which is 0 for a deleted subvolume. The call to btrfs_insert_root() therefore inserts a root with refs == 0. btrfs_get_new_fs_root() then finds the root and returns -ENOENT if refs == 0, which causes create_pending_snapshot() to abort. Fix it by checking the source root's refs before attempting the snapshot, but after locking subvol_sem to avoid racing with deletion. CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-01-08Merge tag 'vfs-6.8.rw' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfsLinus Torvalds1-6/+6
Pull vfs rw updates from Christian Brauner: "This contains updates from Amir for read-write backing file helpers for stacking filesystems such as overlayfs: - Fanotify is currently in the process of introducing pre content events. Roughly, a new permission event will be added indicating that it is safe to write to the file being accessed. These events are used by hierarchical storage managers to e.g., fill the content of files on first access. During that work we noticed that our current permission checking is inconsistent in rw_verify_area() and remap_verify_area(). Especially in the splice code permission checking is done multiple times. For example, one time for the whole range and then again for partial ranges inside the iterator. In addition, we mostly do permission checking before we call file_start_write() except for a few places where we call it after. For pre-content events we need such permission checking to be done before file_start_write(). So this is a nice reason to clean this all up. After this series, all permission checking is done before file_start_write(). As part of this cleanup we also massaged the splice code a bit. We got rid of a few helpers because we are alredy drowning in special read-write helpers. We also cleaned up the return types for splice helpers. - Introduce generic read-write helpers for backing files. This lifts some overlayfs code to common code so it can be used by the FUSE passthrough work coming in over the next cycles. Make Amir and Miklos the maintainers for this new subsystem of the vfs" * tag 'vfs-6.8.rw' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (30 commits) fs: fix __sb_write_started() kerneldoc formatting fs: factor out backing_file_mmap() helper fs: factor out backing_file_splice_{read,write}() helpers fs: factor out backing_file_{read,write}_iter() helpers fs: prepare for stackable filesystems backing file helpers fsnotify: optionally pass access range in file permission hooks fsnotify: assert that file_start_write() is not held in permission hooks fsnotify: split fsnotify_perm() into two hooks fs: use splice_copy_file_range() inline helper splice: return type ssize_t from all helpers fs: use do_splice_direct() for nfsd/ksmbd server-side-copy fs: move file_start_write() into direct_splice_actor() fs: fork splice_file_range() from do_splice_direct() fs: create {sb,file}_write_not_started() helpers fs: create file_write_started() helper fs: create __sb_write_started() helper fs: move kiocb_start_write() into vfs_iocb_iter_write() fs: move permission hook out of do_iter_read() fs: move permission hook out of do_iter_write() fs: move file_start_write() into vfs_iter_write() ...
2023-12-17Merge tag 'for-6.7-rc5-tag' of ↵Linus Torvalds1-0/+9
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "One more fix that verifies that the snapshot source is a root, same check is also done in user space but should be done by the ioctl as well" * tag 'for-6.7-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: do not allow non subvolume root targets for snapshot
2023-12-15btrfs: do not allow non subvolume root targets for snapshotJosef Bacik1-0/+9
Our btrfs subvolume snapshot <source> <destination> utility enforces that <source> is the root of the subvolume, however this isn't enforced in the kernel. Update the kernel to also enforce this limitation to avoid problems with other users of this ioctl that don't have the appropriate checks in place. Reported-by: Martin Michaelis <code@mgjm.de> CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Neal Gompa <neal@gompa.dev> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-28Merge tag 'for-6.7-rc3-tag' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few fixes and message updates: - for simple quotas, handle the case when a snapshot is created and the target qgroup already exists - fix a warning when file descriptor given to send ioctl is not writable - fix off-by-one condition when checking chunk maps - free pages when page array allocation fails during compression read, other cases were handled - fix memory leak on error handling path in ref-verify debugging feature - copy missing struct member 'version' in 64/32bit compat send ioctl - tree-checker verifies inline backref ordering - print messages to syslog on first mount and last unmount - update error messages when reading chunk maps" * tag 'for-6.7-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: send: ensure send_fd is writable btrfs: free the allocated memory if btrfs_alloc_page_array() fails btrfs: fix 64bit compat send ioctl arguments not initializing version member btrfs: make error messages more clear when getting a chunk map btrfs: fix off-by-one when checking chunk map includes logical address btrfs: ref-verify: fix memory leaks in btrfs_ref_tree_mod() btrfs: add dmesg output for first mount and last unmount of a filesystem btrfs: do not abort transaction if there is already an existing qgroup btrfs: tree-checker: add type and sequence check for inline backrefs
2023-11-24btrfs: fix 64bit compat send ioctl arguments not initializing version memberDavid Sterba1-0/+1
When the send protocol versioning was added in 5.16 e77fbf990316 ("btrfs: send: prepare for v2 protocol"), the 32/64bit compat code was not updated (added by 2351f431f727 ("btrfs: fix send ioctl on 32bit with 64bit kernel")), missing the version struct member. The compat code is probably rarely used, nobody reported any bugs. Found by tool https://github.com/jirislaby/clang-struct . Fixes: e77fbf990316 ("btrfs: send: prepare for v2 protocol") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-24btrfs: move file_start_write() to after permission hookAmir Goldstein1-6/+6
In vfs code, file_start_write() is usually called after the permission hook in rw_verify_area(). btrfs_ioctl_encoded_write() in an exception to this rule. Move file_start_write() to after the rw_verify_area() check in encoded write to make the permission hook "start-write-safe". This is needed for fanotify "pre content" events. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/r/20231122122715.2561213-9-amir73il@gmail.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-13Merge tag 'for-6.7-rc1-tag' of ↵Linus Torvalds1-5/+5
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix potential overflow in returned value from SEARCH_TREE_V2 ioctl on 32bit architecture - zoned mode fixes: - drop unnecessary write pointer check for RAID0/RAID1/RAID10 profiles, now it works because of raid-stripe-tree - wait for finishing the zone when direct IO needs a new allocation - simple quota fixes: - pass correct owning root pointer when cleaning up an aborted transaction - fix leaking some structures when processing delayed refs - change key type number of BTRFS_EXTENT_OWNER_REF_KEY, reorder it before inline refs that are supposed to be sorted, keeping the original number would complicate a lot of things; this change needs an updated version of btrfs-progs to work and filesystems need to be recreated - fix error pointer dereference after failure to allocate fs devices - fix race between accounting qgroup extents and removing a qgroup * tag 'for-6.7-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: make OWNER_REF_KEY type value smallest among inline refs btrfs: fix qgroup record leaks when using simple quotas btrfs: fix race between accounting qgroup extents and removing a qgroup btrfs: fix error pointer dereference after failure to allocate fs devices btrfs: make found_logical_ret parameter mandatory for function queue_scrub_stripe() btrfs: get correct owning_root when dropping snapshot btrfs: zoned: wait for data BG to be finished on direct IO allocation btrfs: zoned: drop no longer valid write pointer check btrfs: directly return 0 on no error code in btrfs_insert_raid_extent() btrfs: use u64 for buffer sizes in the tree search ioctls
2023-11-03btrfs: use u64 for buffer sizes in the tree search ioctlsFilipe Manana1-5/+5
In the tree search v2 ioctl we use the type size_t, which is an unsigned long, to track the buffer size in the local variable 'buf_size'. An unsigned long is 32 bits wide on a 32 bits architecture. The buffer size defined in struct btrfs_ioctl_search_args_v2 is a u64, so when we later try to copy the local variable 'buf_size' to the argument struct, when the search returns -EOVERFLOW, we copy only 32 bits which will be a problem on big endian systems. Fix this by using a u64 type for the buffer sizes, not only at btrfs_ioctl_tree_search_v2(), but also everywhere down the call chain so that we can use the u64 at btrfs_ioctl_tree_search_v2(). Fixes: cc68a8a5a433 ("btrfs: new ioctl TREE_SEARCH_V2") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/linux-btrfs/ce6f4bd6-9453-4ffe-ba00-cee35495e10f@moroto.mountain/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-30Merge tag 'for-6.7-tag' of ↵Linus Torvalds1-8/+15
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "New features: - raid-stripe-tree New tree for logical file extent mapping where the physical mapping may not match on multiple devices. This is now used in zoned mode to implement RAID0/RAID1* profiles, but can be used in non-zoned mode as well. The support for RAID56 is in development and will eventually fix the problems with the current implementation. This is a backward incompatible feature and has to be enabled at mkfs time. - simple quota accounting (squota) A simplified mode of qgroup that accounts all space on the initial extent owners (a subvolume), the snapshots are then cheap to create and delete. The deletion of snapshots in fully accounting qgroups is a known CPU/IO performance bottleneck. The squota is not suitable for the general use case but works well for containers where the original subvolume exists for the whole time. This is a backward incompatible feature as it needs extending some structures, but can be enabled on an existing filesystem. - temporary filesystem fsid (temp_fsid) The fsid identifies a filesystem and is hard coded in the structures, which disallows mounting the same fsid found on different devices. For a single device filesystem this is not strictly necessary, a new temporary fsid can be generated on mount e.g. after a device is cloned. This will be used by Steam Deck for root partition A/B testing, or can be used for VM root images. Other user visible changes: - filesystems with partially finished metadata_uuid conversion cannot be mounted anymore and the uuid fixup has to be done by btrfs-progs (btrfstune). Performance improvements: - reduce reservations for checksum deletions (with enabled free space tree by factor of 4), on a sample workload on file with many extents the deletion time decreased by 12% - make extent state merges more efficient during insertions, reduce rb-tree iterations (run time of critical functions reduced by 5%) Core changes: - the integrity check functionality has been removed, this was a debugging feature and removal does not affect other integrity checks like checksums or tree-checker - space reservation changes: - more efficient delayed ref reservations, this avoids building up too much work or overusing or exhausting the global block reserve in some situations - move delayed refs reservation to the transaction start time, this prevents some ENOSPC corner cases related to exhaustion of global reserve - improvements in reducing excessive reservations for block group