summaryrefslogtreecommitdiff
path: root/fs/btrfs/extent_io.c
AgeCommit message (Collapse)AuthorFilesLines
2021-12-14btrfs: clear extent buffer uptodate when we fail to write itJosef Bacik1-0/+6
commit c2e39305299f0118298c2201f6d6cc7d3485f29e upstream. I got dmesg errors on generic/281 on our overnight fstests. Looking at the history this happens occasionally, with errors like this WARNING: CPU: 0 PID: 673217 at fs/btrfs/extent_io.c:6848 assert_eb_page_uptodate+0x3f/0x50 CPU: 0 PID: 673217 Comm: kworker/u4:13 Tainted: G W 5.16.0-rc2+ #469 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 Workqueue: btrfs-cache btrfs_work_helper RIP: 0010:assert_eb_page_uptodate+0x3f/0x50 RSP: 0018:ffffae598230bc60 EFLAGS: 00010246 RAX: 0017ffffc0002112 RBX: ffffebaec4100900 RCX: 0000000000001000 RDX: ffffebaec45733c7 RSI: ffffebaec4100900 RDI: ffff9fd98919f340 RBP: 0000000000000d56 R08: ffff9fd98e300000 R09: 0000000000000000 R10: 0001207370a91c50 R11: 0000000000000000 R12: 00000000000007b0 R13: ffff9fd98919f340 R14: 0000000001500000 R15: 0000000001cb0000 FS: 0000000000000000(0000) GS:ffff9fd9fbc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f549fcf8940 CR3: 0000000114908004 CR4: 0000000000370ef0 Call Trace: extent_buffer_test_bit+0x3f/0x70 free_space_test_bit+0xa6/0xc0 load_free_space_tree+0x1d6/0x430 caching_thread+0x454/0x630 ? rcu_read_lock_sched_held+0x12/0x60 ? rcu_read_lock_sched_held+0x12/0x60 ? rcu_read_lock_sched_held+0x12/0x60 ? lock_release+0x1f0/0x2d0 btrfs_work_helper+0xf2/0x3e0 ? lock_release+0x1f0/0x2d0 ? finish_task_switch.isra.0+0xf9/0x3a0 process_one_work+0x270/0x5a0 worker_thread+0x55/0x3c0 ? process_one_work+0x5a0/0x5a0 kthread+0x174/0x1a0 ? set_kthread_struct+0x40/0x40 ret_from_fork+0x1f/0x30 This happens because we're trying to read from a extent buffer page that is !PageUptodate. This happens because we will clear the page uptodate when we have an IO error, but we don't clear the extent buffer uptodate. If we do a read later and find this extent buffer we'll think its valid and not return an error, and then trip over this warning. Fix this by also clearing uptodate on the extent buffer when this happens, so that we get an error when we do a btrfs_search_slot() and find this block later. CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-03btrfs: return whole extents in fiemapBoris Burkov1-1/+6
[ Upstream commit 15c7745c9a0078edad1f7df5a6bb7b80bc8cca23 ] `xfs_io -c 'fiemap <off> <len>' <file>` can give surprising results on btrfs that differ from xfs. btrfs prints out extents trimmed to fit the user input. If the user's fiemap request has an offset, then rather than returning each whole extent which intersects that range, we also trim the start extent to not have start < off. Documentation in filesystems/fiemap.txt and the xfs_io man page suggests that returning the whole extent is expected. Some cases which all yield the same fiemap in xfs, but not btrfs: dd if=/dev/zero of=$f bs=4k count=1 sudo xfs_io -c 'fiemap 0 1024' $f 0: [0..7]: 26624..26631 sudo xfs_io -c 'fiemap 2048 1024' $f 0: [4..7]: 26628..26631 sudo xfs_io -c 'fiemap 2048 4096' $f 0: [4..7]: 26628..26631 sudo xfs_io -c 'fiemap 3584 512' $f 0: [7..7]: 26631..26631 sudo xfs_io -c 'fiemap 4091 5' $f 0: [7..6]: 26631..26630 I believe this is a consequence of the logic for merging contiguous extents represented by separate extent items. That logic needs to track the last offset as it loops through the extent items, which happens to pick up the start offset on the first iteration, and trim off the beginning of the full extent. To fix it, start `off` at 0 rather than `start` so that we keep the iteration/merging intact without cutting off the start of the extent. after the fix, all the above commands give: 0: [0..7]: 26624..26631 The merging logic is exercised by fstest generic/483, and I have written a new fstest for checking we don't have backwards or zero-length fiemaps for cases like those above. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-01-19btrfs: prevent NULL pointer dereference in extent_io_tree_panicSu Yue1-3/+1
commit 29b665cc51e8b602bf2a275734349494776e3dbc upstream. Some extent io trees are initialized with NULL private member (e.g. btrfs_device::alloc_state and btrfs_fs_info::excluded_extents). Dereference of a NULL tree->private as inode pointer will cause panic. Pass tree->fs_info as it's known to be valid in all cases. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=208929 Fixes: 05912a3c04eb ("btrfs: drop extent_io_ops::tree_fs_info callback") CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Su Yue <l@damenly.su> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-07btrfs: remove struct extent_io_opsNikolay Borisov1-2/+0
It's no longer used just remove the function and any related code which was initialising it for inodes. No functional changes. Removing 8 bytes from extent_io_tree in turn reduces size of other structures where it is embedded, notably btrfs_inode where it reduces size by 24 bytes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07btrfs: call submit_bio_hook directly for metadata pagesNikolay Borisov1-2/+2
No need to go through a function pointer indirection simply call submit_bio_hook directly by exporting and renaming the helper to btrfs_submit_metadata_bio. This makes the code more readable and should result in somewhat faster code due to no longer paying the price for specualtive attack mitigations that come with indirect function calls. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07btrfs: stop calling submit_bio_hook for data inodesNikolay Borisov1-3/+7
Instead export and rename the function to btrfs_submit_data_bio and call it directly in submit_one_bio. This avoids paying the cost for speculative attacks mitigations and improves code readability. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07btrfs: don't opencode is_data_inode in end_bio_extent_readpageNikolay Borisov1-4/+2
Use the is_data_inode helper. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07btrfs: call submit_bio_hook directly in submit_one_bioNikolay Borisov1-5/+2
BTRFS has 2 inode types (for the purposes of the code in submit_one_bio) - ordinary data inodes (including the freespace inode) and the btree inode. Both of these implement submit_bio_hook so btrfsic_submit_bio can never be called from submit_one_bio so just remove it. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07btrfs: replace readpage_end_io_hook with direct callsNikolay Borisov1-3/+6
Don't call readpage_end_io_hook for the btree inode. Instead of relying on indirect calls to implement metadata buffer validation simply check if the inode whose page we are processing equals the btree inode. If it does call the necessary function. This is an improvement in 2 directions: 1. We aren't paying the penalty of indirect calls in a post-speculation attacks world. 2. The function is now named more explicitly so it's obvious what's going on This is in preparation to removing struct extent_io_ops altogether. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07btrfs: open code extent_read_full_page to its sole callerNikolay Borisov1-19/+5
This makes reading the code a tad easier by decreasing the level of indirection by one. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07btrfs: sink mirror_num argument in __do_readpageNikolay Borisov1-6/+5
It's always set to 0 by the 2 callers so move it inside __do_readpage. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07btrfs: sink read_flags argument into extent_read_full_pageNikolay Borisov1-2/+2
It's always set to 0 by its sole caller - btrfs_readpage. Simply remove it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07btrfs: sink mirror_num argument in extent_read_full_pageNikolay Borisov1-3/+2
It's always set to 0 from the sole caller - btrfs_readpage. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07btrfs: promote extent_read_full_page to btrfs_readpageNikolay Borisov1-17/+4
Now that btrfs_readpage is the only caller of extent_read_full_page the latter can be open coded in the former. Use the occassion to rename __extent_read_full_page to extent_read_full_page. To facillitate this change submit_one_bio has to be exported as well. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07btrfs: remove mirror_num argument from extent_read_full_pageNikolay Borisov1-3/+3
It's called only from btrfs_readpage which always passes 0 so just sink the argument into extent_read_full_page. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07btrfs: remove btrfs_get_extent indirection from __do_readpageNikolay Borisov1-19/+12
Now that this function is only responsible for reading data pages it's no longer necessary to pass get_extent_t parameter across several layers of functions. This patch removes this parameter from multiple functions: __get_extent_map/__do_readpage/__extent_read_full_page/ extent_read_full_page and simply calls btrfs_get_extent directly in __get_extent_map. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07btrfs: simplify metadata pages readingNikolay Borisov1-11/+10
Metadata pages currently use __do_readpage to read metadata pages, unfortunately this function is also used to deal with ordinary data pages. This makes the metadata pages reading code to go through multiple hoops in order to adhere to __do_readpage invariants. Most of these are necessary for data pages which could be compressed. For metadata it's enough to simply build a bio and submit it. To this effect simply call submit_extent_page directly from read_extent_buffer_pages which is the only callpath used to populate extent_buffers with data. This in turn enables further cleanups. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07btrfs: make extent_fiemap take btrfs_inodeNikolay Borisov1-15/+13
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07btrfs: make get_extent_skip_holes take btrfs_inodeNikolay Borisov1-5/+6
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07btrfs: convert btrfs_inode_sectorsize to take btrfs_inodeNikolay Borisov1-3/+3
It's counterintuitive to have a function named btrfs_inode_xxx which takes a generic inode. Also move the function to btrfs_inode.h so that it has access to the definition of struct btrfs_inode. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07btrfs: rename extent_buffer::lock_nested to extent_buffer::lock_recursedJosef Bacik1-1/+1
Nested locking with lockdep and everything else refers to lock hierarchy within the same lock map. This is how we indicate the same locks for different objects are ok to take in a specific order, for our use case that would be to take the lock on a leaf and then take a lock on an adjacent leaf. What ->lock_nested _actually_ refers to is if we happen to already be holding the write lock on the extent buffer and we're allowing a read lock to be taken on that extent buffer, which is recursion. Rename this so we don't get confused when we switch to a rwsem and have to start using the _nested helpers. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07btrfs: extent_io: do extra check for extent buffer read write functionsQu Wenruo1-37/+47
Although we have start, len check for extent buffer reader/write (e.g. read_extent_buffer()), these checks have limitations: - No overflow check Values like start = 1024 len = -1024 can still pass the basic (start + len) > eb->len check. - Checks are not consistent For read_extent_buffer() we only check (start + len) against eb->len. While for memcmp_extent_buffer() we also check start against eb->len. - Different error reporting mechanism We use WARN() in read_extent_buffer() but BUG() in memcpy_extent_buffer(). - Still modify memory if the request is obviously wrong In read_extent_buffer() even we find (start + len) > eb->len, we still call memset(dst, 0, len), which can easily cause memory access error if start + len overflows. To address above problems, this patch creates a new common function to check such access, check_eb_range(). - Add overflow check This function checks start, start + len against eb->len and overflow check. - Unified checks - Unified error reports Will call WARN() if CONFIG_BTRFS_DEBUG is configured. And also do btrfs_warn() message for non-debug build. - Exit ASAP if check fails No more possible memory corruption. - Add extra comment for @start @len used in those functions as it's sometimes confused with the logical addressing instead of a range inside the eb space Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202817 [ Inspired by above report, the report itself is already addressed ] Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ use check_add_overflow ] Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07btrfs: delete duplicated words + other fixes in commentsRandy Dunlap1-1/+1
Delete repeated words in fs/btrfs/. {to, the, a, and old} and change "into 2 part" to "into 2 parts". Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-09-01Merge tag 'for-5.9-rc3-tag' of ↵Linus Torvalds1-4/+4
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Two small fixes and a bunch of lockdep fixes for warnings that show up with an upcoming tree locking update but are valid with current locks as well" * tag 'for-5.9-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: tree-checker: fix the error message for transid error btrfs: set the lockdep class for log tree extent buffers btrfs: set the correct lockdep class for new nodes btrfs: allocate scrub workqueues outside of locks btrfs: fix potential deadlock in the search ioctl btrfs: drop path before adding new uuid tree entry btrfs: block-group: fix free-space bitmap threshold
2020-08-27btrfs: fix potential deadlock in the search ioctlJosef Bacik1-4/+4
With the conversion of the tree locks to rwsem I got the following lockdep splat: ====================================================== WARNING: possible circular locking dependency detected 5.8.0-rc7-00165-g04ec4da5f45f-dirty #922 Not tainted ------------------------------------------------------ compsize/11122 is trying to acquire lock: ffff889fabca8768 (&mm->mmap_lock#2){++++}-{3:3}, at: __might_fault+0x3e/0x90 but task is already holding lock: ffff889fe720fe40 (btrfs-fs-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (btrfs-fs-00){++++}-{3:3}: down_write_nested+0x3b/0x70 __btrfs_tree_lock+0x24/0x120 btrfs_search_slot+0x756/0x990 btrfs_lookup_inode+0x3a/0xb4 __btrfs_update_delayed_inode+0x93/0x270 btrfs_async_run_delayed_root+0x168/0x230 btrfs_work_helper+0xd4/0x570 process_one_work+0x2ad/0x5f0 worker_thread+0x3a/0x3d0 kthread+0x133/0x150 ret_from_fork+0x1f/0x30 -> #1 (&delayed_node->mutex){+.+.}-{3:3}: __mutex_lock+0x9f/0x930 btrfs_delayed_update_inode+0x50/0x440 btrfs_update_inode+0x8a/0xf0 btrfs_dirty_inode+0x5b/0xd0 touch_atime+0xa1/0xd0 btrfs_file_mmap+0x3f/0x60 mmap_region+0x3a4/0x640 do_mmap+0x376/0x580 vm_mmap_pgoff+0xd5/0x120 ksys_mmap_pgoff+0x193/0x230 do_syscall_64+0x50/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #0 (&mm->mmap_lock#2){++++}-{3:3}: __lock_acquire+0x1272/0x2310 lock_acquire+0x9e/0x360 __might_fault+0x68/0x90 _copy_to_user+0x1e/0x80 copy_to_sk.isra.32+0x121/0x300 search_ioctl+0x106/0x200 btrfs_ioctl_tree_search_v2+0x7b/0xf0 btrfs_ioctl+0x106f/0x30a0 ksys_ioctl+0x83/0xc0 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x50/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xa9 other info that might help us debug this: Chain exists of: &mm->mmap_lock#2 --> &delayed_node->mutex --> btrfs-fs-00 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(btrfs-fs-00); lock(&delayed_node->mutex); lock(btrfs-fs-00); lock(&mm->mmap_lock#2); *** DEADLOCK *** 1 lock held by compsize/11122: #0: ffff889fe720fe40 (btrfs-fs-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180 stack backtrace: CPU: 17 PID: 11122 Comm: compsize Kdump: loaded Not tainted 5.8.0-rc7-00165-g04ec4da5f45f-dirty #922 Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018 Call Trace: dump_stack+0x78/0xa0 check_noncircular+0x165/0x180 __lock_acquire+0x1272/0x2310 lock_acquire+0x9e/0x360 ? __might_fault+0x3e/0x90 ? find_held_lock+0x72/0x90 __might_fault+0x68/0x90 ? __might_fault+0x3e/0x90 _copy_to_user+0x1e/0x80 copy_to_sk.isra.32+0x121/0x300 ? btrfs_search_forward+0x2a6/0x360 search_ioctl+0x106/0x200 btrfs_ioctl_tree_search_v2+0x7b/0xf0 btrfs_ioctl+0x106f/0x30a0 ? __do_sys_newfstat+0x5a/0x70 ? ksys_ioctl+0x83/0xc0 ksys_ioctl+0x83/0xc0 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x50/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The problem is we're doing a copy_to_user() while holding tree locks, which can deadlock if we have to do a page fault for the copy_to_user(). This exists even without my locking changes, so it needs to be fixed. Rework the search ioctl to do the pre-fault and then copy_to_user_nofault for the copying. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-03Merge tag 'core-rcu-2020-08-03' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RCU updates from Ingo Molnar: - kfree_rcu updates - RCU tasks updates - Read-side scalability tests - SRCU updates - Torture-test updates - Documentation updates - Miscellaneous fixes * tag 'core-rcu-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (109 commits) torture: Remove obsolete "cd $KVM" torture: Avoid duplicate specification of qemu command torture: Dump ftrace at shutdown only if requested torture: Add kvm-tranform.sh script for qemu-cmd files torture: Add more tracing crib notes to kvm.sh torture: Improve diagnostic for KCSAN-incapable compilers torture: Correctly summarize build-only runs torture: Pass --kmake-arg to all make invocations rcutorture: Check for unwatched readers torture: Abstract out console-log error detection torture: Add a stop-run capability torture: Create qemu-cmd in --buildonly runs rcu/rcutorture: Replace 0 with false torture: Add --allcpus argument to the kvm.sh script torture: Remove whitespace from identify_qemu_vcpus output rcutorture: NULL rcu_torture_current earlier in cleanup code rcutorture: Handle non-statistic bang-string error messages torture: Set configfile variable to current scenario rcutorture: Add races with task-exit processing locktorture: Use true and false to assign to bool variables ...
2020-07-31Merge branch 'for-mingo' of ↵Ingo Molnar1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu Pull the v5.9 RCU bits from Paul E. McKenney: - Documentation updates - Miscellaneous fixes - kfree_rcu updates - RCU tasks updates - Read-side scalability tests - SRCU updates - Torture-test updates Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-07-27btrfs: do not set the full sync flag on the inode during page releaseFilipe Manana1-2/+8
When removing an extent map at try_release_extent_mapping(), called through the page release callback (btrfs_releasepage()), we always set the full sync flag on the inode, which forces the next fsync to use a slower code path. This hurts performance for workloads that dirty an amount of data that exceeds or is very close to the system's RAM memory and do frequent fsync operations (like database servers can for example). In particular if there are concurrent fsyncs against different files, by falling back to a full fsync we do a lot more checksum lookups in the checksums btree, as we do it for all the extents created in the current transaction, instead of only the new ones since the last fsync. These checksums lookups not only take some time but, more importantly, they also cause contention on the checksums btree locks due to the concurrency with checksum insertions in the btree by ordered extents from other inodes. We actually don't need to set the full sync flag on the inode, because we only remove extent maps that are in the list of modified extents if they were created in a past transaction, in which case an fsync skips them as it's pointless to log them. So stop setting the full fsync flag on the inode whenever we remove an extent map. This patch is part of a patchset that consists of 3 patches, which have the following subjects: 1/3 btrfs: fix race between page release and a fast fsync 2/3 btrfs: release old extent maps during page release 3/3 btrfs: do not set the full sync flag on the inode during page release Performance tests were ran against a branch (misc-next) containing the whole patchset. The test exercises a workload where there are multiple processes writing to files and fsyncing them (each writing and fsyncing its own file), and in total the amount of data dirtied ranges from 2x to 4x the system's RAM memory (16GiB), so that the page release callback is invoked frequently. The following script, using fio, was used to perform the tests: $ cat test-fsync.sh #!/bin/bash DEV=/dev/sdk MNT=/mnt/sdk MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-d single -m single" if [ $# -ne 3 ]; then echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ" exit 1 fi NUM_JOBS=$1 FILE_SIZE=$2 FSYNC_FREQ=$3 cat <<EOF > /tmp/fio-job.ini [writers] rw=write fsync=$FSYNC_FREQ fallocate=none group_reporting=1 direct=0 bs=64k ioengine=sync size=$FILE_SIZE directory=$MNT numjobs=$NUM_JOBS thread EOF echo "Using config:" echo cat /tmp/fio-job.ini echo mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT The tests were performed for different numbers of jobs, file sizes and fsync frequency. A qemu VM using kvm was used, with 8 cores (the host has 12 cores, with cpu governance set to performance mode on all cores), 16GiB of ram (the host has 64GiB) and using a NVMe device directly (without an intermediary filesystem in the host). While running the tests, the host was not used for anything else, to avoid disturbing the tests. The obtained results were the following, and the last line printed by fio is pasted (includes aggregated throughput and test run time). ***************************************************** **** 1 job, 32GiB file, fsync frequency 1 **** ***************************************************** Before patchset: WRITE: bw=29.1MiB/s (30.5MB/s), 29.1MiB/s-29.1MiB/s (30.5MB/s-30.5MB/s), io=32.0GiB (34.4GB), run=1127557-1127557msec After patchset: WRITE: bw=29.3MiB/s (30.7MB/s), 29.3MiB/s-29.3MiB/s (30.7MB/s-30.7MB/s), io=32.0GiB (34.4GB), run=1119042-1119042msec (+0.7% throughput, -0.8% run time) ***************************************************** **** 2 jobs, 16GiB files, fsync frequency 1 **** ***************************************************** Before patchset: WRITE: bw=33.5MiB/s (35.1MB/s), 33.5MiB/s-33.5MiB/s (35.1MB/s-35.1MB/s), io=32.0GiB (34.4GB), run=979000-979000msec After patchset: WRITE: bw=39.9MiB/s (41.8MB/s), 39.9MiB/s-39.9MiB/s (41.8MB/s-41.8MB/s), io=32.0GiB (34.4GB), run=821283-821283msec (+19.1% throughput, -16.1% runtime) ***************************************************** **** 4 jobs, 8GiB files, fsync frequency 1 **** ***************************************************** Before patchset: WRITE: bw=52.1MiB/s (54.6MB/s), 52.1MiB/s-52.1MiB/s (54.6MB/s-54.6MB/s), io=32.0GiB (34.4GB), run=629130-629130msec After patchset: WRITE: bw=71.8MiB/s (75.3MB/s), 71.8MiB/s-71.8MiB/s (75.3MB/s-75.3MB/s), io=32.0GiB (34.4GB), run=456357-456357msec (+37.8% throughput, -27.5% runtime) ***************************************************** **** 8 jobs, 4GiB files, fsync frequency 1 **** ***************************************************** Before patchset: WRITE: bw=76.1MiB/s (79.8MB/s), 76.1MiB/s-76.1MiB/s (79.8MB/s-79.8MB/s), io=32.0GiB (34.4GB), run=430708-430708msec After patchset: WRITE: bw=133MiB/s (140MB/s), 133MiB/s-133MiB/s (140MB/s-140MB/s), io=32.0GiB (34.4GB), run=245458-245458msec (+74.7% throughput, -43.0% run time) ***************************************************** **** 16 jobs, 2GiB files, fsync frequency 1 **** ***************************************************** Before patchset: WRITE: bw=74.7MiB/s (78.3MB/s), 74.7MiB/s-74.7MiB/s (78.3MB/s-78.3MB/s), io=32.0GiB (34.4GB), run=438625-438625msec After patchset: WRITE: bw=184MiB/s (193MB/s), 184MiB/s-184MiB/s (193MB/s-193MB/s), io=32.0GiB (34.4GB), run=177864-177864msec (+146.3% throughput, -59.5% run time) ***************************************************** **** 32 jobs, 2GiB files, fsync frequency 1 **** ***************************************************** Before patchset: WRITE: bw=72.6MiB/s (76.1MB/s), 72.6MiB/s-72.6MiB/s (76.1MB/s-76.1MB/s), io=64.0GiB (68.7GB), run=902615-902615msec After patchset: WRITE: bw=227MiB/s (238MB/s), 227MiB/s-227MiB/s (238MB/s-238MB/s), io=64.0GiB (68.7GB), run=288936-288936msec (+212.7% throughput, -68.0% run time) ***************************************************** **** 64 jobs, 1GiB files, fsync frequency 1 **** ***************************************************** Before patchset: WRITE: bw=98.8MiB/s (104MB/s), 98.8MiB/s-98.8MiB/s (104MB/s-104MB/s), io=64.0GiB (68.7GB), run=663126-663126msec After patchset: WRITE: bw=294MiB/s (308MB/s), 294MiB/s-294MiB/s (308MB/s-308MB/s), io=64.0GiB (68.7GB), run=222940-222940msec (+197.6% throughput, -66.4% run time) Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27btrfs: release old extent maps during page releaseFilipe Manana1-7/+24
When removing an extent map at try_release_extent_mapping(), called through the page release callback (btrfs_releasepage()), we never release an extent map that is in the list of modified extents. This is to prevent races with a concurrent fsync using the fast path, which could lead to not logging an extent created in the current transaction. However we can safely remove an extent map created in a past transaction that is still in the list of modified extents (because no one fsynced yet the inode after that transaction got commited), because such extents are skipped during an fsync as it is pointless to log them. This change does that. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27btrfs: fix race between page release and a fast fsyncFilipe Manana1-3/+13
When releasing an extent map, done through the page release callback, we can race with an ongoing fast fsync and cause the fsync to miss a new extent and not log it. The steps for this to happen are the following: 1) A page is dirtied for some inode I; 2) Writeback for that page is triggered by a path other than fsync, for example by the system due to memory pressure; 3) When the ordered extent for the extent (a single 4K page) finishes, we unpin the corresponding extent map and set its generation to N, the current transaction's generation; 4) The btrfs_releasepage() callback is invoked by the system due to memory pressure for that no longer dirty page of inode I; 5) At the same time, some task calls fsync on inode I, joins transaction N, and at btrfs_log_inode() it sees that the inode does not have the full sync flag set, so we proceed with a fast fsync. But before we get into btrfs_log_changed_extents() and lock the inode's extent map tree: 6) Through btrfs_releasepage() we end up at try_release_extent_mapping() and we remove the extent map for the new 4Kb extent, because it is neither pinned anymore nor locked. By calling remove_extent_mapping(), we remove the extent map from the list of modified extents, since the extent map does not have the logging flag set. We unlock the inode's extent map tree; 7) The task doing the fast fsync now enters btrfs_log_changed_extents(), locks the inode's extent map tree and iterates its list of modified extents, which no longer has the 4Kb extent in it, so it does not log the extent; 8) The fsync finishes; 9) Before transaction N is committed, a power failure happens. After replaying the log, the 4K extent of inode I will be missing, since it was not logged due to the race with try_release_extent_mapping(). So fix this by teaching try_release_extent_mapping() to not remove an extent map if it's still in the list of modified extents. Fixes: ff44c6e36dc9dc ("Btrfs: do not hold the write_lock on the extent tree while logging") CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27btrfs: return EROFS for BTRFS_FS_STATE_ERROR casesJosef Bacik1-1/+1
Eric reported seeing this message while running generic/475 BTRFS: error (device dm-3) in btrfs_sync_log:3084: errno=-117 Filesystem corrupted Full stack trace: BTRFS: error (device dm-0) in btrfs_commit_transaction:2323: errno=-5 IO failure (Error while writing out transaction) BTRFS info (device dm-0): forced readonly BTRFS warning (device dm-0): Skipping commit of aborted transaction. ------------[ cut here ]------------ BTRFS: error (device dm-0) in cleanup_transaction:1894: errno=-5 IO failure BTRFS: Transaction aborted (error -117) BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6480 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6488 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6490 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6498 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64a0 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64a8 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64b0 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64b8 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64c0 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3572 rw 0,0 sector 0x1b85e8 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3572 rw 0,0 sector 0x1b85f0 len 4096 err no 10 WARNING: CPU: 3 PID: 23985 at fs/btrfs/tree-log.c:3084 btrfs_sync_log+0xbc8/0xd60 [btrfs] BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4288 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4290 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4298 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42a0 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42a8 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42b0 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42b8 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42c0 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42c8 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42d0 len 4096 err no 10 CPU: 3 PID: 23985 Comm: fsstress Tainted: G W L 5.8.0-rc4-default+ #1181 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014 RIP: 0010:btrfs_sync_log+0xbc8/0xd60 [btrfs] RSP: 0018:ffff909a44d17bd0 EFLAGS: 00010286 RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000001 RDX: ffff8f3be41cb940 RSI: ffffffffb0108d2b RDI: ffffffffb0108ff7 RBP: ffff909a44d17e70 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000037988 R12: ffff8f3bd20e4000 R13: ffff8f3bd20e4428 R14: 00000000ffffff8b R15: ffff909a44d17c70 FS: 00007f6a6ed3fb80(0000) GS:ffff8f3c3dc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f6a6ed3e000 CR3: 00000000525c0003 CR4: 0000000000160ee0 Call Trace: ? finish_wait+0x90/0x90 ? __mutex_unlock_slowpath+0x45/0x2a0 ? lock_acquire+0xa3/0x440 ? lockref_put_or_lock+0x9/0x30 ? dput+0x20/0x4a0 ? dput+0x20/0x4a0 ? do_raw_spin_unlock+0x4b/0xc0 ? _raw_spin_unlock+0x1f/0x30 btrfs_sync_file+0x335/0x490 [btrfs] do_fsync+0x38/0x70 __x64_sys_fsync+0x10/0x20 do_syscall_64+0x50/0xe0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f6a6ef1b6e3 Code: Bad RIP value. RSP: 002b:00007ffd01e20038 EFLAGS: 00000246 ORIG_RAX: 000000000000004a RAX: ffffffffffffffda RBX: 000000000007a120 RCX: 00007f6a6ef1b6e3 RDX: 00007ffd01e1ffa0 RSI: 00007ffd01e1ffa0 RDI: 0000000000000003 RBP: 0000000000000003 R08: 0000000000000001 R09: 00007ffd01e2004c R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000009f R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 irq event stamp: 0 hardirqs last enabled at (0): [<0000000000000000>] 0x0 hardirqs last disabled at (0): [<ffffffffb007fe0b>] copy_process+0x67b/0x1b00 softirqs last enabled at (0): [<ffffffffb007fe0b>] copy_process+0x67b/0x1b00 softirqs last disabled at (0): [<0000000000000000>] 0x0 ---[ end trace af146e0e38433456 ]--- BTRFS: error (device dm-0) in btrfs_sync_log:3084: errno=-117 Filesystem corrupted This ret came from btrfs_write_marked_extents(). If we get an aborted transaction via EIO before, we'll see it in btree_write_cache_pages() and return EUCLEAN, which gets printed as "Filesystem corrupted". Except we shouldn't be returning EUCLEAN here, we need to be returning EROFS because EUCLEAN is reserved for actual corruption, not IO errors. We are inconsistent about our handling of BTRFS_FS_STATE_ERROR elsewhere, but we want to use EROFS for this particular case. The original transaction abort has the real error code for why we ended up with an aborted transaction, all subsequent actions just need to return EROFS because they may not have a trans handle and have no idea about the original cause of the abort. After patch "btrfs: don't WARN if we abort a transaction with EROFS" the stacktrace will not be dumped either. Reported-by: Eric Sandeen <esandeen@redhat.com> CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add full test stacktrace ] Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27btrfs: remove done label in writepage_delallocNikolay Borisov1-6/+2
Since there is not common cleanup run after the label it makes it somewhat redundant. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27btrfs: streamline btrfs_get_io_failure_record logicNikolay Borisov1-66/+63
Make the function directly return a pointer to a failure record and adjust callers to handle it. Also refactor the logic inside so that the case which allocates the failure record for the first time is not handled in an 'if' arm, saving us a level of indentation. Finally make the function static as it's not used outside of extent_io.c . Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27btrfs: make get_state_failrec return failrec directlyNikolay Borisov1-11/+11
Only failure that get_state_failrec can get is if there is no failure for the given address. There is no reason why the function should return a status code and use a separate parameter for returning the actual failure rec (if one is found). Simplify it by making the return type a pointer and return ERR_PTR value in case of errors. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27btrfs: make writepage_delalloc take btrfs_inodeNikolay Borisov1-4/+5
Only find_lock_delalloc_range uses vfs_inode so let's take the btrfs_inode as a parameter. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27btrfs: make __extent_writepage_io take btrfs_inodeNikolay Borisov1-8/+7
It has only a single use for a generic vfs inode vs 3 for btrfs_inode. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27btrfs: make btrfs_run_delalloc_range take btrfs_inodeNikolay Borisov1-1/+1
All children now take btrfs_inode so convert it to taking it as a parameter as well. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27btrfs: don't use UAPI types for fiemap callbackDavid Sterba1-1/+1
The fiemap callback is not part of UAPI interface and the prototypes don't have the __u64 types either. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27btrfs: make extent_clear_unlock_delalloc take btrfs_inodeNikolay Borisov1-4/+3
It has one VFS and 1 btrfs inode usages but converting it to btrfs_inode interface will allow seamless conversion of its callers. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-24Merge tag 'for-5.8-rc6-tag' of ↵Linus Torvalds1-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into master Pull btrfs fixes from David Sterba: "A few resouce leak fixes from recent patches, all are stable material. The problems have been observed during testing or have a reproducer" * tag 'for-5.8-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix mount failure caused by race with umount btrfs: fix page leaks after failure to lock page for delalloc btrfs: qgroup: fix data leak caused by race between writeback and truncate btrfs: fix double free on ulist after backref resolution failure
2020-07-21btrfs: fix page leaks after failure to lock page for delallocRobbie Ko1-1/+2
When locking pages for delalloc, we check if it's dirty and mapping still matches. If it does not match, we need to return -EAGAIN and release all pages. Only the current page was put though, iterate over all the remaining pages too. CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Robbie Ko <robbieko@synology.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-07Merge tag 'for-5.8-rc4-tag' of ↵Linus Torvalds1-16/+24
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - regression fix of a leak in global block reserve accounting - fix a (hard to hit) race of readahead vs releasepage that could lead to crash - convert all remaining uses of comment fall through an