Age | Commit message (Collapse) | Author | Files | Lines |
|
When starting a transaction, we are assuming the number of bytes used for
each delayed ref update matches the number of bytes used for each item
update, that is the return value of:
btrfs_calc_insert_metadata_size(fs_info, num_items)
However that is not correct when we are using the free space tree, as we
need to multiply that value by 2, since delayed ref updates need to modify
the free space tree besides the extent tree.
So fix this by using btrfs_calc_delayed_ref_bytes() to get the correct
number of bytes used for delayed ref updates.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When starting a transaction we are comparing the result of a call to
btrfs_block_rsv_full() with 0, but the function returns a boolean. While
in practice it is not incorrect, as 0 is equivalent to false, it makes it
a bit odd and less readable. So update the check to not compare against 0
and instead use the logical not (!) operator.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The function should_end_transaction() is very short and only has one
caller, which is btrfs_should_end_transaction(). So move the code from
should_end_transaction() into btrfs_should_end_transaction().
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Before relocating a block group we pause scrub, then do the relocation and
then unpause scrub. The relocation process requires starting and committing
a transaction, and if we have a failure in the critical section of the
transaction commit path (transaction state >= TRANS_STATE_COMMIT_START),
we will deadlock if there is a paused scrub.
That results in stack traces like the following:
[42.479] BTRFS info (device sdc): relocating block group 53876686848 flags metadata|raid6
[42.936] BTRFS warning (device sdc): Skipping commit of aborted transaction.
[42.936] ------------[ cut here ]------------
[42.936] BTRFS: Transaction aborted (error -28)
[42.936] WARNING: CPU: 11 PID: 346822 at fs/btrfs/transaction.c:1977 btrfs_commit_transaction+0xcc8/0xeb0 [btrfs]
[42.936] Modules linked in: dm_flakey dm_mod loop btrfs (...)
[42.936] CPU: 11 PID: 346822 Comm: btrfs Tainted: G W 6.3.0-rc2-btrfs-next-127+ #1
[42.936] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[42.936] RIP: 0010:btrfs_commit_transaction+0xcc8/0xeb0 [btrfs]
[42.936] Code: ff ff 45 8b (...)
[42.936] RSP: 0018:ffffb58649633b48 EFLAGS: 00010282
[42.936] RAX: 0000000000000000 RBX: ffff8be6ef4d5bd8 RCX: 0000000000000000
[42.936] RDX: 0000000000000002 RSI: ffffffffb35e7782 RDI: 00000000ffffffff
[42.936] RBP: ffff8be6ef4d5c98 R08: 0000000000000000 R09: ffffb586496339e8
[42.936] R10: 0000000000000001 R11: 0000000000000001 R12: ffff8be6d38c7c00
[42.936] R13: 00000000ffffffe4 R14: ffff8be6c268c000 R15: ffff8be6ef4d5cf0
[42.936] FS: 00007f381a82b340(0000) GS:ffff8beddfcc0000(0000) knlGS:0000000000000000
[42.936] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[42.936] CR2: 00007f1e35fb7638 CR3: 0000000117680006 CR4: 0000000000370ee0
[42.936] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[42.936] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[42.936] Call Trace:
[42.936] <TASK>
[42.936] ? start_transaction+0xcb/0x610 [btrfs]
[42.936] prepare_to_relocate+0x111/0x1a0 [btrfs]
[42.936] relocate_block_group+0x57/0x5d0 [btrfs]
[42.936] ? btrfs_wait_nocow_writers+0x25/0xb0 [btrfs]
[42.936] btrfs_relocate_block_group+0x248/0x3c0 [btrfs]
[42.936] ? __pfx_autoremove_wake_function+0x10/0x10
[42.936] btrfs_relocate_chunk+0x3b/0x150 [btrfs]
[42.936] btrfs_balance+0x8ff/0x11d0 [btrfs]
[42.936] ? __kmem_cache_alloc_node+0x14a/0x410
[42.936] btrfs_ioctl+0x2334/0x32c0 [btrfs]
[42.937] ? mod_objcg_state+0xd2/0x360
[42.937] ? refill_obj_stock+0xb0/0x160
[42.937] ? seq_release+0x25/0x30
[42.937] ? __rseq_handle_notify_resume+0x3b5/0x4b0
[42.937] ? percpu_counter_add_batch+0x2e/0xa0
[42.937] ? __x64_sys_ioctl+0x88/0xc0
[42.937] __x64_sys_ioctl+0x88/0xc0
[42.937] do_syscall_64+0x38/0x90
[42.937] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[42.937] RIP: 0033:0x7f381a6ffe9b
[42.937] Code: 00 48 89 44 24 (...)
[42.937] RSP: 002b:00007ffd45ecf060 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[42.937] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f381a6ffe9b
[42.937] RDX: 00007ffd45ecf150 RSI: 00000000c4009420 RDI: 0000000000000003
[42.937] RBP: 0000000000000003 R08: 0000000000000013 R09: 0000000000000000
[42.937] R10: 00007f381a60c878 R11: 0000000000000246 R12: 00007ffd45ed0423
[42.937] R13: 00007ffd45ecf150 R14: 0000000000000000 R15: 00007ffd45ecf148
[42.937] </TASK>
[42.937] ---[ end trace 0000000000000000 ]---
[42.937] BTRFS: error (device sdc: state A) in cleanup_transaction:1977: errno=-28 No space left
[59.196] INFO: task btrfs:346772 blocked for more than 120 seconds.
[59.196] Tainted: G W 6.3.0-rc2-btrfs-next-127+ #1
[59.196] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[59.196] task:btrfs state:D stack:0 pid:346772 ppid:1 flags:0x00004002
[59.196] Call Trace:
[59.196] <TASK>
[59.196] __schedule+0x392/0xa70
[59.196] ? __pv_queued_spin_lock_slowpath+0x165/0x370
[59.196] schedule+0x5d/0xd0
[59.196] __scrub_blocked_if_needed+0x74/0xc0 [btrfs]
[59.197] ? __pfx_autoremove_wake_function+0x10/0x10
[59.197] scrub_pause_off+0x21/0x50 [btrfs]
[59.197] scrub_simple_mirror+0x1c7/0x950 [btrfs]
[59.197] ? scrub_parity_put+0x1a5/0x1d0 [btrfs]
[59.198] ? __pfx_autoremove_wake_function+0x10/0x10
[59.198] scrub_stripe+0x20d/0x740 [btrfs]
[59.198] scrub_chunk+0xc4/0x130 [btrfs]
[59.198] scrub_enumerate_chunks+0x3e4/0x7a0 [btrfs]
[59.198] ? __pfx_autoremove_wake_function+0x10/0x10
[59.198] btrfs_scrub_dev+0x236/0x6a0 [btrfs]
[59.199] ? btrfs_ioctl+0xd97/0x32c0 [btrfs]
[59.199] ? _copy_from_user+0x7b/0x80
[59.199] btrfs_ioctl+0xde1/0x32c0 [btrfs]
[59.199] ? refill_stock+0x33/0x50
[59.199] ? should_failslab+0xa/0x20
[59.199] ? kmem_cache_alloc_node+0x151/0x460
[59.199] ? alloc_io_context+0x1b/0x80
[59.199] ? preempt_count_add+0x70/0xa0
[59.199] ? __x64_sys_ioctl+0x88/0xc0
[59.199] __x64_sys_ioctl+0x88/0xc0
[59.199] do_syscall_64+0x38/0x90
[59.199] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[59.199] RIP: 0033:0x7f82ffaffe9b
[59.199] RSP: 002b:00007f82ff9fcc50 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[59.199] RAX: ffffffffffffffda RBX: 000055b191e36310 RCX: 00007f82ffaffe9b
[59.199] RDX: 000055b191e36310 RSI: 00000000c400941b RDI: 0000000000000003
[59.199] RBP: 0000000000000000 R08: 00007fff1575016f R09: 0000000000000000
[59.199] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f82ff9fd640
[59.199] R13: 000000000000006b R14: 00007f82ffa87580 R15: 0000000000000000
[59.199] </TASK>
[59.199] INFO: task btrfs:346773 blocked for more than 120 seconds.
[59.200] Tainted: G W 6.3.0-rc2-btrfs-next-127+ #1
[59.200] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[59.201] task:btrfs state:D stack:0 pid:346773 ppid:1 flags:0x00004002
[59.201] Call Trace:
[59.201] <TASK>
[59.201] __schedule+0x392/0xa70
[59.201] ? __pv_queued_spin_lock_slowpath+0x165/0x370
[59.201] schedule+0x5d/0xd0
[59.201] __scrub_blocked_if_needed+0x74/0xc0 [btrfs]
[59.201] ? __pfx_autoremove_wake_function+0x10/0x10
[59.201] scrub_pause_off+0x21/0x50 [btrfs]
[59.202] scrub_simple_mirror+0x1c7/0x950 [btrfs]
[59.202] ? scrub_parity_put+0x1a5/0x1d0 [btrfs]
[59.202] ? __pfx_autoremove_wake_function+0x10/0x10
[59.202] scrub_stripe+0x20d/0x740 [btrfs]
[59.202] scrub_chunk+0xc4/0x130 [btrfs]
[59.203] scrub_enumerate_chunks+0x3e4/0x7a0 [btrfs]
[59.203] ? __pfx_autoremove_wake_function+0x10/0x10
[59.203] btrfs_scrub_dev+0x236/0x6a0 [btrfs]
[59.203] ? btrfs_ioctl+0xd97/0x32c0 [btrfs]
[59.203] ? _copy_from_user+0x7b/0x80
[59.203] btrfs_ioctl+0xde1/0x32c0 [btrfs]
[59.204] ? should_failslab+0xa/0x20
[59.204] ? kmem_cache_alloc_node+0x151/0x460
[59.204] ? alloc_io_context+0x1b/0x80
[59.204] ? preempt_count_add+0x70/0xa0
[59.204] ? __x64_sys_ioctl+0x88/0xc0
[59.204] __x64_sys_ioctl+0x88/0xc0
[59.204] do_syscall_64+0x38/0x90
[59.204] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[59.204] RIP: 0033:0x7f82ffaffe9b
[59.204] RSP: 002b:00007f82ff1fbc50 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[59.204] RAX: ffffffffffffffda RBX: 000055b191e36790 RCX: 00007f82ffaffe9b
[59.204] RDX: 000055b191e36790 RSI: 00000000c400941b RDI: 0000000000000003
[59.204] RBP: 0000000000000000 R08: 00007fff1575016f R09: 0000000000000000
[59.204] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f82ff1fc640
[59.204] R13: 000000000000006b R14: 00007f82ffa87580 R15: 0000000000000000
[59.204] </TASK>
[59.204] INFO: task btrfs:346774 blocked for more than 120 seconds.
[59.205] Tainted: G W 6.3.0-rc2-btrfs-next-127+ #1
[59.205] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[59.206] task:btrfs state:D stack:0 pid:346774 ppid:1 flags:0x00004002
[59.206] Call Trace:
[59.206] <TASK>
[59.206] __schedule+0x392/0xa70
[59.206] schedule+0x5d/0xd0
[59.206] __scrub_blocked_if_needed+0x74/0xc0 [btrfs]
[59.206] ? __pfx_autoremove_wake_function+0x10/0x10
[59.206] scrub_pause_off+0x21/0x50 [btrfs]
[59.207] scrub_simple_mirror+0x1c7/0x950 [btrfs]
[59.207] ? scrub_parity_put+0x1a5/0x1d0 [btrfs]
[59.207] ? __pfx_autoremove_wake_function+0x10/0x10
[59.207] scrub_stripe+0x20d/0x740 [btrfs]
[59.208] scrub_chunk+0xc4/0x130 [btrfs]
[59.208] scrub_enumerate_chunks+0x3e4/0x7a0 [btrfs]
[59.208] ? __mutex_unlock_slowpath.isra.0+0x9a/0x120
[59.208] btrfs_scrub_dev+0x236/0x6a0 [btrfs]
[59.208] ? btrfs_ioctl+0xd97/0x32c0 [btrfs]
[59.209] ? _copy_from_user+0x7b/0x80
[59.209] btrfs_ioctl+0xde1/0x32c0 [btrfs]
[59.209] ? should_failslab+0xa/0x20
[59.209] ? kmem_cache_alloc_node+0x151/0x460
[59.209] ? alloc_io_context+0x1b/0x80
[59.209] ? preempt_count_add+0x70/0xa0
[59.209] ? __x64_sys_ioctl+0x88/0xc0
[59.209] __x64_sys_ioctl+0x88/0xc0
[59.209] do_syscall_64+0x38/0x90
[59.209] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[59.209] RIP: 0033:0x7f82ffaffe9b
[59.209] RSP: 002b:00007f82fe9fac50 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[59.209] RAX: ffffffffffffffda RBX: 000055b191e36c10 RCX: 00007f82ffaffe9b
[59.209] RDX: 000055b191e36c10 RSI: 00000000c400941b RDI: 0000000000000003
[59.209] RBP: 0000000000000000 R08: 00007fff1575016f R09: 0000000000000000
[59.209] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f82fe9fb640
[59.209] R13: 000000000000006b R14: 00007f82ffa87580 R15: 0000000000000000
[59.209] </TASK>
[59.209] INFO: task btrfs:346775 blocked for more than 120 seconds.
[59.210] Tainted: G W 6.3.0-rc2-btrfs-next-127+ #1
[59.210] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[59.211] task:btrfs state:D stack:0 pid:346775 ppid:1 flags:0x00004002
[59.211] Call Trace:
[59.211] <TASK>
[59.211] __schedule+0x392/0xa70
[59.211] schedule+0x5d/0xd0
[59.211] __scrub_blocked_if_needed+0x74/0xc0 [btrfs]
[59.211] ? __pfx_autoremove_wake_function+0x10/0x10
[59.211] scrub_pause_off+0x21/0x50 [btrfs]
[59.212] scrub_simple_mirror+0x1c7/0x950 [btrfs]
[59.212] ? scrub_parity_put+0x1a5/0x1d0 [btrfs]
[59.212] ? __pfx_autoremove_wake_function+0x10/0x10
[59.212] scrub_stripe+0x20d/0x740 [btrfs]
[59.213] scrub_chunk+0xc4/0x130 [btrfs]
[59.213] scrub_enumerate_chunks+0x3e4/0x7a0 [btrfs]
[59.213] ? __mutex_unlock_slowpath.isra.0+0x9a/0x120
[59.213] btrfs_scrub_dev+0x236/0x6a0 [btrfs]
[59.213] ? btrfs_ioctl+0xd97/0x32c0 [btrfs]
[59.214] ? _copy_from_user+0x7b/0x80
[59.214] btrfs_ioctl+0xde1/0x32c0 [btrfs]
[59.214] ? should_failslab+0xa/0x20
[59.214] ? kmem_cache_alloc_node+0x151/0x460
[59.214] ? alloc_io_context+0x1b/0x80
[59.214] ? preempt_count_add+0x70/0xa0
[59.214] ? __x64_sys_ioctl+0x88/0xc0
[59.214] __x64_sys_ioctl+0x88/0xc0
[59.214] do_syscall_64+0x38/0x90
[59.214] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[59.214] RIP: 0033:0x7f82ffaffe9b
[59.214] RSP: 002b:00007f82fe1f9c50 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[59.214] RAX: ffffffffffffffda RBX: 000055b191e37090 RCX: 00007f82ffaffe9b
[59.214] RDX: 000055b191e37090 RSI: 00000000c400941b RDI: 0000000000000003
[59.214] RBP: 0000000000000000 R08: 00007fff1575016f R09: 0000000000000000
[59.214] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f82fe1fa640
[59.214] R13: 000000000000006b R14: 00007f82ffa87580 R15: 0000000000000000
[59.214] </TASK>
[59.214] INFO: task btrfs:346776 blocked for more than 120 seconds.
[59.215] Tainted: G W 6.3.0-rc2-btrfs-next-127+ #1
[59.216] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[59.217] task:btrfs state:D stack:0 pid:346776 ppid:1 flags:0x00004002
[59.217] Call Trace:
[59.217] <TASK>
[59.217] __schedule+0x392/0xa70
[59.217] ? __pv_queued_spin_lock_slowpath+0x165/0x370
[59.217] schedule+0x5d/0xd0
[59.217] __scrub_blocked_if_needed+0x74/0xc0 [btrfs]
[59.217] ? __pfx_autoremove_wake_function+0x10/0x10
[59.217] scrub_pause_off+0x21/0x50 [btrfs]
[59.217] scrub_simple_mirror+0x1c7/0x950 [btrfs]
[59.217] ? scrub_parity_put+0x1a5/0x1d0 [btrfs]
[59.218] ? __pfx_autoremove_wake_function+0x10/0x10
[59.218] scrub_stripe+0x20d/0x740 [btrfs]
[59.218] scrub_chunk+0xc4/0x130 [btrfs]
[59.218] scrub_enumerate_chunks+0x3e4/0x7a0 [btrfs]
[59.219] ? __pfx_autoremove_wake_function+0x10/0x10
[59.219] btrfs_scrub_dev+0x236/0x6a0 [btrfs]
[59.219] ? btrfs_ioctl+0xd97/0x32c0 [btrfs]
[59.219] ? _copy_from_user+0x7b/0x80
[59.219] btrfs_ioctl+0xde1/0x32c0 [btrfs]
[59.219] ? should_failslab+0xa/0x20
[59.219] ? kmem_cache_alloc_node+0x151/0x460
[59.219] ? alloc_io_context+0x1b/0x80
[59.219] ? preempt_count_add+0x70/0xa0
[59.219] ? __x64_sys_ioctl+0x88/0xc0
[59.219] __x64_sys_ioctl+0x88/0xc0
[59.219] do_syscall_64+0x38/0x90
[59.219] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[59.219] RIP: 0033:0x7f82ffaffe9b
[59.219] RSP: 002b:00007f82fd9f8c50 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[59.219] RAX: ffffffffffffffda RBX: 000055b191e37510 RCX: 00007f82ffaffe9b
[59.219] RDX: 000055b191e37510 RSI: 00000000c400941b RDI: 0000000000000003
[59.219] RBP: 0000000000000000 R08: 00007fff1575016f R09: 0000000000000000
[59.219] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f82fd9f9640
[59.219] R13: 000000000000006b R14: 00007f82ffa87580 R15: 0000000000000000
[59.219] </TASK>
[59.219] INFO: task btrfs:346822 blocked for more than 120 seconds.
[59.220] Tainted: G W 6.3.0-rc2-btrfs-next-127+ #1
[59.221] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[59.222] task:btrfs state:D stack:0 pid:346822 ppid:1 flags:0x00004002
[59.222] Call Trace:
[59.222] <TASK>
[59.222] __schedule+0x392/0xa70
[59.222] schedule+0x5d/0xd0
[59.222] btrfs_scrub_cancel+0x91/0x100 [btrfs]
[59.222] ? __pfx_autoremove_wake_function+0x10/0x10
[59.222] btrfs_commit_transaction+0x572/0xeb0 [btrfs]
[59.223] ? start_transaction+0xcb/0x610 [btrfs]
[59.223] prepare_to_relocate+0x111/0x1a0 [btrfs]
[59.223] relocate_block_group+0x57/0x5d0 [btrfs]
[59.223] ? btrfs_wait_nocow_writers+0x25/0xb0 [btrfs]
[59.223] btrfs_relocate_block_group+0x248/0x3c0 [btrfs]
[59.224] ? __pfx_autoremove_wake_function+0x10/0x10
[59.224] btrfs_relocate_chunk+0x3b/0x150 [btrfs]
[59.224] btrfs_balance+0x8ff/0x11d0 [btrfs]
[59.224] ? __kmem_cache_alloc_node+0x14a/0x410
[59.224] btrfs_ioctl+0x2334/0x32c0 [btrfs]
[59.225] ? mod_objcg_state+0xd2/0x360
[59.225] ? refill_obj_stock+0xb0/0x160
[59.225] ? seq_release+0x25/0x30
[59.225] ? __rseq_handle_notify_resume+0x3b5/0x4b0
[59.225] ? percpu_counter_add_batch+0x2e/0xa0
[59.225] ? __x64_sys_ioctl+0x88/0xc0
[59.225] __x64_sys_ioctl+0x88/0xc0
[59.225] do_syscall_64+0x38/0x90
[59.225] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[59.225] RIP: 0033:0x7f381a6ffe9b
[59.225] RSP: 002b:00007ffd45ecf060 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[59.225] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f381a6ffe9b
[59.225] RDX: 00007ffd45ecf150 RSI: 00000000c4009420 RDI: 0000000000000003
[59.225] RBP: 0000000000000003 R08: 0000000000000013 R09: 0000000000000000
[59.225] R10: 00007f381a60c878 R11: 0000000000000246 R12: 00007ffd45ed0423
[59.225] R13: 00007ffd45ecf150 R14: 0000000000000000 R15: 00007ffd45ecf148
[59.225] </TASK>
What happens is the following:
1) A scrub is running, so fs_info->scrubs_running is 1;
2) Task A starts block group relocation, and at btrfs_relocate_chunk() it
pauses scrub by calling btrfs_scrub_pause(). That increments
fs_info->scrub_pause_req from 0 to 1 and waits for the scrub task to
pause (for fs_info->scrubs_paused to be == to fs_info->scrubs_running);
3) The scrub task pauses at scrub_pause_off(), waiting for
fs_info->scrub_pause_req to decrease to 0;
4) Task A then enters btrfs_relocate_block_group(), and down that call
chain we start a transaction and then attempt to commit it;
5) When task A calls btrfs_commit_transaction(), it either will do the
commit itself or wait for some other task that already started the
commit of the transaction - it doesn't matter which case;
6) The transaction commit enters state TRANS_STATE_COMMIT_START;
7) An error happens during the transaction commit, like -ENOSPC when
running delayed refs or delayed items for example;
8) This results in calling transaction.c:cleanup_transaction(), where
we call btrfs_scrub_cancel(), incrementing fs_info->scrub_cancel_req
from 0 to 1, and blocking this task waiting for fs_info->scrubs_running
to decrease to 0;
9) From this point on, both the transaction commit and the scrub task
hang forever:
1) The transaction commit is waiting for fs_info->scrubs_running to
be decreased to 0;
2) The scrub task is at scrub_pause_off() waiting for
fs_info->scrub_pause_req to decrease to 0 - so it can not proceed
to stop the scrub and decrement fs_info->scrubs_running from 0 to 1.
Therefore resulting in a deadlock.
Fix this by having cleanup_transaction(), called if a transaction commit
fails, not call btrfs_scrub_cancel() if relocation is in progress, and
having btrfs_relocate_block_group() call btrfs_scrub_cancel() instead if
the relocation failed and a transaction abort happened.
This was triggered with btrfs/061 from fstests.
Fixes: 55e3a601c81c ("btrfs: Fix data checksum error cause by replace with io-load.")
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
Since the introduction of per-fs feature sysfs interface
(/sys/fs/btrfs/<UUID>/features/), the content of that directory is never
updated.
Thus for the following case, that directory will not show the new
features like RAID56:
# mkfs.btrfs -f $dev1 $dev2 $dev3
# mount $dev1 $mnt
# btrfs balance start -f -mconvert=raid5 $mnt
# ls /sys/fs/btrfs/$uuid/features/
extended_iref free_space_tree no_holes skinny_metadata
While after unmount and mount, we got the correct features:
# umount $mnt
# mount $dev1 $mnt
# ls /sys/fs/btrfs/$uuid/features/
extended_iref free_space_tree no_holes raid56 skinny_metadata
[CAUSE]
Because we never really try to update the content of per-fs features/
directory.
We had an attempt to update the features directory dynamically in commit
14e46e04958d ("btrfs: synchronize incompat feature bits with sysfs
files"), but unfortunately it get reverted in commit e410e34fad91
("Revert "btrfs: synchronize incompat feature bits with sysfs files"").
The problem in the original patch is, in the context of
btrfs_create_chunk(), we can not afford to update the sysfs group.
The exported but never utilized function, btrfs_sysfs_feature_update()
is the leftover of such attempt. As even if we go sysfs_update_group(),
new files will need extra memory allocation, and we have no way to
specify the sysfs update to go GFP_NOFS.
[FIX]
This patch will address the old problem by doing asynchronous sysfs
update in the cleaner thread.
This involves the following changes:
- Make __btrfs_(set|clear)_fs_(incompat|compat_ro) helpers to set
BTRFS_FS_FEATURE_CHANGED flag when needed
- Update btrfs_sysfs_feature_update() to use sysfs_update_group()
And drop unnecessary arguments.
- Call btrfs_sysfs_feature_update() in cleaner_kthread
If we have the BTRFS_FS_FEATURE_CHANGED flag set.
- Wake up cleaner_kthread in btrfs_commit_transaction if we have
BTRFS_FS_FEATURE_CHANGED flag
By this, all the previously dangerous call sites like
btrfs_create_chunk() need no new changes, as above helpers would
have already set the BTRFS_FS_FEATURE_CHANGED flag.
The real work happens at cleaner_kthread, thus we pay the cost of
delaying the update to sysfs directory, but the delayed time should be
small enough that end user can not distinguish though it might get
delayed if the cleaner thread is busy with removing subvolumes or
defrag.
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
While trying to sync messages.[ch] I ended up with this dependency on
messages.h in the rest of btrfs-progs code base because it's where
btrfs_abort_transaction() was now held. We want to keep messages.[ch]
limited to the kernel code, and the btrfs_abort_transaction() code
better fits in the transaction code and not in messages.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ move the __cold attributes ]
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
All callers except one pass NULL, so the parameter can be dropped and
the inode::io_tree initialization can be open coded.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The div_factor* helpers calculate fraction or percentage fraction. The
name is a bit confusing, we use it only for percentage calculations and
there are two helpers.
There's a helper mult_frac that's for general fractions, that tries to
be accurate but we multiply and divide by small numbers so we can use
the div_u64 helper.
Rename the div_factor* helpers and use 1..100 percentage range, also drop
the case checking for percentage == 100, it's never hit.
The conversions:
* div_factor calculates tenths and the numbers need to be adjusted
* div_factor_fine is direct replacement
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Move these out of ctree.h into scrub.h to cut down on code in ctree.h.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Move these out of ctree.h into relocation.h to cut down on code in
ctree.h
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Move these out of ctree.h into ioctl.h to cut down on code in ctree.h.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Move these out of ctree.h into uuid-tree.h to cut down on the code in
ctree.h.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Move these prototypes out of ctree.h and into their own header file.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Now that the defrag code is all in one file, create a defrag.h and move
all the defrag related prototypes and helper out of ctree.h and into
defrag.h.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Move all the root-tree.c prototypes to root-tree.h, and then update all
the necessary files to include the new header.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Move all the extent tree related prototypes to extent-tree.h out of
ctree.h, and then go include it everywhere needed so everything
compiles.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
While struct qstr is more natural without fscrypt, since it's provided
by dentries, struct fscrypt_str is provided by the fscrypt handlers
processing dentries, and is thus more natural in the fscrypt world.
Replace all of the struct qstr uses with struct fscrypt_str.
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Most places where we get a struct qstr, we are doing so from a dentry.
With fscrypt, the dentry's name may be encrypted on-disk, so fscrypt
provides a helper to convert a dentry name to the appropriate disk name
if necessary. Convert each of the dentry name accesses to use
fscrypt_setup_filename(), then convert the resulting fscrypt_name back
to an unencrypted qstr. This does not work for nokey names, but the
specific locations that could spawn nokey names are noted.
At present, since there are no encrypted directories, nothing goes down
the filename encryption paths.
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Many functions throughout btrfs take name buffer and name length
arguments. Most of these functions at the highest level are usually
called with these arguments extracted from a supplied dentry's name.
But the entire name can be passed instead, making each function a little
more elegant.
Each function whose arguments are currently the name and length
extracted from a dentry is herein converted to instead take a pointer to
the name in the dentry. The couple of calls to these calls without a
struct dentry are converted to create an appropriate qstr to pass in.
Additionally, every function which is only called with a name/len
extracted directly from a qstr is also converted.
This change has positive effect on stack consumption, frame of many
functions is reduced but this will be used in the future for fscrypt
related structures.
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This is a large patch, but because they're all macros it's impossible to
split up. Simply copy all of the item accessors in ctree.h and paste
them in accessors.h, and then update any files to include the header so
everything compiles.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ reformat comments, style fixups ]
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Now that we're not using this code anywhere we can remove it as well as
the member from fs_info.
We don't have any mount options or on/off features that would utilize
the pending infrastructure, the last one was inode_cache.
There was a patchset [1] to enable some features from sysfs that would
break things if it would be set immediately. In case we'll need that
kind of logic again the patch can be reverted, but for the current use
it can be replaced by the single state bit to do the commit.
[1] https://lore.kernel.org/linux-btrfs/1422609654-19519-1-git-send-email-quwenruo@cn.fujitsu.com/
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add note ]
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently we are only using fs_info->pending_changes to indicate that we
need a transaction commit. The original users for this were removed
years ago and we don't have more usage in sight, so this is the only
remaining reason to have this field. Add a flag so we can remove this
code.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We have several fs wide related helpers in ctree.h. The bulk of these
are the incompat flag test helpers, but there are things such as
btrfs_fs_closing() and the read only helpers that also aren't directly
related to the ctree code. Move these into a fs.h header, which will
serve as the location for file system wide related helpers.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This is local to the transaction code, remove it from ctree.h and
inode.c, create new helpers in the transaction to handle the init work
and move the cachep locally to transaction.c.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We only use this for normal inodes, so don't set it if we're not a
normal inode.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Instead of taking up a whole argument to indicate we're clearing
everything in a range, simply add another EXTENT bit to control this,
and then update all the callers to drop this argument from the
clear_extent_bit variants.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This is only used in the case that we are clearing EXTENT_LOCKED, so
infer this value from the bits passed in instead of taking it as an
argument.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
KCSAN reports that there's unlocked access mixed with locked access,
which is technically correct but is not a bug. To avoid false alerts at
least from KCSAN, add annotation and use a wrapper whenever ->full is
accessed for read outside of lock.
It is used as a fast check and only advisory. In the worst case the
block reserve is found !full and becomes full in the meantime, but
properly handled.
Depending on the value of ->full, btrfs_block_rsv_release decides
where to return the reservation, and block_rsv_release_bytes handles a
NULL pointer for block_rsv and if it's not NULL then it double checks
the full status under a lock.
Link: https://lore.kernel.org/linux-btrfs/CAAwBoOJDjei5Hnem155N_cJwiEkVwJYvgN-tQrwWbZQGhFU=cA@mail.gmail.com/
Link: https://lore.kernel.org/linux-btrfs/YvHU/vsXd7uz5V6j@hungrycats.org
Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The extent tree v2 needs a new root for storing all block group items,
the whole feature hasn't been finished yet so we can afford to do some
changes.
My initial proposal years ago just added a new tree rootid, and load it
from tree root, just like what we did for quota/free space tree/uuid/extent
roots.
But the extent tree v2 patches introduced a completely new way to store
block group tree root into super block which is arguably wasteful.
Currently there are only 3 trees stored in super blocks, and they all
have their valid reasons:
- Chunk root
Needed for bootstrap.
- Tree root
Really the entry point for all trees.
- Log root
This is special as log root has to be updated out of existing
transaction mechanism.
There is not even any reason to put block group root into super blocks,
the block group tree is updated at the same time as the old extent tree,
no need for extra bootstrap/out-of-transaction update.
So just move block group root from super block into tree root.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
struct btrfs_caching_ctl::progress and struct
btrfs_block_group::last_byte_to_unpin were previously needed to ensure
that unpin_extent_range() didn't return a range to the free space cache
before the caching thread had a chance to cache that range. However, the
commit "btrfs: fix space cache corruption and potential double
allocations" made it so that we always synchronously cache the block
group at the time that we pin the extent, so this machinery is no longer
necessary.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
In contrast to the num_writers and num_extwriters wait events, the
condition for the pending ordered wait event is signaled in a different
context from the wait event itself. The condition signaling occurs in
btrfs_remove_ordered_extent() in fs/btrfs/ordered-data.c while the wait
event is implemented in btrfs_commit_transaction() in
fs/btrfs/transaction.c
Thus the thread signaling the condition has to acquire the lockdep map
as a reader at the start of btrfs_remove_ordered_extent() and release it
after it has signaled the condition. In this case some dependencies
might be left out due to the placement of the annotation, but it is
better than no annotation at all.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Add lockdep annotations for the transaction states that have wait
events;
1) TRANS_STATE_COMMIT_START
2) TRANS_STATE_UNBLOCKED
3) TRANS_STATE_SUPER_COMMITTED
4) TRANS_STATE_COMPLETED
The new macros introduced here to annotate the transaction states wait
events have the same effect as the generic lockdep annotation macros.
With the exception of the lockdep annotation for TRANS_STATE_COMMIT_START
the transaction thread has to acquire the lockdep maps for the
transaction states as reader after the lockdep map for num_writers is
released so that lockdep does not complain.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Similarly to the num_writers wait event in fs/btrfs/transaction.c add a
lockdep annotation for the num_extwriters wait event.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Annotate the num_writers wait event in fs/btrfs/transaction.c with
lockdep in order to catch deadlocks involving this wait event.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The chained assignments may be convenient to write, but make readability
a bit worse as it's too easy to overlook that there are several values
set on the same line while this is rather an exception. Making it
consistent everywhere avoids surprises.
The pattern where inode times are initialized reuses the first value and
the order is mtime, ctime. In other blocks the assignments are expanded
so the order of variables is similar to the neighboring code.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Track several stats about transaction commit, to be later exported via
sysfs:
- number of commits so far
- duration of the last commit in ns
- maximum commit duration seen so far in ns
- total duration for all commits so far in ns
The update of the commit stats occurs after the commit thread has gone
through all the logic that checks if there is another thread committing
at the same time. This means that we only account for actual commit work
in the commit stats we report and not the time the thread spends waiting
until it is ready to do the commit work.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This reverts commit 48b36a602a335c184505346b5b37077840660634.
Revert the xarray conversion, there's a problem with potential
sleep-inside-spinlock [1] when calling xa_insert that triggers GFP_NOFS
allocation. The radix tree used the preloading mechanism to avoid
sleeping but this is not available in xarray.
Conversion from spin lock to mutex is possible but at time of rc6 is
riskier than a clean revert.
[1] https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/
Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
… rename it to simply fs_roots and adjust all usages of this object to use
the XArray API, because it is notionally easier to use and understand, as
it provides array semantics, and also takes care of locking for us,
further simplifying the code.
Also do some refactoring, esp. where the API change requires largely
rewriting some functions, anyway.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Gabriel Niebler <gniebler@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently we use a spin lock to protect the red black tree that we use to
track block groups. Most accesses to that tree are actually read only and
for large filesystems, with thousands of block groups, it actually has
a bad impact on performance, as concurrent read only searches on the tree
are serialized.
Read only searches on the tree are very frequent and done when:
1) Pinning and unpinning extents, as we need to lookup the respective
block group from the tree;
2) Freeing the last reference of a tree block, regardless if we pin the
underlying extent or add it back to free space cache/tree;
3) During NOCOW writes, both buffered IO and direct IO, we need to check
if the block group that contains an extent is read only or not and to
increment the number of NOCOW writers in the block group. For those
operations we need to search for the block group in the tree.
Similarly, after creating the ordered extent for the NOCOW write, we
need to decrement the number of NOCOW writers from the same block
group, which requires searching for it in the tree;
4) Decreasing the number of extent reservations in a block group;
5) When allocating extents and freeing reserved extents;
6) Adding and removing free space to the free space tree;
7) When releasing delalloc bytes during ordered extent completion;
8) When relocating a block group;
9) During fitrim, to iterate over the block groups;
10) etc;
Write accesses to the tree, to add or remove block groups, are much less
frequent as they happen only when allocating a new block group or when
deleting a block group.
We also use the same spin lock to protect the list of currently caching
block groups. Additions to this list are made when we need to cache a
block group, because we don't have a free space cache for it (or we have
but it's invalid), and removals from this list are done when caching of
the block group's free space finishes. These cases are also not very
common, but when they happen, they happen only once when the filesystem
is mounted.
So switch the lock that protects the tree of block groups from a spinning
lock to a read/write lock.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Da |