Age | Commit message (Collapse) | Author | Files | Lines |
|
The return variable 'ret' at btrfs_reclaim_sweep() is never assigned if
none of the space infos is reclaimable (for example if periodic reclaim
is disabled, which is the default), so we return an undefined value.
This can be fixed my making btrfs_reclaim_sweep() not return any value
as well as do_reclaim_sweep() because:
1) do_reclaim_sweep() always returns 0, so we can make it return void;
2) The only caller of btrfs_reclaim_sweep() (btrfs_reclaim_bgs()) doesn't
care about its return value, and in its context there's nothing to do
about any errors anyway.
Therefore remove the return value from btrfs_reclaim_sweep() and
do_reclaim_sweep().
Fixes: e4ca3932ae90 ("btrfs: periodic block_group reclaim")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
There is an internal report that KASAN is reporting use-after-free, with
the following backtrace:
BUG: KASAN: slab-use-after-free in btrfs_check_read_bio+0xa68/0xb70 [btrfs]
Read of size 4 at addr ffff8881117cec28 by task kworker/u16:2/45
CPU: 1 UID: 0 PID: 45 Comm: kworker/u16:2 Not tainted 6.11.0-rc2-next-20240805-default+ #76
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014
Workqueue: btrfs-endio btrfs_end_bio_work [btrfs]
Call Trace:
dump_stack_lvl+0x61/0x80
print_address_description.constprop.0+0x5e/0x2f0
print_report+0x118/0x216
kasan_report+0x11d/0x1f0
btrfs_check_read_bio+0xa68/0xb70 [btrfs]
process_one_work+0xce0/0x12a0
worker_thread+0x717/0x1250
kthread+0x2e3/0x3c0
ret_from_fork+0x2d/0x70
ret_from_fork_asm+0x11/0x20
Allocated by task 20917:
kasan_save_stack+0x37/0x60
kasan_save_track+0x10/0x30
__kasan_slab_alloc+0x7d/0x80
kmem_cache_alloc_noprof+0x16e/0x3e0
mempool_alloc_noprof+0x12e/0x310
bio_alloc_bioset+0x3f0/0x7a0
btrfs_bio_alloc+0x2e/0x50 [btrfs]
submit_extent_page+0x4d1/0xdb0 [btrfs]
btrfs_do_readpage+0x8b4/0x12a0 [btrfs]
btrfs_readahead+0x29a/0x430 [btrfs]
read_pages+0x1a7/0xc60
page_cache_ra_unbounded+0x2ad/0x560
filemap_get_pages+0x629/0xa20
filemap_read+0x335/0xbf0
vfs_read+0x790/0xcb0
ksys_read+0xfd/0x1d0
do_syscall_64+0x6d/0x140
entry_SYSCALL_64_after_hwframe+0x4b/0x53
Freed by task 20917:
kasan_save_stack+0x37/0x60
kasan_save_track+0x10/0x30
kasan_save_free_info+0x37/0x50
__kasan_slab_free+0x4b/0x60
kmem_cache_free+0x214/0x5d0
bio_free+0xed/0x180
end_bbio_data_read+0x1cc/0x580 [btrfs]
btrfs_submit_chunk+0x98d/0x1880 [btrfs]
btrfs_submit_bio+0x33/0x70 [btrfs]
submit_one_bio+0xd4/0x130 [btrfs]
submit_extent_page+0x3ea/0xdb0 [btrfs]
btrfs_do_readpage+0x8b4/0x12a0 [btrfs]
btrfs_readahead+0x29a/0x430 [btrfs]
read_pages+0x1a7/0xc60
page_cache_ra_unbounded+0x2ad/0x560
filemap_get_pages+0x629/0xa20
filemap_read+0x335/0xbf0
vfs_read+0x790/0xcb0
ksys_read+0xfd/0x1d0
do_syscall_64+0x6d/0x140
entry_SYSCALL_64_after_hwframe+0x4b/0x53
[CAUSE]
Although I cannot reproduce the error, the report itself is good enough
to pin down the cause.
The call trace is the regular endio workqueue context, but the
free-by-task trace is showing that during btrfs_submit_chunk() we
already hit a critical error, and is calling btrfs_bio_end_io() to error
out. And the original endio function called bio_put() to free the whole
bio.
This means a double freeing thus causing use-after-free, e.g.:
1. Enter btrfs_submit_bio() with a read bio
The read bio length is 128K, crossing two 64K stripes.
2. The first run of btrfs_submit_chunk()
2.1 Call btrfs_map_block(), which returns 64K
2.2 Call btrfs_split_bio()
Now there are two bios, one referring to the first 64K, the other
referring to the second 64K.
2.3 The first half is submitted.
3. The second run of btrfs_submit_chunk()
3.1 Call btrfs_map_block(), which by somehow failed
Now we call btrfs_bio_end_io() to handle the error
3.2 btrfs_bio_end_io() calls the original endio function
Which is end_bbio_data_read(), and it calls bio_put() for the
original bio.
Now the original bio is freed.
4. The submitted first 64K bio finished
Now we call into btrfs_check_read_bio() and tries to advance the bio
iter.
But since the original bio (thus its iter) is already freed, we
trigger the above use-after free.
And even if the memory is not poisoned/corrupted, we will later call
the original endio function, causing a double freeing.
[FIX]
Instead of calling btrfs_bio_end_io(), call btrfs_orig_bbio_end_io(),
which has the extra check on split bios and do the proper refcounting
for cloned bios.
Furthermore there is already one extra btrfs_cleanup_bio() call, but
that is duplicated to btrfs_orig_bbio_end_io() call, so remove that
label completely.
Reported-by: David Sterba <dsterba@suse.com>
Fixes: 852eee62d31a ("btrfs: allow btrfs_submit_bio to split bios")
CC: stable@vger.kernel.org # 6.6+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
extent_fiemap()
There's a warning (probably on some older compiler version):
fs/btrfs/fiemap.c: warning: 'last_extent_end' may be used uninitialized in this function [-Wmaybe-uninitialized]: => 822:19
Initialize the variable to 0 although it's not necessary as it's either
properly set or not used after an error. The called function is in the
same file so this is a false alert but we want to fix all
-Wmaybe-uninitialized reports.
Link: https://lore.kernel.org/all/20240819070639.2558629-1-geert@linux-m68k.org/
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We have transient failures with btrfs/301, specifically in the part
where we do
for i in $(seq 0 10); do
write 50m to file
rm -f file
done
Sometimes this will result in a transient quota error, and it's because
sometimes we start writeback on the file which results in a delayed
iput, and thus the rm doesn't actually clean the file up. When we're
flushing the quota space we need to run the delayed iputs to make sure
all the unlinks that we think have completed have actually completed.
This removes the small window where we could fail to find enough space
in our quota.
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Although there are several patches improving the extent map shrinker,
there are still reports of too frequent shrinker behavior, taking too
much CPU for the kswapd process.
So let's only enable extent shrinker for now, until we got more
comprehensive understanding and a better solution.
Link: https://lore.kernel.org/linux-btrfs/3df4acd616a07ef4d2dc6bad668701504b412ffc.camel@intelfx.name/
Link: https://lore.kernel.org/linux-btrfs/c30fd6b3-ca7a-4759-8a53-d42878bf84f7@gmail.com/
Fixes: 956a17d9d050 ("btrfs: add a shrinker for extent maps")
CC: stable@vger.kernel.org # 6.10+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
__btrfs_add_free_space_zoned() references and modifies bg's alloc_offset,
ro, and zone_unusable, but without taking the lock. It is mostly safe
because they monotonically increase (at least for now) and this function is
mostly called by a transaction commit, which is serialized by itself.
Still, taking the lock is a safer and correct option and I'm going to add a
change to reset zone_unusable while a block group is still alive. So, add
locking around the operations.
Fixes: 169e0da91a21 ("btrfs: zoned: track unusable bytes for zones")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[REPORT]
There is a corruption report that btrfs refused to mount a fs that has
overlapping dev extents:
BTRFS error (device sdc): dev extent devid 4 physical offset 14263979671552 overlap with previous dev extent end 14263980982272
BTRFS error (device sdc): failed to verify dev extents against chunks: -117
BTRFS error (device sdc): open_ctree failed
[CAUSE]
The direct cause is very obvious, there is a bad dev extent item with
incorrect length.
With btrfs check reporting two overlapping extents, the second one shows
some clue on the cause:
ERROR: dev extent devid 4 offset 14263979671552 len 6488064 overlap with previous dev extent end 14263980982272
ERROR: dev extent devid 13 offset 2257707008000 len 6488064 overlap with previous dev extent end 2257707270144
ERROR: errors found in extent allocation tree or chunk allocation
The second one looks like a bitflip happened during new chunk
allocation:
hex(2257707008000) = 0x20da9d30000
hex(2257707270144) = 0x20da9d70000
diff = 0x00000040000
So it looks like a bitflip happened during new dev extent allocation,
resulting the second overlap.
Currently we only do the dev-extent verification at mount time, but if the
corruption is caused by memory bitflip, we really want to catch it before
writing the corruption to the storage.
Furthermore the dev extent items has the following key definition:
(<device id> DEV_EXTENT <physical offset>)
Thus we can not just rely on the generic key order check to make sure
there is no overlapping.
[ENHANCEMENT]
Introduce dedicated dev extent checks, including:
- Fixed member checks
* chunk_tree should always be BTRFS_CHUNK_TREE_OBJECTID (3)
* chunk_objectid should always be
BTRFS_FIRST_CHUNK_CHUNK_TREE_OBJECTID (256)
- Alignment checks
* chunk_offset should be aligned to sectorsize
* length should be aligned to sectorsize
* key.offset should be aligned to sectorsize
- Overlap checks
If the previous key is also a dev-extent item, with the same
device id, make sure we do not overlap with the previous dev extent.
Reported: Stefan N <stefannnau@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CA+W5K0rSO3koYTo=nzxxTm1-Pdu1HYgVxEpgJ=aGc7d=E8mGEg@mail.gmail.com/
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Unlink changes the link count on the target inode. POSIX mandates that
the ctime must also change when this occurs.
According to https://pubs.opengroup.org/onlinepubs/9699919799/functions/unlink.html:
"Upon successful completion, unlink() shall mark for update the last data
modification and last file status change timestamps of the parent
directory. Also, if the file's link count is not 0, the last file status
change timestamp of the file shall be marked for update."
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add link to the opengroup docs ]
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Add the __counted_by compiler attribute to the flexible array member
name to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and
CONFIG_FORTIFY_SOURCE.
Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
In __extent_writepage_io(), we call btrfs_set_range_writeback() ->
folio_start_writeback(), which clears PAGECACHE_TAG_DIRTY mark from the
mapping xarray if the folio is not dirty. This worked fine before commit
97713b1a2ced ("btrfs: do not clear page dirty inside
extent_write_locked_range()").
After the commit, however, the folio is still dirty at this point, so the
mapping DIRTY tag is not cleared anymore. Then, __extent_writepage_io()
calls btrfs_folio_clear_dirty() to clear the folio's dirty flag. That
results in the page being unlocked with a "strange" state. The page is not
PageDirty, but the mapping tag is set as PAGECACHE_TAG_DIRTY.
This strange state looks like causing a hang with a call trace below when
running fstests generic/091 on a null_blk device. It is waiting for a folio
lock.
While I don't have an exact relation between this hang and the strange
state, fixing the state also fixes the hang. And, that state is worth
fixing anyway.
This commit reorders btrfs_folio_clear_dirty() and
btrfs_set_range_writeback() in __extent_writepage_io(), so that the
PAGECACHE_TAG_DIRTY tag is properly removed from the xarray.
[464.274] task:fsx state:D stack:0 pid:3034 tgid:3034 ppid:2853 flags:0x00004002
[464.286] Call Trace:
[464.291] <TASK>
[464.295] __schedule+0x10ed/0x6260
[464.301] ? __pfx___blk_flush_plug+0x10/0x10
[464.308] ? __submit_bio+0x37c/0x450
[464.314] ? __pfx___schedule+0x10/0x10
[464.321] ? lock_release+0x567/0x790
[464.327] ? __pfx_lock_acquire+0x10/0x10
[464.334] ? __pfx_lock_release+0x10/0x10
[464.340] ? __pfx_lock_acquire+0x10/0x10
[464.347] ? __pfx_lock_release+0x10/0x10
[464.353] ? do_raw_spin_lock+0x12e/0x270
[464.360] schedule+0xdf/0x3b0
[464.365] io_schedule+0x8f/0xf0
[464.371] folio_wait_bit_common+0x2ca/0x6d0
[464.378] ? folio_wait_bit_common+0x1cc/0x6d0
[464.385] ? __pfx_folio_wait_bit_common+0x10/0x10
[464.392] ? __pfx_filemap_get_folios_tag+0x10/0x10
[464.400] ? __pfx_wake_page_function+0x10/0x10
[464.407] ? __pfx___might_resched+0x10/0x10
[464.414] ? do_raw_spin_unlock+0x58/0x1f0
[464.420] extent_write_cache_pages+0xe49/0x1620 [btrfs]
[464.428] ? lock_acquire+0x435/0x500
[464.435] ? __pfx_extent_write_cache_pages+0x10/0x10 [btrfs]
[464.443] ? btrfs_do_write_iter+0x493/0x640 [btrfs]
[464.451] ? orc_find.part.0+0x1d4/0x380
[464.457] ? __pfx_lock_release+0x10/0x10
[464.464] ? __pfx_lock_release+0x10/0x10
[464.471] ? btrfs_do_write_iter+0x493/0x640 [btrfs]
[464.478] btrfs_writepages+0x1cc/0x460 [btrfs]
[464.485] ? __pfx_btrfs_writepages+0x10/0x10 [btrfs]
[464.493] ? is_bpf_text_address+0x6e/0x100
[464.500] ? kernel_text_address+0x145/0x160
[464.507] ? unwind_get_return_address+0x5e/0xa0
[464.514] ? arch_stack_walk+0xac/0x100
[464.521] do_writepages+0x176/0x780
[464.527] ? lock_release+0x567/0x790
[464.533] ? __pfx_do_writepages+0x10/0x10
[464.540] ? __pfx_lock_acquire+0x10/0x10
[464.546] ? __pfx_stack_trace_save+0x10/0x10
[464.553] ? do_raw_spin_lock+0x12e/0x270
[464.560] ? do_raw_spin_unlock+0x58/0x1f0
[464.566] ? _raw_spin_unlock+0x23/0x40
[464.573] ? wbc_attach_and_unlock_inode+0x3da/0x7d0
[464.580] filemap_fdatawrite_wbc+0x113/0x180
[464.587] ? prepare_pages.constprop.0+0x13c/0x5c0 [btrfs]
[464.596] __filemap_fdatawrite_range+0xaf/0xf0
[464.603] ? __pfx___filemap_fdatawrite_range+0x10/0x10
[464.611] ? trace_irq_enable.constprop.0+0xce/0x110
[464.618] ? kasan_quarantine_put+0xd7/0x1e0
[464.625] btrfs_start_ordered_extent+0x46f/0x570 [btrfs]
[464.633] ? __pfx_btrfs_start_ordered_extent+0x10/0x10 [btrfs]
[464.642] ? __clear_extent_bit+0x2c0/0x9d0 [btrfs]
[464.650] btrfs_lock_and_flush_ordered_range+0xc6/0x180 [btrfs]
[464.659] ? __pfx_btrfs_lock_and_flush_ordered_range+0x10/0x10 [btrfs]
[464.669] btrfs_read_folio+0x12a/0x1d0 [btrfs]
[464.676] ? __pfx_btrfs_read_folio+0x10/0x10 [btrfs]
[464.684] ? __pfx_filemap_add_folio+0x10/0x10
[464.691] ? __pfx___might_resched+0x10/0x10
[464.698] ? __filemap_get_folio+0x1c5/0x450
[464.705] prepare_uptodate_page+0x12e/0x4d0 [btrfs]
[464.713] prepare_pages.constprop.0+0x13c/0x5c0 [btrfs]
[464.721] ? fault_in_iov_iter_readable+0xd2/0x240
[464.729] btrfs_buffered_write+0x5bd/0x12f0 [btrfs]
[464.737] ? __pfx_btrfs_buffered_write+0x10/0x10 [btrfs]
[464.745] ? __pfx_lock_release+0x10/0x10
[464.752] ? generic_write_checks+0x275/0x400
[464.759] ? down_write+0x118/0x1f0
[464.765] ? up_write+0x19b/0x500
[464.770] btrfs_direct_write+0x731/0xba0 [btrfs]
[464.778] ? __pfx_btrfs_direct_write+0x10/0x10 [btrfs]
[464.785] ? __pfx___might_resched+0x10/0x10
[464.792] ? lock_acquire+0x435/0x500
[464.798] ? lock_acquire+0x435/0x500
[464.804] btrfs_do_write_iter+0x494/0x640 [btrfs]
[464.811] ? __pfx_btrfs_do_write_iter+0x10/0x10 [btrfs]
[464.819] ? __pfx___might_resched+0x10/0x10
[464.825] ? rw_verify_area+0x6d/0x590
[464.831] vfs_write+0x5d7/0xf50
[464.837] ? __might_fault+0x9d/0x120
[464.843] ? __pfx_vfs_write+0x10/0x10
[464.849] ? btrfs_file_llseek+0xb1/0xfb0 [btrfs]
[464.856] ? lock_release+0x567/0x790
[464.862] ksys_write+0xfb/0x1d0
[464.867] ? __pfx_ksys_write+0x10/0x10
[464.873] ? _raw_spin_unlock+0x23/0x40
[464.879] ? btrfs_getattr+0x4af/0x670 [btrfs]
[464.886] ? vfs_getattr_nosec+0x79/0x340
[464.892] do_syscall_64+0x95/0x180
[464.898] ? __do_sys_newfstat+0xde/0xf0
[464.904] ? __pfx___do_sys_newfstat+0x10/0x10
[464.911] ? trace_irq_enable.constprop.0+0xce/0x110
[464.918] ? syscall_exit_to_user_mode+0xac/0x2a0
[464.925] ? do_syscall_64+0xa1/0x180
[464.931] ? trace_irq_enable.constprop.0+0xce/0x110
[464.939] ? trace_irq_enable.constprop.0+0xce/0x110
[464.946] ? syscall_exit_to_user_mode+0xac/0x2a0
[464.953] ? btrfs_file_llseek+0xb1/0xfb0 [btrfs]
[464.960] ? do_syscall_64+0xa1/0x180
[464.966] ? btrfs_file_llseek+0xb1/0xfb0 [btrfs]
[464.973] ? trace_irq_enable.constprop.0+0xce/0x110
[464.980] ? syscall_exit_to_user_mode+0xac/0x2a0
[464.987] ? __pfx_btrfs_file_llseek+0x10/0x10 [btrfs]
[464.995] ? trace_irq_enable.constprop.0+0xce/0x110
[465.002] ? __pfx_btrfs_file_llseek+0x10/0x10 [btrfs]
[465.010] ? do_syscall_64+0xa1/0x180
[465.016] ? lock_release+0x567/0x790
[465.022] ? __pfx_lock_acquire+0x10/0x10
[465.028] ? __pfx_lock_release+0x10/0x10
[465.034] ? trace_irq_enable.constprop.0+0xce/0x110
[465.042] ? syscall_exit_to_user_mode+0xac/0x2a0
[465.049] ? do_syscall_64+0xa1/0x180
[465.055] ? syscall_exit_to_user_mode+0xac/0x2a0
[465.062] ? do_syscall_64+0xa1/0x180
[465.068] ? syscall_exit_to_user_mode+0xac/0x2a0
[465.075] ? do_syscall_64+0xa1/0x180
[465.081] ? clear_bhb_loop+0x25/0x80
[465.087] ? clear_bhb_loop+0x25/0x80
[465.093] ? clear_bhb_loop+0x25/0x80
[465.099] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[465.106] RIP: 0033:0x7f093b8ee784
[465.111] RSP: 002b:00007ffc29d31b28 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[465.122] RAX: ffffffffffffffda RBX: 0000000000006000 RCX: 00007f093b8ee784
[465.131] RDX: 000000000001de00 RSI: 00007f093b6ed200 RDI: 0000000000000003
[465.141] RBP: 000000000001de00 R08: 0000000000006000 R09: 0000000000000000
[465.150] R10: 0000000000023e00 R11: 0000000000000202 R12: 0000000000006000
[465.160] R13: 0000000000023e00 R14: 0000000000023e00 R15: 0000000000000001
[465.170] </TASK>
[465.174] INFO: lockdep is turned off.
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: 97713b1a2ced ("btrfs: do not clear page dirty inside extent_write_locked_range()")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
If we a find that an extent is shared but its end offset is not sector
size aligned, then we don't clone it and issue write operations instead.
This is because the reflink (remap_file_range) operation does not allow
to clone unaligned ranges, except if the end offset of the range matches
the i_size of the source and destination files (and the start offset is
sector size aligned).
While this is not incorrect because send can only guarantee that a file
has the same data in the source and destination snapshots, it's not
optimal and generates confusion and surprising behaviour for users.
For example, running this test:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
mkfs.btrfs -f $DEV
mount $DEV $MNT
# Use a file size not aligned to any possible sector size.
file_size=$((1 * 1024 * 1024 + 5)) # 1MB + 5 bytes
dd if=/dev/random of=$MNT/foo bs=$file_size count=1
cp --reflink=always $MNT/foo $MNT/bar
btrfs subvolume snapshot -r $MNT/ $MNT/snap
rm -f /tmp/send-test
btrfs send -f /tmp/send-test $MNT/snap
umount $MNT
mkfs.btrfs -f $DEV
mount $DEV $MNT
btrfs receive -vv -f /tmp/send-test $MNT
xfs_io -r -c "fiemap -v" $MNT/snap/bar
umount $MNT
Gives the following result:
(...)
mkfile o258-7-0
rename o258-7-0 -> bar
write bar - offset=0 length=49152
write bar - offset=49152 length=49152
write bar - offset=98304 length=49152
write bar - offset=147456 length=49152
write bar - offset=196608 length=49152
write bar - offset=245760 length=49152
write bar - offset=294912 length=49152
write bar - offset=344064 length=49152
write bar - offset=393216 length=49152
write bar - offset=442368 length=49152
write bar - offset=491520 length=49152
write bar - offset=540672 length=49152
write bar - offset=589824 length=49152
write bar - offset=638976 length=49152
write bar - offset=688128 length=49152
write bar - offset=737280 length=49152
write bar - offset=786432 length=49152
write bar - offset=835584 length=49152
write bar - offset=884736 length=49152
write bar - offset=933888 length=49152
write bar - offset=983040 length=49152
write bar - offset=1032192 length=16389
chown bar - uid=0, gid=0
chmod bar - mode=0644
utimes bar
utimes
BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=06d640da-9ca1-604c-b87c-3375175a8eb3, stransid=7
/mnt/sdi/snap/bar:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..2055]: 26624..28679 2056 0x1
There's no clone operation to clone extents from the file foo into file
bar and fiemap confirms there's no shared flag (0x2000).
So update send_write_or_clone() so that it proceeds with cloning if the
source and destination ranges end at the i_size of the respective files.
After this changes the result of the test is:
(...)
mkfile o258-7-0
rename o258-7-0 -> bar
clone bar - source=foo source offset=0 offset=0 length=1048581
chown bar - uid=0, gid=0
chmod bar - mode=0644
utimes bar
utimes
BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=582420f3-ea7d-564e-bbe5-ce440d622190, stransid=7
/mnt/sdi/snap/bar:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..2055]: 26624..28679 2056 0x2001
A test case for fstests will also follow up soon.
Link: https://github.com/kdave/btrfs-progs/issues/572#issuecomment-2282841416
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently the extent map shrinker can be run by any task when attempting
to allocate memory and there's enough memory pressure to trigger it.
To avoid too much latency we stop iterating over extent maps and removing
them once the task needs to reschedule. This logic was introduced in commit
b3ebb9b7e92a ("btrfs: stop extent map shrinker if reschedule is needed").
While that solved high latency problems for some use cases, it's still
not enough because with a too high number of tasks entering the extent map
shrinker code, either due to memory allocations or because they are a
kswapd task, we end up having a very high level of contention on some
spin locks, namely:
1) The fs_info->fs_roots_radix_lock spin lock, which we need to find
roots to iterate over their inodes;
2) The spin lock of the xarray used to track open inodes for a root
(struct btrfs_root::inodes) - on 6.10 kernels and below, it used to
be a red black tree and the spin lock was root->inode_lock;
3) The fs_info->delayed_iput_lock spin lock since the shrinker adds
delayed iputs (calls btrfs_add_delayed_iput()).
Instead of allowing the extent map shrinker to be run by any task, make
it run only by kswapd tasks. This still solves the problem of running
into OOM situations due to an unbounded extent map creation, which is
simple to trigger by direct IO writes, as described in the changelog
of commit 956a17d9d050 ("btrfs: add a shrinker for extent maps"), and
by a similar case when doing buffered IO on files with a very large
number of holes (keeping the file open and creating many holes, whose
extent maps are only released when the file is closed).
Reported-by: kzd <kzd@56709.net>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219121
Reported-by: Octavia Togami <octavia.togami@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAHPNGSSt-a4ZZWrtJdVyYnJFscFjP9S7rMcvEMaNSpR556DdLA@mail.gmail.com/
Fixes: 956a17d9d050 ("btrfs: add a shrinker for extent maps")
CC: stable@vger.kernel.org # 6.10+
Tested-by: kzd <kzd@56709.net>
Tested-by: Octavia Togami <octavia.togami@gmail.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[REPORT]
There is a bug report that kernel is rejecting a mismatching inode mode
and its dir item:
[ 1881.553937] BTRFS critical (device dm-0): inode mode mismatch with
dir: inode mode=040700 btrfs type=2 dir type=0
[CAUSE]
It looks like the inode mode is correct, while the dir item type
0 is BTRFS_FT_UNKNOWN, which should not be generated by btrfs at all.
This may be caused by a memory bit flip.
[ENHANCEMENT]
Although tree-checker is not able to do any cross-leaf verification, for
this particular case we can at least reject any dir type with
BTRFS_FT_UNKNOWN.
So here we enhance the dir type check from [0, BTRFS_FT_MAX), to
(0, BTRFS_FT_MAX).
Although the existing corruption can not be fixed just by such enhanced
checking, it should prevent the same 0x2->0x0 bitflip for dir type to
reach disk in the future.
Reported-by: Kota <nospam@kota.moe>
Link: https://lore.kernel.org/linux-btrfs/CACsxjPYnQF9ZF-0OhH16dAx50=BXXOcP74MxBc3BG+xae4vTTw@mail.gmail.com/
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
In the patch 78c52d9eb6b7 ("btrfs: check for refs on snapshot delete
resume") I added some code to handle file systems that had been
corrupted by a bug that incorrectly skipped updating the drop progress
key while dropping a snapshot. This code would check to see if we had
already deleted our reference for a child block, and skip the deletion
if we had already.
Unfortunately there is a bug, as the check would only check the on-disk
references. I made an incorrect assumption that blocks in an already
deleted snapshot that was having the deletion resume on mount wouldn't
be modified.
If we have 2 pending deleted snapshots that share blocks, we can easily
modify the rules for a block. Take the following example
subvolume a exists, and subvolume b is a snapshot of subvolume a. They
share references to block 1. Block 1 will have 2 full references, one
for subvolume a and one for subvolume b, and it belongs to subvolume a
(btrfs_header_owner(block 1) == subvolume a).
When deleting subvolume a, we will drop our full reference for block 1,
and because we are the owner we will drop our full reference for all of
block 1's children, convert block 1 to FULL BACKREF, and add a shared
reference to all of block 1's children.
Then we will start the snapshot deletion of subvolume b. We look up the
extent info for block 1, which checks delayed refs and tells us that
FULL BACKREF is set, so sets parent to the bytenr of block 1. However
because this is a resumed snapshot deletion, we call into
check_ref_exists(). Because check_ref_exists() only looks at the disk,
it doesn't find the shared backref for the child of block 1, and thus
returns 0 and we skip deleting the reference for the child of block 1
and continue. This orphans the child of block 1.
The fix is to lookup the delayed refs, similar to what we do in
btrfs_lookup_extent_info(). However we only care about whether the
reference exists or not. If we fail to find our reference on disk, go
look up the bytenr in the delayed refs, and if it exists look for an
existing ref in the delayed ref head. If that exists then we know we
can delete the reference safely and carry on. If it doesn't exist we
know we have to skip over this block.
This bug has existed since I introduced this fix, however requires
having multiple deleted snapshots pending when we unmount. We noticed
this in production because our shutdown path stops the container on the
system, which deletes a bunch of subvolumes, and then reboots the box.
This gives us plenty of opportunities to hit this issue. Looking at the
history we've seen this occasionally in production, but we had a big
spike recently thanks to faster machines getting jobs with multiple
subvolumes in the job.
Chris Mason wrote a reproducer which does the following
mount /dev/nvme4n1 /btrfs
btrfs subvol create /btrfs/s1
simoop -E -f 4k -n 200000 -z /btrfs/s1
while(true) ; do
btrfs subvol snap /btrfs/s1 /btrfs/s2
simoop -f 4k -n 200000 -r 10 -z /btrfs/s2
btrfs subvol snap /btrfs/s2 /btrfs/s3
btrfs balance start -dusage=80 /btrfs
btrfs subvol del /btrfs/s2 /btrfs/s3
umount /btrfs
btrfsck /dev/nvme4n1 || exit 1
mount /dev/nvme4n1 /btrfs
done
On the second loop this would fail consistently, with my patch it has
been running for hours and hasn't failed.
I also used dm-log-writes to capture the state of the failure so I could
debug the problem. Using the existing failure case to test my patch
validated that it fixes the problem.
Fixes: 78c52d9eb6b7 ("btrfs: check for refs on snapshot delete resume")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix double inode unlock for direct IO sync writes (reported by
syzbot)
- fix root tree id/name map definitions, don't use fixed size buffers
for name (reported by -Werror=unterminated-string-initialization)
- fix qgroup reserve leaks in bufferd write path
- update scrub status structure more often so it can be reported in
user space more accurately and let 'resume' not repeat work
- in preparation to remove space cache v1 in the future print a warning
if it's detected
* tag 'for-6.11-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: avoid using fixed char array size for tree names
btrfs: fix double inode unlock for direct IO sync writes
btrfs: emit a warning about space cache v1 being deprecated
btrfs: fix qgroup reserve leaks in cow_file_range
btrfs: implement launder_folio for clearing dirty page reserve
btrfs: scrub: update last_physical after scrubbing one stripe
btrfs: factor out stripe length calculation into a helper
|
|
[BUG]
There is a bug report that using the latest trunk GCC 15, btrfs would cause
unterminated-string-initialization warning:
linux-6.6/fs/btrfs/print-tree.c:29:49: error: initializer-string for array of ‘char’ is too long [-Werror=unterminated-string-initialization]
29 | { BTRFS_BLOCK_GROUP_TREE_OBJECTID, "BLOCK_GROUP_TREE" },
|
^~~~~~~~~~~~~~~~~~
[CAUSE]
To print tree names we have an array of root_name_map structure, which
uses "char name[16];" to store the name string of a tree.
But the following trees have names exactly at 16 chars length:
- "BLOCK_GROUP_TREE"
- "RAID_STRIPE_TREE"
This means we will have no space for the terminating '\0', and can lead
to unexpected access when printing the name.
[FIX]
Instead of "char name[16];" use "const char *" instead.
Since the name strings are all read-only data, and are all NULL
terminated by default, there is not much need to bother the length at
all.
Reported-by: Sam James <sam@gentoo.org>
Reported-by: Alejandro Colomar <alx@kernel.org>
Fixes: edde81f1abf29 ("btrfs: add raid stripe tree pretty printer")
Fixes: 9c54e80ddc6bd ("btrfs: add code to support the block group root")
CC: stable@vger.kernel.org # 6.1+
Suggested-by: Alejandro Colomar <alx@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Alejandro Colomar <alx@kernel.org>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
If we do a direct IO sync write, at btrfs_sync_file(), and we need to skip
inode logging or we get an error starting a transaction or an error when
flushing delalloc, we end up unlocking the inode when we shouldn't under
the 'out_release_extents' label, and then unlock it again at
btrfs_direct_write().
Fix that by checking if we have to skip inode unlocking under that label.
Reported-by: syzbot+7dbbb74af6291b5a5a8b@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/000000000000dfd631061eaeb4bc@google.com/
Fixes: 939b656bc8ab ("btrfs: fix corruption after buffer fault in during direct IO append write")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We've been wanting to get rid of this for a while, add a message to
indicate that this feature is going away and when so we can finally have
a date when we're going to remove it. The output looks like this
BTRFS warning (device nvme0n1): space cache v1 is being deprecated and will be removed in a future release, please use -o space_cache=v2
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Neal Gompa <neal@gompa.dev>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
In the buffered write path, the dirty page owns the qgroup reserve until
it creates an ordered_extent.
Therefore, any errors that occur before the ordered_extent is created
must free that reservation, or else the space is leaked. The fstest
generic/475 exercises various IO error paths, and is able to trigger
errors in cow_file_range where we fail to get to allocating the ordered
extent. Note that because we *do* clear delalloc, we are likely to
remove the inode from the delalloc list, so the inodes/pages to not have
invalidate/launder called on them in the commit abort path.
This results in failures at the unmount stage of the test that look like:
BTRFS: error (device dm-8 state EA) in cleanup_transaction:2018: errno=-5 IO failure
BTRFS: error (device dm-8 state EA) in btrfs_replace_file_extents:2416: errno=-5 IO failure
BTRFS warning (device dm-8 state EA): qgroup 0/5 has unreleased space, type 0 rsv 28672
------------[ cut here ]------------
WARNING: CPU: 3 PID: 22588 at fs/btrfs/disk-io.c:4333 close_ctree+0x222/0x4d0 [btrfs]
Modules linked in: btrfs blake2b_generic libcrc32c xor zstd_compress raid6_pq
CPU: 3 PID: 22588 Comm: umount Kdump: loaded Tainted: G W 6.10.0-rc7-gab56fde445b8 #21
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
RIP: 0010:close_ctree+0x222/0x4d0 [btrfs]
RSP: 0018:ffffb4465283be00 EFLAGS: 00010202
RAX: 0000000000000001 RBX: ffffa1a1818e1000 RCX: 0000000000000001
RDX: 0000000000000000 RSI: ffffb4465283bbe0 RDI: ffffa1a19374fcb8
RBP: ffffa1a1818e13c0 R08: 0000000100028b16 R09: 0000000000000000
R10: 0000000000000003 R11: 0000000000000003 R12: ffffa1a18ad7972c
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
FS: 00007f9168312b80(0000) GS:ffffa1a4afcc0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f91683c9140 CR3: 000000010acaa000 CR4: 00000000000006f0
Call Trace:
<TASK>
? close_ctree+0x222/0x4d0 [btrfs]
? __warn.cold+0x8e/0xea
? close_ctree+0x222/0x4d0 [btrfs]
? report_bug+0xff/0x140
? handle_bug+0x3b/0x70
? exc_invalid_op+0x17/0x70
? asm_exc_invalid_op+0x1a/0x20
? close_ctree+0x222/0x4d0 [btrfs]
generic_shutdown_super+0x70/0x160
kill_anon_super+0x11/0x40
btrfs_kill_super+0x11/0x20 [btrfs]
deactivate_locked_super+0x2e/0xa0
cleanup_mnt+0xb5/0x150
task_work_run+0x57/0x80
syscall_exit_to_user_mode+0x121/0x130
do_syscall_64+0xab/0x1a0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f916847a887
---[ end trace 0000000000000000 ]---
BTRFS error (device dm-8 state EA): qgroup reserved space leaked
Cases 2 and 3 in the out_reserve path both pertain to this type of leak
and must free the reserved qgroup data. Because it is already an error
path, I opted not to handle the possible errors in
btrfs_free_qgroup_data.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
In the buffered write path, dirty pages can be said to "own" the qgroup
reservation until they create an ordered_extent. It is possible for
there to be outstanding dirty pages when a transaction is aborted, in
which case there is no cancellation path for freeing this reservation
and it is leaked.
We do already walk the list of outstanding delalloc inodes in
btrfs_destroy_delalloc_inodes() and call invalidate_inode_pages2() on them.
This does *not* call btrfs_invalidate_folio(), as one might guess, but
rather calls launder_folio() and release_folio(). Since this is a
reservation associated with dirty pages only, rather than something
associated with the private bit (ordered_extent is cancelled separately
already in the cleanup transaction path), implementing this release
should be done via launder_folio.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently sctx->stat.last_physical only got updated in the following
cases:
- When the last stripe of a non-RAID56 chunk is scrubbed
This implies a pitfall, if the last stripe is at the chunk boundary,
and we finished the scrub of the whole chunk, we won't update
last_physical at all until the next chunk.
- When a P/Q stripe of a RAID56 chunk is scrubbed
This leads the following two problems:
- sctx->stat.last_physical is not updated for a almost full chunk
This is especially bad, affecting scrub resume, as the resume would
start from last_physical, causing unnecessary re-scrub.
- "btrfs scrub status" will not report any progress for a long time
Fix the problem by properly updating @last_physical after each stripe is
scrubbed.
And since we're here, for the sake of consistency, use spin lock to
protect the update of @last_physical, just like all the remaining
call sites touching sctx->stat.
Reported-by: Michel Palleau <michel.palleau@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAMFk-+igFTv2E8svg=cQ6o3e6CrR5QwgQ3Ok9EyRaEvvthpqCQ@mail.gmail.com/
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently there are two locations which need to calculate the real
length of a stripe (which can be at the end of a chunk, and the chunk
size may not always be 64K aligned).
Factor them into a helper as we're going to have a third user soon.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix regression in extent map rework when handling insertion of
overlapping compressed extent
- fix unexpected file length when appending to a file using direct io
and buffer not faulted in
- in zoned mode, fix accounting of unusable space when flipping
read-only block group back to read-write
- fix page locking when COWing an inline range, assertion failure found
by syzbot
- fix calculation of space info in debugging print
- tree-checker, add validation of data reference item
- fix a few -Wmaybe-uninitialized build warnings
* tag 'for-6.11-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: initialize location to fix -Wmaybe-uninitialized in btrfs_lookup_dentry()
btrfs: fix corruption after buffer fault in during direct IO append write
btrfs: zoned: fix zone_unusable accounting on making block group read-write again
btrfs: do not subtract delalloc from avail bytes
btrfs: make cow_file_range_inline() honor locked_page on error
btrfs: fix corrupt read due to bad offset of a compressed extent map
btrfs: tree-checker: validate dref root and objectid
|
|
Some arch + compiler combinations report a potentially unused variable
location in btrfs_lookup_dentry(). This is a false alert as the variable
is passed by value and always valid or there's an error. The compilers
cannot probably reason about that although btrfs_inode_by_name() is in
the same file.
> + /kisskb/src/fs/btrfs/inode.c: error: 'location.objectid' may be used
+uninitialized in this function [-Werror=maybe-uninitialized]: => 5603:9
> + /kisskb/src/fs/btrfs/inode.c: error: 'location.type' may be used
+uninitialized in this function [-Werror=maybe-uninitialized]: => 5674:5
m68k-gcc8/m68k-allmodconfig
mips-gcc8/mips-allmodconfig
powerpc-gcc5/powerpc-all{mod,yes}config
powerpc-gcc5/ppc64_defconfig
Initialize it to zero, this should fix the warnings and won't change the
behaviour as btrfs_inode_by_name() accepts only a root or inode item
types, otherwise returns an error.
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Link: https://lore.kernel.org/linux-btrfs/bd4e9928-17b3-9257-8ba7-6b7f9bbb639a@linux-m68k.org/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
During an append (O_APPEND write flag) direct IO write if the input buffer
was not previously faulted in, we can corrupt the file in a way that the
final size is unexpected and it includes an unexpected hole.
The problem happens like this:
1) We have an empty file, with size 0, for example;
2) We do an O_APPEND direct IO with a length of 4096 bytes and the input
buffer is not currently faulted in;
3) We enter btrfs_direct_write(), lock the inode and call
generic_write_checks(), which calls generic_write_checks_count(), and
that function sets the iocb position to 0 with the following code:
if (iocb->ki_flags & IOCB_APPEND)
iocb->ki_pos = i_size_read(inode);
4) We call btrfs_dio_write() and enter into iomap, which will end up
calling btrfs_dio_iomap_begin() and that calls
btrfs_get_blocks_direct_write(), where we update the i_size of the
inode to 4096 bytes;
5) After btrfs_dio_iomap_begin() returns, iomap will attempt to access
the page of the write input buffer (at iomap_dio_bio_iter(), with a
call to bio_iov_iter_get_pages()) and fail with -EFAULT, which gets
returned to btrfs at btrfs_direct_write() via btrfs_dio_write();
6) At btrfs_direct_write() we get the -EFAULT error, unlock the inode,
fault in the write buffer and then goto to the label 'relock';
7) We lock again the inode, do all the necessary checks again and call
again generic_write_checks(), which calls generic_write_checks_count()
again, and there we set the iocb's position to 4K, which is the current
i_size of the inode, with the following code pointed above:
if (iocb->ki_flags & IOCB_APPEND)
iocb->ki_pos = i_size_read(inode);
8) Then we go again to btrfs_dio_write() and enter iomap and the write
succeeds, but it wrote to the file range [4K, 8K), leaving a hole in
the [0, 4K) range and an i_size of 8K, which goes against the
expectations of having the data written to the range [0, 4K) and get an
i_size of 4K.
Fix this by not unlocking the inode before faulting in the input buffer,
in case we get -EFAULT or an incomplete write, and not jumping to the
'relock' label after faulting in the buffer - instead jump to a location
immediately before calling iomap, skipping all the write checks and
relocking. This solves this problem and it's fine even in case the input
buffer is memory mapped to the same file range, since only holding the
range locked in the inode's io tree can cause a deadlock, it's safe to
keep the inode lock (VFS lock), as was fixed and described in commit
51bd9563b678 ("btrfs: fix deadlock due to page faults during direct IO
reads and writes").
A sample reproducer provided by a reporter is the following:
$ cat test.c
#ifndef _GNU_SOURCE
#define _GNU_SOURCE
#endif
#include <fcntl.h>
#include <stdio.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <unistd.h>
int main(int argc, char *argv[])
{
if (argc < 2) {
fprintf(stderr, "Usage: %s <test file>\n", argv[0]);
return 1;
}
int fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC | O_DIRECT |
O_APPEND, 0644);
if (fd < 0) {
perror("creating test file");
return 1;
}
char *buf = mmap(NULL, 4096, PROT_READ,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
ssize_t ret = write(fd, buf, 4096);
if (ret < 0) {
perror("pwritev2");
return 1;
}
struct stat stbuf;
ret = fstat(fd, &stbuf);
if (ret < 0) {
perror("stat");
return 1;
}
printf("size: %llu\n", (unsigned long long)stbuf.st_size);
return stbuf.st_size == 4096 ? 0 : 1;
}
A test case for fstests will be sent soon.
Reported-by: Hanna Czenczek <hreitz@redhat.com>
Link: https://lore.kernel.org/linux-btrfs/0b841d46-12fe-4e64-9abb-871d8d0de271@redhat.com/
Fixes: 8184620ae212 ("btrfs: fix lost file sync on direct IO write with nowait and dsync iocb")
CC: stable@vger.kernel.org # 6.1+
Tested-by: Hanna Czenczek <hreitz@redhat.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
again
When btrfs makes a block group read-only, it adds all free regions in the
block group to space_info->bytes_readonly. That free space excludes
reserved and pinned regions. OTOH, when btrfs makes the block group
read-write again, it moves all the unused regions into the block group's
zone_unusable. That unused region includes reserved and pinned regions.
As a result, it counts too much zone_unusable bytes.
Fortunately (or unfortunately), having erroneous zone_unusable does not
affect the calculation of space_info->bytes_readonly, because free
space (num_bytes in btrfs_dec_block_group_ro) calculation is done based on
the erroneous zone_unusable and it reduces the num_bytes just to cancel the
error.
This behavior can be easily discovered by adding a WARN_ON to check e.g,
"bg->pinned > 0" in btrfs_dec_block_group_ro(), and running fstests test
case like btrfs/282.
Fix it by properly considering pinned and reserved in
btrfs_dec_block_group_ro(). Also, add a WARN_ON and introduce
btrfs_space_info_update_bytes_zone_unusable() to catch a similar mistake.
Fixes: 169e0da91a21 ("btrfs: zoned: track unusable bytes for zones")
CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The block group's avail bytes printed when dumping a space info subtract
the delalloc_bytes. However, as shown in btrfs_add_reserved_bytes() and
btrfs_free_reserved_bytes(), it is added or subtracted along with
"reserved" for the delalloc case, which means the "delalloc_bytes" is a
part of the "reserved" bytes. So, excluding it to calculate the avail space
counts delalloc_bytes twice, which can lead to an invalid result.
Fixes: e50b122b832b ("btrfs: print available space for a block group when |