Age | Commit message (Collapse) | Author | Files | Lines |
|
commit 082cd4ec240b8734a82a89ffb890216ac98fec68 upstream.
We got follow bug_on when run fsstress with injecting IO fault:
[130747.323114] kernel BUG at fs/ext4/extents_status.c:762!
[130747.323117] Internal error: Oops - BUG: 0 [#1] SMP
......
[130747.334329] Call trace:
[130747.334553] ext4_es_cache_extent+0x150/0x168 [ext4]
[130747.334975] ext4_cache_extents+0x64/0xe8 [ext4]
[130747.335368] ext4_find_extent+0x300/0x330 [ext4]
[130747.335759] ext4_ext_map_blocks+0x74/0x1178 [ext4]
[130747.336179] ext4_map_blocks+0x2f4/0x5f0 [ext4]
[130747.336567] ext4_mpage_readpages+0x4a8/0x7a8 [ext4]
[130747.336995] ext4_readpage+0x54/0x100 [ext4]
[130747.337359] generic_file_buffered_read+0x410/0xae8
[130747.337767] generic_file_read_iter+0x114/0x190
[130747.338152] ext4_file_read_iter+0x5c/0x140 [ext4]
[130747.338556] __vfs_read+0x11c/0x188
[130747.338851] vfs_read+0x94/0x150
[130747.339110] ksys_read+0x74/0xf0
This patch's modification is according to Jan Kara's suggestion in:
https://patchwork.ozlabs.org/project/linux-ext4/patch/20210428085158.3728201-1-yebin10@huawei.com/
"I see. Now I understand your patch. Honestly, seeing how fragile is trying
to fix extent tree after split has failed in the middle, I would probably
go even further and make sure we fix the tree properly in case of ENOSPC
and EDQUOT (those are easily user triggerable). Anything else indicates a
HW problem or fs corruption so I'd rather leave the extent tree as is and
don't try to fix it (which also means we will not create overlapping
extents)."
Cc: stable@kernel.org
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210506141042.3298679-1-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit afd09b617db3786b6ef3dc43e28fe728cfea84df upstream.
Buffer head references must be released before calling kill_bdev();
otherwise the buffer head (and its page referenced by b_data) will not
be freed by kill_bdev, and subsequently that bh will be leaked.
If blocksizes differ, sb_set_blocksize() will kill current buffers and
page cache by using kill_bdev(). And then super block will be reread
again but using correct blocksize this time. sb_set_blocksize() didn't
fully free superblock page and buffer head, and being busy, they were
not freed and instead leaked.
This can easily be reproduced by calling an infinite loop of:
systemctl start <ext4_on_lvm>.mount, and
systemctl stop <ext4_on_lvm>.mount
... since systemd creates a cgroup for each slice which it mounts, and
the bh leak get amplified by a dying memory cgroup that also never
gets freed, and memory consumption is much more easily noticed.
Fixes: ce40733ce93d ("ext4: Check for return value from sb_set_blocksize")
Fixes: ac27a0ec112a ("ext4: initial copy of files from ext3")
Link: https://lore.kernel.org/r/20210521075533.95732-1-amakhalov@vmware.com
Signed-off-by: Alexey Makhalov <amakhalov@vmware.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 20265d9a67e40eafd39a8884658ca2e36f05985d upstream.
Before this patch, in the unlikely event that gfs2_glock_dq encountered
a withdraw, it would do a wait_on_bit to wait for its journal to be
recovered, but it never released the glock's spin_lock, which caused a
scheduling-while-atomic error.
This patch unlocks the lockref spin_lock before waiting for recovery.
Fixes: 601ef0d52e96 ("gfs2: Force withdraw to replay journals and wait for it to finish")
Cc: stable@vger.kernel.org # v5.7+
Reported-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 8c3f9cd1603d0e4af6c50ebc6d974ab7bdd03cf4 ]
__io_cqring_fill_event() takes cflags as long to squeeze it into u32 in
an CQE, awhile all users pass int or unsigned. Replace it with unsigned
int and store it as u32 in struct io_completion to match CQE.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit a298232ee6b9a1d5d732aa497ff8be0d45b5bd82 ]
WARNING: CPU: 0 PID: 10242 at lib/refcount.c:28 refcount_warn_saturate+0x15b/0x1a0 lib/refcount.c:28
RIP: 0010:refcount_warn_saturate+0x15b/0x1a0 lib/refcount.c:28
Call Trace:
__refcount_sub_and_test include/linux/refcount.h:283 [inline]
__refcount_dec_and_test include/linux/refcount.h:315 [inline]
refcount_dec_and_test include/linux/refcount.h:333 [inline]
io_put_req fs/io_uring.c:2140 [inline]
io_queue_linked_timeout fs/io_uring.c:6300 [inline]
__io_queue_sqe+0xbef/0xec0 fs/io_uring.c:6354
io_submit_sqe fs/io_uring.c:6534 [inline]
io_submit_sqes+0x2bbd/0x7c50 fs/io_uring.c:6660
__do_sys_io_uring_enter fs/io_uring.c:9240 [inline]
__se_sys_io_uring_enter+0x256/0x1d60 fs/io_uring.c:9182
io_link_timeout_fn() should put only one reference of the linked timeout
request, however in case of racing with the master request's completion
first io_req_complete() puts one and then io_put_req_deferred() is
called.
Cc: stable@vger.kernel.org # 5.12+
Fixes: 9ae1f8dd372e0 ("io_uring: fix inconsistent lock state")
Reported-by: syzbot+a2910119328ce8e7996f@syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ff51018ff29de5ffa76f09273ef48cb24c720368.1620417627.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 1119a72e223f3073a604f8fccb3a470ccd8a4416 upstream.
The tree checker checks the extent ref hash at read and write time to
make sure we do not corrupt the file system. Generally extent
references go inline, but if we have enough of them we need to make an
item, which looks like
key.objectid = <bytenr>
key.type = <BTRFS_EXTENT_DATA_REF_KEY|BTRFS_TREE_BLOCK_REF_KEY>
key.offset = hash(tree, owner, offset)
However if key.offset collide with an unrelated extent reference we'll
simply key.offset++ until we get something that doesn't collide.
Obviously this doesn't match at tree checker time, and thus we error
while writing out the transaction. This is relatively easy to
reproduce, simply do something like the following
xfs_io -f -c "pwrite 0 1M" file
offset=2
for i in {0..10000}
do
xfs_io -c "reflink file 0 ${offset}M 1M" file
offset=$(( offset + 2 ))
done
xfs_io -c "reflink file 0 17999258914816 1M" file
xfs_io -c "reflink file 0 35998517829632 1M" file
xfs_io -c "reflink file 0 53752752058368 1M" file
btrfs filesystem sync
And the sync will error out because we'll abort the transaction. The
magic values above are used because they generate hash collisions with
the first file in the main subvol.
The fix for this is to remove the hash value check from tree checker, as
we have no idea which offset ours should belong to.
Reported-by: Tuomas Lähdekorpi <tuomas.lahdekorpi@gmail.com>
Fixes: 0785a9aacf9d ("btrfs: tree-checker: Add EXTENT_DATA_REF check")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comment]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit bc6a385132601c29a6da1dbf8148c0d3c9ad36dc ]
When BLKRRPART is called concurrently with del_gendisk, the partitions
rescan can create a stale partition that will never be be cleaned up.
Fix this by checking the the disk is up before rescanning partitions
while under bd_mutex.
Signed-off-by: Gulam Mohamed <gulam.mohamed@oracle.com>
[hch: split from a larger patch]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210514131842.1600568-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit c0d46717b95735b0eacfddbcca9df37a49de9c7a ]
See MS-SMB2 3.2.4.1.4, file ids in compounded requests should be set to
0xFFFFFFFFFFFFFFFF (we were treating it as u32 not u64 and setting
it incorrectly).
Signed-off-by: Steve French <stfrench@microsoft.com>
Reported-by: Stefan Metzmacher <metze@samba.org>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 91df99a6eb50d5a1bc70fff4a09a0b7ae6aab96d ]
While doing error injection testing I got the following panic
kernel BUG at fs/btrfs/tree-log.c:1862!
invalid opcode: 0000 [#1] SMP NOPTI
CPU: 1 PID: 7836 Comm: mount Not tainted 5.13.0-rc1+ #305
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
RIP: 0010:link_to_fixup_dir+0xd5/0xe0
RSP: 0018:ffffb5800180fa30 EFLAGS: 00010216
RAX: fffffffffffffffb RBX: 00000000fffffffb RCX: ffff8f595287faf0
RDX: ffffb5800180fa37 RSI: ffff8f5954978800 RDI: 0000000000000000
RBP: ffff8f5953af9450 R08: 0000000000000019 R09: 0000000000000001
R10: 000151f408682970 R11: 0000000120021001 R12: ffff8f5954978800
R13: ffff8f595287faf0 R14: ffff8f5953c77dd0 R15: 0000000000000065
FS: 00007fc5284c8c40(0000) GS:ffff8f59bbd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fc5287f47c0 CR3: 000000011275e002 CR4: 0000000000370ee0
Call Trace:
replay_one_buffer+0x409/0x470
? btree_read_extent_buffer_pages+0xd0/0x110
walk_up_log_tree+0x157/0x1e0
walk_log_tree+0xa6/0x1d0
btrfs_recover_log_trees+0x1da/0x360
? replay_one_extent+0x7b0/0x7b0
open_ctree+0x1486/0x1720
btrfs_mount_root.cold+0x12/0xea
? __kmalloc_track_caller+0x12f/0x240
legacy_get_tree+0x24/0x40
vfs_get_tree+0x22/0xb0
vfs_kern_mount.part.0+0x71/0xb0
btrfs_mount+0x10d/0x380
? vfs_parse_fs_string+0x4d/0x90
legacy_get_tree+0x24/0x40
vfs_get_tree+0x22/0xb0
path_mount+0x433/0xa10
__x64_sys_mount+0xe3/0x120
do_syscall_64+0x3d/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
We can get -EIO or any number of legitimate errors from
btrfs_search_slot(), panicing here is not the appropriate response. The
error path for this code handles errors properly, simply return the
error.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 6416954ca75baed71640bf3828625bf165fb9b5e ]
When cloning an inline extent there are a few cases, such as when we have
an implicit hole at file offset 0, where we start a transaction while
holding a read lock on a leaf. Starting the transaction results in a call
to sb_start_intwrite(), which results in doing a read lock on a percpu
semaphore. Lockdep doesn't like this and complains about it:
[46.580704] ======================================================
[46.580752] WARNING: possible circular locking dependency detected
[46.580799] 5.13.0-rc1 #28 Not tainted
[46.580832] ------------------------------------------------------
[46.580877] cloner/3835 is trying to acquire lock:
[46.580918] c00000001301d638 (sb_internal#2){.+.+}-{0:0}, at: clone_copy_inline_extent+0xe4/0x5a0
[46.581167]
[46.581167] but task is already holding lock:
[46.581217] c000000007fa2550 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x70/0x1d0
[46.581293]
[46.581293] which lock already depends on the new lock.
[46.581293]
[46.581351]
[46.581351] the existing dependency chain (in reverse order) is:
[46.581410]
[46.581410] -> #1 (btrfs-tree-00){++++}-{3:3}:
[46.581464] down_read_nested+0x68/0x200
[46.581536] __btrfs_tree_read_lock+0x70/0x1d0
[46.581577] btrfs_read_lock_root_node+0x88/0x200
[46.581623] btrfs_search_slot+0x298/0xb70
[46.581665] btrfs_set_inode_index+0xfc/0x260
[46.581708] btrfs_new_inode+0x26c/0x950
[46.581749] btrfs_create+0xf4/0x2b0
[46.581782] lookup_open.isra.57+0x55c/0x6a0
[46.581855] path_openat+0x418/0xd20
[46.581888] do_filp_open+0x9c/0x130
[46.581920] do_sys_openat2+0x2ec/0x430
[46.581961] do_sys_open+0x90/0xc0
[46.581993] system_call_exception+0x3d4/0x410
[46.582037] system_call_common+0xec/0x278
[46.582078]
[46.582078] -> #0 (sb_internal#2){.+.+}-{0:0}:
[46.582135] __lock_acquire+0x1e90/0x2c50
[46.582176] lock_acquire+0x2b4/0x5b0
[46.582263] start_transaction+0x3cc/0x950
[46.582308] clone_copy_inline_extent+0xe4/0x5a0
[46.582353] btrfs_clone+0x5fc/0x880
[46.582388] btrfs_clone_files+0xd8/0x1c0
[46.582434] btrfs_remap_file_range+0x3d8/0x590
[46.582481] do_clone_file_range+0x10c/0x270
[46.582558] vfs_clone_file_range+0x1b0/0x310
[46.582605] ioctl_file_clone+0x90/0x130
[46.582651] do_vfs_ioctl+0x874/0x1ac0
[46.582697] sys_ioctl+0x6c/0x120
[46.582733] system_call_exception+0x3d4/0x410
[46.582777] system_call_common+0xec/0x278
[46.582822]
[46.582822] other info that might help us debug this:
[46.582822]
[46.582888] Possible unsafe locking scenario:
[46.582888]
[46.582942] CPU0 CPU1
[46.582984] ---- ----
[46.583028] lock(btrfs-tree-00);
[46.583062] lock(sb_internal#2);
[46.583119] lock(btrfs-tree-00);
[46.583174] lock(sb_internal#2);
[46.583212]
[46.583212] *** DEADLOCK ***
[46.583212]
[46.583266] 6 locks held by cloner/3835:
[46.583299] #0: c00000001301d448 (sb_writers#12){.+.+}-{0:0}, at: ioctl_file_clone+0x90/0x130
[46.583382] #1: c00000000f6d3768 (&sb->s_type->i_mutex_key#15){+.+.}-{3:3}, at: lock_two_nondirectories+0x58/0xc0
[46.583477] #2: c00000000f6d72a8 (&sb->s_type->i_mutex_key#15/4){+.+.}-{3:3}, at: lock_two_nondirectories+0x9c/0xc0
[46.583574] #3: c00000000f6d7138 (&ei->i_mmap_lock){+.+.}-{3:3}, at: btrfs_remap_file_range+0xd0/0x590
[46.583657] #4: c00000000f6d35f8 (&ei->i_mmap_lock/1){+.+.}-{3:3}, at: btrfs_remap_file_range+0xe0/0x590
[46.583743] #5: c000000007fa2550 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x70/0x1d0
[46.583828]
[46.583828] stack backtrace:
[46.583872] CPU: 1 PID: 3835 Comm: cloner Not tainted 5.13.0-rc1 #28
[46.583931] Call Trace:
[46.583955] [c0000000167c7200] [c000000000c1ee78] dump_stack+0xec/0x144 (unreliable)
[46.584052] [c0000000167c7240] [c000000000274058] print_circular_bug.isra.32+0x3a8/0x400
[46.584123] [c0000000167c72e0] [c0000000002741f4] check_noncircular+0x144/0x190
[46.584191] [c0000000167c73b0] [c000000000278fc0] __lock_acquire+0x1e90/0x2c50
[46.584259] [c0000000167c74f0] [c00000000027aa94] lock_acquire+0x2b4/0x5b0
[46.584317] [c0000000167c75e0] [c000000000a0d6cc] start_transaction+0x3cc/0x950
[46.584388] [c0000000167c7690] [c000000000af47a4] clone_copy_inline_extent+0xe4/0x5a0
[46.584457] [c0000000167c77c0] [c000000000af525c] btrfs_clone+0x5fc/0x880
[46.584514] [c0000000167c7990] [c000000000af5698] btrfs_clone_files+0xd8/0x1c0
[46.584583] [c0000000167c7a00] [c000000000af5b58] btrfs_remap_file_range+0x3d8/0x590
[46.584652] [c0000000167c7ae0] [c0000000005d81dc] do_clone_file_range+0x10c/0x270
[46.584722] [c0000000167c7b40] [c0000000005d84f0] vfs_clone_file_range+0x1b0/0x310
[46.584793] [c0000000167c7bb0] [c00000000058bf80] ioctl_file_clone+0x90/0x130
[46.584861] [c0000000167c7c10] [c00000000058c894] do_vfs_ioctl+0x874/0x1ac0
[46.584922] [c0000000167c7d10] [c00000000058db4c] sys_ioctl+0x6c/0x120
[46.584978] [c0000000167c7d60] [c0000000000364a4] system_call_exception+0x3d4/0x410
[46.585046] [c0000000167c7e10] [c00000000000d45c] system_call_common+0xec/0x278
[46.585114] --- interrupt: c00 at 0x7ffff7e22990
[46.585160] NIP: 00007ffff7e22990 LR: 00000001000010ec CTR: 0000000000000000
[46.585224] REGS: c0000000167c7e80 TRAP: 0c00 Not tainted (5.13.0-rc1)
[46.585280] MSR: 800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE> CR: 28000244 XER: 00000000
[46.585374] IRQMASK: 0
[46.585374] GPR00: 0000000000000036 00007fffffffdec0 00007ffff7f17100 0000000000000004
[46.585374] GPR04: 000000008020940d 00007fffffffdf40 0000000000000000 0000000000000000
[46.585374] GPR08: 0000000000000004 0000000000000000 0000000000000000 0000000000000000
[46.585374] GPR12: 0000000000000000 00007ffff7ffa940 0000000000000000 0000000000000000
[46.585374] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[46.585374] GPR20: 0000000000000000 000000009123683e 00007fffffffdf40 0000000000000000
[46.585374] GPR24: 0000000000000000 0000000000000000 0000000000000000 0000000000000004
[46.585374] GPR28: 0000000100030260 0000000100030280 0000000000000003 000000000000005f
[46.585919] NIP [00007ffff7e22990] 0x7ffff7e22990
[46.585964] LR [00000001000010ec] 0x1000010ec
[46.586010] --- interrupt: c00
This should be a false positive, as both locks are acquired in read mode.
Nevertheless, we don't need to hold a leaf locked when we start the
transaction, so just release the leaf (path) before starting it.
Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/linux-btrfs/20210513214404.xks77p566fglzgum@riteshh-domain/
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 15c7745c9a0078edad1f7df5a6bb7b80bc8cca23 ]
`xfs_io -c 'fiemap <off> <len>' <file>`
can give surprising results on btrfs that differ from xfs.
btrfs prints out extents trimmed to fit the user input. If the user's
fiemap request has an offset, then rather than returning each whole
extent which intersects that range, we also trim the start extent to not
have start < off.
Documentation in filesystems/fiemap.txt and the xfs_io man page suggests
that returning the whole extent is expected.
Some cases which all yield the same fiemap in xfs, but not btrfs:
dd if=/dev/zero of=$f bs=4k count=1
sudo xfs_io -c 'fiemap 0 1024' $f
0: [0..7]: 26624..26631
sudo xfs_io -c 'fiemap 2048 1024' $f
0: [4..7]: 26628..26631
sudo xfs_io -c 'fiemap 2048 4096' $f
0: [4..7]: 26628..26631
sudo xfs_io -c 'fiemap 3584 512' $f
0: [7..7]: 26631..26631
sudo xfs_io -c 'fiemap 4091 5' $f
0: [7..6]: 26631..26630
I believe this is a consequence of the logic for merging contiguous
extents represented by separate extent items. That logic needs to track
the last offset as it loops through the extent items, which happens to
pick up the start offset on the first iteration, and trim off the
beginning of the full extent. To fix it, start `off` at 0 rather than
`start` so that we keep the iteration/merging intact without cutting off
the start of the extent.
after the fix, all the above commands give:
0: [0..7]: 26624..26631
The merging logic is exercised by fstest generic/483, and I have written
a new fstest for checking we don't have backwards or zero-length fiemaps
for cases like those above.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit f610a5a29c3cfb7d37bdfa4ef52f72ea51f24a76 upstream.
Fix rename of one directory over another such that the nlink on the deleted
directory is cleared to 0 rather than being decremented to 1.
This was causing the generic/035 xfstest to fail.
Fixes: e49c7b2f6de7 ("afs: Build an abstraction around an "operation" concept")
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/162194384460.3999479.7605572278074191079.stgit@warthog.procyon.org.uk/ # v1
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit e67afa7ee4a59584d7253e45d7f63b9528819a13 upstream.
Since commit bdcc2cd14e4e ("NFSv4.2: handle NFS-specific llseek errors"),
nfs42_proc_llseek would return -EOPNOTSUPP rather than -ENOTSUPP when
SEEK_DATA on NFSv4.0/v4.1.
This will lead xfstests generic/285 not run on NFSv4.0/v4.1 when set the
CONFIG_NFS_V4_2, rather than run failed.
Fixes: bdcc2cd14e4e ("NFSv4.2: handle NFS-specific llseek errors")
Cc: <stable.vger.kernel.org> # 4.2
Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 0d0ea309357dea0d85a82815f02157eb7fcda39f upstream.
The value of mirror->pg_bytes_written should only be updated after a
successful attempt to flush out the requests on the list.
Fixes: a7d42ddb3099 ("nfs: add mirroring support to pgio layer")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 56517ab958b7c11030e626250c00b9b1a24b41eb upstream.
Ensure that nfs_pageio_error_cleanup() resets the mirror array contents,
so that the structure reflects the fact that it is now empty.
Also change the test in nfs_pageio_do_add_request() to be more robust by
checking whether or not the list is empty rather than relying on the
value of pg_count.
Fixes: a7d42ddb3099 ("nfs: add mirroring support to pgio layer")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 769b01ea68b6c49dc3cde6adf7e53927dacbd3a8 upstream.
The "sizeof(struct nfs_fh)" is two bytes too large and could lead to
memory corruption. It should be NFS_MAXFHSIZE because that's the size
of the ->data[] buffer.
I reversed the size of the arguments to put the variable on the left.
Fixes: 16b374ca439f ("NFSv4.1: pnfs: filelayout: add driver's LAYOUTGET and GETDEVICEINFO infrastructure")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit bb002388901151fe35b6697ab116f6ed0721a9ed upstream.
We set the state of the current process to TASK_KILLABLE via
prepare_to_wait(). Should we use fatal_signal_pending() to detect
the signal here?
Fixes: b4868b44c562 ("NFSv4: Wait for stateid updates after CLOSE/OPEN_DOWNGRADE")
Signed-off-by: zhouchuangao <zhouchuangao@vivo.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit bfb819ea20ce8bbeeba17e1a6418bf8bda91fc28 upstream.
Fix another "confused deputy" weakness[1]. Writes to /proc/$pid/attr/
files need to check the opener credentials, since these fds do not
transition state across execve(). Without this, it is possible to
trick another process (which may have different credentials) to write
to its own /proc/$pid/attr/ files, leading to unexpected and possibly
exploitable behaviors.
[1] https://www.kernel.org/doc/html/latest/security/credentials.html?highlight=confused#open-file-credentials
Fixes: 1da177e4c3f41 ("Linux-2.6.12-rc2")
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit a421d218603ffa822a0b8045055c03eae394a7eb upstream.
Commit de144ff4234f changes _pnfs_return_layout() to call
pnfs_mark_matching_lsegs_return() passing NULL as the struct
pnfs_layout_range argument. Unfortunately,
pnfs_mark_matching_lsegs_return() doesn't check if we have a value here
before dereferencing it, causing an oops.
I'm able to hit this crash consistently when running connectathon basic
tests on NFS v4.1/v4.2 against Ontap.
Fixes: de144ff4234f ("NFSv4: Don't discard segments marked for return in _pnfs_return_layout()")
Cc: stable@vger.kernel.org
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 6d2fcfe6b517fe7cbf2687adfb0a16cdcd5d9243 upstream.
SMB3.0 doesn't have encryption negotiate context but simply uses
the SMB2_GLOBAL_CAP_ENCRYPTION flag.
When that flag is present in the neg response cifs.ko uses AES-128-CCM
which is the only cipher available in this context.
cipher_type was set to the server cipher only when parsing encryption
negotiate context (SMB3.1.1).
For SMB3.0 it was set to 0. This means cipher_type value can be 0 or 1
for AES-128-CCM.
Fix this by checking for SMB3.0 and encryption capability and setting
cipher_type appropriately.
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit e1436df2f2550bc89d832ffd456373fdf5d5b5d7 upstream.
This reverts commit 2c2a7552dd6465e8fde6bc9cccf8d66ed1c1eb72.
Because of recent interactions with developers from @umn.edu, all
commits from them have been recently re-reviewed to ensure if they were
correct or not.
Upon review, this commit was found to be incorrect for the reasons
below, so it must be reverted. It will be fixed up "correctly" in a
later kernel change.
The original commit log for this change was incorrect, no "error
handling code" was added, things will blow up just as badly as before if
any of these cases ever were true. As this BUG_ON() never fired, and
most of these checks are "obviously" never going to be true, let's just
revert to the original code for now until this gets unwound to be done
correctly in the future.
Cc: Aditya Pakki <pakki001@umn.edu>
Fixes: 2c2a7552dd64 ("ecryptfs: replace BUG_ON with error handling code")
Cc: stable <stable@vger.kernel.org>
Acked-by: Tyler Hicks <code@tyhicks.com>
Link: https://lore.kernel.org/r/20210503115736.2104747-49-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit d201d7631ca170b038e7f8921120d05eec70d7c5 upstream.
When using smb2_copychunk_range() for large ranges we will
run through several iterations of a loop calling SMB2_ioctl()
but never actually free the returned buffer except for the final
iteration.
This leads to memory leaks everytime a large copychunk is requested.
Fixes: 9bf0c9cd4314 ("CIFS: Fix SMB2/SMB3 Copy offload support (refcopy) for large files")
Cc: <stable@vger.kernel.org>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 71795ee590111e3636cc3c148289dfa9fa0a5fc3 upstream.
Generally a delayed iput is added when we might do the final iput, so
usually we'll end up sleeping while processing the delayed iputs
naturally. However there's no guarantee of this, especially for small
files. In production we noticed 5 instances of RCU stalls while testing
a kernel release overnight across 1000 machines, so this is relatively
common:
host count: 5
rcu: INFO: rcu_sched self-detected stall on CPU
rcu: ....: (20998 ticks this GP) idle=59e/1/0x4000000000000002 softirq=12333372/12333372 fqs=3208
(t=21031 jiffies g=27810193 q=41075) NMI backtrace for cpu 1
CPU: 1 PID: 1713 Comm: btrfs-cleaner Kdump: loaded Not tainted 5.6.13-0_fbk12_rc1_5520_gec92bffc1ec9 #1
Call Trace:
<IRQ> dump_stack+0x50/0x70
nmi_cpu_backtrace.cold.6+0x30/0x65
? lapic_can_unplug_cpu.cold.30+0x40/0x40
nmi_trigger_cpumask_backtrace+0xba/0xca
rcu_dump_cpu_stacks+0x99/0xc7
rcu_sched_clock_irq.cold.90+0x1b2/0x3a3
? trigger_load_balance+0x5c/0x200
? tick_sched_do_timer+0x60/0x60
? tick_sched_do_timer+0x60/0x60
update_process_times+0x24/0x50
tick_sched_timer+0x37/0x70
__hrtimer_run_queues+0xfe/0x270
hrtimer_interrupt+0xf4/0x210
smp_apic_timer_interrupt+0x5e/0x120
apic_timer_interrupt+0xf/0x20 </IRQ>
RIP: 0010:queued_spin_lock_slowpath+0x17d/0x1b0
RSP: 0018:ffffc9000da5fe48 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
RAX: 0000000000000000 RBX: ffff889fa81d0cd8 RCX: 0000000000000029
RDX: ffff889fff86c0c0 RSI: 0000000000080000 RDI: ffff88bfc2da7200
RBP: ffff888f2dcdd768 R08: 0000000001040000 R09: 0000000000000000
R10: 0000000000000001 R11: ffffffff82a55560 R12: ffff88bfc2da7200
R13: 0000000000000000 R14: ffff88bff6c2a360 R15: ffffffff814bd870
? kzalloc.constprop.57+0x30/0x30
list_lru_add+0x5a/0x100
inode_lru_list_add+0x20/0x40
iput+0x1c1/0x1f0
run_delayed_iput_locked+0x46/0x90
btrfs_run_delayed_iputs+0x3f/0x60
cleaner_kthread+0xf2/0x120
kthread+0x10b/0x130
Fix this by adding a cond_resched_lock() to the loop processing delayed
iputs so we can avoid these sort of stalls.
CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit cf7b39a0cbf6bf57aa07a008d46cf695add05b4c ]
We get a bug:
BUG: KASAN: slab-out-of-bounds in iov_iter_revert+0x11c/0x404
lib/iov_iter.c:1139
Read of size 8 at addr ffff0000d3fb11f8 by task
CPU: 0 PID: 12582 Comm: syz-executor.2 Not tainted
5.10.0-00843-g352c8610ccd2 #2
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace+0x0/0x2d0 arch/arm64/kernel/stacktrace.c:132
show_stack+0x28/0x34 arch/arm64/kernel/stacktrace.c:196
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x110/0x164 lib/dump_stack.c:118
print_address_description+0x78/0x5c8 mm/kasan/report.c:385
__kasan_report mm/kasan/report.c:545 [inline]
kasan_report+0x148/0x1e4 mm/kasan/report.c:562
check_memory_region_inline mm/kasan/generic.c:183 [inline]
__asan_load8+0xb4/0xbc mm/kasan/generic.c:252
iov_iter_revert+0x11c/0x404 lib/iov_iter.c:1139
io_read fs/io_uring.c:3421 [inline]
io_issue_sqe+0x2344/0x2d64 fs/io_uring.c:5943
__io_queue_sqe+0x19c/0x520 fs/io_uring.c:6260
io_queue_sqe+0x2a4/0x590 fs/io_uring.c:6326
io_submit_sqe fs/io_uring.c:6395 [inline]
io_submit_sqes+0x4c0/0xa04 fs/io_uring.c:6624
__do_sys_io_uring_enter fs/io_uring.c:9013 [inline]
__se_sys_io_uring_enter fs/io_uring.c:8960 [inline]
__arm64_sys_io_uring_enter+0x190/0x708 fs/io_uring.c:8960
__invoke_syscall arch/arm64/kernel/syscall.c:36 [inline]
invoke_syscall arch/arm64/kernel/syscall.c:48 [inline]
el0_svc_common arch/arm64/kernel/syscall.c:158 [inline]
do_el0_svc+0x120/0x290 arch/arm64/kernel/syscall.c:227
el0_svc+0x1c/0x28 arch/arm64/kernel/entry-common.c:367
el0_sync_handler+0x98/0x170 arch/arm64/kernel/entry-common.c:383
el0_sync+0x140/0x180 arch/arm64/kernel/entry.S:670
Allocated by task 12570:
stack_trace_save+0x80/0xb8 kernel/stacktrace.c:121
kasan_save_stack mm/kasan/common.c:48 [inline]
kasan_set_track mm/kasan/common.c:56 [inline]
__kasan_kmalloc+0xdc/0x120 mm/kasan/common.c:461
kasan_kmalloc+0xc/0x14 mm/kasan/common.c:475
__kmalloc+0x23c/0x334 mm/slub.c:3970
kmalloc include/linux/slab.h:557 [inline]
__io_alloc_async_data+0x68/0x9c fs/io_uring.c:3210
io_setup_async_rw fs/io_uring.c:3229 [inline]
io_read fs/io_uring.c:3436 [inline]
io_issue_sqe+0x2954/0x2d64 fs/io_uring.c:5943
__io_queue_sqe+0x19c/0x520 fs/io_uring.c:6260
io_queue_sqe+0x2a4/0x590 fs/io_uring.c:6326
io_submit_sqe fs/io_uring.c:6395 [inline]
io_submit_sqes+0x4c0/0xa04 fs/io_uring.c:6624
__do_sys_io_uring_enter fs/io_uring.c:9013 [inline]
__se_sys_io_uring_enter fs/io_uring.c:8960 [inline]
__arm64_sys_io_uring_enter+0x190/0x708 fs/io_uring.c:8960
__invoke_syscall arch/arm64/kernel/syscall.c:36 [inline]
invoke_syscall arch/arm64/kernel/syscall.c:48 [inline]
el0_svc_common arch/arm64/kernel/syscall.c:158 [inline]
do_el0_svc+0x120/0x290 arch/arm64/kernel/syscall.c:227
el0_svc+0x1c/0x28 arch/arm64/kernel/entry-common.c:367
el0_sync_handler+0x98/0x170 arch/arm64/kernel/entry-common.c:383
el0_sync+0x140/0x180 arch/arm64/kernel/entry.S:670
Freed by task 12570:
stack_trace_save+0x80/0xb8 kernel/stacktrace.c:121
kasan_save_stack mm/kasan/common.c:48 [inline]
kasan_set_track+0x38/0x6c mm/kasan/common.c:56
kasan_set_free_info+0x20/0x40 mm/kasan/generic.c:355
__kasan_slab_free+0x124/0x150 mm/kasan/common.c:422
kasan_slab_free+0x10/0x1c mm/kasan/common.c:431
slab_free_hook mm/slub.c:1544 [inline]
slab_free_freelist_hook mm/slub.c:1577 [inline]
slab_free mm/slub.c:3142 [inline]
kfree+0x104/0x38c mm/slub.c:4124
io_dismantle_req fs/io_uring.c:1855 [inline]
__io_free_req+0x70/0x254 fs/io_uring.c:1867
io_put_req_find_next fs/io_uring.c:2173 [inline]
__io_queue_sqe+0x1fc/0x520 fs/io_uring.c:6279
__io_req_task_submit+0x154/0x21c fs/io_uring.c:2051
io_req_task_submit+0x2c/0x44 fs/io_uring.c:2063
task_work_run+0xdc/0x128 kernel/task_work.c:151
get_signal+0x6f8/0x980 kernel/signal.c:2562
do_signal+0x108/0x3a4 arch/arm64/kernel/signal.c:658
do_notify_resume+0xbc/0x25c arch/arm64/kernel/signal.c:722
work_pending+0xc/0x180
blkdev_read_iter can truncate iov_iter's count since the count + pos may
exceed the size of the blkdev. This will confuse io_read that we have
consume the iovec. And once we do the iov_iter_revert in io_read, we
will trigger the slab-out-of-bounds. Fix it by reexpand the count with
size has been truncated.
blkdev_write_iter can trigger the problem too.
Signed-off-by: yangerkun <yangerkun@huawei.com>
Acked-by: Pavel Begunkov <asml.silencec@gmail.com>
Link: https://lore.kernel.org/r/20210401071807.3328235-1-yangerkun@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit d4f6b31d721779d91b5e2f8072478af73b196c34 ]
The MDS reserves a set of inodes for its own usage, and these should
never be accessible to clients. Add a new helper to vet a proposed
inode number against that range, and complain loudly and refuse to
create or look it up if it's in it.
Also, ensure that the MDS doesn't try to delegate inodes that are in
that range or lower. Print a warning if it does, and don't save the
range in the xarray.
URL: https://tracker.ceph.com/issues/49922
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit d3c51ae1b8cce5bdaf91a1ce32b33cf5626075dc ]
We want the snapdir to mirror the non-snapped directory's attributes for
most things, but i_snap_caps represents the caps granted on the snapshot
directory by the MDS itself. A misbehaving MDS could issue different
caps for the snapdir and we lose them here.
Only reset i_snap_caps when the inode is I_NEW. Also, move the setting
of i_op and i_fop inside the if block since they should never change
anyway.
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 10a7052c7868bc7bc72d947f5aac6f768928db87 ]
Ensure that we invalidate the fscache whenever we invalidate the
pagecache.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 50c7a7994dd20af56e4d47e90af10bab71b71001 ]
When we're looking to revalidate the page cache, we should just ensure
that we mark the change attribute invalid.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit fcdf3c34b7abdcbb49690c94c7fa6ce224dc9749 upstream.
Using no_printk() for jbd_debug() revealed two warnings:
fs/jbd2/recovery.c: In function 'fc_do_one_pass':
fs/jbd2/recovery.c:256:30: error: format '%d' expects a matching 'int' argument [-Werror=format=]
256 | jbd_debug(3, "Processing fast commit blk with seq %d");
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
fs/ext4/fast_commit.c: In function 'ext4_fc_replay_add_range':
fs/ext4/fast_commit.c:1732:30: error: format '%d' expects argument of type 'int', but argument 2 has type 'long unsigned int' [-Werror=format=]
1732 | jbd_debug(1, "Converting from %d to %d %lld",
The first one was added incorrectly, and was also missing a few newlines
in debug output, and the second one happened when the type of an
argument changed.
Reported-by: kernel test robot <lkp@intel.com>
Fixes: d556435156b7 ("jbd2: avoid -Wempty-body warnings")
Fixes: 6db074618969 ("ext4: use BIT() macro for BH_** state bits")
Fixes: 5b849b5f96b4 ("jbd2: fast commit recovery path")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20210409201211.1866633-1-arnd@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 312723a0b34d6d110aa4427a982536bb36ab8471 upstream.
Since debugfs_allow is only set at boot time during __init, make it
read-only after being set.
Fixes: a24c6f7bc923 ("debugfs: Add access restriction option")
Cc: Peter Enderborg <peter.enderborg@sony.com>
Reviewed-by: Peter Enderborg <peter.enderborg@sony.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210405213959.3079432-1-keescook@chromium.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 8bfbfb0ddd706b1ce2e89259ecc45f192c0ec2bf ]
In f2fs_destroy_compress_ctx(), after f2fs_destroy_compress_ctx(),
cc.cluster_idx will be cleared w/ NULL_CLUSTER, f2fs_cluster_blocks()
may check wrong cluster metadata, fix it.
Fixes: 4c8ff7095bef ("f2fs: support data compression")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit a949dc5f2c5cfe0c910b664650f45371254c0744 ]
pos_fsstress testcase complains a panic as belew:
------------[ cut here ]------------
kernel BUG at fs/f2fs/compress.c:1082!
invalid opcode: 0000 [#1] SMP PTI
CPU: 4 PID: 2753477 Comm: kworker/u16:2 Tainted: G OE 5.12.0-rc1-custom #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
Workqueue: writeback wb_workfn (flush-252:16)
RIP: 0010:prepare_compress_overwrite+0x4c0/0x760 [f2fs]
Call Trace:
f2fs_prepare_compress_overwrite+0x5f/0x80 [f2fs]
f2fs_write_cache_pages+0x468/0x8a0 [f2fs]
f2fs_write_data_pages+0x2a4/0x2f0 [f2fs]
do_writepages+0x38/0xc0
__writeback_single_inode+0x44/0x2a0
writeback_sb_inodes+0x223/0x4d0
__writeback_inodes_wb+0x56/0xf0
wb_writeback+0x1dd/0x290
wb_workfn+0x309/0x500
process_one_work+0x220/0x3c0
worker_thread+0x53/0x420
kthread+0x12f/0x150
ret_from_fork+0x22/0x30
The root cause is truncate() may race with overwrite as below,
so that one reference count left in page can not guarantee the
page attaching in mapping tree all the time, after truncation,
later find_lock_page() may return NULL pointer.
- prepare_compress_overwrite
- f2fs_pagecache_get_page
- unlock_page
- f2fs_setattr
- truncate_setsize
- truncate_inode_page
- delete_from_page_cache
- find_lock_page
Fix this by avoiding referencing updated page.
Fixes: 4c8ff7095bef ("f2fs: support data compression")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit a12cc5b423d4f36dc1a1ea3911e49cf9dff43898 ]
In error path of f2fs_write_compressed_pages(), it needs to call
f2fs_compress_free_page() to release temporary page.
Fixes: 5e6bbde95982 ("f2fs: introduce mempool for {,de}compress intermediate page allocation")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 237388320deffde7c2d65ed8fc9eef670dc979b3 ]
I am seeing missed wakeups which ultimately lead to a deadlock when I am
using virtiofs with DAX enabled and running "make -j". I had to mount
virtiofs as rootfs and also reduce to dax window size to 256M to reproduce
the problem consistently.
So here is the problem. put_unlocked_entry() wakes up waiters only
if entry is not null as well as !dax_is_conflict(entry). But if I
call multiple instances of invalidate_inode_pages2() in parallel,
then I can run into a situation where there are waiters on
this index but nobody will wake these waiters.
invalidate_inode_pages2()
invalidate_inode_pages2_range()
invalidate_exceptional_entry2()
dax_invalidate_mapping_entry_sync()
__dax_invalidate_entry() {
xas_lock_irq(&xas);
entry = get_unlocked_entry(&xas, 0);
...
...
dax_disassociate_entry(entry, mapping, trunc);
xas_store(&xas, NULL);
...
...
put_unlocked_entry(&xas, entry);
xas_unlock_irq(&xas);
}
Say a fault in in progress and it has locked entry at offset say "0x1c".
Now say three instances of invalidate_inode_pages2() are in progress
(A, B, C) and they all try to invalidate entry at offset "0x1c". Given
dax entry is locked, all tree instances A, B, C will wait in wait queue.
When dax fault finishes, say A is woken up. It will store NULL entry
at index "0x1c" and wake up B. When B comes along it will find "entry=0"
at page offset 0x1c and it will call put_unlocked_entry(&xas, 0). And
this means put_unlocked_entry() will not wake up next waiter, given
the current code. And that means C continues to wait and is not woken
up.
This patch fixes the issue by waking up all waiters when a dax entry
has been invalidated. This seems to fix the deadlock I am facing
and I can make forward progress.
Reported-by: Sergio Lopez <slp@redhat.com>
Fixes: ac401cc78242 ("dax: New fault locking")
Reviewed-by: Jan Kara <jack@suse.cz>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Link: https://lore.kernel.org/r/20210428190314.1865312-4-vgoyal@redhat.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 4c3d043d271d4d629aa2328796cdfc96b37d3b3c ]
As of now put_unlocked_entry() always wakes up next waiter. In next
patches we want to wake up all waiters at one callsite. Hence, add a
parameter to the function.
This patch does not introduce any change of behavior.
Reviewed-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Link: https://lore.kernel.org/r/20210428190314.1865312-3-vgoyal@redhat.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 698ab77aebffe08b312fbcdddeb0e8bd08b78717 ]
Dan mentioned that he is not very fond of passing around a boolean true/false
to specify if only next waiter should be woken up or all waiters should be
woken up. He instead prefers that we introduce an enum and make it very
explicity at the callsite itself. Easier to read code.
This patch should not introduce any change of behavior.
Reviewed-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Link: https://lore.kernel.org/r/20210428190314.1865312-2-vgoyal@redhat.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|