linux.git/fs, branch v5.4.63

io_uring: Fix NULL pointer dereference in io_sq_wq_submit_work()

2020-09-03T09:27:11+00:00

the commit <1c4404efcf2c0> ("") caused a crash in io_sq_wq_submit_work().
when io_ring-wq get a req form async_list, which not have been
added to task_list. Then try to delete the req from task_list will caused
a "NULL pointer dereference".

Ensure add req to async_list and task_list at the sametime.

The crash log looks like this:
[95995.973638] Unable to handle kernel NULL pointer dereference at virtual address 00000000
[95995.979123] pgd = c20c8964
[95995.981803] [00000000] *pgd=1c72d831, *pte=00000000, *ppte=00000000
[95995.988043] Internal error: Oops: 817 [#1] SMP ARM
[95995.992814] Modules linked in: bpfilter(-)
[95995.996898] CPU: 1 PID: 15661 Comm: kworker/u8:5 Not tainted 5.4.56 #2
[95996.003406] Hardware name: Amlogic Meson platform
[95996.008108] Workqueue: io_ring-wq io_sq_wq_submit_work
[95996.013224] PC is at io_sq_wq_submit_work+0x1f4/0x5c4
[95996.018261] LR is at walk_stackframe+0x24/0x40
[95996.022685] pc : []    lr : []    psr: 600f0093
[95996.028936] sp : dc6f7e88  ip : dc6f7df0  fp : dc6f7ef4
[95996.034148] r10: deff9800  r9 : dc1d1694  r8 : dda58b80
[95996.039358] r7 : dc6f6000  r6 : dc6f7ebc  r5 : dc1d1600  r4 : deff99c0
[95996.045871] r3 : 0000cb5d  r2 : 00000000  r1 : ef6b9b80  r0 : c059b88c
[95996.052385] Flags: nZCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment user
[95996.059593] Control: 10c5387d  Table: 22be804a  DAC: 00000055
[95996.065325] Process kworker/u8:5 (pid: 15661, stack limit = 0x78013c69)
[95996.071923] Stack: (0xdc6f7e88 to 0xdc6f8000)
[95996.076268] 7e80:                   dc6f7ecc dc6f7e98 00000000 c1f06c08 de9dc800 deff9a04
[95996.084431] 7ea0: 00000000 dc6f7f7c 00000000 c1f65808 0000080c dc677a00 c1ee9bd0 dc6f7ebc
[95996.092594] 7ec0: dc6f7ebc d085c8f6 c0445a90 dc1d1e00 e008f300 c0288400 e4ef7100 00000000
[95996.100757] 7ee0: c20d45b0 e4ef7115 dc6f7f34 dc6f7ef8 c03725f0 c059b6b0 c0288400 c0288400
[95996.108921] 7f00: c0288400 00000001 c0288418 e008f300 c0288400 e008f314 00000088 c0288418
[95996.117083] 7f20: c1f03d00 dc6f6038 dc6f7f7c dc6f7f38 c0372df8 c037246c dc6f7f5c 00000000
[95996.125245] 7f40: c1f03d00 c1f03d00 c20d3cbe c0288400 dc6f7f7c e1c43880 e4fa7980 00000000
[95996.133409] 7f60: e008f300 c0372d9c e48bbe74 e1c4389c dc6f7fac dc6f7f80 c0379244 c0372da8
[95996.141570] 7f80: 600f0093 e4fa7980 c0379108 00000000 00000000 00000000 00000000 00000000
[95996.149734] 7fa0: 00000000 dc6f7fb0 c03010ac c0379114 00000000 00000000 00000000 00000000
[95996.157897] 7fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[95996.166060] 7fe0: 00000000 00000000 00000000 00000000 00000013 00000000 00000000 00000000
[95996.174217] Backtrace:
[95996.176662] [] (io_sq_wq_submit_work) from [] (process_one_work+0x190/0x4c0)
[95996.185425]  r10:e4ef7115 r9:c20d45b0 r8:00000000 r7:e4ef7100 r6:c0288400 r5:e008f300
[95996.193237]  r4:dc1d1e00
[95996.195760] [] (process_one_work) from [] (worker_thread+0x5c/0x5bc)
[95996.203836]  r10:dc6f6038 r9:c1f03d00 r8:c0288418 r7:00000088 r6:e008f314 r5:c0288400
[95996.211647]  r4:e008f300
[95996.214173] [] (worker_thread) from [] (kthread+0x13c/0x168)
[95996.221554]  r10:e1c4389c r9:e48bbe74 r8:c0372d9c r7:e008f300 r6:00000000 r5:e4fa7980
[95996.229363]  r4:e1c43880
[95996.231888] [] (kthread) from [] (ret_from_fork+0x14/0x28)
[95996.239088] Exception stack(0xdc6f7fb0 to 0xdc6f7ff8)
[95996.244127] 7fa0:                                     00000000 00000000 00000000 00000000
[95996.252291] 7fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[95996.260453] 7fe0: 00000000 00000000 00000000 00000000 00000013 00000000
[95996.267054]  r10:00000000 r9:00000000 r8:00000000 r7:00000000 r6:00000000 r5:c0379108
[95996.274866]  r4:e4fa7980 r3:600f0093
[95996.278430] Code: eb3a59e1 e5952098 e5951094 e5812004 (e5821000)

Signed-off-by: Xin Yin 
Signed-off-by: Greg Kroah-Hartman

writeback: Fix sync livelock due to b_dirty_time processing

2020-09-03T09:27:04+00:00

commit f9cae926f35e8230330f28c7b743ad088611a8de upstream.

When we are processing writeback for sync(2), move_expired_inodes()
didn't set any inode expiry value (older_than_this). This can result in
writeback never completing if there's steady stream of inodes added to
b_dirty_time list as writeback rechecks dirty lists after each writeback
round whether there's more work to be done. Fix the problem by using
sync(2) start time is inode expiry value when processing b_dirty_time
list similarly as for ordinarily dirtied inodes. This requires some
refactoring of older_than_this handling which simplifies the code
noticeably as a bonus.

Fixes: 0ae45f63d4ef ("vfs: add support for a lazytime mount option")
CC: stable@vger.kernel.org
Reviewed-by: Christoph Hellwig 
Signed-off-by: Jan Kara 
Signed-off-by: Greg Kroah-Hartman

writeback: Avoid skipping inode writeback

2020-09-03T09:27:04+00:00

commit 5afced3bf28100d81fb2fe7e98918632a08feaf5 upstream.

Inode's i_io_list list head is used to attach inode to several different
lists - wb->{b_dirty, b_dirty_time, b_io, b_more_io}. When flush worker
prepares a list of inodes to writeback e.g. for sync(2), it moves inodes
to b_io list. Thus it is critical for sync(2) data integrity guarantees
that inode is not requeued to any other writeback list when inode is
queued for processing by flush worker. That's the reason why
writeback_single_inode() does not touch i_io_list (unless the inode is
completely clean) and why __mark_inode_dirty() does not touch i_io_list
if I_SYNC flag is set.

However there are two flaws in the current logic:

1) When inode has only I_DIRTY_TIME set but it is already queued in b_io
list due to sync(2), concurrent __mark_inode_dirty(inode, I_DIRTY_SYNC)
can still move inode back to b_dirty list resulting in skipping
writeback of inode time stamps during sync(2).

2) When inode is on b_dirty_time list and writeback_single_inode() races
with __mark_inode_dirty() like:

writeback_single_inode()		__mark_inode_dirty(inode, I_DIRTY_PAGES)
  inode->i_state |= I_SYNC
  __writeback_single_inode()
					  inode->i_state |= I_DIRTY_PAGES;
					  if (inode->i_state & I_SYNC)
					    bail
  if (!(inode->i_state & I_DIRTY_ALL))
  - not true so nothing done

We end up with I_DIRTY_PAGES inode on b_dirty_time list and thus
standard background writeback will not writeback this inode leading to
possible dirty throttling stalls etc. (thanks to Martijn Coenen for this
analysis).

Fix these problems by tracking whether inode is queued in b_io or
b_more_io lists in a new I_SYNC_QUEUED flag. When this flag is set, we
know flush worker has queued inode and we should not touch i_io_list.
On the other hand we also know that once flush worker is done with the
inode it will requeue the inode to appropriate dirty list. When
I_SYNC_QUEUED is not set, __mark_inode_dirty() can (and must) move inode
to appropriate dirty list.

Reported-by: Martijn Coenen 
Reviewed-by: Martijn Coenen 
Tested-by: Martijn Coenen 
Reviewed-by: Christoph Hellwig 
Fixes: 0ae45f63d4ef ("vfs: add support for a lazytime mount option")
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara 
Signed-off-by: Greg Kroah-Hartman

writeback: Protect inode->i_io_list with inode->i_lock

2020-09-03T09:27:04+00:00

commit b35250c0816c7cf7d0a8de92f5fafb6a7508a708 upstream.

Currently, operations on inode->i_io_list are protected by
wb->list_lock. In the following patches we'll need to maintain
consistency between inode->i_state and inode->i_io_list so change the
code so that inode->i_lock protects also all inode's i_io_list handling.

Reviewed-by: Martijn Coenen 
Reviewed-by: Christoph Hellwig 
CC: stable@vger.kernel.org # Prerequisite for "writeback: Avoid skipping inode writeback"
Signed-off-by: Jan Kara 
Signed-off-by: Greg Kroah-Hartman

btrfs: detect nocow for swap after snapshot delete

2020-09-03T09:27:02+00:00

commit a84d5d429f9eb56f81b388609841ed993f0ddfca upstream.

can_nocow_extent and btrfs_cross_ref_exist both rely on a heuristic for
detecting a must cow condition which is not exactly accurate, but saves
unnecessary tree traversal. The incorrect assumption is that if the
extent was created in a generation smaller than the last snapshot
generation, it must be referenced by that snapshot. That is true, except
the snapshot could have since been deleted, without affecting the last
snapshot generation.

The original patch claimed a performance win from this check, but it
also leads to a bug where you are unable to use a swapfile if you ever
snapshotted the subvolume it's in. Make the check slower and more strict
for the swapon case, without modifying the general cow checks as a
compromise. Turning swap on does not seem to be a particularly
performance sensitive operation, so incurring a possibly unnecessary
btrfs_search_slot seems worthwhile for the added usability.

Note: Until the snapshot is competely cleaned after deletion,
check_committed_refs will still cause the logic to think that cow is
necessary, so the user must until 'btrfs subvolu sync' finished before
activating the swapfile swapon.

CC: stable@vger.kernel.org # 5.4+
Suggested-by: Omar Sandoval 
Signed-off-by: Boris Burkov 
Signed-off-by: David Sterba 
Signed-off-by: Greg Kroah-Hartman

btrfs: fix space cache memory leak after transaction abort

2020-09-03T09:27:02+00:00

commit bbc37d6e475eee8ffa2156ec813efc6bbb43c06d upstream.

If a transaction aborts it can cause a memory leak of the pages array of
a block group's io_ctl structure. The following steps explain how that can
happen:

1) Transaction N is committing, currently in state TRANS_STATE_UNBLOCKED
   and it's about to start writing out dirty extent buffers;

2) Transaction N + 1 already started and another task, task A, just called
   btrfs_commit_transaction() on it;

3) Block group B was dirtied (extents allocated from it) by transaction
   N + 1, so when task A calls btrfs_start_dirty_block_groups(), at the
   very beginning of the transaction commit, it starts writeback for the
   block group's space cache by calling btrfs_write_out_cache(), which
   allocates the pages array for the block group's io_ctl with a call to
   io_ctl_init(). Block group A is added to the io_list of transaction
   N + 1 by btrfs_start_dirty_block_groups();

4) While transaction N's commit is writing out the extent buffers, it gets
   an IO error and aborts transaction N, also setting the file system to
   RO mode;

5) Task A has already returned from btrfs_start_dirty_block_groups(), is at
   btrfs_commit_transaction() and has set transaction N + 1 state to
   TRANS_STATE_COMMIT_START. Immediately after that it checks that the
   filesystem was turned to RO mode, due to transaction N's abort, and
   jumps to the "cleanup_transaction" label. After that we end up at
   btrfs_cleanup_one_transaction() which calls btrfs_cleanup_dirty_bgs().
   That helper finds block group B in the transaction's io_list but it
   never releases the pages array of the block group's io_ctl, resulting in
   a memory leak.

In fact at the point when we are at btrfs_cleanup_dirty_bgs(), the pages
array points to pages that were already released by us at
__btrfs_write_out_cache() through the call to io_ctl_drop_pages(). We end
up freeing the pages array only after waiting for the ordered extent to
complete through btrfs_wait_cache_io(), which calls io_ctl_free() to do
that. But in the transaction abort case we don't wait for the space cache's
ordered extent to complete through a call to btrfs_wait_cache_io(), so
that's why we end up with a memory leak - we wait for the ordered extent
to complete indirectly by shutting down the work queues and waiting for
any jobs in them to complete before returning from close_ctree().

We can solve the leak simply by freeing the pages array right after
releasing the pages (with the call to io_ctl_drop_pages()) at
__btrfs_write_out_cache(), since we will never use it anymore after that
and the pages array points to already released pages at that point, which
is currently not a problem since no one will use it after that, but not a
good practice anyway since it can easily lead to use-after-free issues.

So fix this by freeing the pages array right after releasing the pages at
__btrfs_write_out_cache().

This issue can often be reproduced with test case generic/475 from fstests
and kmemleak can detect it and reports it with the following trace:

unreferenced object 0xffff9bbf009fa600 (size 512):
  comm "fsstress", pid 38807, jiffies 4298504428 (age 22.028s)
  hex dump (first 32 bytes):
    00 a0 7c 4d 3d ed ff ff 40 a0 7c 4d 3d ed ff ff  ..|M=...@.|M=...
    80 a0 7c 4d 3d ed ff ff c0 a0 7c 4d 3d ed ff ff  ..|M=.....|M=...
  backtrace:
    [<00000000f4b5cfe2>] __kmalloc+0x1a8/0x3e0
    [<0000000028665e7f>] io_ctl_init+0xa7/0x120 [btrfs]
    [<00000000a1f95b2d>] __btrfs_write_out_cache+0x86/0x4a0 [btrfs]
    [<00000000207ea1b0>] btrfs_write_out_cache+0x7f/0xf0 [btrfs]
    [<00000000af21f534>] btrfs_start_dirty_block_groups+0x27b/0x580 [btrfs]
    [<00000000c3c23d44>] btrfs_commit_transaction+0xa6f/0xe70 [btrfs]
    [<000000009588930c>] create_subvol+0x581/0x9a0 [btrfs]
    [<000000009ef2fd7f>] btrfs_mksubvol+0x3fb/0x4a0 [btrfs]
    [<00000000474e5187>] __btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs]
    [<00000000708ee349>] btrfs_ioctl_snap_create_v2+0xb0/0xf0 [btrfs]
    [<00000000ea60106f>] btrfs_ioctl+0x12c/0x3130 [btrfs]
    [<000000005c923d6d>] __x64_sys_ioctl+0x83/0xb0
    [<0000000043ace2c9>] do_syscall_64+0x33/0x80
    [<00000000904efbce>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Josef Bacik 
Signed-off-by: Filipe Manana 
Signed-off-by: David Sterba 
Signed-off-by: Greg Kroah-Hartman

btrfs: check the right error variable in btrfs_del_dir_entries_in_log

2020-09-03T09:27:02+00:00

commit fb2fecbad50964b9f27a3b182e74e437b40753ef upstream.

With my new locking code dbench is so much faster that I tripped over a
transaction abort from ENOSPC.  This turned out to be because
btrfs_del_dir_entries_in_log was checking for ret == -ENOSPC, but this
function sets err on error, and returns err.  So instead of properly
marking the inode as needing a full commit, we were returning -ENOSPC
and aborting in __btrfs_unlink_inode.  Fix this by checking the proper
variable so that we return the correct thing in the case of ENOSPC.

The ENOENT needs to be checked, because btrfs_lookup_dir_item_index()
can return -ENOENT if the dir item isn't in the tree log (which would
happen if we hadn't fsync'ed this guy).  We actually handle that case in
__btrfs_unlink_inode, so it's an expected error to get back.

Fixes: 4a500fd178c8 ("Btrfs: Metadata ENOSPC handling for tree log")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana 
Signed-off-by: Josef Bacik 
Reviewed-by: David Sterba 
[ add note and comment about ENOENT ]
Signed-off-by: David Sterba 
Signed-off-by: Greg Kroah-Hartman

btrfs: reset compression level for lzo on remount

2020-09-03T09:27:02+00:00

commit 282dd7d7718444679b046b769d872b188818ca35 upstream.

Currently a user can set mount "-o compress" which will set the
compression algorithm to zlib, and use the default compress level for
zlib (3):

  relatime,compress=zlib:3,space_cache

If the user remounts the fs using "-o compress=lzo", then the old
compress_level is used:

  relatime,compress=lzo:3,space_cache

But lzo does not expose any tunable compression level. The same happens
if we set any compress argument with different level, also with zstd.

Fix this by resetting the compress_level when compress=lzo is
specified.  With the fix applied, lzo is shown without compress level:

  relatime,compress=lzo,space_cache

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Marcos Paulo de Souza 
Reviewed-by: David Sterba 
Signed-off-by: David Sterba 
Signed-off-by: Greg Kroah-Hartman

fs: prevent BUG_ON in submit_bh_wbc()

2020-09-03T09:26:57+00:00

[ Upstream commit 377254b2cd2252c7c3151b113cbdf93a7736c2e9 ]

If a device is hot-removed --- for example, when a physical device is
unplugged from pcie slot or a nbd device's network is shutdown ---
this can result in a BUG_ON() crash in submit_bh_wbc().  This is
because the when the block device dies, the buffer heads will have
their Buffer_Mapped flag get cleared, leading to the crash in
submit_bh_wbc.

We had attempted to work around this problem in commit a17712c8
("ext4: check superblock mapped prior to committing").  Unfortunately,
it's still possible to hit the BUG_ON(!buffer_mapped(bh)) if the
device dies between when the work-around check in ext4_commit_super()
and when submit_bh_wbh() is finally called:

Code path:
ext4_commit_super
    judge if 'buffer_mapped(sbh)' is false, return <== commit a17712c8
          lock_buffer(sbh)
          ...
          unlock_buffer(sbh)
               __sync_dirty_buffer(sbh,...
                    lock_buffer(sbh)
                        judge if 'buffer_mapped(sbh))' is false, return <== added by this patch
                            submit_bh(...,sbh)
                                submit_bh_wbc(...,sbh,...)

[100722.966497] kernel BUG at fs/buffer.c:3095! <== BUG_ON(!buffer_mapped(bh))' in submit_bh_wbc()
[100722.966503] invalid opcode: 0000 [#1] SMP
[100722.966566] task: ffff8817e15a9e40 task.stack: ffffc90024744000
[100722.966574] RIP: 0010:submit_bh_wbc+0x180/0x190
[100722.966575] RSP: 0018:ffffc90024747a90 EFLAGS: 00010246
[100722.966576] RAX: 0000000000620005 RBX: ffff8818a80603a8 RCX: 0000000000000000
[100722.966576] RDX: ffff8818a80603a8 RSI: 0000000000020800 RDI: 0000000000000001
[100722.966577] RBP: ffffc90024747ac0 R08: 0000000000000000 R09: ffff88207f94170d
[100722.966578] R10: 00000000000437c8 R11: 0000000000000001 R12: 0000000000020800
[100722.966578] R13: 0000000000000001 R14: 000000000bf9a438 R15: ffff88195f333000
[100722.966580] FS:  00007fa2eee27700(0000) GS:ffff88203d840000(0000) knlGS:0000000000000000
[100722.966580] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[100722.966581] CR2: 0000000000f0b008 CR3: 000000201a622003 CR4: 00000000007606e0
[100722.966582] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[100722.966583] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[100722.966583] PKRU: 55555554
[100722.966583] Call Trace:
[100722.966588]  __sync_dirty_buffer+0x6e/0xd0
[100722.966614]  ext4_commit_super+0x1d8/0x290 [ext4]
[100722.966626]  __ext4_std_error+0x78/0x100 [ext4]
[100722.966635]  ? __ext4_journal_get_write_access+0xca/0x120 [ext4]
[100722.966646]  ext4_reserve_inode_write+0x58/0xb0 [ext4]
[100722.966655]  ? ext4_dirty_inode+0x48/0x70 [ext4]
[100722.966663]  ext4_mark_inode_dirty+0x53/0x1e0 [ext4]
[100722.966671]  ? __ext4_journal_start_sb+0x6d/0xf0 [ext4]
[100722.966679]  ext4_dirty_inode+0x48/0x70 [ext4]
[100722.966682]  __mark_inode_dirty+0x17f/0x350
[100722.966686]  generic_update_time+0x87/0xd0
[100722.966687]  touch_atime+0xa9/0xd0
[100722.966690]  generic_file_read_iter+0xa09/0xcd0
[100722.966694]  ? page_cache_tree_insert+0xb0/0xb0
[100722.966704]  ext4_file_read_iter+0x4a/0x100 [ext4]
[100722.966707]  ? __inode_security_revalidate+0x4f/0x60
[100722.966709]  __vfs_read+0xec/0x160
[100722.966711]  vfs_read+0x8c/0x130
[100722.966712]  SyS_pread64+0x87/0xb0
[100722.966716]  do_syscall_64+0x67/0x1b0
[100722.966719]  entry_SYSCALL64_slow_path+0x25/0x25

To address this, add the check of 'buffer_mapped(bh)' to
__sync_dirty_buffer().  This also has the benefit of fixing this for
other file systems.

With this addition, we can drop the workaround in ext4_commit_supper().

[ Commit description rewritten by tytso. ]

Signed-off-by: Xianting Tian 
Link: https://lore.kernel.org/r/1596211825-8750-1-git-send-email-xianting_tian@126.com
Signed-off-by: Theodore Ts'o 
Signed-off-by: Sasha Levin

ext4: correctly restore system zone info when remount fails

2020-09-03T09:26:57+00:00

[ Upstream commit 0f5bde1db174f6c471f0bd27198575719dabe3e5 ]

When remounting filesystem fails late during remount handling and
block_validity mount option is also changed during the remount, we fail
to restore system zone information to a state matching the mount option.
This is mostly harmless, just the block validity checking will not match
the situation described by the mount option. Make sure these two are always
consistent.

Reported-by: Lukas Czerner 
Reviewed-by: Lukas Czerner 
Signed-off-by: Jan Kara 
Link: https://lore.kernel.org/r/20200728130437.7804-7-jack@suse.cz
Signed-off-by: Theodore Ts'o 
Signed-off-by: Sasha Levin