summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2020-12-30inotify: convert to handle_inode_event() interfaceAmir Goldstein3-54/+14
commit 1a2620a99803ad660edc5d22fd9c66cce91ceb1c upstream. Convert inotify to use the simple handle_inode_event() interface to get rid of the code duplication between the generic helper fsnotify_handle_event() and the inotify_handle_event() callback, which also happen to be buggy code. The bug will be fixed in the generic helper. Link: https://lore.kernel.org/r/20201202120713.702387-3-amir73il@gmail.com CC: stable@vger.kernel.org Fixes: b9a1b9772509 ("fsnotify: create method handle_inode_event() in fsnotify_operations") Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-30fsnotify: generalize handle_inode_event()Amir Goldstein3-9/+26
commit 950cc0d2bef078e1f6459900ca4d4b2a2e0e3c37 upstream. The handle_inode_event() interface was added as (quoting comment): "a simple variant of handle_event() for groups that only have inode marks and don't have ignore mask". In other words, all backends except fanotify. The inotify backend also falls under this category, but because it required extra arguments it was left out of the initial pass of backends conversion to the simple interface. This results in code duplication between the generic helper fsnotify_handle_event() and the inotify_handle_event() callback which also happen to be buggy code. Generalize the handle_inode_event() arguments and add the check for FS_EXCL_UNLINK flag to the generic helper, so inotify backend could be converted to use the simple interface. Link: https://lore.kernel.org/r/20201202120713.702387-2-amir73il@gmail.com CC: stable@vger.kernel.org Fixes: b9a1b9772509 ("fsnotify: create method handle_inode_event() in fsnotify_operations") Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-30jffs2: Fix ignoring mounting options problem during remountinglizhe1-0/+17
commit 08cd274f9b8283a1da93e2ccab216a336da83525 upstream. The jffs2 mount options will be ignored when remounting jffs2. It can be easily reproduced with the steps listed below. 1. mount -t jffs2 -o compr=none /dev/mtdblockx /mnt 2. mount -o remount compr=zlib /mnt Since ec10a24f10c8, the option parsing happens before fill_super and then pass fc, which contains the options parsing results, to function jffs2_reconfigure during remounting. But function jffs2_reconfigure do not update c->mount_opts. This patch add a function jffs2_update_mount_opts to fix this problem. By the way, I notice that tmpfs use the same way to update remounting options. If it is necessary to unify them? Cc: <stable@vger.kernel.org> Fixes: ec10a24f10c8 ("vfs: Convert jffs2 to use the new mount API") Signed-off-by: lizhe <lizhe67@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-30jffs2: Fix GC exit abnormallyZhe Li1-0/+16
commit 9afc9a8a4909fece0e911e72b1060614ba2f7969 upstream. The log of this problem is: jffs2: Error garbage collecting node at 0x***! jffs2: No space for garbage collection. Aborting GC thread This is because GC believe that it do nothing, so it abort. After going over the image of jffs2, I find a scene that can trigger this problem stably. The scene is: there is a normal dirent node at summary-area, but abnormal at corresponding not-summary-area with error name_crc. The reason that GC exit abnormally is because it find that abnormal dirent node to GC, but when it goes to function jffs2_add_fd_to_list, it cannot meet the condition listed below: if ((*prev)->nhash == new->nhash && !strcmp((*prev)->name, new->name)) So no node is marked obsolete, statistical information of erase_block do not change, which cause GC exit abnormally. The root cause of this problem is: we do not check the name_crc of the abnormal dirent node with summary is enabled. Noticed that in function jffs2_scan_dirent_node, we use function jffs2_scan_dirty_space to deal with the dirent node with error name_crc. So this patch add a checking code in function read_direntry to ensure the correctness of dirent node. If checked failed, the dirent node will be marked obsolete so GC will pass this node and this problem will be fixed. Cc: <stable@vger.kernel.org> Signed-off-by: Zhe Li <lizhe67@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-30ubifs: wbuf: Don't leak kernel memory to flashRichard Weinberger1-2/+11
commit 20f1431160c6b590cdc269a846fc5a448abf5b98 upstream. Write buffers use a kmalloc()'ed buffer, they can leak up to seven bytes of kernel memory to flash if writes are not aligned. So use ubifs_pad() to fill these gaps with padding bytes. This was never a problem while scanning because the scanner logic manually aligns node lengths and skips over these gaps. Cc: <stable@vger.kernel.org> Fixes: 1e51764a3c2ac05a2 ("UBIFS: add new flash file system") Signed-off-by: Richard Weinberger <richard@nod.at> Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-30SMB3.1.1: do not log warning message if server doesn't populate saltSteve French2-5/+16
commit 7955f105afb6034af344038d663bc98809483cdd upstream. In the negotiate protocol preauth context, the server is not required to populate the salt (although it is done by most servers) so do not warn on mount. We retain the checks (warn) that the preauth context is the minimum size and that the salt does not exceed DataLength of the SMB response. Although we use the defaults in the case that the preauth context response is invalid, these checks may be useful in the future as servers add support for additional mechanisms. CC: Stable <stable@vger.kernel.org> Reviewed-by: Shyam Prasad N <sprasad@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-30SMB3.1.1: remove confusing mount warning when no SPNEGO info on negprot rspSteve French1-4/+12
commit bc7c4129d4cdc56d1b5477c1714246f27df914dd upstream. Azure does not send an SPNEGO blob in the negotiate protocol response, so we shouldn't assume that it is there when validating the location of the first negotiate context. This avoids the potential confusing mount warning: CIFS: Invalid negotiate context offset CC: Stable <stable@vger.kernel.org> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-30SMB3: avoid confusing warning message on mount to AzureSteve French1-1/+2
commit ebcd6de98754d9b6a5f89d7835864b1c365d432f upstream. Mounts to Azure cause an unneeded warning message in dmesg "CIFS: VFS: parse_server_interfaces: incomplete interface info" Azure rounds up the size (by 8 additional bytes, to a 16 byte boundary) of the structure returned on the query of the server interfaces at mount time. This is permissible even though different than other servers so do not log a warning if query network interfaces response is only rounded up by 8 bytes or fewer. CC: Stable <stable@vger.kernel.org> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-30ceph: fix race in concurrent __ceph_remove_cap invocationsLuis Henriques1-2/+9
commit e5cafce3ad0f8652d6849314d951459c2bff7233 upstream. A NULL pointer dereference may occur in __ceph_remove_cap with some of the callbacks used in ceph_iterate_session_caps, namely trim_caps_cb and remove_session_caps_cb. Those callers hold the session->s_mutex, so they are prevented from concurrent execution, but ceph_evict_inode does not. Since the callers of this function hold the i_ceph_lock, the fix is simply a matter of returning immediately if caps->ci is NULL. Cc: stable@vger.kernel.org URL: https://tracker.ceph.com/issues/43272 Suggested-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Luis Henriques <lhenriques@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-30ovl: make ioctl() safeMiklos Szeredi1-71/+16
commit 89bdfaf93d9157499c3a0d61f489df66f2dead7f upstream. ovl_ioctl_set_flags() does a capability check using flags, but then the real ioctl double-fetches flags and uses potentially different value. The "Check the capability before cred override" comment misleading: user can skip this check by presenting benign flags first and then overwriting them to non-benign flags. Just remove the cred override for now, hoping this doesn't cause a regression. The proper solution is to create a new setxflags i_op (patches are in the works). Xfstests don't show a regression. Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Fixes: dab5ca8fd9dd ("ovl: add lsattr/chattr support") Cc: <stable@vger.kernel.org> # v4.19 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-30ext4: don't remount read-only with errors=continue on rebootJan Kara1-8/+6
commit b08070eca9e247f60ab39d79b2c25d274750441f upstream. ext4_handle_error() with errors=continue mount option can accidentally remount the filesystem read-only when the system is rebooting. Fix that. Fixes: 1dc1097ff60e ("ext4: avoid panic during forced reboot") Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Cc: stable@kernel.org Link: https://lore.kernel.org/r/20201127113405.26867-2-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-30ext4: fix deadlock with fs freezing and EA inodesJan Kara1-5/+14
commit 46e294efc355c48d1dd4d58501aa56dac461792a upstream. Xattr code using inodes with large xattr data can end up dropping last inode reference (and thus deleting the inode) from places like ext4_xattr_set_entry(). That function is called with transaction started and so ext4_evict_inode() can deadlock against fs freezing like: CPU1 CPU2 removexattr() freeze_super() vfs_removexattr() ext4_xattr_set() handle = ext4_journal_start() ... ext4_xattr_set_entry() iput(old_ea_inode) ext4_evict_inode(old_ea_inode) sb->s_writers.frozen = SB_FREEZE_FS; sb_wait_write(sb, SB_FREEZE_FS); ext4_freeze() jbd2_journal_lock_updates() -> blocks waiting for all handles to stop sb_start_intwrite() -> blocks as sb is already in SB_FREEZE_FS state Generally it is advisable to delete inodes from a separate transaction as it can consume quite some credits however in this case it would be quite clumsy and furthermore the credits for inode deletion are quite limited and already accounted for. So just tweak ext4_evict_inode() to avoid freeze protection if we have transaction already started and thus it is not really needed anyway. Cc: stable@vger.kernel.org Fixes: dec214d00e0d ("ext4: xattr inode deduplication") Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/20201127110649.24730-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-30ext4: fix a memory leak of ext4_free_dataChunguang Xu1-0/+1
commit cca415537244f6102cbb09b5b90db6ae2c953bdd upstream. When freeing metadata, we will create an ext4_free_data and insert it into the pending free list. After the current transaction is committed, the object will be freed. ext4_mb_free_metadata() will check whether the area to be freed overlaps with the pending free list. If true, return directly. At this time, ext4_free_data is leaked. Fortunately, the probability of this problem is small, since it only occurs if the file system is corrupted such that a block is claimed by more one inode and those inodes are deleted within a single jbd2 transaction. Signed-off-by: Chunguang Xu <brookxu@tencent.com> Link: https://lore.kernel.org/r/1604764698-4269-8-git-send-email-brookxu@tencent.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-30ext4: fix an IS_ERR() vs NULL checkDan Carpenter1-2/+2
commit bc18546bf68e47996a359d2533168d5770a22024 upstream. The ext4_find_extent() function never returns NULL, it returns error pointers. Fixes: 44059e503b03 ("ext4: fast commit recovery path") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20201023112232.GB282278@mwanda Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-30btrfs: fix race when defragmenting leads to unnecessary IOFilipe Manana1-0/+39
commit 7f458a3873ae94efe1f37c8b96c97e7298769e98 upstream. When defragmenting we skip ranges that have holes or inline extents, so that we don't do unnecessary IO and waste space. We do this check when calling should_defrag_range() at btrfs_defrag_file(). However we do it without holding the inode's lock. The reason we do it like this is to avoid blocking other tasks for too long, that possibly want to operate on other file ranges, since after the call to should_defrag_range() and before locking the inode, we trigger a synchronous page cache readahead. However before we were able to lock the inode, some other task might have punched a hole in our range, or we may now have an inline extent there, in which case we should not set the range for defrag anymore since that would cause unnecessary IO and make us waste space (i.e. allocating extents to contain zeros for a hole). So after we locked the inode and the range in the iotree, check again if we have holes or an inline extent, and if we do, just skip the range. I hit this while testing my next patch that fixes races when updating an inode's number of bytes (subject "btrfs: update the number of bytes used by an inode atomically"), and it depends on this change in order to work correctly. Alternatively I could rework that other patch to detect holes and flag their range with the 'new delalloc' bit, but this itself fixes an efficiency problem due a race that from a functional point of view is not harmful (it could be triggered with btrfs/062 from fstests). CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-30btrfs: update last_byte_to_unpin in switch_commit_rootsJosef Bacik3-28/+40
commit 27d56e62e4748c2135650c260024e9904b8c1a0a upstream. While writing an explanation for the need of the commit_root_sem for btrfs_prepare_extent_commit, I realized we have a slight hole that could result in leaked space if we have to do the old style caching. Consider the following scenario commit root +----+----+----+----+----+----+----+ |\\\\| |\\\\|\\\\| |\\\\|\\\\| +----+----+----+----+----+----+----+ 0 1 2 3 4 5 6 7 new commit root +----+----+----+----+----+----+----+ | | | |\\\\| | |\\\\| +----+----+----+----+----+----+----+ 0 1 2 3 4 5 6 7 Prior to this patch, we run btrfs_prepare_extent_commit, which updates the last_byte_to_unpin, and then we subsequently run switch_commit_roots. In this example lets assume that caching_ctl->progress == 1 at btrfs_prepare_extent_commit() time, which means that cache->last_byte_to_unpin == 1. Then we go and do the switch_commit_roots(), but in the meantime the caching thread has made some more progress, because we drop the commit_root_sem and re-acquired it. Now caching_ctl->progress == 3. We swap out the commit root and carry on to unpin. The race can happen like: 1) The caching thread was running using the old commit root when it found the extent for [2, 3); 2) Then it released the commit_root_sem because it was in the last item of a leaf and the semaphore was contended, and set ->progress to 3 (value of 'last'), as the last extent item in the current leaf was for the extent for range [2, 3); 3) Next time it gets the commit_root_sem, will start using the new commit root and search for a key with offset 3, so it never finds the hole for [2, 3). So the caching thread never saw [2, 3) as free space in any of the commit roots, and by the time finish_extent_commit() was called for the range [0, 3), ->last_byte_to_unpin was 1, so it only returned the subrange [0, 1) to the free space cache, skipping [2, 3). In the unpin code we have last_byte_to_unpin == 1, so we unpin [0,1), but do not unpin [2,3). However because caching_ctl->progress == 3 we do not see the newly freed section of [2,3), and thus do not add it to our free space cache. This results in us missing a chunk of free space in memory (on disk too, unless we have a power failure before writing the free space cache to disk). Fix this by making sure the ->last_byte_to_unpin is set at the same time that we swap the commit roots, this ensures that we will always be consistent. CC: stable@vger.kernel.org # 5.8+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> [ update changelog with Filipe's review comments ] Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-30btrfs: do not shorten unpin len for caching block groupsJosef Bacik1-4/+4
commit 9076dbd5ee837c3882fc42891c14cecd0354a849 upstream. While fixing up our ->last_byte_to_unpin locking I noticed that we will shorten len based on ->last_byte_to_unpin if we're caching when we're adding back the free space. This is correct for the free space, as we cannot unpin more than ->last_byte_to_unpin, however we use len to adjust the ->bytes_pinned counters and such, which need to track the actual pinned usage. This could result in WARN_ON(space_info->bytes_pinned) triggering at unmount time. Fix this by using a local variable for the amount to add to free space cache, and leave len untouched in this case. CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-30io_uring: make ctx cancel on exit targeted to actual ctxJens Axboe1-1/+8
commit 00c18640c2430c4bafaaeede1f9dd6f7ec0e4b25 upstream. Before IORING_SETUP_ATTACH_WQ, we could just cancel everything on the io-wq when exiting. But that's not the case if they are shared, so cancel for the specific ctx instead. Cc: stable@vger.kernel.org Fixes: 24369c2e3bb0 ("io_uring: add io-wq workqueue sharing") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-30io_uring: fix double io_uring freePavel Begunkov1-32/+39
commit 9faadcc8abe4b83d0263216dc3a6321d5bbd616b upstream. Once we created a file for current context during setup, we should not call io_ring_ctx_wait_and_kill() directly as it'll be done by fput(file) Cc: stable@vger.kernel.org # 5.10 Reported-by: syzbot+c9937dfb2303a5f18640@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> [axboe: fix unused 'ret' for !CONFIG_UNIX] Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-30io_uring: fix ignoring xa_store errorsPavel Begunkov1-3/+7
commit a528b04ea40690ff40501f50d618a62a02b19620 upstream. xa_store() may fail, check the result. Cc: stable@vger.kernel.org # 5.10 Fixes: 0f2122045b946 ("io_uring: don't rely on weak ->files references") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-30io_uring: hold uring_lock while completing failed polled io in ↵Xiaoguang Wang1-10/+19
io_wq_submit_work() commit c07e6719511e77c4b289f62bfe96423eb6ea061d upstream. io_iopoll_complete() does not hold completion_lock to complete polled io, so in io_wq_submit_work(), we can not call io_req_complete() directly, to complete polled io, otherwise there maybe concurrent access to cqring, defer_list, etc, which is not safe. Commit dad1b1242fd5 ("io_uring: always let io_iopoll_complete() complete polled io") has fixed this issue, but Pavel reported that IOPOLL apart from rw can do buf reg/unreg requests( IORING_OP_PROVIDE_BUFFERS or IORING_OP_REMOVE_BUFFERS), so the fix is not good. Given that io_iopoll_complete() is always called under uring_lock, so here for polled io, we can also get uring_lock to fix this issue. Fixes: dad1b1242fd5 ("io_uring: always let io_iopoll_complete() complete polled io") Cc: <stable@vger.kernel.org> # 5.5+ Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> [axboe: don't deref 'req' after completing it'] Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-30io_uring: fix 0-iov read buffer selectPavel Begunkov1-3/+1
commit dd20166236953c8cd14f4c668bf972af32f0c6be upstream. Doing vectored buf-select read with 0 iovec passed is meaningless and utterly broken, forbid it. Cc: <stable@vger.kernel.org> # 5.7+ Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-30io_uring: fix io_wqe->work_list corruptionXiaoguang Wang1-0/+1
commit 0020ef04e48571a88d4f482ad08f71052c5c5a08 upstream. For the first time a req punted to io-wq, we'll initialize io_wq_work's list to be NULL, then insert req to io_wqe->work_list. If this req is not inserted into tail of io_wqe->work_list, this req's io_wq_work list will point to another req's io_wq_work. For splitted bio case, this req maybe inserted to io_wqe->work_list repeatedly, once we insert it to tail of io_wqe->work_list for the second time, now io_wq_work->list->next will be invalid pointer, which then result in many strang error, panic, kernel soft-lockup, rcu stall, etc. In my vm, kernel doest not have commit cc29e1bf0d63f7 ("block: disable iopoll for split bio"), below fio job can reproduce this bug steadily: [global] name=iouring-sqpoll-iopoll-1 ioengine=io_uring iodepth=128 numjobs=1 thread rw=randread direct=1 registerfiles=1 hipri=1 bs=4m size=100M runtime=120 time_based group_reporting randrepeat=0 [device] directory=/home/feiman.wxg/mntpoint/ # an ext4 mount point If we have commit cc29e1bf0d63f7 ("block: disable iopoll for split bio"), there will no splitted bio case for polled io, but I think we still to need to fix this list corruption, it also should maybe go to stable branchs. To fix this corruption, if a req is inserted into tail of io_wqe->work_list, initialize req->io_wq_work->list->next to bu NULL. Cc: stable@vger.kernel.org Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-30io_uring: always let io_iopoll_complete() complete polled ioXiaoguang Wang1-2/+13
commit dad1b1242fd5717af18ae4ac9d12b9f65849e13a upstream. Abaci Fuzz reported a double-free or invalid-free BUG in io_commit_cqring(): [ 95.504842] BUG: KASAN: double-free or invalid-free in io_commit_cqring+0x3ec/0x8e0 [ 95.505921] [ 95.506225] CPU: 0 PID: 4037 Comm: io_wqe_worker-0 Tainted: G B W 5.10.0-rc5+ #1 [ 95.507434] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 95.508248] Call Trace: [ 95.508683] dump_stack+0x107/0x163 [ 95.509323] ? io_commit_cqring+0x3ec/0x8e0 [ 95.509982] print_address_description.constprop.0+0x3e/0x60 [ 95.510814] ? vprintk_func+0x98/0x140 [ 95.511399] ? io_commit_cqring+0x3ec/0x8e0 [ 95.512036] ? io_commit_cqring+0x3ec/0x8e0 [ 95.512733] kasan_report_invalid_free+0x51/0x80 [ 95.513431] ? io_commit_cqring+0x3ec/0x8e0 [ 95.514047] __kasan_slab_free+0x141/0x160 [ 95.514699] kfree+0xd1/0x390 [ 95.515182] io_commit_cqring+0x3ec/0x8e0 [ 95.515799] __io_req_complete.part.0+0x64/0x90 [ 95.516483] io_wq_submit_work+0x1fa/0x260 [ 95.517117] io_worker_handle_work+0xeac/0x1c00 [ 95.517828] io_wqe_worker+0xc94/0x11a0 [ 95.518438] ? io_worker_handle_work+0x1c00/0x1c00 [ 95.519151] ? __kthread_parkme+0x11d/0x1d0 [ 95.519806] ? io_worker_handle_work+0x1c00/0x1c00 [ 95.520512] ? io_worker_handle_work+0x1c00/0x1c00 [ 95.521211] kthread+0x396/0x470 [ 95.521727] ? _raw_spin_unlock_irq+0x24/0x30 [ 95.522380] ? kthread_mod_delayed_work+0x180/0x180 [ 95.523108] ret_from_fork+0x22/0x30 [ 95.523684] [ 95.523985] Allocated by task 4035: [ 95.524543] kasan_save_stack+0x1b/0x40 [ 95.525136] __kasan_kmalloc.constprop.0+0xc2/0xd0 [ 95.525882] kmem_cache_alloc_trace+0x17b/0x310 [ 95.533930] io_queue_sqe+0x225/0xcb0 [ 95.534505] io_submit_sqes+0x1768/0x25f0 [ 95.535164] __x64_sys_io_uring_enter+0x89e/0xd10 [ 95.535900] do_syscall_64+0x33/0x40 [ 95.536465] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 95.537199] [ 95.537505] Freed by task 4035: [ 95.538003] kasan_save_stack+0x1b/0x40 [ 95.538599] kasan_set_track+0x1c/0x30 [ 95.539177] kasan_set_free_info+0x1b/0x30 [ 95.539798] __kasan_slab_free+0x112/0x160 [ 95.540427] kfree+0xd1/0x390 [ 95.540910] io_commit_cqring+0x3ec/0x8e0 [ 95.541516] io_iopoll_complete+0x914/0x1390 [ 95.542150] io_do_iopoll+0x580/0x700 [ 95.542724] io_iopoll_try_reap_events.part.0+0x108/0x200 [ 95.543512] io_ring_ctx_wait_and_kill+0x118/0x340 [ 95.544206] io_uring_release+0x43/0x50 [ 95.544791] __fput+0x28d/0x940 [ 95.545291] task_work_run+0xea/0x1b0 [ 95.545873] do_exit+0xb6a/0x2c60 [ 95.546400] do_group_exit+0x12a/0x320 [ 95.546967] __x64_sys_exit_group+0x3f/0x50 [ 95.547605] do_syscall_64+0x33/0x40 [ 95.548155] entry_SYSCALL_64_after_hwframe+0x44/0xa9 The reason is that once we got a non EAGAIN error in io_wq_submit_work(), we'll complete req by calling io_req_complete(), which will hold completion_lock to call io_commit_cqring(), but for polled io, io_iopoll_complete() won't hold completion_lock to call io_commit_cqring(), then there maybe concurrent access to ctx->defer_list, double free may happen. To fix this bug, we always let io_iopoll_complete() complete polled io. Cc: <stable@vger.kernel.org> # 5.5+ Reported-by: Abaci Fuzz <abaci@linux.alibaba.com> Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-30io_uring: fix racy IOPOLL completionsPavel Begunkov1-5/+18
commit 31bff9a51b264df6d144931a6a5f1d6cc815ed4b upstream. IOPOLL allows buffer remove/provide requests, but they doesn't synchronise by rules of IOPOLL, namely it have to hold uring_lock. Cc: <stable@vger.kernel.org> # 5.7+ Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-30io_uring: fix io_cqring_events()'s noflushPavel Begunkov1-1/+1
commit 59850d226e4907a6f37c1d2fe5ba97546a8691a4 upstream. Checking !list_empty(&ctx->cq_overflow_list) around noflush in io_cqring_events() is racy, because if it fails but a request overflowed just after that, io_cqring_overflow_flush() still will be called. Remove the second check, it shouldn't be a problem for performance, because there is cq_check_overflow bit check just above. Cc: <stable@vger.kernel.org> # 5.5+ Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-30proc mountinfo: make splice available againLinus Torvalds1-3/+6
[ Upstream commit 14e3e989f6a5d9646b6cf60690499cc8bdc11f7d ] Since commit 36e2c7421f02 ("fs: don't allow splice read/write without explicit ops") we've required that file operation structures explicitly enable splice support, rather than falling back to the default handlers. Most /proc files use the indirect 'struct proc_ops' to describe their file operations, and were fixed up to support splice earlier in commits 40be821d627c..b24c30c67863, but the mountinfo files interact with the VFS directly using their own 'struct file_operations' and got missed as a result. This adds the necessary support for splice to work for /proc/*/mountinfo and friends. Reported-by: Joan Bruguera Micó <joanbrugueram@gmail.com> Reported-by: Jussi Kivilinna <jussi.kivilinna@iki.fi> Link: https://bugzilla.kernel.org/show_bug.cgi?id=209971 Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-30io_uring: cancel reqs shouldn't kill overflow listPavel Begunkov1-4/+2
[ Upstream commit cda286f0715c82f8117e166afd42cca068876dde ] io_uring_cancel_task_requests() doesn't imply that the ring is going away, it may continue to work well after that. The problem is that it sets ->cq_overflow_flushed effectively disabling the CQ overflow feature Split setting cq_overflow_flushed from flush, and do the first one only on exit. It's ok in terms of cancellations because there is a io_uring->in_idle check in __io_cqring_fill_event(). It also fixes a race with setting ->cq_overflow_flushed in io_uring_cancel_task_requests, whuch's is not atomic and a part of a bitmask with other flags. Though, the only other flag that's not set during init is drain_next, so it's not as bad for sane architectures. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Fixes: 0f2122045b946 ("io_uring: don't rely on weak ->files references") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-30io_uring: fix racy IOPOLL flush overflowPavel Begunkov1-4/+6
[ Upstream commit 634578f800652035debba3098d8ab0d21af7c7a5 ] It's not safe to call io_cqring_overflow_flush() for IOPOLL mode without hodling uring_lock, because it does synchronisation differently. Make sure we have it. As for io_ring_exit_work(), we don't even need it there because io_ring_ctx_wait_and_kill() already set force flag making all overflowed requests to be dropped. Cc: <stable@vger.kernel.org> # 5.5+ Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-30epoll: check for events when removing a timed out thread from the wait queueSoheil Hassas Yeganeh1-9/+16
[ Upstream commit 289caf5d8f6c61c6d2b7fd752a7f483cd153f182 ] Patch series "simplify ep_poll". This patch series is a followup based on the suggestions and feedback by Linus: https://lkml.kernel.org/r/CAHk-=wizk=OxUyQPbO8MS41w2Pag1kniUV5WdD5qWL-gq1kjDA@mail.gmail.com The first patch in the series is a fix for the epoll race in presence of timeouts, so that it can be cleanly backported to all affected stable kernels. The rest of the patch series simplify the ep_poll() implementation. Some of these simplifications result in minor performance enhancements as well. We have kept these changes under self tests and internal benchmarks for a few days, and there are minor (1-2%) performance enhancements as a result. This patch (of 8): After abc610e01c66 ("fs/epoll: avoid barrier after an epoll_wait(2) timeout"), we break out of the ep_poll loop upon timeout, without checking whether there is any new events available. Prior to that patch-series we always called ep_events_available() after exiting the loop. This can cause races and missed wakeups. For example, consider the following scenario reported by Guantao Liu: Suppose we have an eventfd added using EPOLLET to an epollfd. Thread 1: Sleeps for just below 5ms and then writes to an eventfd. Thread 2: Calls epoll_wait with a timeout of 5 ms. If it sees an event of the eventfd, it will write back on that fd. Thread 3: Calls epoll_wait with a negative timeout. Prior to abc610e01c66, it is guaranteed that Thread 3 will wake up either by Thread 1 or Thread 2. After abc610e01c66, Thread 3 can be blocked indefinitely if Thread 2 sees a timeout right before the write to the eventfd by Thread 1. Thread 2 will be woken up from schedule_hrtimeout_range and, with evail 0, it will not call ep_send_events(). To fix this issue: 1) Simplify the timed_out case as suggested by Linus. 2) while holding the lock, recheck whether the thread was woken up after its time out has reached. Note that (2) is different from Linus' original suggestion: It do not set "eavail = ep_events_available(ep)" to avoid unnecessary contention (when there are too many timed-out threads and a small number of events), as well as races mentioned in the discussion thread. This is the first patch in the series so that the backport to stable releases is straightforward. Link: https://lkml.kernel.org/r/20201106231635.3528496-1-soheil.kdev@gmail.com Link: https://lkml.kernel.org/r/CAHk-=wizk=OxUyQPbO8MS41w2Pag1kniUV5WdD5qWL-gq1kjDA@mail.gmail.com Link: https://lkml.kernel.org/r/20201106231635.3528496-2-soheil.kdev@gmail.com Fixes: abc610e01c66 ("fs/epoll: avoid barrier after an epoll_wait(2) timeout") Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Tested-by: Guantao Liu <guantaol@google.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Reported-by: Guantao Liu <guantaol@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Khazhismel Kumykov <khazhy@google.com> Reviewed-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-30io_uring: cancel only requests of current taskPavel Begunkov1-18/+5
[ Upstream commit df9923f96717d0aebb0a73adbcf6285fa79e38cb ] io_uring_cancel_files() cancels all request that match files regardless of task. There is no real need in that, cancel only requests of the specified task. That also handles SQPOLL case as it already changes task to it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-30NFS/pNFS: Fix a typo in ff_layout_resend_pnfs_read()Trond Myklebust1-1/+1
[ Upstream commit 52104f274e2d7f134d34bab11cada8913d4544e2 ] Don't bump the index twice. Fixes: 563c53e73b8b ("NFS: Fix flexfiles read failover") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-30proc: fix lookup in /proc/net subdirectories after setns(2)Alexey Dobriyan3-18/+29
[ Upstream commit c6c75deda81344c3a95d1d1f606d5cee109e5d54 ] Commit 1fde6f21d90f ("proc: fix /proc/net/* after setns(2)") only forced revalidation of regular files under /proc/net/ However, /proc/net/ is unusual in the sense of /proc/net/foo handlers take netns pointer from parent directory which is old netns. Steps to reproduce: (void)open("/proc/net/sctp/snmp", O_RDONLY); unshare(CLONE_NEWNET); int fd = open("/proc/net/sctp/snmp", O_RDONLY); read(fd, &c, 1); Read will read wrong data from original netns. Patch forces lookup on every directory under /proc/net . Link: https://lkml.kernel.org/r/20201205160916.GA109739@localhost.localdomain Fixes: 1da4d377f943 ("proc: revalidate misc dentries") Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reported-by: "Rantala, Tommi T. (Nokia - FI/Espoo)" <tommi.t.rantala@nokia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-30ubifs: Fix error return code in ubifs_init_authentication()Wang ShaoBo1-1/+3
[ Upstream commit 3cded66330591cfd2554a3fd5edca8859ea365a2 ] Fix to return PTR_ERR() error code from the error handling case where ubifs_hash_get_desc() failed instead of 0 in ubifs_init_authentication(), as done elsewhere in this function. Fixes: 49525e5eecca5 ("ubifs: Add helper functions for authentication support") Signed-off-by: Wang ShaoBo <bobo.shaobowang@huawei.com> Reviewed-by: Sascha Hauer <s.hauer@pengutronix.de> Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-30fs: Handle I_DONTCACHE in iput_final() instead of generic_drop_inode()Hao Li1-1/+3
[ Upstream commit 88149082bb8ef31b289673669e080ec6a00c2e59 ] If generic_drop_inode() returns true, it means iput_final() can evict this inode regardless of whether it is dirty or not. If we check I_DONTCACHE in generic_drop_inode(), any inode with this bit set will be evicted unconditionally. This is not the desired behavior because I_DONTCACHE only means the inode shouldn't be cached on the LRU list. As for whether we need to evict this inode, this is what generic_drop_inode() should do. This patch corrects the usage of I_DONTCACHE. This patch was proposed in [1]. [1]: https://lore.kernel.org/linux-fsdevel/20200831003407.GE12096@dread.disaster.area/ Fixes: dae2f8ed7992 ("fs: Lift XFS_IDONTCACHE to the VFS layer") Signed-off-by: Hao Li <lihao2018.fnst@cn.fujitsu.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-30erofs: avoid using generic_block_bmapHuang Jianan1-19/+7
[ Upstream commit d8b3df8b1048405e73558b88cba2adf29490d468 ] Surprisingly, `block' in sector_t indicates the number of i_blkbits-sized blocks rather than sectors for bmap. In addition, considering buffer_head limits mapped size to 32-bits, should avoid using generic_block_bmap. Link: https://lore.kernel.org/r/20201209115740.18802-1-huangjianan@oppo.com Fixes: 9da681e017a3 ("staging: erofs: support bmap") Reviewed-by: Chao Yu <yuchao0@huawei.com> Reviewed-by: Gao Xiang <hsiangkao@redhat.com> Signed-off-by: Huang Jianan <huangjianan@oppo.com> Signed-off-by: Guo Weichao <guoweichao@oppo.com> [ Gao Xiang: slightly update the commit message description. ] Signed-off-by: Gao Xiang <hsiangkao@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-30nfs_common: need lock during iterate through the listCheng Lin1-1/+5
[ Upstream commit 4a9d81caf841cd2c0ae36abec9c2963bf21d0284 ] If the elem is deleted during be iterated on it, the iteration process will fall into an endless loop. kernel: NMI watchdog: BUG: soft lockup - CPU#4 stuck for 22s! [nfsd:17137] PID: 17137  TASK: ffff8818d93c0000  CPU: 4   COMMAND: "nfsd"     [exception RIP: __state_in_grace+76]     RIP: ffffffffc00e817c  RSP: ffff8818d3aefc98  RFLAGS: 00000246     RAX: ffff881dc0c38298  RBX: ffffffff81b03580  RCX: ffff881dc02c9f50     RDX: ffff881e3fce8500  RSI: 0000000000000001  RDI: ffffffff81b03580     RBP: ffff8818d3aefca0   R8: 0000000000000020   R9: ffff8818d3aefd40     R10: ffff88017fc03800  R11: ffff8818e83933c0  R12: ffff8818d3aefd40     R13: 0000000000000000  R14: ffff8818e8391068  R15: ffff8818fa6e4000     CS: 0010  SS: 0018  #0 [ffff8818d3aefc98] opens_in_grace at ffffffffc00e81e3 [grace]  #1 [ffff8818d3aefca8] nfs4_preprocess_stateid_op at ffffffffc02a3e6c [nfsd]  #2 [ffff8818d3aefd18] nfsd4_write at ffffffffc028ed5b [nfsd]  #3 [ffff8818d3aefd80] nfsd4_proc_compound at ffffffffc0290a0d [nfsd]  #4 [ffff8818d3aefdd0] nfsd_dispatch at ffffffffc027b800 [nfsd]  #5 [ffff8818d3aefe08] svc_process_common at ffffffffc02017f3 [sunrpc]  #6 [ffff8818d3aefe70] svc_process at ffffffffc0201ce3 [sunrpc]  #7 [ffff8818d3aefe98] nfsd at ffffffffc027b117 [nfsd]  #8 [ffff8818d3aefec8] kthread at ffffffff810b88c1  #9 [ffff8818d3aeff50] ret_from_fork at ffffffff816d1607 The troublemake elem: crash> lock_manager ffff881dc0c38298 struct lock_manager {   list = {     next = 0xffff881dc0c38298,     prev = 0xffff881dc0c38298   },   block_opens = false } Fixes: c87fb4a378f9 ("lockd: NLM grace period shouldn't block NFSv4 opens") Signed-off-by: Cheng Lin <cheng.lin130@zte.com.cn> Signed-off-by: Yi Wang <wang.yi59@zte.com.cn> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-30NFSD: Fix 5 seconds delay when doing inter server copyDai Ngo1-0/+1
[ Upstream commit ca9364dde50daba93eff711b4b945fd08beafcc2 ] Since commit b4868b44c5628 ("NFSv4: Wait for stateid updates after CLOSE/OPEN_DOWNGRADE"), every inter server copy operation suffers 5 seconds delay regardless of the size of the copy. The delay is from nfs_set_open_stateid_locked when the check by nfs_stateid_is_sequential fails because the seqid in both nfs4_state and nfs4_stateid are 0. Fix by modifying nfs4_init_cp_state to return the stateid with seqid 1 instead of 0. This is also to conform with section 4.8 of RFC 7862. Here is the relevant paragraph from section 4.8 of RFC 7862: A copy offload stateid's seqid MUST NOT be zero. In the context of a copy offload operation, it is inappropriate to indicate "the most recent copy offload operation" using a stateid with a seqid of zero (see Section 8.2.2 of [RFC5661]). It is inappropriate because the stateid refers to internal state in the server and there may be several asynchronous COPY operations being performed in parallel on the same file by the server. Therefore, a copy offload stateid with a seqid of zero MUST be considered invalid. Fixes: ce0887ac96d3 ("NFSD add nfs4 inter ssc to nfsd4_copy") Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-30nfsd: Fix message level for normal terminationkazuo ito1-2/+1
[ Upstream commit 4420440c57892779f265108f46f83832a88ca795 ] The warning message from nfsd terminating normally can confuse system adminstrators or monitoring software. Though it's not exactly fair to pin-point a commit where it originated, the current form in the current place started to appear in: Fixes: e096bbc6488d ("knfsd: remove special handling for SIGHUP") Signed-off-by: kazuo ito <kzpn200@gmail.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-30f2fs: fix double free of unicode mapHyeongseok Kim1-0/+1
[ Upstream commit 89ff6005039a878afac87889fee748fa3f957c3a ] In case of retrying fill_super with skip_recovery, s_encoding for casefold would not be loaded again even though it's already been freed because it's not NULL. Set NULL after free to prevent double freeing when unmount. Fixes: eca4873ee1b6 ("f2fs: Use generic casefolding support") Signed-off-by: Hyeongseok Kim <hyeongseok@gmail.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-30NFS: switch nfsiod to be an UNBOUND workqueue.NeilBrown1-1/+1
[ Upstream commit bf701b765eaa82dd164d65edc5747ec7288bb5c3 ] nfsiod is currently a concurrency-managed workqueue (CMWQ). This means that workitems scheduled to nfsiod on a given CPU are queued behind all other work items queued on any CMWQ on the same CPU. This can introduce unexpected latency. Occaionally nfsiod can even cause excessive latency. If the work item to complete a CLOSE request calls the final iput() on an inode, the address_space of that inode will be dismantled. This takes time proportional to the number of in-memory pages, which on a large host working on large files (e.g.. 5TB), can be a large number of pages resulting in a noticable number of seconds. We can avoid these latency problems by switching nfsiod to WQ_UNBOUND. This causes each concurrent work item to gets a dedicated thread w