summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2025-08-20btrfs: don't skip remaining extrefs if dir not found during log replayFilipe Manana1-4/+14
commit 24e066ded45b8147b79c7455ac43a5bff7b5f378 upstream. During log replay, at add_inode_ref(), if we have an extref item that contains multiple extrefs and one of them points to a directory that does not exist in the subvolume tree, we are supposed to ignore it and process the remaining extrefs encoded in the extref item, since each extref can point to a different parent inode. However when that happens we just return from the function and ignore the remaining extrefs. The problem has been around since extrefs were introduced, in commit f186373fef00 ("btrfs: extended inode refs"), but it's hard to hit in practice because getting extref items encoding multiple extref requires getting a hash collision when computing the offset of the extref's key. The offset if computed like this: key.offset = btrfs_extref_hash(dir_ino, name->name, name->len); and btrfs_extref_hash() is just a wrapper around crc32c(). Fix this by moving to next iteration of the loop when we don't find the parent directory that an extref points to. Fixes: f186373fef00 ("btrfs: extended inode refs") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-20btrfs: qgroup: fix qgroup create ioctl returning success after quotas disabledFilipe Manana2-5/+4
commit 08530d6e638427e7e1344bd67bacc03882ba95b9 upstream. When quotas are disabled qgroup ioctls are supposed to return -ENOTCONN, but the qgroup create ioctl stopped doing that when it races with a quota disable operation, returning 0 instead. This change of behaviour happened in commit 6ed05643ddb1 ("btrfs: create qgroup earlier in snapshot creation"). The issue happens as follows: 1) Task A enters btrfs_ioctl_qgroup_create(), qgroups are enabled and so qgroup_enabled() returns true since fs_info->quota_root is not NULL; 2) Task B enters btrfs_ioctl_quota_ctl() -> btrfs_quota_disable() and disables qgroups, so now fs_info->quota_root is NULL; 3) Task A enters btrfs_create_qgroup() and calls btrfs_qgroup_mode(), which returns BTRFS_QGROUP_MODE_DISABLED since quotas are disabled, and then btrfs_create_qgroup() returns 0 to the caller, which makes the ioctl return 0 instead of -ENOTCONN. The check for fs_info->quota_root and returning -ENOTCONN if it's NULL is made only after the call btrfs_qgroup_mode(). Fix this by moving the check for disabled quotas with btrfs_qgroup_mode() into transaction.c:create_pending_snapshot(), so that we don't abort the transaction if btrfs_create_qgroup() returns -ENOTCONN and quotas are disabled. Fixes: 6ed05643ddb1 ("btrfs: create qgroup earlier in snapshot creation") CC: stable@vger.kernel.org # 6.12+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-20btrfs: populate otime when logging an inode itemQu Wenruo1-0/+3
commit 1ef94169db0958d6de39f9ea6e063ce887342e2d upstream. [TEST FAILURE WITH EXPERIMENTAL FEATURES] When running test case generic/508, the test case will fail with the new btrfs shutdown support: generic/508 - output mismatch (see /home/adam/xfstests/results//generic/508.out.bad) # --- tests/generic/508.out 2022-05-11 11:25:30.806666664 +0930 # +++ /home/adam/xfstests/results//generic/508.out.bad 2025-07-02 14:53:22.401824212 +0930 # @@ -1,2 +1,6 @@ # QA output created by 508 # Silence is golden # +Before: # +After : stat.btime = Thu Jan 1 09:30:00 1970 # +Before: # +After : stat.btime = Wed Jul 2 14:53:22 2025 # ... # (Run 'diff -u /home/adam/xfstests/tests/generic/508.out /home/adam/xfstests/results//generic/508.out.bad' to see the entire diff) Ran: generic/508 Failures: generic/508 Failed 1 of 1 tests Please note that the test case requires shutdown support, thus the test case will be skipped using the current upstream kernel, as it doesn't have shutdown ioctl support. [CAUSE] The direct cause the 0 time stamp in the log tree: leaf 30507008 items 2 free space 16057 generation 9 owner TREE_LOG leaf 30507008 flags 0x1(WRITTEN) backref revision 1 checksum stored e522548d checksum calced e522548d fs uuid 57d45451-481e-43e4-aa93-289ad707a3a0 chunk uuid d52bd3fd-5163-4337-98a7-7986993ad398 item 0 key (257 INODE_ITEM 0) itemoff 16123 itemsize 160 generation 9 transid 9 size 0 nbytes 0 block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0 sequence 1 flags 0x0(none) atime 1751432947.492000000 (2025-07-02 14:39:07) ctime 1751432947.492000000 (2025-07-02 14:39:07) mtime 1751432947.492000000 (2025-07-02 14:39:07) otime 0.0 (1970-01-01 09:30:00) <<< But the old fs tree has all the correct time stamp: btrfs-progs v6.12 fs tree key (FS_TREE ROOT_ITEM 0) leaf 30425088 items 2 free space 16061 generation 5 owner FS_TREE leaf 30425088 flags 0x1(WRITTEN) backref revision 1 checksum stored 48f6c57e checksum calced 48f6c57e fs uuid 57d45451-481e-43e4-aa93-289ad707a3a0 chunk uuid d52bd3fd-5163-4337-98a7-7986993ad398 item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160 generation 3 transid 0 size 0 nbytes 16384 block group 0 mode 40755 links 1 uid 0 gid 0 rdev 0 sequence 0 flags 0x0(none) atime 1751432947.0 (2025-07-02 14:39:07) ctime 1751432947.0 (2025-07-02 14:39:07) mtime 1751432947.0 (2025-07-02 14:39:07) otime 1751432947.0 (2025-07-02 14:39:07) <<< The root cause is that fill_inode_item() in tree-log.c is only populating a/c/m time, not the otime (or btime in statx output). Part of the reason is that, the vfs inode only has a/c/m time, no native btime support yet. [FIX] Thankfully btrfs has its otime stored in btrfs_inode::i_otime_sec and btrfs_inode::i_otime_nsec. So what we really need is just fill the otime time stamp in fill_inode_item() of tree-log.c There is another fill_inode_item() in inode.c, which is doing the proper otime population. Fixes: 94edf4ae43a5 ("Btrfs: don't bother committing delayed inode updates when fsyncing") CC: stable@vger.kernel.org Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-20btrfs: fix ssd_spread overallocationBoris Burkov1-16/+17
commit 807d9023e75fc20bfd6dd2ac0408ce4af53f1648 upstream. If the ssd_spread mount option is enabled, then we run the so called clustered allocator for data block groups. In practice, this results in creating a btrfs_free_cluster which caches a block_group and borrows its free extents for allocation. Since the introduction of allocation size classes in 6.1, there has been a bug in the interaction between that feature and ssd_spread. find_free_extent() has a number of nested loops. The loop going over the allocation stages, stored in ffe_ctl->loop and managed by find_free_extent_update_loop(), the loop over the raid levels, and the loop over all the block_groups in a space_info. The size class feature relies on the block_group loop to ensure it gets a chance to see a block_group of a given size class. However, the clustered allocator uses the cached cluster block_group and breaks that loop. Each call to do_allocation() will really just go back to the same cached block_group. Normally, this is OK, as the allocation either succeeds and we don't want to loop any more or it fails, and we clear the cluster and return its space to the block_group. But with size classes, the allocation can succeed, then later fail, outside of do_allocation() due to size class mismatch. That latter failure is not properly handled due to the highly complex multi loop logic. The result is a painful loop where we continue to allocate the same num_bytes from the cluster in a tight loop until it fails and releases the cluster and lets us try a new block_group. But by then, we have skipped great swaths of the available block_groups and are likely to fail to allocate, looping the outer loop. In pathological cases like the reproducer below, the cached block_group is often the very last one, in which case we don't perform this tight bg loop but instead rip through the ffe stages to LOOP_CHUNK_ALLOC and allocate a chunk, which is now the last one, and we enter the tight inner loop until an allocation failure. Then allocation succeeds on the final block_group and if the next allocation is a size mismatch, the exact same thing happens again. Triggering this is as easy as mounting with -o ssd_spread and then running: mount -o ssd_spread $dev $mnt dd if=/dev/zero of=$mnt/big bs=16M count=1 &>/dev/null dd if=/dev/zero of=$mnt/med bs=4M count=1 &>/dev/null sync if you do the two writes + sync in a loop, you can force btrfs to spin an excessive amount on semi-successful clustered allocations, before ultimately failing and advancing to the stage where we force a chunk allocation. This results in 2G of data allocated per iteration, despite only using ~20M of data. By using a small size classed extent, the inner loop takes longer and we can spin for longer. The simplest, shortest term fix to unbreak this is to make the clustered allocator size_class aware in the dumbest way, where it fails on size class mismatch. This may hinder the operation of the clustered allocator, but better hindered than completely broken and terribly overallocating. Further re-design improvements are also in the works. Fixes: 52bb7a2166af ("btrfs: introduce size class to block group allocator") CC: stable@vger.kernel.org # 6.1+ Reported-by: David Sterba <dsterba@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-20btrfs: don't ignore inode missing when replaying log treeFilipe Manana1-2/+12
commit 7ebf381a69421a88265d3c49cd0f007ba7336c9d upstream. During log replay, at add_inode_ref(), we return -ENOENT if our current inode isn't found on the subvolume tree or if a parent directory isn't found. The error comes from btrfs_iget_logging() <- btrfs_iget() <- btrfs_read_locked_inode(). The single caller of add_inode_ref(), replay_one_buffer(), ignores an -ENOENT error because it expects that error to mean only that a parent directory wasn't found and that is ok. Before commit 5f61b961599a ("btrfs: fix inode lookup error handling during log replay") we were converting any error when getting a parent directory to -ENOENT and any error when getting the current inode to -EIO, so our caller would fail log replay in case we can't find the current inode. After that commit however in case the current inode is not found we return -ENOENT to the caller and therefore it ignores the critical fact that the current inode was not found in the subvolume tree. Fix this by converting -ENOENT to 0 when we don't find a parent directory, returning -ENOENT when we don't find the current inode and making the caller, replay_one_buffer(), not ignore -ENOENT anymore. Fixes: 5f61b961599a ("btrfs: fix inode lookup error handling during log replay") CC: stable@vger.kernel.org # 6.16 Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-20btrfs: qgroup: set quota enabled bit if quota disable fails flushing ↵Filipe Manana1-2/+5
reservations commit e41c75ca3189341e76e6af64b857c05b68a1d7db upstream. Before waiting for the rescan worker to finish and flushing reservations, we clear the BTRFS_FS_QUOTA_ENABLED flag from fs_info. If we fail flushing reservations we leave with the flag not set which is not correct since quotas are still enabled - we must set back the flag on error paths, such as when we fail to start a transaction, except for error paths that abort a transaction. The reservation flushing happens very early before we do any operation that actually disables quotas and before we start a transaction, so set back BTRFS_FS_QUOTA_ENABLED if it fails. Fixes: af0e2aab3b70 ("btrfs: qgroup: flush reservations during quota disable") CC: stable@vger.kernel.org # 6.12+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-20btrfs: zoned: do not remove unwritten non-data block groupNaohiro Aota1-2/+25
commit 3061801420469610c8fa6080a950e56770773ef1 upstream. There are some reports of "unable to find chunk map for logical 2147483648 length 16384" error message appears in dmesg. This means some IOs are occurring after a block group is removed. When a metadata tree node is cleaned on a zoned setup, we keep that node still dirty and write it out not to create a write hole. However, this can make a block group's used bytes == 0 while there is a dirty region left. Such an unused block group is moved into the unused_bg list and processed for removal. When the removal succeeds, the block group is removed from the transaction->dirty_bgs list, so the unused dirty nodes in the block group are not sent at the transaction commit time. It will be written at some later time e.g, sync or umount, and causes "unable to find chunk map" errors. This can happen relatively easy on SMR whose zone size is 256MB. However, calling do_zone_finish() on such block group returns -EAGAIN and keep that block group intact, which is why the issue is hidden until now. Fixes: afba2bc036b0 ("btrfs: zoned: implement active zone tracking") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-20btrfs: abort transaction during log replay if walk_log_tree() failedFilipe Manana1-2/+5
commit 2a5898c4aac67494c2f0f7fe38373c95c371c930 upstream. If we failed walking a log tree during replay, we have a missing transaction abort to prevent committing a transaction where we didn't fully replay all the changes from a log tree and therefore can leave the respective subvolume tree in some inconsistent state. So add the missing transaction abort. CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-20btrfs: zoned: use filesystem size not disk size for reclaim decisionJohannes Thumshirn1-2/+1
commit 55f7c65b2f69c7e4cb7aa7c1654a228ccf734fd8 upstream. When deciding if a zoned filesystem is reaching the threshold to reclaim data block groups, look at the size of the filesystem not to potentially total available size of all drives in the filesystem. Especially if a filesystem was created with mkfs' -b option, constraining it to only a portion of the block device, the numbers won't match and potentially garbage collection is kicking in too late. Fixes: 3687fcb0752a ("btrfs: zoned: make auto-reclaim less aggressive") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Tested-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-20ext4: initialize superblock fields in the kballoc-test.c kunit testsZhang Yi1-0/+9
commit 82e6381e23f1ea7a14f418215068aaa2ca046c84 upstream. Various changes in the "ext4: better scalability for ext4 block allocation" patch series have resulted in kunit test failures, most notably in the test_new_blocks_simple and the test_mb_mark_used tests. The root cause of these failures is that various in-memory ext4 data structures were not getting initialized, and while previous versions of the functions exercised by the unit tests didn't use these structure members, this was arguably a test bug. Since one of the patches in the block allocation scalability patches is a fix which is has a cc:stable tag, this commit also has a cc:stable tag. CC: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250714130327.1830534-1-libaokun1@huawei.com Link: https://patch.msgid.link/20250725021550.3177573-1-yi.zhang@huaweicloud.com Link: https://patch.msgid.link/20250725021654.3188798-1-yi.zhang@huaweicloud.com Reported-by: Guenter Roeck <linux@roeck-us.net> Closes: https://lore.kernel.org/linux-ext4/b0635ad0-7ebf-4152-a69b-58e7e87d5085@roeck-us.net/ Tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-20ext4: fix largest free orders lists corruption on mb_optimize_scan switchBaokun Li1-19/+14
commit 7d345aa1fac4c2ec9584fbd6f389f2c2368671d5 upstream. The grp->bb_largest_free_order is updated regardless of whether mb_optimize_scan is enabled. This can lead to inconsistencies between grp->bb_largest_free_order and the actual s_mb_largest_free_orders list index when mb_optimize_scan is repeatedly enabled and disabled via remount. For example, if mb_optimize_scan is initially enabled, largest free order is 3, and the group is in s_mb_largest_free_orders[3]. Then, mb_optimize_scan is disabled via remount, block allocations occur, updating largest free order to 2. Finally, mb_optimize_scan is re-enabled via remount, more block allocations update largest free order to 1. At this point, the group would be removed from s_mb_largest_free_orders[3] under the protection of s_mb_largest_free_orders_locks[2]. This lock mismatch can lead to list corruption. To fix this, whenever grp->bb_largest_free_order changes, we now always attempt to remove the group from its old order list. However, we only insert the group into the new order list if `mb_optimize_scan` is enabled. This approach helps prevent lock inconsistencies and ensures the data in the order lists remains reliable. Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning") CC: stable@vger.kernel.org Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-12-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-20ext4: fix zombie groups in average fragment size listsBaokun Li1-18/+18
commit 1c320d8e92925bb7615f83a7b6e3f402a5c2ca63 upstream. Groups with no free blocks shouldn't be in any average fragment size list. However, when all blocks in a group are allocated(i.e., bb_fragments or bb_free is 0), we currently skip updating the average fragment size, which means the group isn't removed from its previous s_mb_avg_fragment_size[old] list. This created "zombie" groups that were always skipped during traversal as they couldn't satisfy any block allocation requests, negatively impacting traversal efficiency. Therefore, when a group becomes completely full, bb_avg_fragment_size_order is now set to -1. If the old order was not -1, a removal operation is performed; if the new order is not -1, an insertion is performed. Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning") CC: stable@vger.kernel.org Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-11-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-20cifs: reset iface weights when we cannot find a candidateShyam Prasad N1-0/+9
commit 9d5eff7821f6d70f7d1b4d8a60680fba4de868a7 upstream. We now do a weighted selection of server interfaces when allocating new channels. The weights are decided based on the speed advertised. The fulfilled weight for an interface is a counter that is used to track the interface selection. It should be reset back to zero once all interfaces fulfilling their weight. In cifs_chan_update_iface, this reset logic was missing. As a result when the server interface list changes, the client may not be able to find a new candidate for other channels after all interfaces have been fulfilled. Fixes: a6d8fb54a515 ("cifs: distribute channels across interfaces based on speed") Cc: <stable@vger.kernel.org> Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-20cifs: Fix collect_sample() to handle any iterator typeDavid Howells1-50/+21
[ Upstream commit b63335fb3d32579c5ff0b7038b9cc23688fff528 ] collect_sample() is used to gather samples of the data in a Write op for analysis to try and determine if the compression algorithm is likely to achieve anything more quickly than actually running the compression algorithm. However, collect_sample() assumes that the data it is going to be sampling is stored in an ITER_XARRAY-type iterator (which it now should never be) and doesn't actually check that it is before accessing the underlying xarray directly. Fix this by replacing the code with a loop that just uses the standard iterator functions to sample every other 2KiB block, skipping the intervening ones. It's not quite the same as the previous algorithm as it doesn't necessarily align to the pages within an ordinary write from the pagecache. Note that the btrfs code from which this was derived samples the inode's pagecache directly rather than the iterator - but that doesn't necessarily work for network filesystems if O_DIRECT is in operation. Fixes: 94ae8c3fee94 ("smb: client: compress: LZ77 code improvements cleanup") Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> cc: Enzo Matsumiya <ematsumiya@suse.de> cc: Shyam Prasad N <sprasad@microsoft.com> cc: Tom Talpey <tom@talpey.com> cc: linux-cifs@vger.kernel.org cc: linux-fsdevel@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-20pNFS: Fix uninited ptr deref in block/scsi layoutSergey Bashirov1-5/+15
[ Upstream commit 9768797c219326699778fba9cd3b607b2f1e7950 ] The error occurs on the third attempt to encode extents. When function ext_tree_prepare_commit() reallocates a larger buffer to retry encoding extents, the "layoutupdate_pages" page array is initialized only after the retry loop. But ext_tree_free_commitdata() is called on every iteration and tries to put pages in the array, thus dereferencing uninitialized pointers. An additional problem is that there is no limit on the maximum possible buffer_size. When there are too many extents, the client may create a layoutcommit that is larger than the maximum possible RPC size accepted by the server. During testing, we observed two typical scenarios. First, one memory page for extents is enough when we work with small files, append data to the end of the file, or preallocate extents before writing. But when we fill a new large file without preallocating, the number of extents can be huge, and counting the number of written extents in ext_tree_encode_commit() does not help much. Since this number increases even more between unlocking and locking of ext_tree, the reallocated buffer may not be large enough again and again. Co-developed-by: Konstantin Evtushenko <koevtushenko@yandex.com> Signed-off-by: Konstantin Evtushenko <koevtushenko@yandex.com> Signed-off-by: Sergey Bashirov <sergeybashirov@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250630183537.196479-2-sergeybashirov@gmail.com Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-20pNFS: Handle RPC size limit for layoutcommitsSergey Bashirov1-3/+8
[ Upstream commit d897d81671bc4615c80f4f3bd5e6b218f59df50c ] When there are too many block extents for a layoutcommit, they may not all fit into the maximum-sized RPC. This patch allows the generic pnfs code to properly handle -ENOSPC returned by the block/scsi layout driver and trigger additional layoutcommits if necessary. Co-developed-by: Konstantin Evtushenko <koevtushenko@yandex.com> Signed-off-by: Konstantin Evtushenko <koevtushenko@yandex.com> Signed-off-by: Sergey Bashirov <sergeybashirov@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250630183537.196479-5-sergeybashirov@gmail.com Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-20pNFS: Fix disk addr range check in block/scsi layoutSergey Bashirov1-2/+2
[ Upstream commit 7db6e66663681abda54f81d5916db3a3b8b1a13d ] At the end of the isect translation, disc_addr represents the physical disk offset. Thus, end calculated from disk_addr is also a physical disk offset. Therefore, range checking should be done using map->disk_offset, not map->start. Signed-off-by: Sergey Bashirov <sergeybashirov@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250702133226.212537-1-sergeybashirov@gmail.com Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-20pNFS: Fix stripe mapping in block/scsi layoutSergey Bashirov1-2/+3
[ Upstream commit 81438498a285759f31e843ac4800f82a5ce6521f ] Because of integer division, we need to carefully calculate the disk offset. Consider the example below for a stripe of 6 volumes, a chunk size of 4096, and an offset of 70000. chunk = div_u64(offset, dev->chunk_size) = 70000 / 4096 = 17 offset = chunk * dev->chunk_size = 17 * 4096 = 69632 disk_offset_wrong = div_u64(offset, dev->nr_children) = 69632 / 6 = 11605 disk_chunk = div_u64(chunk, dev->nr_children) = 17 / 6 = 2 disk_offset = disk_chunk * dev->chunk_size = 2 * 4096 = 8192 Signed-off-by: Sergey Bashirov <sergeybashirov@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250701122341.199112-1-sergeybashirov@gmail.com Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-20smb: client: don't call init_waitqueue_head(&info->conn_wait) twice in ↵Stefan Metzmacher1-1/+0
_smbd_get_connection [ Upstream commit 550a194c5998e4e77affc6235e80d3766dc2d27e ] It is already called long before we may hit this cleanup code path. Cc: Steve French <smfrench@gmail.com> Cc: Tom Talpey <tom@talpey.com> Cc: Long Li <longli@microsoft.com> Cc: linux-cifs@vger.kernel.org Cc: samba-technical@lists.samba.org Signed-off-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-20f2fs: check the generic conditions firstJaegeuk Kim1-12/+12
[ Upstream commit e23ab8028de0d92df5921a570f5212c0370db3b5 ] Let's return errors caught by the generic checks. This fixes generic/494 where it expects to see EBUSY by setattr_prepare instead of EINVAL by f2fs for active swapfile. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-20exfat: add cluster chain loop check for dirYuezhang Mo4-11/+48
[ Upstream commit 99f9a97dce39ad413c39b92c90393bbd6778f3fd ] An infinite loop may occur if the following conditions occur due to file system corruption. (1) Condition for exfat_count_dir_entries() to loop infinitely. - The cluster chain includes a loop. - There is no UNUSED entry in the cluster chain. (2) Condition for exfat_create_upcase_table() to loop infinitely. - The cluster chain of the root directory includes a loop. - There are no UNUSED entry and up-case table entry in the cluster chain of the root directory. (3) Condition for exfat_load_bitmap() to loop infinitely. - The cluster chain of the root directory includes a loop. - There are no UNUSED entry and bitmap entry in the cluster chain of the root directory. (4) Condition for exfat_find_dir_entry() to loop infinitely. - The cluster chain includes a loop. - The unused directory entries were exhausted by some operation. (5) Condition for exfat_check_dir_empty() to loop infinitely. - The cluster chain includes a loop. - The unused directory entries were exhausted by some operation. - All files and sub-directories under the directory are deleted. This commit adds checks to break the above infinite loop. Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-20jfs: upper bound check of tree index in dbAllocAGArnaud Lecomte1-0/+6
[ Upstream commit c214006856ff52a8ff17ed8da52d50601d54f9ce ] When computing the tree index in dbAllocAG, we never check if we are out of bounds realative to the size of the stree. This could happen in a scenario where the filesystem metadata are corrupted. Reported-by: syzbot+cffd18309153948f3c3e@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=cffd18309153948f3c3e Tested-by: syzbot+cffd18309153948f3c3e@syzkaller.appspotmail.com Signed-off-by: Arnaud Lecomte <contact@arnaud-lcm.com> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-20jfs: Regular file corruption checkEdward Adam Davis1-0/+3
[ Upstream commit 2d04df8116426b6c7b9f8b9b371250f666a2a2fb ] The reproducer builds a corrupted file on disk with a negative i_size value. Add a check when opening this file to avoid subsequent operation failures. Reported-by: syzbot+630f6d40b3ccabc8e96e@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=630f6d40b3ccabc8e96e Tested-by: syzbot+630f6d40b3ccabc8e96e@syzkaller.appspotmail.com Signed-off-by: Edward Adam Davis <eadavis@qq.com> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-20jfs: truncate good inode pages when hard link is 0Lizhi Xu1-1/+1
[ Upstream commit 2d91b3765cd05016335cd5df5e5c6a29708ec058 ] The fileset value of the inode copy from the disk by the reproducer is AGGR_RESERVED_I. When executing evict, its hard link number is 0, so its inode pages are not truncated. This causes the bugon to be triggered when executing clear_inode() because nrpages is greater than 0. Reported-by: syzbot+6e516bb515d93230bc7b@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=6e516bb515d93230bc7b Signed-off-by: Lizhi Xu <lizhi.xu@windriver.com> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-20fs/orangefs: use snprintf() instead of sprintf()Amir Mohammad Jahangirzad1-1/+1
[ Upstream commit cdfa1304657d6f23be8fd2bb0516380a3c89034e ] sprintf() is discouraged for use with bounded destination buffers as it does not prevent buffer overflows when the formatted output exceeds the destination buffer size. snprintf() is a safer alternative as it limits the number of bytes written and ensures NUL-termination. Replace sprintf() with snprintf() for copying the debug string into a temporary buffer, using ORANGEFS_MAX_DEBUG_STRING_LEN as the maximum size to ensure safe formatting and prevent memory corruption in edge cases. EDIT: After this patch sat on linux-next for a few days, Dan Carpenter saw it and suggested that I use scnprintf instead of snprintf. I made the change and retested. Signed-off-by: Amir Mohammad Jahangirzad <a.jahangirzad@gmail.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-20ext4: do not BUG when INLINE_DATA_FL lacks system.data xattrTheodore Ts'o1-3/+16
[ Upstream commit 099b847ccc6c1ad2f805d13cfbcc83f5b6d4bc42 ] A syzbot fuzzed image triggered a BUG_ON in ext4_update_inline_data() when an inode had the INLINE_DATA_FL flag set but was missing the system.data extended attribute. Since this can happen due to a maiciouly fuzzed file system, we shouldn't BUG, but rather, report it as a corrupted file system. Add similar replacements of BUG_ON with EXT4_ERROR_INODE() ii ext4_create_inline_data() and ext4_inline_data_truncate(). Reported-by: syzbot+544248a761451c0df72f@syzkaller.appspotmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-20cifs: Fix calling CIFSFindFirst() for root path without msearchPali Rohár1-0/+10
[ Upstream commit b460249b9a1dab7a9f58483e5349d045ad6d585c ] To query root path (without msearch wildcard) it is needed to send pattern '\' instead of '' (empty string). This allows to use CIFSFindFirst() to query information about root path which is being used in followup changes. This change fixes the stat() syscall called on the root path on the mount. It is because stat() syscall uses the cifs_query_path_info() function and it can fallback to the CIFSFindFirst() usage with msearch=false. Signed-off-by: Pali Rohár <pali@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-20tracefs: Add d_delete to remove negative dentriesSteven Rostedt1-0/+11
[ Upstream commit d9b13cdad80dc11d74408cf201939a946e9303a6 ] If a lookup in tracefs is done on a file that does not exist, it leaves a dentry hanging around until memory pressure removes it. But eventfs dentries should hang around as when their ref count goes to zero, it requires more work to recreate it. For the rest of the tracefs dentries, they hang around as their dentry is used as a descriptor for the tracing system. But if a file lookup happens for a file in tracefs that does not exist, it should be deleted. Add a .d_delete callback that checks if dentry->fsdata is set or not. Only eventfs dentries set fsdata so if it has content it should not be deleted and should hang around in the cache. Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-20ext2: Handle fiemap on empty files to prevent EINVALWei Gao1-1/+11
[ Upstream commit a099b09a3342a0b28ea330e405501b5b4d0424b4 ] Previously, ext2_fiemap would unconditionally apply "len = min_t(u64, len, i_size_read(inode));", When inode->i_size was 0 (for an empty file), this would reduce the requested len to 0. Passing len = 0 to iomap_fiemap could then result in an -EINVAL error, even for valid queries on empty files. Link: https://github.com/linux-test-project/ltp/issues/1246 Signed-off-by: Wei Gao <wegao@suse.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250613152402.3432135-1-wegao@suse.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-20pidfs: raise SB_I_NODEV and SB_I_NOEXECChristian Brauner1-0/+2
[ Upstream commit 1a1ad73aa1a66787f05f7f10f686b74bab77be72 ] Similar to commit 1ed95281c0c7 ("anon_inode: raise SB_I_NODEV and SB_I_NOEXEC"): it shouldn't be possible to execute pidfds via execveat(fd_anon_inode, "", NULL, NULL, AT_EMPTY_PATH) so raise SB_I_NOEXEC so that no one gets any creative ideas. Also raise SB_I_NODEV as we don't expect or support any devices on pidfs. Link: https://lore.kernel.org/20250618-work-pidfs-persistent-v2-1-98f3456fd552@kernel.org Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-20fs/ntfs3: correctly create symlink for relative pathRong Zhang1-13/+18
[ Upstream commit b1e9d89408f402858c00103f9831b25ffa0994d3 ] After applying this patch, could correctly create symlink: ln -s "relative/path/to/file" symlink Signed-off-by: Rong Zhang <ulin0208@gmail.com> [almaz.alexandrovich@paragon-software.com: added cpu_to_le32 macro to rs->Flags assignment] Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-20fs/ntfs3: Add sanity check for file nameLizhi Xu1-0/+3
[ Upstream commit e841ecb139339602bc1853f5f09daa5d1ea920a2 ] The length of the file name should be smaller than the directory entry size. Reported-by: syzbot+598057afa0f49e62bd23@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=598057afa0f49e62bd23 Signed-off-by: Lizhi Xu <lizhi.xu@windriver.com> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-20better lockdep annotations for simple_recursive_removal()Al Viro1-2/+2
[ Upstream commit 2a8061ee5e41034eb14170ec4517b5583dbeff9f ] We want a class that nests outside of I_MUTEX_NORMAL (for the sake of callbacks that might want to lock the victim) and inside I_MUTEX_PARENT (so that a variant of that could be used with parent of the victim held locked by the caller). In reality, simple_recursive_removal() * never holds two locks at once * holds the lock on parent of dentry passed to callback * is used only on the trees with fixed topology, so the depths are not changing. So the locking order is actually fine. AFAICS, the best solution is to assign I_MUTEX_CHILD to the locks grabbed by that thing. Reported-by: syzbot+169de184e9defe7fe709@syzkaller.appspotmail.com Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-20hfs: fix not erasing deleted b-tree node issueViacheslav Dubeyko1-0/+1
[ Upstream commit d3ed6d6981f4756f145766753c872482bc3b28d3 ] The generic/001 test of xfstests suite fails and corrupts the HFS volume: sudo ./check generic/001 FSTYP -- hfs PLATFORM -- Linux/x86_64 hfsplus-testing-0001 6.15.0-rc2+ #3 SMP PREEMPT_DYNAMIC Fri Apr 25 17:13:00 PDT 2> MKFS_OPTIONS -- /dev/loop51 MOUNT_OPTIONS -- /dev/loop51 /mnt/scratch generic/001 32s ... _check_generic_filesystem: filesystem on /dev/loop50 is inconsistent (see /home/slavad/XFSTESTS-2/xfstests-dev/results//generic/001.full for details) Ran: generic/001 Failures: generic/001 Failed 1 of 1 tests fsck.hfs -d -n ./test-image.bin ** ./test-image.bin (NO WRITE) Using cacheBlockSize=32K cacheTotalBlock=1024 cacheSize=32768K. Executing fsck_hfs (version 540.1-Linux). ** Checking HFS volume. The volume name is untitled ** Checking extents overflow file. ** Checking catalog file. Unused node is not erased (node = 2) Unused node is not erased (node = 4) <skipped> Unused node is not erased (node = 253) Unused node is not erased (node = 254) Unused node is not erased (node = 255) Unused node is not erased (node = 256) ** Checking catalog hierarchy. ** Checking volume bitmap. ** Checking volume information. Verify Status: VIStat = 0x0000, ABTStat = 0x0000 EBTStat = 0x0000 CBTStat = 0x0004 CatStat = 0x00000000 ** The volume untitled was found corrupt and needs to be repaired. volume type is HFS primary MDB is at block 2 0x02 alternate MDB is at block 20971518 0x13ffffe primary VHB is at block 0 0x00 alternate VHB is at block 0 0x00 sector size = 512 0x200 VolumeObject flags = 0x19 total sectors for volume = 20971520 0x1400000 total sectors for embedded volume = 0 0x00 This patch adds logic of clearing the deleted b-tree node. sudo ./check generic/001 FSTYP -- hfs PLATFORM -- Linux/x86_64 hfsplus-testing-0001 6.15.0-rc2+ #3 SMP PREEMPT_DYNAMIC Fri Apr 25 17:13:00 PDT 2025 MKFS_OPTIONS -- /dev/loop51 MOUNT_OPTIONS -- /dev/loop51 /mnt/scratch generic/001 9s ... 32s Ran: generic/001 Passed all 1 tests fsck.hfs -d -n ./test-image.bin ** ./test-image.bin (NO WRITE) Using cacheBlockSize=32K cacheTotalBlock=1024 cacheSize=32768K. Executing fsck_hfs (version 540.1-Linux). ** Checking HFS volume. The volume name is untitled ** Checking extents overflow file. ** Checking catalog file. ** Checking catalog hierarchy. ** Checking volume bitmap. ** Checking volume information. ** The volume untitled appears to be OK. Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250430001211.1912533-1-slava@dubeyko.com Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-20udf: Verify partition map countJan Kara1-2/+11
[ Upstream commit 1a11201668e8635602577dcf06f2e96c591d8819 ] Verify that number of partition maps isn't insanely high which can lead to large allocation in udf_sb_alloc_partition_maps(). All partition maps have to fit in the LVD which is in a single block. Reported-by: syzbot+478f2c1a6f0f447a46bb@syzkaller.appspotmail.com Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-20gfs2: Set .migrate_folio in gfs2_{rgrp,meta}_aopsAndrew Price1-0/+2
[ Upstream commit 5c8f12cf1e64e0e8e6cb80b0c935389973e8be8d ] Clears up the warning added in 7ee3647243e5 ("migrate: Remove call to ->writepage") that occurs in various xfstests, causing "something found in dmesg" failures. [ 341.136573] gfs2_meta_aops does not implement migrate_folio [ 341.136953] WARNING: CPU: 1 PID: 36 at mm/migrate.c:944 move_to_new_folio+0x2f8/0x300 Signed-off-by: Andrew Price <anprice@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-20gfs2: Validate i_depth for exhash directoriesAndrew Price2-4/+8
[ Upstream commit 557c024ca7250bb65ae60f16c02074106c2f197b ] A fuzzer test introduced corruption that ends up with a depth of 0 in dir_e_read(), causing an undefined shift by 32 at: index = hash >> (32 - dip->i_depth); As calculated in an open-coded way in dir_make_exhash(), the minimum depth for an exhash directory is ilog2(sdp->sd_hash_ptrs) and 0 is invalid as sdp->sd_hash_ptrs is fixed as sdp->bsize / 16 at mount time. So we can avoid the undefined behaviour by checking for depth values lower than the minimum in gfs2_dinode_in(). Values greater than the maximum are already being checked for there. Also switch the calculation in dir_make_exhash() to use ilog2() to clarify how the depth is calculated. Tested with the syzkaller repro.c and xfstests '-g quick'. Reported-by: syzbot+4708579bb230a0582a57@syzkaller.appspotmail.com Signed-off-by: Andrew Price <anprice@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-20smb/server: avoid deadlock when linking with ReplaceIfExistsNeilBrown1-12/+4
[ Upstream commit d5fc1400a34b4ea5e8f2ce296ea12bf8c8421694 ] If smb2_create_link() is called with ReplaceIfExists set and the name does exist then a deadlock will happen. ksmbd_vfs_kern_path_locked() will return with success and the parent directory will be locked. ksmbd_vfs_remove_file() will then remove the file. ksmbd_vfs_link() will then be called while the parent is still locked. It will try to lock the same parent and will deadlock. This patch moves the ksmbd_vfs_kern_path_unlock() call to *before* ksmbd_vfs_link() and then simplifies the code, removing the file_present flag variable. Signed-off-by: NeilBrown <neil@brown.name> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-20hfsplus: don't use BUG_ON() in hfsplus_create_attributes_file()Tetsuo Handa1-1/+5
[ Upstream commit c7c6363ca186747ebc2df10c8a1a51e66e0e32d9 ] When the volume header contains erroneous values that do not reflect the actual state of the filesystem, hfsplus_fill_super() assumes that the attributes file is not yet created, which later results in hitting BUG_ON() when hfsplus_create_attributes_file() is called. Replace this BUG_ON() with -EIO error with a message to suggest running fsck tool. Reported-by: syzbot <syzbot+1107451c16b9eb9d29e6@syzkaller.appspotmail.com> Closes: https://syzkaller.appspot.com/bug?extid=1107451c16b9eb9d29e6 Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Viacheslav Dubeyko <slava@dubeyko.com> Link: https://lore.kernel.org/r/7b587d24-c8a1-4413-9b9a-00a33fbd849f@I-love.SAKURA.ne.jp Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-20hfsplus: fix slab-out-of-bounds read in hfsplus_uni2asc()Viacheslav Dubeyko1-0/+7
[ Upstream commit 94458781aee6045bd3d0ad4b80b02886b9e2219b ] The hfsplus_readdir() method is capable to crash by calling hfsplus_uni2asc(): [ 667.121659][ T9805] ================================================================== [ 667.122651][ T9805] BUG: KASAN: slab-out-of-bounds in hfsplus_uni2asc+0x902/0xa10 [ 667.123627][ T9805] Read of size 2 at addr ffff88802592f40c by task repro/9805 [ 667.124578][ T9805] [ 667.124876][ T9805] CPU: 3 UID: 0 PID: 9805 Comm: