summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2024-02-13bcachefs: Clamp replicas_required to replicasKent Overstreet6-5/+21
This prevents going emergency read only when the user has specified replicas_required > replicas. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-13btrfs: don't refill whole delayed refs block reserve when starting transactionFilipe Manana1-36/+2
Since commit 28270e25c69a ("btrfs: always reserve space for delayed refs when starting transaction") we started not only to reserve metadata space for the delayed refs a caller of btrfs_start_transaction() might generate but also to try to fully refill the delayed refs block reserve, because there are several case where we generate delayed refs and haven't reserved space for them, relying on the global block reserve. Relying too much on the global block reserve is not always safe, and can result in hitting -ENOSPC during transaction commits or worst, in rare cases, being unable to mount a filesystem that needs to do orphan cleanup or anything that requires modifying the filesystem during mount, and has no more unallocated space and the metadata space is nearly full. This was explained in detail in that commit's change log. However the gap between the reserved amount and the size of the delayed refs block reserve can be huge, so attempting to reserve space for such a gap can result in allocating many metadata block groups that end up not being used. After a recent patch, with the subject: "btrfs: add new unused block groups to the list of unused block groups" We started to add new block groups that are unused to the list of unused block groups, to avoid having them around for a very long time in case they are never used, because a block group is only added to the list of unused block groups when we deallocate the last extent or when mounting the filesystem and the block group has 0 bytes used. This is not a problem introduced by the commit mentioned earlier, it always existed as our metadata space reservations are, most of the time, pessimistic and end up not using all the space they reserved, so we can occasionally end up with one or two unused metadata block groups for a long period. However after that commit mentioned earlier, we are just more pessimistic in the metadata space reservations when starting a transaction and therefore the issue is more likely to happen. This however is not always enough because we might create unused metadata block groups when reserving metadata space at a high rate if there's always a gap in the delayed refs block reserve and the cleaner kthread isn't triggered often enough or is busy with other work (running delayed iputs, cleaning deleted roots, etc), not to mention the block group's allocated space is only usable for a new block group after the transaction used to remove it is committed. A user reported that he's getting a lot of allocated metadata block groups but the usage percentage of metadata space was very low compared to the total allocated space, specially after running a series of block group relocations. So for now stop trying to refill the gap in the delayed refs block reserve and reserve space only for the delayed refs we are expected to generate when starting a transaction. CC: stable@vger.kernel.org # 6.7+ Reported-by: Ivan Shapovalov <intelfx@intelfx.name> Link: https://lore.kernel.org/linux-btrfs/9cdbf0ca9cdda1b4c84e15e548af7d7f9f926382.camel@intelfx.name/ Link: https://lore.kernel.org/linux-btrfs/CAL3q7H6802ayLHUJFztzZAVzBLJAGdFx=6FHNNy87+obZXXZpQ@mail.gmail.com/ Tested-by: Ivan Shapovalov <intelfx@intelfx.name> Reported-by: Heddxh <g311571057@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CAE93xANEby6RezOD=zcofENYZOT-wpYygJyauyUAZkLv6XVFOA@mail.gmail.com/ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-13btrfs: zoned: fix chunk map leak when loading block group zone infoFilipe Manana1-0/+1
At btrfs_load_block_group_zone_info() we never drop a reference on the chunk map we have looked up, therefore leaking a reference on it. So add the missing btrfs_free_chunk_map() at the end of the function. Fixes: 7dc66abb5a47 ("btrfs: use a dedicated data structure for chunk maps") Reported-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-13btrfs: reject encoded write if inode has nodatasum flag setFilipe Manana1-0/+7
Currently we allow an encoded write against inodes that have the NODATASUM flag set, either because they are NOCOW files or they were created while the filesystem was mounted with "-o nodatasum". This results in having compressed extents without corresponding checksums, which is a filesystem inconsistency reported by 'btrfs check'. For example, running btrfs/281 with MOUNT_OPTIONS="-o nodatacow" triggers this and 'btrfs check' errors out with: [1/7] checking root items [2/7] checking extents [3/7] checking free space tree [4/7] checking fs roots root 256 inode 257 errors 1040, bad file extent, some csum missing root 256 inode 258 errors 1040, bad file extent, some csum missing ERROR: errors found in fs roots (...) So reject encoded writes if the target inode has NODATASUM set. CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-13btrfs: don't reserve space for checksums when writing to nocow filesFilipe Manana1-10/+19
Currently when doing a write to a file we always reserve metadata space for inserting data checksums. However we don't need to do it if we have a nodatacow file (-o nodatacow mount option or chattr +C) or if checksums are disabled (-o nodatasum mount option), as in that case we are only adding unnecessary pressure to metadata reservations. For example on x86_64, with the default node size of 16K, a 4K buffered write into a nodatacow file is reserving 655360 bytes of metadata space, as it's accounting for checksums. After this change, which stops reserving space for checksums if we have a nodatacow file or checksums are disabled, we only need to reserve 393216 bytes of metadata. CC: stable@vger.kernel.org # 6.1+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-13ceph: add ceph_cap_unlink_work to fire check_caps() immediatelyXiubo Li3-1/+69
When unlinking a file the check caps could be delayed for more than 5 seconds, but in MDS side it maybe waiting for the clients to release caps. This will use the cap_wq work queue and a dedicated list to help fire the check_caps() and dirty buffer flushing immediately. Link: https://tracker.ceph.com/issues/50223 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-02-13ceph: always queue a writeback when revoking the Fb capsXiubo Li1-24/+24
In case there is 'Fw' dirty caps and 'CHECK_CAPS_FLUSH' is set we will always ignore queue a writeback. Queue a writeback is very important because it will block kclient flushing the snapcaps to MDS and which will block MDS waiting for revoking the 'Fb' caps. Link: https://tracker.ceph.com/issues/50223 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-02-12smb: client: handle path separator of created SMB symlinksPaulo Alcantara1-2/+6
Convert path separator to CIFS_DIR_SEP(cifs_sb) from symlink target before sending it over the wire otherwise the created SMB symlink may become innaccesible from server side. Fixes: 514d793e27a3 ("smb: client: allow creating symlinks via reparse points") Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-02-12smb: client: set correct id, uid and cruid for multiuser automountsPaulo Alcantara1-0/+16
When uid, gid and cruid are not specified, we need to dynamically set them into the filesystem context used for automounting otherwise they'll end up reusing the values from the parent mount. Fixes: 9fd29a5bae6e ("cifs: use fs_context for automounts") Reported-by: Shane Nehring <snehring@iastate.edu> Closes: https://bugzilla.redhat.com/show_bug.cgi?id=2259257 Cc: stable@vger.kernel.org # 6.2+ Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-02-12Merge tag 'vfs-6.8-rc5.fixes' of ↵Linus Torvalds3-33/+23
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - Fix performance regression introduced by moving the security permission hook out of do_clone_file_range() and into its caller vfs_clone_file_range(). This causes the security hook to be called in situation were it wasn't called before as the fast permission checks were left in do_clone_file_range(). Fix this by merging the two implementations back together and restoring the old ordering: fast permission checks first, expensive ones later. - Tweak mount_setattr() permission checking so that mount properties on the real rootfs can be changed. When we added mount_setattr() we added additional checks compared to legacy mount(2). If the mount had a parent then verify that the caller and the mount namespace the mount is attached to match and if not make sure that it's an anonymous mount. But the real rootfs falls into neither category. It is neither an anoymous mount because it is obviously attached to the initial mount namespace but it also obviously doesn't have a parent mount. So that means legacy mount(2) allows changing mount properties on the real rootfs but mount_setattr(2) blocks this. This causes regressions (See the commit for details). Fix this by relaxing the check. If the mount has a parent or if it isn't a detached mount, verify that the mount namespaces of the caller and the mount are the same. Technically, we could probably write this even simpler and check that the mount namespaces match if it isn't a detached mount. But the slightly longer check makes it clearer what conditions one needs to think about. * tag 'vfs-6.8-rc5.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: relax mount_setattr() permission checks remap_range: merge do_clone_file_range() into vfs_clone_file_range()
2024-02-11cifs: update the same create_guid on replayShyam Prasad N4-2/+14
File open requests made to the server contain a CreateGuid, which is used by the server to identify the open request. If the same request needs to be replayed, it needs to be sent with the same CreateGuid in the durable handle v2 context. Without doing so, we could end up leaking handles on the server when: 1. multichannel is used AND 2. connection goes down, but not for all channels This is because the replayed open request would have a new CreateGuid and the server will treat this as a new request and open a new handle. This change fixes this by reusing the existing create_guid stored in the cached fid struct. REF: MS-SMB2 4.9 Replay Create Request on an Alternate Channel Fixes: 4f1fffa23769 ("cifs: commands that are retried should have replay flag set") Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-02-11cifs: fix underflow in parse_server_interfaces()Dan Carpenter1-1/+1
In this loop, we step through the buffer and after each item we check if the size_left is greater than the minimum size we need. However, the problem is that "bytes_left" is type ssize_t while sizeof() is type size_t. That means that because of type promotion, the comparison is done as an unsigned and if we have negative bytes left the loop continues instead of ending. Fixes: fe856be475f7 ("CIFS: parse and store info on iface queries") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-02-10Merge tag 'mm-hotfixes-stable-2024-02-10-11-16' of ↵Linus Torvalds5-39/+69
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "21 hotfixes. 12 are cc:stable and the remainder pertain to post-6.7 issues or aren't considered to be needed in earlier kernel versions" * tag 'mm-hotfixes-stable-2024-02-10-11-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (21 commits) nilfs2: fix potential bug in end_buffer_async_write mm/damon/sysfs-schemes: fix wrong DAMOS tried regions update timeout setup nilfs2: fix hang in nilfs_lookup_dirty_data_buffers() MAINTAINERS: Leo Yan has moved mm/zswap: don't return LRU_SKIP if we have dropped lru lock fs,hugetlb: fix NULL pointer dereference in hugetlbs_fill_super mailmap: switch email address for John Moon mm: zswap: fix objcg use-after-free in entry destruction mm/madvise: don't forget to leave lazy MMU mode in madvise_cold_or_pageout_pte_range() arch/arm/mm: fix major fault accounting when retrying under per-VMA lock selftests: core: include linux/close_range.h for CLOSE_RANGE_* macros mm/memory-failure: fix crash in split_huge_page_to_list from soft_offline_page mm: memcg: optimize parent iteration in memcg_rstat_updated() nilfs2: fix data corruption in dsync block recovery for small block sizes mm/userfaultfd: UFFDIO_MOVE implementation should use ptep_get() exit: wait_task_zombie: kill the no longer necessary spin_lock_irq(siglock) fs/proc: do_task_stat: use sig->stats_lock to gather the threads/children stats fs/proc: do_task_stat: move thread_group_cputime_adjusted() outside of lock_task_sighand() getrusage: use sig->stats_lock rather than lock_task_sighand() getrusage: move thread_group_cputime_adjusted() outside of lock_task_sighand() ...
2024-02-10bcachefs: fix missing endiannes conversion in sb_membersKent Overstreet1-1/+1
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-10bcachefs: fix kmemleak in __bch2_read_super error handling pathSu Yue1-1/+1
During xfstest tests, there are some kmemleak reports e.g. generic/051 with if USE_KMEMLEAK=yes: ==================================================================== EXPERIMENTAL kmemleak reported some memory leaks! Due to the way kmemleak works, the leak might be from an earlier test, or something totally unrelated. unreferenced object 0xffff9ef905aaf778 (size 8): comm "mount.bcachefs", pid 169844, jiffies 4295281209 (age 87.040s) hex dump (first 8 bytes): a5 cc cc cc cc cc cc cc ........ backtrace: [<ffffffff87fd9a43>] __kmem_cache_alloc_node+0x1f3/0x2c0 [<ffffffff87f49b66>] kmalloc_trace+0x26/0xb0 [<ffffffffc0a3fefe>] __bch2_read_super+0xfe/0x4e0 [bcachefs] [<ffffffffc0a3ad22>] bch2_fs_open+0x262/0x1710 [bcachefs] [<ffffffffc09c9e24>] bch2_mount+0x4c4/0x640 [bcachefs] [<ffffffff88080c90>] legacy_get_tree+0x30/0x60 [<ffffffff8802c748>] vfs_get_tree+0x28/0xf0 [<ffffffff88061fe5>] path_mount+0x475/0xb60 [<ffffffff880627e5>] __x64_sys_mount+0x105/0x140 [<ffffffff88932642>] do_syscall_64+0x42/0xf0 [<ffffffff88a000e6>] entry_SYSCALL_64_after_hwframe+0x6e/0x76 unreferenced object 0xffff9ef96cdc4fc0 (size 32): comm "mount.bcachefs", pid 169844, jiffies 4295281209 (age 87.040s) hex dump (first 32 bytes): 2f 64 65 76 2f 6d 61 70 70 65 72 2f 74 65 73 74 /dev/mapper/test 2d 31 00 cc cc cc cc cc cc cc cc cc cc cc cc cc -1.............. backtrace: [<ffffffff87fd9a43>] __kmem_cache_alloc_node+0x1f3/0x2c0 [<ffffffff87f4a081>] __kmalloc_node_track_caller+0x51/0x150 [<ffffffff87f3adc2>] kstrdup+0x32/0x60 [<ffffffffc0a3ff1a>] __bch2_read_super+0x11a/0x4e0 [bcachefs] [<ffffffffc0a3ad22>] bch2_fs_open+0x262/0x1710 [bcachefs] [<ffffffffc09c9e24>] bch2_mount+0x4c4/0x640 [bcachefs] [<ffffffff88080c90>] legacy_get_tree+0x30/0x60 [<ffffffff8802c748>] vfs_get_tree+0x28/0xf0 [<ffffffff88061fe5>] path_mount+0x475/0xb60 [<ffffffff880627e5>] __x64_sys_mount+0x105/0x140 [<ffffffff88932642>] do_syscall_64+0x42/0xf0 [<ffffffff88a000e6>] entry_SYSCALL_64_after_hwframe+0x6e/0x76 ==================================================================== The leak happens if bdev_open_by_path() failed to open a block device then it goes label 'out' directly without call of bch2_free_super(). Fix it by going to label 'err' instead of 'out' if bdev_open_by_path() fails. Signed-off-by: Su Yue <glass.su@suse.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-10bcachefs: Fix missing bch2_err_class() callsKent Overstreet1-4/+5
We aren't supposed to be leaking our private error codes outside of fs/bcachefs/. Fixes: Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-10Merge tag '6.8-rc3-ksmbd-server-fixes' of git://git.samba.org/ksmbdLinus Torvalds2-2/+7
Pull smb server fixes from Steve French: "Two ksmbd server fixes: - memory leak fix - a minor kernel-doc fix" * tag '6.8-rc3-ksmbd-server-fixes' of git://git.samba.org/ksmbd: ksmbd: free aux buffer if ksmbd_iov_pin_rsp_read fails ksmbd: Add kernel-doc for ksmbd_extract_sharename() function
2024-02-09Revert "get rid of DCACHE_GENOCIDE"Al Viro1-1/+4
This reverts commit 57851607326a2beef21e67f83f4f53a90df8445a. Unfortunately, while we only call that thing once, the callback *can* be called more than once for the same dentry - all it takes is rename_lock being touched while we are in d_walk(). For now let's revert it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-02-09Merge tag '6.8-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds7-11/+36
Pull smb client fixes from Steve French: - reconnect fix - multichannel channel selection fix - minor mount warning fix - reparse point fix - null pointer check improvement * tag '6.8-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: smb3: clarify mount warning cifs: handle cases where multiple sessions share connection cifs: change tcon status when need_reconnect is set on it smb: client: set correct d_type for reparse points under DFS mounts smb3: add missing null server pointer check
2024-02-09Merge tag 'ceph-for-6.8-rc4' of https://github.com/ceph/ceph-clientLinus Torvalds5-15/+9
Pull ceph fixes from Ilya Dryomov: "Some fscrypt-related fixups (sparse reads are used only for encrypted files) and two cap handling fixes from Xiubo and Rishabh" * tag 'ceph-for-6.8-rc4' of https://github.com/ceph/ceph-client: ceph: always check dir caps asynchronously ceph: prevent use-after-free in encode_cap_msg() ceph: always set initial i_blkbits to CEPH_FSCRYPT_BLOCK_SHIFT libceph: just wait for more data to be available on the socket libceph: rename read_sparse_msg_*() to read_partial_sparse_msg_*() libceph: fail sparse-read if the data length doesn't match
2024-02-09Merge tag 'ntfs3_for_6.8' of ↵Linus Torvalds16-247/+381
https://github.com/Paragon-Software-Group/linux-ntfs3 Pull ntfs3 fixes from Konstantin Komarov: "Fixed: - size update for compressed file - some logic errors, overflows - memory leak - some code was refactored Added: - implement super_operations::shutdown Improved: - alternative boot processing - reduced stack usage" * tag 'ntfs3_for_6.8' of https://github.com/Paragon-Software-Group/linux-ntfs3: (28 commits) fs/ntfs3: Slightly simplify ntfs_inode_printk() fs/ntfs3: Add ioctl operation for directories (FITRIM) fs/ntfs3: Fix oob in ntfs_listxattr fs/ntfs3: Fix an NULL dereference bug fs/ntfs3: Update inode->i_size after success write into compressed file fs/ntfs3: Fixed overflow check in mi_enum_attr() fs/ntfs3: Correct function is_rst_area_valid fs/ntfs3: Use i_size_read and i_size_write fs/ntfs3: Prevent generic message "attempt to access beyond end of device" fs/ntfs3: use non-movable memory for ntfs3 MFT buffer cache fs/ntfs3: Use kvfree to free memory allocated by kvmalloc fs/ntfs3: Disable ATTR_LIST_ENTRY size check fs/ntfs3: Fix c/mtime typo fs/ntfs3: Add NULL ptr dereference checking at the end of attr_allocate_frame() fs/ntfs3: Add and fix comments fs/ntfs3: ntfs3_forced_shutdown use int instead of bool fs/ntfs3: Implement super_operations::shutdown fs/ntfs3: Drop suid and sgid bits as a part of fpunch fs/ntfs3: Add file_modified fs/ntfs3: Correct use bh_read ...
2024-02-09smb3: clarify mount warningSteve French1-1/+1
When a user tries to use the "sec=krb5p" mount parameter to encrypt data on connection to a server (when authenticating with Kerberos), we indicate that it is not supported, but do not note the equivalent recommended mount parameter ("sec=krb5,seal") which turns on encryption for that mount (and uses Kerberos for auth). Update the warning message. Reviewed-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-02-09cifs: handle cases where multiple sessions share connectionShyam Prasad N2-1/+6
Based on our implementation of multichannel, it is entirely possible that a server struct may not be found in any channel of an SMB session. In such cases, we should be prepared to move on and search for the server struct in the next session. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-02-09cifs: change tcon status when need_reconnect is set on itShyam Prasad N3-1/+14
When a tcon is marked for need_reconnect, the intention is to have it reconnected. This change adjusts tcon->status in cifs_tree_connect when need_reconnect is set. Also, this change has a minor correction in resetting need_reconnect on success. It makes sure that it is done with tc_lock held. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-02-09btrfs: add new unused block groups to the list of unused block groupsFilipe Manana1-0/+31
Space reservations for metadata are, most of the time, pessimistic as we reserve space for worst possible cases - where tree heights are at the maximum possible height (8), we need to COW every extent buffer in a tree path, need to split extent buffers, etc. For data, we generally reserve the exact amount of space we are going to allocate. The exception here is when using compression, in which case we reserve space matching the uncompressed size, as the compression only happens at writeback time and in the worst possible case we need that amount of space in case the data is not compressible. This means that when there's not available space in the corresponding space_info object, we may need to allocate a new block group, and then that block group might not be used after all. In this case the block group is never added to the list of unused block groups and ends up never being deleted - except if we unmount and mount again the fs, as when reading block groups from disk we add unused ones to the list of unused block groups (fs_info->unused_bgs). Otherwise a block group is only added to the list of unused block groups when we deallocate the last extent from it, so if no extent is ever allocated, the block group is kept around forever. This also means that if we have a bunch of tasks reserving space in parallel we can end up allocating many block groups that end up never being used or kept around for too long without being used, which has the potential to result in ENOSPC failures in case for example we over allocate too many metadata block groups and then end up in a state without enough unallocated space to allocate a new data block group. This is more likely to happen with metadata reservations as of kernel 6.7, namely since commit 28270e25c69a ("btrfs: always reserve space for delayed refs when starting transaction"), because we started to always reserve space for delayed references when starting a transaction handle for a non-zero number of items, and also to try to reserve space to fill the gap between the delayed block reserve's reserved space and its size. So to avoid this, when finishing the creation a new block group, add the block group to the list of unused block groups if it's still unused at that time. This way the next time the cleaner kthread runs, it will delete the block group if it's still unused and not needed to satisfy existing space reservations. Reported-by: Ivan Shapovalov <intelfx@intelfx.name> Link: https://lore.kernel.org/linux-btrfs/9cdbf0ca9cdda1b4c84e15e548af7d7f9f926382.camel@intelfx.name/ CC: stable@vger.kernel.org # 6.7+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-09btrfs: do not delete unused block group if it may be used soonFilipe Manana1-0/+46
Before deleting a block group that is in the list of unused block groups (fs_info->unused_bgs), we check if the block group became used before deleting it, as extents from it may have been allocated after it was added to the list. However even if the block group was not yet used, there may be tasks that have only reserved space and have not yet allocated extents, and they might be relying on the availability of the unused block group in order to allocate extents. The reservation works first by increasing the "bytes_may_use" field of the corresponding space_info object (which may first require flushing delayed items, allocating a new block group, etc), and only later a task does the actual allocation of extents. For metadata we usually don't end up using all reserved space, as we are pessimistic and typically account for the worst cases (need to COW every single node in a path of a tree at maximum possible height, etc). For data we usually reserve the exact amount of space we're going to allocate later, except when using compression where we always reserve space based on the uncompressed size, as compression is only triggered when writeback starts so we don't know in advance how much space we'll actually need, or if the data is compressible. So don't delete an unused block group if the total size of its space_info object minus the block group's size is less then the sum of used space and space that may be used (space_info->bytes_may_use), as that means we have tasks that reserved space and may need to allocate extents from the block group. In this case, besides skipping the deletion, re-add the block group to the list of unused block groups so that it may be reconsidered later, in case the tasks that reserved space end up not needing to allocate extents from it. Allowing the deletion of the block group while we have reserved space, can result in tasks failing to allocate metadata extents (-ENOSPC) while under a transaction handle, resulting in a transaction abort, or failure during writeback for the case of data extents. CC: stable@vger.kernel.org # 6.0+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-09btrfs: add and use helper to check if block group is usedFilipe Manana2-2/+8
Add a helper function to determine if a block group is being used and make use of it at btrfs_delete_unused_bgs(). This helper will also be used in future code changes. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-09btrfs: don't drop extent_map for free space inode on write errorJosef Bacik1-2/+17
While running the CI for an unrelated change I hit the following panic with generic/648 on btrfs_holes_spacecache. assertion failed: block_start != EXTENT_MAP_HOLE, in fs/btrfs/extent_io.c:1385 ------------[ cut here ]------------ kernel BUG at fs/btrfs/extent_io.c:1385! invalid opcode: 0000 [#1] PREEMPT SMP NOPTI CPU: 1 PID: 2695096 Comm: fsstress Kdump: loaded Tainted: G W 6.8.0-rc2+ #1 RIP: 0010:__extent_writepage_io.constprop.0+0x4c1/0x5c0 Call Trace: <TASK> extent_write_cache_pages+0x2ac/0x8f0 extent_writepages+0x87/0x110 do_writepages+0xd5/0x1f0 filemap_fdatawrite_wbc+0x63/0x90 __filemap_fdatawrite_range+0x5c/0x80 btrfs_fdatawrite_range+0x1f/0x50 btrfs_write_out_cache+0x507/0x560 btrfs_write_dirty_block_groups+0x32a/0x420 commit_cowonly_roots+0x21b/0x290 btrfs_commit_transaction+0x813/0x1360 btrfs_sync_file+0x51a/0x640 __x64_sys_fdatasync+0x52/0x90 do_syscall_64+0x9c/0x190 entry_SYSCALL_64_after_hwframe+0x6e/0x76 This happens because we fail to write out the free space cache in one instance, come back around and attempt to write it again. However on the second pass through we go to call btrfs_get_extent() on the inode to get the extent mapping. Because this is a new block group, and with the free space inode we always search the commit root to avoid deadlocking with the tree, we find nothing and return a EXTENT_MAP_HOLE for the requested range. This happens because the first time we try to write the space cache out we hit an error, and on an error we drop the extent mapping. This is normal for normal files, but the free space cache inode is special. We always expect the extent map to be correct. Thus the second time through we end up with a bogus extent map. Since we're deprecating this feature, the most straightforward way to fix this is to simply skip dropping the extent map range for this failed range. I shortened the test by using error injection to stress the area to make it easier to reproduce. With this patch in place we no longer panic with my error injection test. CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-08smb: client: set correct d_type for reparse points under DFS mountsPaulo Alcantara2-7/+14
Send query dir requests with an info level of SMB_FIND_FILE_FULL_DIRECTORY_INFO rather than SMB_FIND_FILE_DIRECTORY_INFO when the client is generating its own inode numbers (e.g. noserverino) so that reparse tags still can be parsed directly from the responses, but server won't send UniqueId (server inode number) Signed-off-by: Paulo Alcantara <pc@manguebit.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-02-08smb3: add missing null server pointer checkSteve French1-1/+1
Address static checker warning in cifs_ses_get_chan_index(): warn: variable dereferenced before check 'server' To be consistent, and reduce risk, we should add another check for null server pointer. Fixes: 88675b22d34e ("cifs: do not search for channel if server is terminating") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-02-07nilfs2: fix potential bug in end_buffer_async_writeRyusuke Konishi1-3/+5
According to a syzbot report, end_buffer_async_write(), which handles the completion of block device writes, may detect abnormal condition of the buffer async_write flag and cause a BUG_ON failure when using nilfs2. Nilfs2 itself does not use end_buffer_async_write(). But, the async_write flag is now used as a marker by commit 7f42ec394156 ("nilfs2: fix issue with race condition of competition between segments for dirty blocks") as a means of resolving double list insertion of dirty blocks in nilfs_lookup_dirty_data_buffers() and nilfs_lookup_node_buffers() and the resulting crash. This modification is safe as long as it is used for file data and b-tree node blocks where the page caches are independent. However, it was irrelevant and redundant to also introduce async_write for segment summary and super root blocks that share buffers with the backing device. This led to the possibility that the BUG_ON check in end_buffer_async_write would fail as described above, if independent writebacks of the backing device occurred in parallel. The use of async_write for segment summary buffers has already been removed in a previous change. Fix this issue by removing the manipulation of the async_write flag for the remaining super root block buffer. Link: https://lkml.kernel.org/r/20240203161645.4992-1-konishi.ryusuke@gmail.com Fixes: 7f42ec394156 ("nilfs2: fix issue with race condition of competition between segments for dirty blocks") Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+5c04210f7c7f897c1e7f@syzkaller.appspotmail.com Closes: https://lkml.kernel.org/r/00000000000019a97c05fd42f8c8@google.com Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-07nilfs2: fix hang in nilfs_lookup_dirty_data_buffers()Ryusuke Konishi1-1/+7
Syzbot reported a hang issue in migrate_pages_batch() called by mbind() and nilfs_lookup_dirty_data_buffers() called in the log writer of nilfs2. While migrate_pages_batch() locks a folio and waits for the writeback to complete, the log writer thread that should bring the writeback to completion picks up the folio being written back in nilfs_lookup_dirty_data_buffers() that it calls for subsequent log creation and was trying to lock the folio. Thus causing a deadlock. In the first place, it is unexpected that folios/pages in the middle of writeback will be updated and become dirty. Nilfs2 adds a checksum to verify the validity of the log being written and uses it for recovery at mount, so data changes during writeback are suppressed. Since this is broken, an unclean shutdown could potentially cause recovery to fail. Investigation revealed that the root cause is that the wait for writeback completion in nilfs_page_mkwrite() is conditional, and if the backing device does not require stable writes, data may be modified without waiting. Fix these issues by making nilfs_page_mkwrite() wait for writeback to finish regardless of the stable write requirement of the backing device. Link: https://lkml.kernel.org/r/20240131145657.4209-1-konishi.ryusuke@gmail.com Fixes: 1d1d1a767206 ("mm: only enforce stable page writes if the backing device requires it") Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+ee2ae68da3b22d04cd8d@syzkaller.appspotmail.com Closes: https://lkml.kernel.org/r/00000000000047d819061004ad6c@google.com Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-07fs,hugetlb: fix NULL pointer dereference in hugetlbs_fill_superOscar Salvador1-2/+4
When configuring a hugetlb filesystem via the fsconfig() syscall, there is a possible NULL dereference in hugetlbfs_fill_super() caused by assigning NULL to ctx->hstate in hugetlbfs_parse_param() when the requested pagesize is non valid. E.g: Taking the following steps: fd = fsopen("hugetlbfs", FSOPEN_CLOEXEC); fsconfig(fd, FSCONFIG_SET_STRING, "pagesize", "1024", 0); fsconfig(fd, FSCONFIG_CMD_CREATE, NULL, NULL, 0); Given that the requested "pagesize" is invalid, ctxt->hstate will be replaced with NULL, losing its previous value, and we will print an error: ... ... case Opt_pagesize: ps = memparse(param->string, &rest); ctx->hstate = h; if (!ctx->hstate) { pr_err("Unsupported page size %lu MB\n", ps / SZ_1M); return -EINVAL; } return 0; ... ... This is a problem because later on, we will dereference ctxt->hstate in hugetlbfs_fill_super() ... ... sb->s_blocksize = huge_page_size(ctx->hstate); ... ... Causing below Oops. Fix this by replacing cxt->hstate value only when then pagesize is known to be valid. kernel: hugetlbfs: Unsupported page size 0 MB kernel: BUG: kernel NULL pointer dereference, address: 0000000000000028 kernel: #PF: supervisor read access in kernel mode kernel: #PF: error_code(0x0000) - not-present page kernel: PGD 800000010f66c067 P4D 800000010f66c067 PUD 1b22f8067 PMD 0 kernel: Oops: 0000 [#1] PREEMPT SMP PTI kernel: CPU: 4 PID: 5659 Comm: syscall Tainted: G E 6.8.0-rc2-default+ #22 5a47c3fef76212addcc6eb71344aabc35190ae8f kernel: Hardware name: Intel Corp. GROVEPORT/GROVEPORT, BIOS GVPRCRB1.86B.0016.D04.1705030402 05/03/2017 kernel: RIP: 0010:hugetlbfs_fill_super+0xb4/0x1a0 kernel: Code: 48 8b 3b e8 3e c6 ed ff 48 85 c0 48 89 45 20 0f 84 d6 00 00 00 48 b8 ff ff ff ff ff ff ff 7f 4c 89 e7 49 89 44 24 20 48 8b 03 <8b> 48 28 b8 00 10 00 00 48 d3 e0 49 89 44 24 18 48 8b 03 8b 40 28 kernel: RSP: 0018:ffffbe9960fcbd48 EFLAGS: 00010246 kernel: RAX: 0000000000000000 RBX: ffff9af5272ae780 RCX: 0000000000372004 kernel: RDX: ffffffffffffffff RSI: ffffffffffffffff RDI: ffff9af555e9b000 kernel: RBP: ffff9af52ee66b00 R08: 0000000000000040 R09: 0000000000370004 kernel: R10: ffffbe9960fcbd48 R11: 0000000000000040 R12: ffff9af555e9b000 kernel: R13: ffffffffa66b86c0 R14: ffff9af507d2f400 R15: ffff9af507d2f400 kernel: FS: 00007ffbc0ba4740(0000) GS:ffff9b0bd7000000(0000) knlGS:0000000000000000 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 kernel: CR2: 0000000000000028 CR3: 00000001b1ee0000 CR4: 00000000001506f0 kernel: Call Trace: kernel: <TASK> kernel: ? __die_body+0x1a/0x60 kernel: ? page_fault_oops+0x16f/0x4a0 kernel: ? search_bpf_extables+0x65/0x70 kernel: ? fixup_exception+0x22/0x310 kernel: ? exc_page_fault+0x69/0x150 kernel: ? asm_exc_page_fault+0x22/0x30 kernel: ? __pfx_hugetlbfs_fill_super+0x10/0x10 kernel: ? hugetlbfs_fill_super+0xb4/0x1a0 kernel: ? hugetlbfs_fill_super+0x28/0x1a0 kernel: ? __pfx_hugetlbfs_fill_super+0x10/0x10 kernel: vfs_get_super+0x40/0xa0 kernel: ? __pfx_bpf_lsm_capable+0x10/0x10 kernel: vfs_get_tree+0x25/0xd0 kernel: vfs_cmd_create+0x64/0xe0 kernel: __x64_sys_fsconfig+0x395/0x410 kernel: do_syscall_64+0x80/0x160 kernel: ? syscall_exit_to_user_mode+0x82/0x240 kernel: ? do_syscall_64+0x8d/0x160 kernel: ? syscall_exit_to_user_mode+0x82/0x240 kernel: ? do_syscall_64+0x8d/0x160 kernel: ? exc_page_fault+0x69/0x150 kernel: entry_SYSCALL_64_after_hwframe+0x6e/0x76 kernel: RIP: 0033:0x7ffbc0cb87c9 kernel: Code: 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 97 96 0d 00 f7 d8 64 89 01 48 kernel: RSP: 002b:00007ffc29d2f388 EFLAGS: 00000206 ORIG_RAX: 00000000000001af kernel: RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ffbc0cb87c9 kernel: RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000003 kernel: RBP: 00007ffc29d2f3b0 R08: 0000000000000000 R09: 0000000000000000 kernel: R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000 kernel: R13: 00007ffc29d2f4c0 R14: 0000000000000000 R15: 0000000000000000 kernel: </TASK> kernel: Modules linked in: rpcsec_gss_krb5(E) auth_rpcgss(E) nfsv4(E) dns_resolver(E) nfs(E) lockd(E) grace(E) sunrpc(E) netfs(E) af_packet(E) bridge(E) stp(E) llc(E) iscsi_ibft(E) iscsi_boot_sysfs(E) intel_rapl_msr(E) intel_rapl_common(E) iTCO_wdt(E) intel_pmc_bxt(E) sb_edac(E) iTCO_vendor_support(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) rfkill(E) ipmi_ssif(E) kvm(E) acpi_ipmi(E) irqbypass(E) pcspkr(E) igb(E) ipmi_si(E) mei_me(E) i2c_i801(E) joydev(E) intel_pch_thermal(E) i2c_smbus(E) dca(E) lpc_ich(E) mei(E) ipmi_devintf(E) ipmi_msghandler(E) acpi_pad(E) tiny_power_button(E) button(E) fuse(E) efi_pstore(E) configfs(E) ip_tables(E) x_tables(E) ext4(E) mbcache(E) jbd2(E) hid_generic(E) usbhid(E) sd_mod(E) t10_pi(E) crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) polyval_clmulni(E) ahci(E) xhci_pci(E) polyval_generic(E) gf128mul(E) ghash_clmulni_intel(E) sha512_ssse3(E) sha256_ssse3(E) xhci_pci_renesas(E) libahci(E) ehci_pci(E) sha1_ssse3(E) xhci_hcd(E) ehci_hcd(E) libata(E) kernel: mgag200(E) i2c_algo_bit(E) usbcore(E) wmi(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) scsi_mod(E) scsi_common(E) aesni_intel(E) crypto_simd(E) cryptd(E) kernel: Unloaded tainted modules: acpi_cpufreq(E):1 fjes(E):1 kernel: CR2: 0000000000000028 kernel: ---[ end trace 0000000000000000 ]--- kernel: RIP: 0010:hugetlbfs_fill_super+0xb4/0x1a0 kernel: Code: 48 8b 3b e8 3e c6 ed ff 48 85 c0 48 89 45 20 0f 84 d6 00 00 00 48 b8 ff ff ff ff ff ff ff 7f 4c 89 e7 49 89 44 24 20 48 8b 03 <8b> 48 28 b8 00 10 00 00 48 d3 e0 49 89 44 24 18 48 8b 03 8b 40 28 kernel: RSP: 0018:ffffbe9960fcbd48 EFLAGS: 00010246 kernel: RAX: 0000000000000000 RBX: ffff9af5272ae780 RCX: 0000000000372004 kernel: RDX: ffffffffffffffff RSI: ffffffffffffffff RDI: ffff9af555e9b000 kernel: RBP: ffff9af52ee66b00 R08: 0000000000000040 R09: 0000000000370004 kernel: R10: ffffbe9960fcbd48 R11: 0000000000000040 R12: ffff9af555e9b000 kernel: R13: ffffffffa66b86c0 R14: ffff9af507d2f400 R15: ffff9af507d2f400 kernel: FS: 00007ffbc0ba4740(0000) GS:ffff9b0bd7000000(0000) knlGS:0000000000000000 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 kernel: CR2: 0000000000000028 CR3: 00000001b1ee0000 CR4: 00000000001506f0 Link: https://lkml.kernel.org/r/20240130210418.3771-1-osalvador@suse.de Fixes: 32021982a324 ("hugetlbfs: Convert to fs_context") Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: Muchun Song <muchun.song@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-07nilfs2: fix data corruption in dsync block recovery for small block sizesRyusuke Konishi1-3/+4
The helper function nilfs_recovery_copy_block() of nilfs_recovery_dsync_blocks(), which recovers data from logs created by data sync writes during a mount after an unclean shutdown, incorrectly calculates the on-page offset when copying repair data to the file's page cache. In environments where the block size is smaller than the page size, this flaw can cause data corruption and leak uninitialized memory bytes during the recovery process. Fix these issues by correcting this byte offset calculation on the page. Link: https://lkml.kernel.org/r/20240124121936.10575-1-konishi.ryusuke@gmail.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-07fs/proc: do_task_stat: use sig->stats_lock to gather the threads/children statsOleg Nesterov1-26/+32
lock_task_sighand() can trigger a hard lockup. If NR_CPUS threads call do_task_stat() at the same time and the process has NR_THREADS, it will spin with irqs disabled O(NR_CPUS * NR_THREADS) time. Change do_task_stat() to use sig->stats_lock to gather the statistics outside of ->siglock protected section, in the likely case this code will run lockless. Link: https://lkml.kernel.org/r/20240123153357.GA21857@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Dylan Hatch <dylanbhatch@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-07fs/proc: do_task_stat: move thread_group_cputime_adjusted() outside of ↵Oleg Nesterov1-4/+6
lock_task_sighand() Patch series "fs/proc: do_task_stat: use sig->stats_". do_task_stat() has the same problem as getrusage() had before "getrusage: use sig->stats_lock rather than lock_task_sighand()": a hard lockup. If NR_CPUS threads call lock_task_sighand() at the same time and the process has NR_THREADS, spin_lock_irq will spin with irqs disabled O(NR_CPUS * NR_THREADS) time. This patch (of 3): thread_group_cputime() does its own locking, we can safely shift thread_group_cputime_adjusted() which does another for_each_thread loop outside of ->siglock protected section. Not only this removes for_each_thread() from the critical section with irqs disabled, this removes another case when stats_lock is taken with siglock held. We want to remove this dependency, then we can change the users of stats_lock to not disable irqs. Link: https://lkml.kernel.org/r/20240123153313.GA21832@redhat.com Link: https://lkml.kernel.org/r/20240123153355.GA21854@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Dylan Hatch <dylanbhatch@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-07mm: hugetlb pages should not be reserved by shmat() if SHM_NORESERVEPrakash Sangappa1-1/+12
For shared memory of type SHM_HUGETLB, hugetlb pages are reserved in shmget() call. If SHM_NORESERVE flags is specified then the hugetlb pages are not reserved. However when the shared memory is attached with the shmat() call the hugetlb pages are getting reserved incorrectly for SHM_HUGETLB shared memory created with SHM_NORESERVE which is a bug. ------------------------------- Following test shows the issue. $cat shmhtb.c int main() { int shmflags = 0660 | IPC_CREAT | SHM_HUGETLB | SHM_NORESERVE; int shmid; shmid = shmget(SKEY, SHMSZ, shmflags); if (shmid < 0) { printf("shmat: shmget() failed, %d\n", errno); return 1; } printf("After shmget()\n"); system("cat /proc/meminfo | grep -i hugepages_"); shmat(shmid, NULL, 0); printf("\nAfter shmat()\n"); system("cat /proc/meminfo | grep -i hugepages_"); shmctl(shmid, IPC_RMID, NULL); return 0; } #sysctl -w vm.nr_hugepages=20 #./shmhtb