summaryrefslogtreecommitdiff
path: root/fs/fuse
AgeCommit message (Collapse)AuthorFilesLines
2024-11-01fuse: update inode size after extending passthrough writeAmir Goldstein1-1/+1
[ Upstream commit 20121d3f58f06e977ca43eb6efe1fb23b1d2f6d9 ] yangyun reported that libfuse test test_copy_file_range() copies zero bytes from a newly written file when fuse passthrough is enabled. The reason is that extending passthrough write is not updating the fuse inode size and when vfs_copy_file_range() observes a zero size inode, it returns without calling the filesystem copy_file_range() method. Fix this by adjusting the fuse inode size after an extending passthrough write. This does not provide cache coherency of fuse inode attributes and backing inode attributes, but it should prevent situations where fuse inode size is too small, causing read/copy to be wrongly shortened. Reported-by: yangyun <yangyun50@huawei.com> Closes: https://github.com/libfuse/libfuse/issues/1048 Fixes: 57e1176e6086 ("fuse: implement read/write passthrough") Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-11-01fs: pass offset and result to backing_file end_write() callbackAmir Goldstein1-3/+3
[ Upstream commit f03b296e8b516dbd63f57fc9056c1b0da1b9a0ff ] This is needed for extending fuse inode size after fuse passthrough write. Suggested-by: Miklos Szeredi <miklos@szeredi.hu> Link: https://lore.kernel.org/linux-fsdevel/CAJfpegs=cvZ_NYy6Q_D42XhYS=Sjj5poM1b5TzXzOVvX=R36aA@mail.gmail.com/ Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Stable-dep-of: 20121d3f58f0 ("fuse: update inode size after extending passthrough write") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-10-04fuse: use exclusive lock when FUSE_I_CACHE_IO_MODE is setyangyun1-1/+1
commit 2f3d8ff457982f4055fe8f7bf19d3821ba22c376 upstream. This may be a typo. The comment has said shared locks are not allowed when this bit is set. If using shared lock, the wait in `fuse_file_cached_io_open` may be forever. Fixes: 205c1d802683 ("fuse: allow parallel dio writes with FUSE_DIRECT_IO_ALLOW_MMAP") CC: stable@vger.kernel.org # v6.9 Signed-off-by: yangyun <yangyun50@huawei.com> Reviewed-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-03Merge tag 'fuse-fixes-6.11-rc7' of ↵Linus Torvalds5-9/+26
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse fixes from Miklos Szeredi: - Fix EIO if splice and page stealing are enabled on the fuse device - Disable problematic combination of passthrough and writeback-cache - Other bug fixes found by code review * tag 'fuse-fixes-6.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: disable the combination of passthrough and writeback cache fuse: update stats for pages in dropped aux writeback list fuse: clear PG_uptodate when using a stolen page fuse: fix memory leak in fuse_create_open fuse: check aborted connection before adding requests to pending list for resending fuse: use unsigned type for getxattr/listxattr size truncation
2024-08-29fuse: disable the combination of passthrough and writeback cacheBernd Schubert1-1/+6
Current design and handling of passthrough is without fuse caching and with that FUSE_WRITEBACK_CACHE is conflicting. Fixes: 7dc4e97a4f9a ("fuse: introduce FUSE_PASSTHROUGH capability") Cc: stable@kernel.org # v6.9 Signed-off-by: Bernd Schubert <bschubert@ddn.com> Acked-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-08-28fuse: update stats for pages in dropped aux writeback listJoanne Koong1-1/+7
In the case where the aux writeback list is dropped (e.g. the pages have been truncated or the connection is broken), the stats for its pages and backing device info need to be updated as well. Fixes: e2653bd53a98 ("fuse: fix leaked aux requests") Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Cc: <stable@vger.kernel.org> # v5.1 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-08-28fuse: clear PG_uptodate when using a stolen pageMiklos Szeredi1-4/+1
Originally when a stolen page was inserted into fuse's page cache by fuse_try_move_page(), it would be marked uptodate. Then fuse_readpages_end() would call SetPageUptodate() again on the already uptodate page. Commit 413e8f014c8b ("fuse: Convert fuse_readpages_end() to use folio_end_read()") changed that by replacing the SetPageUptodate() + unlock_page() combination with folio_end_read(), which does mostly the same, except it sets the uptodate flag with an xor operation, which in the above scenario resulted in the uptodate flag being cleared, which in turn resulted in EIO being returned on the read. Fix by clearing PG_uptodate instead of setting it in fuse_try_move_page(), conforming to the expectation of folio_end_read(). Reported-by: Jürg Billeter <j@bitron.ch> Debugged-by: Matthew Wilcox <willy@infradead.org> Fixes: 413e8f014c8b ("fuse: Convert fuse_readpages_end() to use folio_end_read()") Cc: <stable@vger.kernel.org> # v6.10 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-08-28fuse: fix memory leak in fuse_create_openyangyun1-1/+1
The memory of struct fuse_file is allocated but not freed when get_create_ext return error. Fixes: 3e2b6fdbdc9a ("fuse: send security context of inode on file") Cc: stable@vger.kernel.org # v5.17 Signed-off-by: yangyun <yangyun50@huawei.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-08-28fuse: check aborted connection before adding requests to pending list for ↵Joanne Koong1-0/+9
resending There is a race condition where inflight requests will not be aborted if they are in the middle of being re-sent when the connection is aborted. If fuse_resend has already moved all the requests in the fpq->processing lists to its private queue ("to_queue") and then the connection starts and finishes aborting, these requests will be added to the pending queue and remain on it indefinitely. Fixes: 760eac73f9f6 ("fuse: Introduce a new notification type for resend pending requests") Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com> Cc: <stable@vger.kernel.org> # v6.9 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-08-28fuse: use unsigned type for getxattr/listxattr size truncationJann Horn1-2/+2
The existing code uses min_t(ssize_t, outarg.size, XATTR_LIST_MAX) when parsing the FUSE daemon's response to a zero-length getxattr/listxattr request. On 32-bit kernels, where ssize_t and outarg.size are the same size, this is wrong: The min_t() will pass through any size values that are negative when interpreted as signed. fuse_listxattr() will then return this userspace-supplied negative value, which callers will treat as an error value. This kind of bug pattern can lead to fairly bad security bugs because of how error codes are used in the Linux kernel. If a caller were to convert the numeric error into an error pointer, like so: struct foo *func(...) { int len = fuse_getxattr(..., NULL, 0); if (len < 0) return ERR_PTR(len); ... } then it would end up returning this userspace-supplied negative value cast to a pointer - but the caller of this function wouldn't recognize it as an error pointer (IS_ERR_VALUE() only detects values in the narrow range in which legitimate errno values are), and so it would just be treated as a kernel pointer. I think there is at least one theoretical codepath where this could happen, but that path would involve virtio-fs with submounts plus some weird SELinux configuration, so I think it's probably not a concern in practice. Cc: stable@vger.kernel.org # v4.9 Fixes: 63401ccdb2ca ("fuse: limit xattr returned size") Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-08-18fuse: Initialize beyond-EOF page contents before setting uptodateJann Horn1-2/+4
fuse_notify_store(), unlike fuse_do_readpage(), does not enable page zeroing (because it can be used to change partial page contents). So fuse_notify_store() must be more careful to fully initialize page contents (including parts of the page that are beyond end-of-file) before marking the page uptodate. The current code can leave beyond-EOF page contents uninitialized, which makes these uninitialized page contents visible to userspace via mmap(). This is an information leak, but only affects systems which do not enable init-on-alloc (via CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y or the corresponding kernel command line parameter). Link: https://bugs.chromium.org/p/project-zero/issues/detail?id=2574 Cc: stable@kernel.org Fixes: a1d75f258230 ("fuse: add store request") Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-07-19Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds1-31/+31
Pull virtio updates from Michael Tsirkin: "Several new features here: - Virtio find vqs API has been reworked (required to fix the scalability issue we have with adminq, which I hope to merge later in the cycle) - vDPA driver for Marvell OCTEON - virtio fs performance improvement - mlx5 migration speedups Fixes, cleanups all over the place" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (56 commits) virtio: rename virtio_find_vqs_info() to virtio_find_vqs() virtio: remove unused virtio_find_vqs() and virtio_find_vqs_ctx() helpers virtio: convert the rest virtio_find_vqs() users to virtio_find_vqs_info() virtio_balloon: convert to use virtio_find_vqs_info() virtiofs: convert to use virtio_find_vqs_info() scsi: virtio_scsi: convert to use virtio_find_vqs_info() virtio_net: convert to use virtio_find_vqs_info() virtio_crypto: convert to use virtio_find_vqs_info() virtio_console: convert to use virtio_find_vqs_info() virtio_blk: convert to use virtio_find_vqs_info() virtio: rename find_vqs_info() op to find_vqs() virtio: remove the original find_vqs() op virtio: call virtio_find_vqs_info() from virtio_find_single_vq() directly virtio: convert find_vqs() op implementations to find_vqs_info() virtio_pci: convert vp_*find_vqs() ops to find_vqs_info() virtio: introduce virtio_queue_info struct and find_vqs_info() config op virtio: make virtio_find_single_vq() call virtio_find_vqs() virtio: make virtio_find_vqs() call virtio_find_vqs_ctx() caif_virtio: use virtio_find_single_vq() for single virtqueue finding vdpa/mlx5: Don't enable non-active VQs in .set_vq_ready() ...
2024-07-17virtio: rename virtio_find_vqs_info() to virtio_find_vqs()Jiri Pirko1-1/+1
Since the original virtio_find_vqs() is no longer present, rename virtio_find_vqs_info() back to virtio_find_vqs(). Signed-off-by: Jiri Pirko <jiri@nvidia.com> Message-Id: <20240708074814.1739223-20-jiri@resnulli.us> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-17virtiofs: convert to use virtio_find_vqs_info()Jiri Pirko1-13/+9
Instead of passing separate names and callbacks arrays to virtio_find_vqs(), allocate one of virtual_queue_info structs and pass it to virtio_find_vqs_info(). Suggested-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Signed-off-by: Jiri Pirko <jiri@nvidia.com> Message-Id: <20240708074814.1739223-16-jiri@resnulli.us> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-15Merge tag 'vfs-6.11.mount.api' of ↵Linus Torvalds1-6/+18
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs mount API updates from Christian Brauner: - Add a generic helper to parse uid and gid mount options. Currently we open-code the same logic in various filesystems which is error prone, especially since the verification of uid and gid mount options is a sensitive operation in the face of idmappings. Add a generic helper and convert all filesystems over to it. Make sure that filesystems that are mountable in unprivileged containers verify that the specified uid and gid can be represented in the owning namespace of the filesystem. - Convert hostfs to the new mount api. * tag 'vfs-6.11.mount.api' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fuse: Convert to new uid/gid option parsing helpers fuse: verify {g,u}id mount options correctly fat: Convert to new uid/gid option parsing helpers fat: Convert to new mount api fat: move debug into fat_mount_options vboxsf: Convert to new uid/gid option parsing helpers tracefs: Convert to new uid/gid option parsing helpers smb: client: Convert to new uid/gid option parsing helpers tmpfs: Convert to new uid/gid option parsing helpers ntfs3: Convert to new uid/gid option parsing helpers isofs: Convert to new uid/gid option parsing helpers hugetlbfs: Convert to new uid/gid option parsing helpers ext4: Convert to new uid/gid option parsing helpers exfat: Convert to new uid/gid option parsing helpers efivarfs: Convert to new uid/gid option parsing helpers debugfs: Convert to new uid/gid option parsing helpers autofs: Convert to new uid/gid option parsing helpers fs_parse: add uid & gid option option parsing helpers hostfs: Add const qualifier to host_root in hostfs_fill_super() hostfs: convert hostfs to use the new mount API
2024-07-09virtio-fs: improved request latencies when Virtio queue is fullPeter-Jan Gootzen1-15/+19
Currently, when the Virtio queue is full, a work item is scheduled to execute in 1ms that retries adding the request to the queue. This is a large amount of time on the scale on which a virtio-fs device can operate. When using a DPU this is around 30-40us baseline without going to a remote server (4k, QD=1). This patch changes the retrying behavior to immediately filling the Virtio queue up again when a completion has been received. This reduces the 99.9th percentile latencies in our tests by 60x and slightly increases the overall throughput, when using a workload IO depth 2x the size of the Virtio queue and a DPU-powered virtio-fs device (NVIDIA BlueField DPU). Signed-off-by: Peter-Jan Gootzen <pgootzen@nvidia.com> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Reviewed-by: Yoray Zack <yorayz@nvidia.com> Message-Id: <20240517190435.152096-3-pgootzen@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
2024-07-09virtio-fs: let -ENOMEM bubble up or burst gentlyPeter-Jan Gootzen1-3/+3
Currently, when the enqueueing of a request or forget operation fails with -ENOMEM, the enqueueing is retried after a timeout. This patch removes this behavior and treats -ENOMEM in these scenarios like any other error. By bubbling up the error to user space in the case of a request, and by dropping the operation in case of a forget. This behavior matches that of the FUSE layer above, and also simplifies the error handling. The latter will come in handy for upcoming patches that optimize the retrying of operations in case of -ENOSPC. Signed-off-by: Peter-Jan Gootzen <pgootzen@nvidia.com> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Reviewed-by: Yoray Zack <yorayz@nvidia.com> Message-Id: <20240517190435.152096-2-pgootzen@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
2024-07-03fuse: Convert to new uid/gid option parsing helpersEric Sandeen1-8/+4
Convert to new uid/gid option parsing helpers Signed-off-by: Eric Sandeen <sandeen@redhat.com> Link: https://lore.kernel.org/r/4e1a4efa-4ca5-4358-acee-40efd07c3c44@redhat.com Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-07-03fuse: verify {g,u}id mount options correctlyEric Sandeen1-4/+20
As was done in 0200679fc795 ("tmpfs: verify {g,u}id mount options correctly") we need to validate that the requested uid and/or gid is representable in the filesystem's idmapping. Cribbing from the above commit log, The contract for {g,u}id mount options and {g,u}id values in general set from userspace has always been that they are translated according to the caller's idmapping. In so far, fuse has been doing the correct thing. But since fuse is mountable in unprivileged contexts it is also necessary to verify that the resulting {k,g}uid is representable in the namespace of the superblock. Fixes: c30da2e981a7 ("fuse: convert to use the new mount API") Cc: stable@vger.kernel.org # 5.4+ Signed-off-by: Eric Sandeen <sandeen@redhat.com> Link: https://lore.kernel.org/r/8f07d45d-c806-484d-a2e3-7a2199df1cd2@redhat.com Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-06-25fuse: Use in_group_or_capable() helperYouling Tang1-2/+2
Use the in_group_or_capable() helper function to simplify the code. Signed-off-by: Youling Tang <tangyouling@kylinos.cn> Link: https://lore.kernel.org/r/20240620032335.147136-3-youling.tang@linux.dev Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-05-23Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds1-1/+0
Pull virtio updates from Michael Tsirkin: "Several new features here: - virtio-net is finally supported in vduse - virtio (balloon and mem) interaction with suspend is improved - vhost-scsi now handles signals better/faster And fixes, cleanups all over the place" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (48 commits) virtio-pci: Check if is_avq is NULL virtio: delete vq in vp_find_vqs_msix() when request_irq() fails MAINTAINERS: add Eugenio Pérez as reviewer vhost-vdpa: Remove usage of the deprecated ida_simple_xx() API vp_vdpa: don't allocate unused msix vectors sound: virtio: drop owner assignment fuse: virtio: drop owner assignment scsi: virtio: drop owner assignment rpmsg: virtio: drop owner assignment nvdimm: virtio_pmem: drop owner assignment wifi: mac80211_hwsim: drop owner assignment vsock/virtio: drop owner assignment net: 9p: virtio: drop owner assignment net: virtio: drop owner assignment net: caif: virtio: drop owner assignment misc: nsm: drop owner assignment iommu: virtio: drop owner assignment drm/virtio: drop owner assignment gpio: virtio: drop owner assignment firmware: arm_scmi: virtio: drop owner assignment ...
2024-05-22fuse: virtio: drop owner assignmentKrzysztof Kozlowski1-1/+0
virtio core already sets the .owner, so driver does not need to. Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Message-Id: <20240331-module-owner-virtio-v2-24-98f04bfaf46a@linaro.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
2024-05-10virtio-fs: add multi-queue supportPeter-Jan Gootzen1-8/+62
This commit creates a multi-queue mapping at device bring-up. The driver first attempts to use the existing MSI-X interrupt affinities (previously disabled), and if not present, will distribute the request queues evenly over the CPUs. If the latter fails as well, all CPUs are mapped to request queue zero. When a request is handed from FUSE to the virtio-fs device driver, the driver will use the current CPU to index into the multi-queue mapping and determine the optimal request queue to use. We measured the performance of this patch with the fio benchmarking tool, increasing the number of queues results in a significant speedup for both read and write operations, demonstrating the effectiveness of multi-queue support. Host: - Dell PowerEdge R760 - CPU: Intel(R) Xeon(R) Gold 6438M, 128 cores - VM: KVM with 32 cores Virtio-fs device: - BlueField-3 DPU - CPU: ARM Cortex-A78AE, 16 cores - One thread per queue, each busy polling on one request queue - Each queue is 1024 descriptors deep Workload: - fio, sequential read or write, ioengine=libaio, numjobs=32, 4GiB file per job, iodepth=8, bs=256KiB, runtime=30s Performance Results: +===========================+==========+===========+ | Number of queues | Fio read | Fio write | +===========================+==========+===========+ | 1 request queue (GiB/s) | 6.1 | 4.6 | +---------------------------+----------+-----------+ | 8 request queues (GiB/s) | 25.8 | 10.3 | +---------------------------+----------+-----------+ | 16 request queues (GiB/s) | 30.9 | 19.5 | +---------------------------+----------+-----------+ | 32 request queue (GiB/s) | 33.2 | 22.6 | +---------------------------+----------+-----------+ | Speedup | 5.5x | 5x | +---------------=-----------+----------+-----------+ Signed-off-by: Peter-Jan Gootzen <pgootzen@nvidia.com> Signed-off-by: Yoray Zack <yorayz@nvidia.com> Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-05-10virtio-fs: limit number of request queuesPeter-Jan Gootzen1-0/+3
Virtio-fs devices might allocate significant resources to virtio queues such as CPU cores that busy poll on the queue. The device indicates how many request queues it can support and the driver should initialize the number of queues that they want to utilize. In this patch we limit the number of initialized request queues to the number of CPUs, to limit the resource consumption on the device-side and to prepare for the upcoming multi-queue patch. Signed-off-by: Peter-Jan Gootzen <pgootzen@nvidia.com> Signed-off-by: Yoray Zack <yorayz@nvidia.com> Suggested-by: Max Gurtovoy <mgurtovoy@nvidia.com> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-05-10fuse: clear FR_SENT when re-adding requests into pending listHou Tao1-0/+1
The following warning was reported by lee bruce: ------------[ cut here ]------------ WARNING: CPU: 0 PID: 8264 at fs/fuse/dev.c:300 fuse_request_end+0x685/0x7e0 fs/fuse/dev.c:300 Modules linked in: CPU: 0 PID: 8264 Comm: ab2 Not tainted 6.9.0-rc7 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) RIP: 0010:fuse_request_end+0x685/0x7e0 fs/fuse/dev.c:300 ...... Call Trace: <TASK> fuse_dev_do_read.constprop.0+0xd36/0x1dd0 fs/fuse/dev.c:1334 fuse_dev_read+0x166/0x200 fs/fuse/dev.c:1367 call_read_iter include/linux/fs.h:2104 [inline] new_sync_read fs/read_write.c:395 [inline] vfs_read+0x85b/0xba0 fs/read_write.c:476 ksys_read+0x12f/0x260 fs/read_write.c:619 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xce/0x260 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f ...... </TASK> The warning is due to the FUSE_NOTIFY_RESEND notify sent by the write() syscall in the reproducer program and it happens as follows: (1) calls fuse_dev_read() to read the INIT request The read succeeds. During the read, bit FR_SENT will be set on the request. (2) calls fuse_dev_write() to send an USE_NOTIFY_RESEND notify The resend notify will resend all processing requests, so the INIT request is moved from processing list to pending list again. (3) calls fuse_dev_read() with an invalid output address fuse_dev_read() will try to copy the same INIT request to the output address, but it will fail due to the invalid address, so the INIT request is ended and triggers the warning in fuse_request_end(). Fix it by clearing FR_SENT when re-adding requests into pending list. Acked-by: Miklos Szeredi <mszeredi@redhat.com> Reported-by: xingwei lee <xrivendell7@gmail.com> Reported-by: yue sun <samsun1006219@gmail.com> Closes: https://lore.kernel.org/linux-fsdevel/58f13e47-4765-fce4-daf4-dffcc5ae2330@huaweicloud.com/T/#m091614e5ea2af403b259e7cea6a49e51b9ee07a7 Fixes: 760eac73f9f6 ("fuse: Introduce a new notification type for resend pending requests") Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-05-10fuse: set FR_PENDING atomically in fuse_resend()Hou Tao1-1/+1
When fuse_resend() moves the requests from processing lists to pending list, it uses __set_bit() to set FR_PENDING bit in req->flags. Using __set_bit() is not safe, because other functions may update req->flags concurrently (e.g., request_wait_answer() may call set_bit(FR_INTERRUPTED, &flags)). Fix it by using set_bit() instead. Fixes: 760eac73f9f6 ("fuse: Introduce a new notification type for resend pending requests") Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-05-08fuse: Add initial support for fs-verityRichard Fung1-0/+60
This adds support for the FS_IOC_ENABLE_VERITY and FS_IOC_MEASURE_VERITY ioctls. The FS_IOC_READ_VERITY_METADATA is missing but from the documentation, "This is a fairly specialized use case, and most fs-verity users won’t need this ioctl." Signed-off-by: Richard Fung <richardfung@google.com> Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-05-08fuse: Convert fuse_readpages_end() to use folio_end_read()Matthew Wilcox (Oracle)1-7/+3
Nobody checks the error flag on fuse folios, so stop setting it. Optimise the (optional) setting of the uptodate flag and clearing of the lock flag by using folio_end_read(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-05-08virtiofs: include a newline in sysfs tagBrian Foster1-1/+1
The internal tag string doesn't contain a newline. Append one when emitting the tag via sysfs. [Stefan] Orthogonal to the newline issue, sysfs_emit(buf, "%s", fs->tag) is needed to prevent format string injection. Signed-off-by: Brian Foster <bfoster@redhat.com> Fixes: a8f62f50b4e4 ("virtiofs: export filesystem tags through sysfs") Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-04-22fuse: verify zero padding in fuse_backing_mapAmir Goldstein1-1/+1
To allow us extending the interface in the future. Fixes: 44350256ab94 ("fuse: implement ioctls to manage backing files") Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-04-15cuse: add kernel-doc comments to cuse_process_init_reply()Yang Li1-0/+4
This commit adds kernel-doc style comments with complete parameter descriptions for the function cuse_process_init_reply. Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-04-15fuse: fix leaked ENOSYS error on first statx callDanny Lin1-0/+1
FUSE attempts to detect server support for statx by trying it once and setting no_statx=1 if it fails with ENOSYS, but consider the following scenario: - Userspace (e.g. sh) calls stat() on a file * succeeds - Userspace (e.g. lsd) calls statx(BTIME) on the same file - request_mask = STATX_BASIC_STATS | STATX_BTIME - first pass: sync=true due to differing cache_mask - statx fails and returns ENOSYS - set no_statx and retry - retry sets mask = STATX_BASIC_STATS - now mask == cache_mask; sync=false (time_before: still valid) - so we take the "else if (stat)" path - "err" is still ENOSYS from the failed statx call Fix this by zeroing "err" before retrying the failed call. Fixes: d3045530bdd2 ("fuse: implement statx") Cc: stable@vger.kernel.org # v6.6 Signed-off-by: Danny Lin <danny@orbstack.dev> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-04-15fuse: fix parallel dio write on file open in passthrough modeAmir Goldstein1-2/+2
Parallel dio write takes a negative refcount of fi->iocachectr and so does open of file in passthrough mode. The refcount of passthrough mode is associated with attach/detach of a fuse_backing object to fuse inode. For parallel dio write, the backing file is irrelevant, so the call to fuse_inode_uncached_io_start() passes a NULL fuse_backing object. Passing a NULL fuse_backing will result in false -EBUSY error if the file is already open in passthrough mode. Allow taking negative fi->iocachectr refcount with NULL fuse_backing, because it does not conflict with an already attached fuse_backing object. Fixes: 4a90451bbc7f ("fuse: implement open in passthrough mode") Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-04-15fuse: fix wrong ff->iomode state changes from parallel dio writeAmir Goldstein4-25/+51
There is a confusion with fuse_file_uncached_io_{start,end} interface. These helpers do two things when called from passthrough open()/release(): 1. Take/drop negative refcount of fi->iocachectr (inode uncached io mode) 2. State change ff->iomode IOM_NONE <-> IOM_UNCACHED (file uncached open) The calls from parallel dio write path need to take a reference on fi->iocachectr, but they should not be changing ff->iomode state, because in this case, the fi->iocachectr reference does not stick around until file release(). Factor out helpers fuse_inode_uncached_io_{start,end}, to be used from parallel dio write path and rename fuse_file_*cached_io_{start,end} helpers to fuse_file_*cached_io_{open,release} to clarify the difference. Fixes: 205c1d802683 ("fuse: allow parallel dio writes with FUSE_DIRECT_IO_ALLOW_MMAP") Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-03-15Merge tag 'fuse-update-6.9' of ↵Linus Torvalds12-274/+1375
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse updates from Miklos Szeredi: - Add passthrough mode for regular file I/O. This allows performing read and write (also via memory maps) on a backing file without incurring the overhead of roundtrips to userspace. For now this is only allowed to privileged servers, but this limitation will go away in the future (Amir Goldstein) - Fix interaction of direct I/O mode with memory maps (Bernd Schubert) - Export filesystem tags through sysfs for virtiofs (Stefan Hajnoczi) - Allow resending queued requests for server crash recovery (Zhao Chen) - Misc fixes and cleanups * tag 'fuse-update-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (38 commits) fuse: get rid of ff->readdir.lock fuse: remove unneeded lock which protecting update of congestion_threshold fuse: Fix missing FOLL_PIN for direct-io fuse: remove an unnecessary if statement fuse: Track process write operations in both direct and writethrough modes fuse: Use the high bit of request ID for indicating resend requests fuse: Introduce a new notification type for resend pending requests fuse: add support for explicit export disabling fuse: __kuid_val/__kgid_val helpers in fuse_fill_attr_from_inode() fuse: fix typo for fuse_permission comment fuse: Convert fuse_writepage_locked to take a folio fuse: Remove fuse_writepage virtio_fs: remove duplicate check if queue is broken fuse: use FUSE_ROOT_ID in fuse_get_root_inode() fuse: don't unhash root fuse: fix root lookup with nonzero generation fuse: replace remaining make_bad_inode() with fuse_make_bad() virtiofs: drop __exit from virtio_fs_sysfs_exit() fuse: implement passthrough for mmap fuse: implement splice read/write passthrough ...
2024-03-14Merge tag 'mm-stable-2024-03-13-20-04' of ↵Linus Torvalds1-4/+11
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Sumanth Korikkar has taught s390 to allocate hotplug-time page frames from hotplugged memory rather than only from main memory. Series "implement "memmap on memory" feature on s390". - More folio conversions from Matthew Wilcox in the series "Convert memcontrol charge moving to use folios" "mm: convert mm counter to take a folio" - Chengming Zhou has optimized zswap's rbtree locking, providing significant reductions in system time and modest but measurable reductions in overall runtimes. The series is "mm/zswap: optimize the scalability of zswap rb-tree". - Chengming Zhou has also provided the series "mm/zswap: optimize zswap lru list" which provides measurable runtime benefits in some swap-intensive situations. - And Chengming Zhou further optimizes zswap in the series "mm/zswap: optimize for dynamic zswap_pools". Measured improvements are modest. - zswap cleanups and simplifications from Yosry Ahmed in the series "mm: zswap: simplify zswap_swapoff()". - In the series "Add DAX ABI for memmap_on_memory", Vishal Verma has contributed several DAX cleanups as well as adding a sysfs tunable to control the memmap_on_memory setting when the dax device is hotplugged as system memory. - Johannes Weiner has added the large series "mm: zswap: cleanups", which does that. - More DAMON work from SeongJae Park in the series "mm/damon: make DAMON debugfs interface deprecation unignorable" "selftests/damon: add more tests for core functionalities and corner cases" "Docs/mm/damon: misc readability improvements" "mm/damon: let DAMOS feeds and tame/auto-tune itself" - In the series "mm/mempolicy: weighted interleave mempolicy and sysfs extension" Rakie Kim has developed a new mempolicy interleaving policy wherein we allocate memory across nodes in a weighted fashion rather than uniformly. This is beneficial in heterogeneous memory environments appearing with CXL. - Christophe Leroy has contributed some cleanup and consolidation work against the ARM pagetable dumping code in the series "mm: ptdump: Refactor CONFIG_DEBUG_WX and check_wx_pages debugfs attribute". - Luis Chamberlain has added some additional xarray selftesting in the series "test_xarray: advanced API multi-index tests". - Muhammad Usama Anjum has reworked the selftest code to make its human-readable output conform to the TAP ("Test Anything Protocol") format. Amongst other things, this opens up the use of third-party tools to parse and process out selftesting results. - Ryan Roberts has added fork()-time PTE batching of THP ptes in the series "mm/memory: optimize fork() with PTE-mapped THP". Mainly targeted at arm64, this significantly speeds up fork() when the process has a large number of pte-mapped folios. - David Hildenbrand also gets in on the THP pte batching game in his series "mm/memory: optimize unmap/zap with PTE-mapped THP". It implements batching during munmap() and other pte teardown situations. The microbenchmark improvements are nice. - And in the series "Transparent Contiguous PTEs for User Mappings" Ryan Roberts further utilizes arm's pte's contiguous bit ("contpte mappings"). Kernel build times on arm64 improved nicely. Ryan's series "Address some contpte nits" provides some followup work. - In the series "mm/hugetlb: Restore the reservation" Breno Leitao has fixed an obscure hugetlb race which was causing unnecessary page faults. He has also added a reproducer under the selftest code. - In the series "selftests/mm: Output cleanups for the compaction test", Mark Brown did what the title claims. - Kinsey Ho has added the series "mm/mglru: code cleanup and refactoring". - Even more zswap material from Nhat Pham. The series "fix and extend zswap kselftests" does as claimed. - In the series "Introduce cpu_dcache_is_aliasing() to fix DAX regression" Mathieu Desnoyers has cleaned up and fixed rather a mess in our handling of DAX on archiecctures which have virtually aliasing data caches. The arm architecture is the main beneficiary. - Lokesh Gidra's series "per-vma locks in userfaultfd" provides dramatic improvements in worst-case mmap_lock hold times during certain userfaultfd operations. - Some page_owner enhancements and maintenance work from Oscar Salvador in his series "page_owner: print stacks and their outstanding allocations" "page_owner: Fixup and cleanup" - Uladzislau Rezki has contributed some vmalloc scalability improvements in his series "Mitigate a vmap lock contention". It realizes a 12x improvement for a certain microbenchmark. - Some kexec/crash cleanup work from Baoquan He in the series "Split crash out from kexec and clean up related config items". - Some zsmalloc maintenance work from Chengming Zhou in the series "mm/zsmalloc: fix and optimize objects/page migration" "mm/zsmalloc: some cleanup for get/set_zspage_mapping()" - Zi Yan has taught the MM to perform compaction on folios larger than order=0. This a step along the path to implementaton of the merging of large anonymous folios. The series is named "Enable >0 order folio memory compaction". - Christoph Hellwig has done quite a lot of cleanup work in the pagecache writeback code in his series "convert write_cache_pages() to an iterator". - Some modest hugetlb cleanups and speedups in Vishal Moola's series "Handle hugetlb faults under the VMA lock". - Zi Yan has changed the page splitting code so we can split huge pages into sizes other than order-0 to better utilize large folios. The series is named "Split a folio to any lower order folios". - David Hildenbrand has contributed the series "mm: remove total_mapcount()", a cleanup. - Matthew Wilcox has sought to improve the performance of bulk memory freeing in his series "Rearrange batched folio freeing". - Gang Li's series "hugetlb: parallelize hugetlb page init on boot" provides large improvements in bootup times on large machines which are configured to use large numbers of hugetlb pages. - Matthew Wilcox's series "PageFlags cleanups" does that. - Qi Zheng's series "minor fixes and supplement for ptdesc" does that also. S390 is affected. - Cleanups to our pagemap utility functions from Peter Xu in his series "mm/treewide: Replace pXd_large() with pXd_leaf()". - Nico Pache has fixed a few things with our hugepage selftests in his series "selftests/mm: Improve Hugepage Test Handling in MM Selftests". - Also, of course, many singleton patches to many things. Please see the individual changelogs for details. * tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (435 commits) mm/zswap: remove the memcpy if acomp is not sleepable crypto: introduce: acomp_is_async to expose if comp drivers might sleep memtest: use {READ,WRITE}_ONCE in memory scanning mm: prohibit the last subpage from reusing the entire large folio mm: recover pud_leaf() definitions in nopmd case selftests/mm: skip the hugetlb-madvise tests on unmet hugepage requirements selftests/mm: skip uffd hugetlb tests with insufficient hugepages selftests/mm: dont fail testsuite due to a lack of hugepages mm/huge_memory: skip invalid debugfs new_order input for folio split mm/huge_memory: check new folio order when split a folio mm, vmscan: retry kswapd's priority loop with cache_trim_mode off on failure mm: add an explicit smp_wmb() to UFFDIO_CONTINUE mm: fix list corruption in put_pages_list mm: remove folio from deferred split list before uncharging it filemap: avoid unnecessary major faults in filemap_fault() mm,page_owner: drop unnecessary check mm,page_owner: check for null stack_record before bumping its refcount mm: swap: fix race between free_swap_and_cache() and swapoff() mm/treewide: align up pXd_leaf() retval across archs mm/treewide: drop pXd_large() ...
2024-03-11Merge tag 'vfs-6.9.file' of ↵Linus Torvalds1-7/+7
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull file locking updates from Christian Brauner: "A few years ago struct file_lock_context was added to allow for separate lists to track different types of file locks instead of using a singly-linked list for all of them. Now leases no longer need to be tracked using struct file_lock. However, a lot of the infrastructure is identical for leases and locks so separating them isn't trivial. This splits a group of fields used by both file locks and leases into a new struct file_lock_core. The new core struct is embedded in struct file_lock. Coccinelle was used to convert a lot of the callers to deal with the move, with the remaining 25% or so converted by hand. Afterwards several internal functions in fs/locks.c are made to work with struct file_lock_core. Ultimately this allows to split struct file_lock into struct file_lock and struct file_lease. The file lease APIs are then converted to take struct file_lease" * tag 'vfs-6.9.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (51 commits) filelock: fix deadlock detection in POSIX locking filelock: always define for_each_file_lock() smb: remove redundant check filelock: don't do security checks on nfsd setlease calls filelock: split leases out of struct file_lock filelock: remove temporary compatibility macros smb/server: adapt to breakup of struct file_lock smb/client: adapt to breakup of struct file_lock ocfs2: adapt to breakup of struct file_lock nfsd: adapt to breakup of struct file_lock nfs: adapt to breakup of struct file_lock lockd: adapt to breakup of struct file_lock fuse: adapt to breakup of struct file_lock gfs2: adapt to breakup of struct file_lock dlm: adapt to breakup of struct file_lock ceph: adapt to breakup of struct file_lock afs: adapt to breakup of struct file_lock 9p: adapt to breakup of struct file_lock filelock: convert seqfile handling to use file_lock_core filelock: convert locks_translate_pid to take file_lock_core ...
2024-03-06fuse: get rid of ff->readdir.lockMiklos Szeredi3-12/+0
The same protection is provided by file->f_pos_lock. Note, this relies on the fact that file->f_mode has FMODE_ATOMIC_POS. This flag is cleared by stream_open(), which would prevent locking of f_pos_lock. Prior to commit 7de64d521bf9 ("fuse: break up fuse_open_common()") FOPEN_STREAM on a directory would cause stream_open() to be called. After this commit this is not done anymore, so f_pos_lock will always be locked. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-03-06fuse: remove unneeded lock which protecting update of congestion_thresholdKemeng Shi1-5/+1
Commit 670d21c6e17f6 ("fuse: remove reliance on bdi congestion") change how congestion_threshold is used and lock in fuse_conn_congestion_threshold_write is not needed anymore. 1. Access to supe_block is removed along with removing of bdi congestion. Then down_read(&fc->killsb) which protecting access to super_block is no needed. 2. Compare num_background and congestion_threshold without holding bg_lock. Then there is no need to hold bg_lock to update congestion_threshold. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-03-06fuse: Fix missing FOLL_PIN for direct-ioLei Huang2-5/+11
Our user space filesystem relies on fuse to provide POSIX interface. In our test, a known string is written into a file and the content is read back later to verify correct data returned. We observed wrong data returned in read buffer in rare cases although correct data are stored in our filesystem. Fuse kernel module calls iov_iter_get_pages2() to get the physical pages of the user-space read buffer passed in read(). The pages are not pinned to avoid page migration. When page migration occurs, the consequence are two-folds. 1) Applications do not receive correct data in read buffer. 2) fuse kernel writes data into a wrong place. Using iov_iter_extract_pages() to pin pages fixes the issue in our test. An auxiliary variable "struct page **pt_pages" is used in the patch to prepare the 2nd parameter for iov_iter_extract_pages() since iov_iter_get_pages2() uses a different type for the 2nd parameter. [SzM] add iov_iter_extract_will_pin(ii) and unpin only if true. Signed-off-by: Lei Huang <lei.huang@linux.intel.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-03-06fuse: remove an unnecessary if statementJiachen Zhang1-4/+0
FUSE remote locking code paths never add any locking state to inode->i_flctx, so the locks_remove_posix() function called on file close will return without calling fuse_setlk(). Therefore, as the if statement to be removed in this commit will always be false, remove it for clearness. Signed-off-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-03-06fuse: Track process write operations in both direct and writethrough modesZhou Jifeng1-2/+6
Due to the fact that fuse does not count the