summaryrefslogtreecommitdiff
path: root/io_uring
AgeCommit message (Collapse)AuthorFilesLines
2024-11-18Merge tag 'for-6.13/io_uring-20241118' of git://git.kernel.dk/linuxLinus Torvalds36-1171/+1644
Pull io_uring updates from Jens Axboe: - Cleanups of the eventfd handling code, making it fully private. - Support for sending a sync message to another ring, without having a ring available to send a normal async message. - Get rid of the separate unlocked hash table, unify everything around the single locked one. - Add support for ring resizing. It can be hard to appropriately size the CQ ring upfront, if the application doesn't know how busy it will be. This results in applications sizing rings for the most busy case, which can be wasteful. With ring resizing, they can start small and grow the ring, if needed. - Add support for fixed wait regions, rather than needing to copy the same wait data tons of times for each wait operation. - Rewrite the resource node handling, which before was serialized per ring. This caused issues with particularly fixed files, where one file waiting on IO could hold up putting and freeing of other unrelated files. Now each node is handled separately. New code is much simpler too, and was a net 250 line reduction in code. - Add support for just doing partial buffer clones, rather than always cloning the entire buffer table. - Series adding static NAPI support, where a specific NAPI instance is used rather than having a list of them available that need lookup. - Add support for mapped regions, and also convert the fixed wait support mentioned above to that concept. This avoids doing special mappings for various planned features, and folds the existing registered wait into that too. - Add support for hybrid IO polling, which is a variant of strict IOPOLL but with an initial sleep delay to avoid spinning too early and wasting resources on devices that aren't necessarily in the < 5 usec category wrt latencies. - Various cleanups and little fixes. * tag 'for-6.13/io_uring-20241118' of git://git.kernel.dk/linux: (79 commits) io_uring/region: fix error codes after failed vmap io_uring: restore back registered wait arguments io_uring: add memory region registration io_uring: introduce concept of memory regions io_uring: temporarily disable registered waits io_uring: disable ENTER_EXT_ARG_REG for IOPOLL io_uring: fortify io_pin_pages with a warning switch io_msg_ring() to CLASS(fd) io_uring: fix invalid hybrid polling ctx leaks io_uring/uring_cmd: fix buffer index retrieval io_uring/rsrc: add & apply io_req_assign_buf_node() io_uring/rsrc: remove '->ctx_ptr' of 'struct io_rsrc_node' io_uring/rsrc: pass 'struct io_ring_ctx' reference to rsrc helpers io_uring: avoid normal tw intermediate fallback io_uring/napi: add static napi tracking strategy io_uring/napi: clean up __io_napi_do_busy_loop io_uring/napi: Use lock guards io_uring/napi: improve __io_napi_add io_uring/napi: fix io_napi_entry RCU accesses io_uring/napi: protect concurrent io_napi_entry timeout accesses ...
2024-11-18Merge tag 'for-6.13/block-20241118' of git://git.kernel.dk/linuxLinus Torvalds1-2/+2
Pull block updates from Jens Axboe: - NVMe updates via Keith: - Use uring_cmd helper (Pavel) - Host Memory Buffer allocation enhancements (Christoph) - Target persistent reservation support (Guixin) - Persistent reservation tracing (Guixen) - NVMe 2.1 specification support (Keith) - Rotational Meta Support (Matias, Wang, Keith) - Volatile cache detection enhancment (Guixen) - MD updates via Song: - Maintainers update - raid5 sync IO fix - Enhance handling of faulty and blocked devices - raid5-ppl atomic improvement - md-bitmap fix - Support for manually defining embedded partition tables - Zone append fixes and cleanups - Stop sending the queued requests in the plug list to the driver ->queue_rqs() handle in reverse order. - Zoned write plug cleanups - Cleanups disk stats tracking and add support for disk stats for passthrough IO - Add preparatory support for file system atomic writes - Add lockdep support for queue freezing. Already found a bunch of issues, and some fixes for that are in here. More will be coming. - Fix race between queue stopping/quiescing and IO queueing - ublk recovery improvements - Fix ublk mmap for 64k pages - Various fixes and cleanups * tag 'for-6.13/block-20241118' of git://git.kernel.dk/linux: (118 commits) MAINTAINERS: Update git tree for mdraid subsystem block: make struct rq_list available for !CONFIG_BLOCK block/genhd: use seq_put_decimal_ull for diskstats decimal values block: don't reorder requests in blk_mq_add_to_batch block: don't reorder requests in blk_add_rq_to_plug block: add a rq_list type block: remove rq_list_move virtio_blk: reverse request order in virtio_queue_rqs nvme-pci: reverse request order in nvme_queue_rqs btrfs: validate queue limits block: export blk_validate_limits nvmet: add tracing of reservation commands nvme: parse reservation commands's action and rtype to string nvmet: report ns's vwc not present md/raid5: Increase r5conf.cache_name size block: remove the ioprio field from struct request block: remove the write_hint field from struct request nvme: check ns's volatile write cache not present nvme: add rotational support nvme: use command set independent id ns if available ...
2024-11-18Merge tag 'for-6.13-tag' of ↵Linus Torvalds1-1/+5
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "Changes outside of btrfs: add io_uring command flag to track a dying task (the rest will go via the block git tree). User visible changes: - wire encoded read (ioctl) to io_uring commands, this can be used on itself, in the future this will allow 'send' to be asynchronous. As a consequence, the encoded read ioctl can also work in non-blocking mode - new ioctl to wait for cleaned subvolumes, no need to use the generic and root-only SEARCH_TREE ioctl, will be used by "btrfs subvol sync" - recognize different paths/symlinks for the same devices and don't report them during rescanning, this can be observed with LVM or DM - seeding device use case change, the sprout device (the one capturing new writes) will not clear the read-only status of the super block; this prevents accumulating space from deleted snapshots Performance improvements: - reduce lock contention when traversing extent buffers - reduce extent tree lock contention when searching for inline backref - switch from rb-trees to xarray for delayed ref tracking, improvements due to better cache locality, branching factors and more compact data structures - enable extent map shrinker again (prevent memory exhaustion under some types of IO load), reworked to run in a single worker thread (there used to be problems causing long stalls under memory pressure) Core changes: - raid-stripe-tree feature updates: - make device replace and scrub work - implement partial deletion of stripe extents - new selftests - split the config option BTRFS_DEBUG and add EXPERIMENTAL for features that are experimental or with known problems so we don't misuse debugging config for that - subpage mode updates (sector < page): - update compression implementations - update writepage, writeback - continued folio API conversions: - buffered writes - make buffered write copy one page at a time, preparatory work for future integration with large folios, may cause performance drop - proper locking of root item regarding starting send - error handling improvements - code cleanups and refactoring: - dead code removal - unused parameter reduction - lockdep assertions" * tag 'for-6.13-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (119 commits) btrfs: send: check for read-only send root under critical section btrfs: send: check for dead send root under critical section btrfs: remove check for NULL fs_info at btrfs_folio_end_lock_bitmap() btrfs: fix warning on PTR_ERR() against NULL device at btrfs_control_ioctl() btrfs: fix a typo in btrfs_use_zone_append btrfs: avoid superfluous calls to free_extent_map() in btrfs_encoded_read() btrfs: simplify logic to decrement snapshot counter at btrfs_mksnapshot() btrfs: remove hole from struct btrfs_delayed_node btrfs: update stale comment for struct btrfs_delayed_ref_node::add_list btrfs: add new ioctl to wait for cleaned subvolumes btrfs: simplify range tracking in cow_file_range() btrfs: remove conditional path allocation in btrfs_read_locked_inode() btrfs: push cleanup into btrfs_read_locked_inode() io_uring/cmd: let cmds to know about dying task btrfs: add struct io_btrfs_cmd as type for io_uring_cmd_to_pdu() btrfs: add io_uring command for encoded reads (ENCODED_READ ioctl) btrfs: move priv off stack in btrfs_encoded_read_regular_fill_pages() btrfs: don't sleep in btrfs_encoded_read() if IOCB_NOWAIT is set btrfs: change btrfs_encoded_read() so that reading of extent is done by caller btrfs: remove pointless iocb::ki_pos addition in btrfs_encoded_read() ...
2024-11-18Merge tag 'pull-statx' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds1-2/+1
Pull statx updates from Al Viro: "Sanitize struct filename and lookup flags handling in statx and friends" * tag 'pull-statx' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: libfs: kill empty_dir_getattr() fs: Simplify getattr interface function checking AT_GETATTR_NOSEC flag fs/stat.c: switch to CLASS(fd_raw) kill getname_statx_lookup_flags() io_statx_prep(): use getname_uflags()
2024-11-18Merge tag 'pull-xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds1-72/+25
Pull xattr updates from Al Viro: "Sanitize xattr and io_uring interactions with it, add *xattrat() syscalls, sanitize struct filename handling in there" * tag 'pull-xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: xattr: remove redundant check on variable err fs/xattr: add *at family syscalls new helpers: file_removexattr(), filename_removexattr() new helpers: file_listxattr(), filename_listxattr() replace do_getxattr() with saner helpers. replace do_setxattr() with saner helpers. new helper: import_xattr_name() fs: rename struct xattr_ctx to kernel_xattr_ctx xattr: switch to CLASS(fd) io_[gs]etxattr_prep(): just use getname() io_uring: IORING_OP_F[GS]ETXATTR is fine with REQ_F_FIXED_FILE getname_maybe_null() - the third variant of pathname copy-in teach filename_lookup() to treat NULL filename as ""
2024-11-18Merge tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds1-21/+8
Pull 'struct fd' class updates from Al Viro: "The bulk of struct fd memory safety stuff Making sure that struct fd instances are destroyed in the same scope where they'd been created, getting rid of reassignments and passing them by reference, converting to CLASS(fd{,_pos,_raw}). We are getting very close to having the memory safety of that stuff trivial to verify" * tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (28 commits) deal with the last remaing boolean uses of fd_file() css_set_fork(): switch to CLASS(fd_raw, ...) memcg_write_event_control(): switch to CLASS(fd) assorted variants of irqfd setup: convert to CLASS(fd) do_pollfd(): convert to CLASS(fd) convert do_select() convert vfs_dedupe_file_range(). convert cifs_ioctl_copychunk() convert media_request_get_by_fd() convert spu_run(2) switch spufs_calls_{get,put}() to CLASS() use convert cachestat(2) convert do_preadv()/do_pwritev() fdget(), more trivial conversions fdget(), trivial conversions privcmd_ioeventfd_assign(): don't open-code eventfd_ctx_fdget() o2hb_region_dev_store(): avoid goto around fdget()/fdput() introduce "fd_pos" class, convert fdget_pos() users to it. fdget_raw() users: switch to CLASS(fd_raw) convert vmsplice() to CLASS(fd) ...
2024-11-18Merge tag 'vfs-6.13.file' of ↵Linus Torvalds1-1/+0
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs file updates from Christian Brauner: "This contains changes the changes for files for this cycle: - Introduce a new reference counting mechanism for files. As atomic_inc_not_zero() is implemented with a try_cmpxchg() loop it has O(N^2) behaviour under contention with N concurrent operations and it is in a hot path in __fget_files_rcu(). The rcuref infrastructures remedies this problem by using an unconditional increment relying on safe- and dead zones to make this work and requiring rcu protection for the data structure in question. This not just scales better it also introduces overflow protection. However, in contrast to generic rcuref, files require a memory barrier and thus cannot rely on *_relaxed() atomic operations and also require to be built on atomic_long_t as having massive amounts of reference isn't unheard of even if it is just an attack. This adds a file specific variant instead of making this a generic library. This has been tested by various people and it gives consistent improvement up to 3-5% on workloads with loads of threads. - Add a fastpath for find_next_zero_bit(). Skip 2-levels searching via find_next_zero_bit() when there is a free slot in the word that contains the next fd. This improves pts/blogbench-1.1.0 read by 8% and write by 4% on Intel ICX 160. - Conditionally clear full_fds_bits since it's very likely that a bit in full_fds_bits has been cleared during __clear_open_fds(). This improves pts/blogbench-1.1.0 read up to 13%, and write up to 5% on Intel ICX 160. - Get rid of all lookup_*_fdget_rcu() variants. They were used to lookup files without taking a reference count. That became invalid once files were switched to SLAB_TYPESAFE_BY_RCU and now we're always taking a reference count. Switch to an already existing helper and remove the legacy variants. - Remove pointless includes of <linux/fdtable.h>. - Avoid cmpxchg() in close_files() as nobody else has a reference to the files_struct at that point. - Move close_range() into fs/file.c and fold __close_range() into it. - Cleanup calling conventions of alloc_fdtable() and expand_files(). - Merge __{set,clear}_close_on_exec() into one. - Make __set_open_fd() set cloexec as well instead of doing it in two separate steps" * tag 'vfs-6.13.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: selftests: add file SLAB_TYPESAFE_BY_RCU recycling stressor fs: port files to file_ref fs: add file_ref expand_files(): simplify calling conventions make __set_open_fd() set cloexec state as well fs: protect backing files with rcu file.c: merge __{set,clear}_close_on_exec() alloc_fdtable(): change calling conventions. fs/file.c: add fast path in find_next_fd() fs/file.c: conditionally clear full_fds fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd() move close_range(2) into fs/file.c, fold __close_range() into it close_files(): don't bother with xchg() remove pointless includes of <linux/fdtable.h> get rid of ...lookup...fdget_rcu() family
2024-11-17io_uring/region: fix error codes after failed vmapPavel Begunkov1-1/+3
io_create_region() jumps after a vmap failure without setting the return code, it could be 0 or just uninitialised. Fixes: dfbbfbf191878 ("io_uring: introduce concept of memory regions") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0abac19dbf81c061cffaa9534a2471ed5460ad3e.1731803848.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-15io_uring: restore back registered wait argumentsPavel Begunkov2-2/+28
Now we've got a more generic region registration API, place IORING_ENTER_EXT_ARG_REG and re-enable it. First, the user has to register a region with the IORING_MEM_REGION_REG_WAIT_ARG flag set. It can only be done for a ring in a disabled state, aka IORING_SETUP_R_DISABLED, to avoid races with already running waiters. With that we should have stable constant values for ctx->cq_wait_{size,arg} in io_get_ext_arg_reg() and hence no READ_ONCE required. The other API difference is that we're now passing byte offsets instead of indexes. The user _must_ align all offsets / pointers to the native word size, failing to do so might but not necessarily has to lead to a failure usually returned as -EFAULT. liburing will be hiding this details from users. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/81822c1b4ffbe8ad391b4f9ad1564def0d26d990.1731689588.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-15io_uring: add memory region registrationPavel Begunkov2-0/+38
Regions will serve multiple purposes. First, with it we can decouple ring/etc. object creation from registration / mapping of the memory they will be placed in. We already have hacks that allow to put both SQ and CQ into the same huge page, in the future we should be able to: region = create_region(io_ring); create_pbuf_ring(io_uring, region, offset=0); create_pbuf_ring(io_uring, region, offset=N); The second use case is efficiently passing parameters. The following patch enables back on top of regions IORING_ENTER_EXT_ARG_REG, which optimises wait arguments. It'll also be useful for request arguments replacing iovecs, msghdr, etc. pointers. Eventually it would also be handy for BPF as well if it comes to fruition. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0798cf3a14fad19cfc96fc9feca5f3e11481691d.1731689588.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-15io_uring: introduce concept of memory regionsPavel Begunkov2-0/+81
We've got a good number of mappings we share with the userspace, that includes the main rings, provided buffer rings, upcoming rings for zerocopy rx and more. All of them duplicate user argument parsing and some internal details as well (page pinnning, huge page optimisations, mmap'ing, etc.) Introduce a notion of regions. For userspace for now it's just a new structure called struct io_uring_region_desc which is supposed to parameterise all such mapping / queue creations. A region either represents a user provided chunk of memory, in which case the user_addr field should point to it, or a request for the kernel to allocate the memory, in which case the user would need to mmap it after using the offset returned in the mmap_offset field. With a uniform userspace API we can avoid additional boiler plate code and apply future optimisation to all of them at once. Internally, there is a new structure struct io_mapped_region holding all relevant runtime information and some helpers to work with it. This patch limits it to user provided regions. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0e6fe25818dfbaebd1bd90b870a6cac503fe1a24.1731689588.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-15io_uring: temporarily disable registered waitsPavel Begunkov3-93/+0
Disable wait argument registration as it'll be replaced with a more generic feature. We'll still need IORING_ENTER_EXT_ARG_REG parsing in a few commits so leave it be. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/70b1d1d218c41ba77a76d1789c8641dab0b0563e.1731689588.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-15io_uring: disable ENTER_EXT_ARG_REG for IOPOLLPavel Begunkov1-6/+2
IOPOLL doesn't use the extended arguments, no need for it to support IORING_ENTER_EXT_ARG_REG. Let's disable it for IOPOLL, if anything it leaves more space for future extensions. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a35ecd919dbdc17bd5b7932273e317832c531b45.1731689588.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-15io_uring: fortify io_pin_pages with a warningPavel Begunkov1-0/+2
We're a bit too frivolous with types of nr_pages arguments, converting it to long and back to int, passing an unsigned int pointer as an int pointer and so on. Shouldn't cause any problem but should be carefully reviewed, but until then let's add a WARN_ON_ONCE check to be more confident callers don't pass poorely checked arguents. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d48e0c097cbd90fb47acaddb6c247596510d8cfc.1731689588.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-15switch io_msg_ring() to CLASS(fd)Al Viro1-11/+7
Use CLASS(fd) to get the file for sync message ring requests, rather than open-code the file retrieval dance. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20241115034902.GP3387508@ZenIV [axboe: make a more coherent commit message] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-13block: add a rq_list typeChristoph Hellwig1-2/+2
Replace the semi-open coded request list helpers with a proper rq_list type that mirrors the bio_list and has head and tail pointers. Besides better type safety this actually allows to insert at the tail of the list, which will be useful soon. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241113152050.157179-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-13io_statx_prep(): use getname_uflags()Al Viro1-2/+1
the only thing in flags getname_flags() ever cares about is LOOKUP_EMPTY; anything else is none of its damn business. Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-13io_uring: fix invalid hybrid polling ctx leaksPavel Begunkov1-5/+5
It has already allocated the ctx by the point where it checks the hybrid poll configuration, plain return leaks the memory. Fixes: 01ee194d1aba1 ("io_uring: add support for hybrid IOPOLL") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Link: https://lore.kernel.org/r/b57f2608088020501d352fcdeebdb949e281d65b.1731468230.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-11io_uring/uring_cmd: fix buffer index retrievalMing Lei1-1/+2
Add back buffer index retrieval for IORING_URING_CMD_FIXED. Reported-by: Guangwu Zhang <guazhang@redhat.com> Cc: Jeff Moyer <jmoyer@redhat.com> Fixes: b54a14041ee6 ("io_uring/rsrc: add io_rsrc_node_lookup() helper") Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Tested-by: Guangwu Zhang <guazhang@redhat.com> Link: https://lore.kernel.org/r/20241111101318.1387557-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-11io_uring/cmd: let cmds to know about dying taskPavel Begunkov1-1/+5
When the taks that submitted a request is dying, a task work for that request might get run by a kernel thread or even worse by a half dismantled task. We can't just cancel the task work without running the callback as the cmd might need to do some clean up, so pass a flag instead. If set, it's not safe to access any task resources and the callback is expected to cancel the cmd ASAP. Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-07io_uring/rsrc: add & apply io_req_assign_buf_node()Ming Lei5-7/+11
The following pattern becomes more and more: + io_req_assign_rsrc_node(&req->buf_node, node); + req->flags |= REQ_F_BUF_NODE; so make it a helper, which is less fragile to use than above code, for example, the BUF_NODE flag is even missed in current io_uring_cmd_prep(). Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241107110149.890530-4-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-07io_uring/rsrc: remove '->ctx_ptr' of 'struct io_rsrc_node'Ming Lei2-10/+3
Remove '->ctx_ptr' of 'struct io_rsrc_node', and add 'type' field, meantime remove io_rsrc_node_type(). Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241107110149.890530-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-07io_uring/rsrc: pass 'struct io_ring_ctx' reference to rsrc helpersMing Lei5-35/+30
`io_rsrc_node` instance won't be shared among different io_uring ctxs, and its allocation 'ctx' is always same with the user's 'ctx', so it is safe to pass user 'ctx' reference to rsrc helpers. Even in io_clone_buffers(), `io_rsrc_node` instance is allocated actually for destination io_uring_ctx. Then io_rsrc_node_ctx() can be removed, and the 8 bytes `ctx` pointer will be removed from `io_rsrc_node` in the following patch. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241107110149.890530-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-06io_uring: avoid normal tw intermediate fallbackPavel Begunkov2-12/+11
When a DEFER_TASKRUN io_uring is terminating it requeues deferred task work items as normal tw, which can further fallback to kthread execution. Avoid this extra step and always push them to the fallback kthread. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d1cd472cec2230c66bd1c8d412a5833f0af75384.1730772720.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-06io_uring/napi: add static napi tracking strategyOlivier Langlois3-24/+129
Add the static napi tracking strategy. That allows the user to manually manage the napi ids list for busy polling, and eliminate the overhead of dynamically updating the list from the fast path. Signed-off-by: Olivier Langlois <olivier@trillion01.com> Link: https://lore.kernel.org/r/96943de14968c35a5c599352259ad98f3c0770ba.1728828877.git.olivier@trillion01.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-06io_uring/napi: clean up __io_napi_do_busy_loopOlivier Langlois1-7/+8
__io_napi_do_busy_loop now requires to have loop_end in its parameters. This makes the code cleaner and also has the benefit of removing a branch since the only caller not passing NULL for loop_end_arg is also setting the value conditionally. Signed-off-by: Olivier Langlois <olivier@trillion01.com> Link: https://lore.kernel.org/r/d5b9bb91b1a08fff50525e1c18d7b4709b9ca100.1728828877.git.olivier@trillion01.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-06io_uring/napi: Use lock guardsOlivier Langlois1-19/+21
Convert napi locks to use the shiny new Scope-Based Resource Management machinery. Signed-off-by: Olivier Langlois <olivier@trillion01.com> Link: https://lore.kernel.org/r/2680ca47ee183cfdb89d1a40c84d349edeb620ab.1728828877.git.olivier@trillion01.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-06io_uring/napi: improve __io_napi_addOlivier Langlois2-16/+9
1. move the sock->sk pointer validity test outside the function to avoid the function call overhead and to make the function more more reusable 2. change its name to __io_napi_add_id to be more precise about it is doing 3. return an error code to report errors Signed-off-by: Olivier Langlois <olivier@trillion01.com> Link: https://lore.kernel.org/r/d637fa3b437d753c0f4e44ff6a7b5bf2c2611270.1728828877.git.olivier@trillion01.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-06io_uring/napi: fix io_napi_entry RCU accessesOlivier Langlois1-6/+11
correct 3 RCU structures modifications that were not using the RCU functions to make their update. Signed-off-by: Olivier Langlois <olivier@trillion01.com> Link: https://lore.kernel.org/r/9f53b5169afa8c7bf3665a0b19dc2f7061173530.1728828877.git.olivier@trillion01.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-06io_uring/napi: protect concurrent io_napi_entry timeout accessesOlivier Langlois1-3/+3
io_napi_entry timeout value can be updated while accessed from the poll functions. Its concurrent accesses are wrapped with READ_ONCE()/WRITE_ONCE() macros to avoid incorrect compiler optimizations. Signed-off-by: Olivier Langlois <olivier@trillion01.com> Link: https://lore.kernel.org/r/3de3087563cf98f75266fd9f85fdba063a8720db.1728828877.git.olivier@trillion01.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-06io_uring: prevent speculating sq_array indexingPavel Begunkov1-0/+1
The SQ index array consists of user provided indexes, which io_uring then uses to index the SQ, and so it's susceptible to speculation. For all other queues io_uring tracks heads and tails in kernel, and they shouldn't need any special care. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c6c7a25962924a55869e317e4fdb682dfdc6b279.1730687889.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-06io_uring: move struct io_kiocb from task_struct to io_uring_taskJens Axboe12-34/+45
Rather than store the task_struct itself in struct io_kiocb, store the io_uring specific task_struct. The life times are the same in terms of io_uring, and this avoids doing some dereferences through the task_struct. For the hot path of putting local task references, we can deref req->tctx instead, which we'll need anyway in that function regardless of whether it's local or remote references. This is mostly straight forward, except the original task PF_EXITING check needs a bit of tweaking. task_work is _always_ run from the originating task, except in the fallback case, where it's run from a kernel thread. Replace the potentially racy (in case of fallback work) checks for req->task->flags with current->flags. It's either the still the original task, in which case PF_EXITING will be sane, or it has PF_KTHREAD set, in which case it's fallback work. Both cases should prevent moving forward with the given request. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-06io_uring: remove task ref helpersJens Axboe1-21/+10
They are only used right where they are defined, just open-code them inside io_put_task(). Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-06io_uring: move cancelations to be io_uring_task basedJens Axboe12-40/+40
Right now the task_struct pointer is used as the key to match a task, but in preparation for some io_kiocb changes, move it to using struct io_uring_task instead. No functional changes intended in this patch. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-06io_uring/rsrc: split io_kiocb node type assignmentsJens Axboe7-16/+23
Currently the io_rsrc_node assignment in io_kiocb is an array of two pointers, as two nodes may be assigned to a request - one file node, and one buffer node. However, the buffer node can co-exist with the provided buffers, as currently it's not supported to use both provided and registered buffers at the same time. This crucially brings struct io_kiocb down to 4 cache lines again, as before it spilled into the 5th cacheline. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-06io_uring/rsrc: encode node type and ctx togetherJens Axboe2-9/+19
Rather than keep the type field separate rom ctx, use the fact that we can encode up to 4 types of nodes in the LSB of the ctx pointer. Doesn't reclaim any space right now on 64-bit archs, but it leaves a full int for future use. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-06replace do_getxattr() with saner helpers.Al Viro1-24/+7
similar to do_setxattr() in the previous commit... Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-06replace do_setxattr() with saner helpers.Al Viro1-34/+7
io_uring setxattr logics duplicates stuff from fs/xattr.c; provide saner helpers (filename_setxattr() and file_setxattr() resp.) and use them. NB: putname(ERR_PTR()) is a no-op Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-06new helper: import_xattr_name()Al Viro1-5/+2
common logics for marshalling xattr names. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-06fs: rename struct xattr_ctx to kernel_xattr_ctxChristian Göttsche1-1/+1
Rename the struct xattr_ctx to increase distinction with the about to be added user API struct xattr_args. No functional change. Suggested-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Christian Göttsche <cgzones@googlemail.com> Link: https://lore.kernel.org/r/20240426162042.191916-2-cgoettsche@seltendoof.de Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-03io_[gs]etxattr_prep(): just use getname()Al Viro1-2/+2
getname_flags(pathname, LOOKUP_FOLLOW) is obviously bogus - following trailing symlinks has no impact on how to copy the pathname from userland... Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-03fdget(), trivial conversionsAl Viro1-21/+8
fdget() is the first thing done in scope, all matching fdput() are immediately followed by leaving the scope. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-02io_uring: add support for hybrid IOPOLLhexue2-12/+88
A new hybrid poll is implemented on the io_uring layer. Once an IO is issued, it will not poll immediately, but rather block first and re-run before IO complete, then poll to reap IO. While this poll method could be a suboptimal solution when running on a single thread, it offers performance lower than regular polling but higher than IRQ, and CPU utilization is also lower than polling. To use hybrid polling, the ring must be setup with both the IORING_SETUP_IOPOLL and IORING_SETUP_HYBRID)IOPOLL flags set. Hybrid polling has the same restrictions as IOPOLL, in that commands must explicitly support it. Signed-off-by: hexue <xue01.he@samsung.com> Link: https://lore.kernel.org/r/20241101091957.564220-2-xue01.he@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-02io_uring/rsrc: allow cloning with node replacementsJens Axboe1-14/+52
Currently cloning a buffer table will fail if the destination already has a table. But it should be possible to use it to replace existing elements. Add a IORING_REGISTER_DST_REPLACE cloning flag, which if set, will allow the destination to already having a buffer table. If that is the case, then entries designated by offset + nr buffers will be replaced if they already exist. Note that it's allowed to use IORING_REGISTER_DST_REPLACE and not have an existing table, in which case it'll work just like not having the flag set and an empty table - it'll just assign the newly created table for that case. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-02io_uring/rsrc: allow cloning at an offsetJens Axboe1-6/+26
Right now buffer cloning is an all-or-nothing kind of thing - either the whole table is cloned from a source to a destination ring, or nothing at all. However, it's not always desired to clone the whole thing. Allow for the application to specify a source and destination offset, and a number of buffers to clone. If the destination offset is non-zero, then allocate sparse nodes upfront. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-02io_uring/rsrc: get rid of the empty node and dummy_ubufJens Axboe6-50/+40
The empty node was used as a placeholder for a sparse entry, but it didn't really solve any issues. The caller still has to check for whether it's the empty node or not, it may as well just check for a NULL return instead. The dummy_ubuf was used for a sparse buffer entry, but NULL will serve the same purpose there of ensuring an -EFAULT on attempted import. Just use NULL for a sparse node, regardless of whether or not it's a file or buffer resource. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-02io_uring/rsrc: add io_reset_rsrc_node() helperJens Axboe3-16/+17
Puts and reset an existing node in a slot, if one exists. Returns true if a node was there, false if not. This helps cleanup some of the code that does a lookup just to clear an existing node. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-02io_uring/filetable: kill io_reset_alloc_hint() helperJens Axboe1-6/+1
It's only used internally, and in one spot, just open-code ti. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-02io_uring/filetable: remove io_file_from_index() helperJens Axboe2-11/+3
It's only used in fdinfo, nothing really gained from having this helper. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-02io_uring/rsrc: add io_rsrc_node_lookup() helperJens Axboe12-59/+57
There are lots of spots open-coding this functionality, add a generic helper that does the node lookup in a speculation safe way. Signed-off-by: Jens Axboe <axboe@kernel.dk>