summaryrefslogtreecommitdiff
path: root/block
AgeCommit message (Collapse)AuthorFilesLines
5 daysblock: Introduce bio_needs_zone_write_plugging()Damien Le Moal2-21/+5
commit f70291411ba20d50008db90a6f0731efac27872c upstream. In preparation for fixing device mapper zone write handling, introduce the inline helper function bio_needs_zone_write_plugging() to test if a BIO requires handling through zone write plugging using the function blk_zone_plug_bio(). This function returns true for any write (op_is_write(bio) == true) operation directed at a zoned block device using zone write plugging, that is, a block device with a disk that has a zone write plug hash table. This helper allows simplifying the check on entry to blk_zone_plug_bio() and used in to protect calls to it for blk-mq devices and DM devices. Fixes: f211268ed1f9 ("dm: Use the block layer zone append emulation") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250625093327.548866-3-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 daysblock: avoid possible overflow for chunk_sectors check in blk_stack_limits()John Garry1-1/+1
[ Upstream commit 448dfecc7ff807822ecd47a5c052acedca7d09e8 ] In blk_stack_limits(), we check that the t->chunk_sectors value is a multiple of the t->physical_block_size value. However, by finding the chunk_sectors value in bytes, we may overflow the unsigned int which holds chunk_sectors, so change the check to be based on sectors. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20250729091448.1691334-2-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
5 dayslib/sbitmap: convert shallow_depth from one word to the whole sbitmapYu Kuai4-43/+20
[ Upstream commit 42e6c6ce03fd3e41e39a0f93f9b1a1d9fa664338 ] Currently elevators will record internal 'async_depth' to throttle asynchronous requests, and they both calculate shallow_dpeth based on sb->shift, with the respect that sb->shift is the available tags in one word. However, sb->shift is not the availbale tags in the last word, see __map_depth: if (index == sb->map_nr - 1) return sb->depth - (index << sb->shift); For consequence, if the last word is used, more tags can be get than expected, for example, assume nr_requests=256 and there are four words, in the worst case if user set nr_requests=32, then the first word is the last word, and still use bits per word, which is 64, to calculate async_depth is wrong. One the ohter hand, due to cgroup qos, bfq can allow only one request to be allocated, and set shallow_dpeth=1 will still allow the number of words request to be allocated. Fix this problems by using shallow_depth to the whole sbitmap instead of per word, also change kyber, mq-deadline and bfq to follow this, a new helper __map_depth_with_shallow() is introduced to calculate available bits in each word. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250807032413.1469456-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
10 daysblock: ensure discard_granularity is zero when discard is not supportedChristoph Hellwig1-3/+10
[ Upstream commit fad6551fcf537375702b9af012508156a16a1ff7 ] Documentation/ABI/stable/sysfs-block states: What: /sys/block/<disk>/queue/discard_granularity [...] A discard_granularity of 0 means that the device does not support discard functionality. but this got broken when sorting out the block limits updates. Fix this by setting the discard_granularity limit to zero when the combined max_discard_sectors is zero. Fixes: 3c407dc723bb ("block: default the discard granularity to sector size") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20250731152228.873923-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
10 daysblock: sanitize chunk_sectors for atomic write limitsJohn Garry1-0/+6
[ Upstream commit 1de67e8e28fc47d71ee06ffa0185da549b378ffb ] Currently we just ensure that a non-zero value in chunk_sectors aligns with any atomic write boundary, as the blk boundary functionality uses both these values. However it is also improper to have atomic write unit max > chunk_sectors (for non-zero chunk_sectors), as this would lead to splitting of atomic write bios (which is disallowed). Sanitize atomic write unit max against chunk_sectors to avoid any potential problems. Fixes: d00eea91deaf3 ("block: Add extra checks in blk_validate_atomic_write_limits()") Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20250711105258.3135198-3-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-07-24block: fix kobject leak in blk_unregister_queueMing Lei1-0/+1
[ Upstream commit 3051247e4faa32a3d90c762a243c2c62dde310db ] The kobject for the queue, `disk->queue_kobj`, is initialized with a reference count of 1 via `kobject_init()` in `blk_register_queue()`. While `kobject_del()` is called during the unregister path to remove the kobject from sysfs, the initial reference is never released. Add a call to `kobject_put()` in `blk_unregister_queue()` to properly decrement the reference count and fix the leak. Fixes: 2bd85221a625 ("block: untangle request_queue refcounting from sysfs") Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250711083009.2574432-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-27block: Clear BIO_EMULATES_ZONE_APPEND flag on BIO completionDamien Le Moal1-0/+1
commit f705d33c2f0353039d03e5d6f18f70467d86080e upstream. When blk_zone_write_plug_bio_endio() is called for a regular write BIO used to emulate a zone append operation, that is, a BIO flagged with BIO_EMULATES_ZONE_APPEND, the BIO operation code is restored to the original REQ_OP_ZONE_APPEND but the BIO_EMULATES_ZONE_APPEND flag is not cleared. Clear it to fully return the BIO to its orginal definition. Fixes: 9b1ce7f0c6f8 ("block: Implement zone append emulation") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250611005915.89843-1-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-27block: use plug request list tail for one-shot backmerge attemptJens Axboe1-13/+13
commit 961296e89dc3800e6a3abc3f5d5bb4192cf31e98 upstream. Previously, the block layer stored the requests in the plug list in LIFO order. For this reason, blk_attempt_plug_merge() would check just the head entry for a back merge attempt, and abort after that unless requests for multiple queues existed in the plug list. If more than one request is present in the plug list, this makes the one-shot back merging less useful than before, as it'll always fail to find a quick merge candidate. Use the tail entry for the one-shot merge attempt, which is the last added request in the list. If that fails, abort immediately unless there are multiple queues available. If multiple queues are available, then scan the list. Ideally the latter scan would be a backwards scan of the list, but as it currently stands, the plug list is singly linked and hence this isn't easily feasible. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/linux-block/20250611121626.7252-1-abuehaze@amazon.com/ Reported-by: Hazem Mohamed Abuelfotoh <abuehaze@amazon.com> Fixes: e70c301faece ("block: don't reorder requests in blk_add_rq_to_plug") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-19block: don't use submit_bio_noacct_nocheck in blk_zone_wplug_bio_workChristoph Hellwig1-2/+5
[ Upstream commit cf625013d8741c01407bbb4a60c111b61b9fa69d ] Bios queued up in the zone write plug have already gone through all all preparation in the submit_bio path, including the freeze protection. Submitting them through submit_bio_noacct_nocheck duplicates the work and can can cause deadlocks when freezing a queue with pending bio write plugs. Go straight to ->submit_bio or blk_mq_submit_bio to bypass the superfluous extra freeze protection and checks. Fixes: 9b1ce7f0c6f8 ("block: Implement zone append emulation") Reported-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Tested-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20250611044416.2351850-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-19block: use q->elevator with ->elevator_lock held in elv_iosched_show()Ming Lei1-2/+1
[ Upstream commit 94209d27d14104ed828ca88cd5403a99162fe51a ] Use q->elevator with ->elevator_lock held in elv_iosched_show(), since the local cached elevator reference may become stale after getting ->elevator_lock. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-5-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-19block: flip iter directions in blk_rq_integrity_map_user()Caleb Sander Mateos1-6/+1
[ Upstream commit 43a67dd812c5d3de163c0b6971046b4a4b633d3f ] blk_rq_integrity_map_user() creates the ubuf iter with ITER_DEST for write-direction operations and ITER_SOURCE for read-direction ones. This is backwards; writes use the user buffer as a source for metadata and reads use it as a destination. Switch to the rq_data_dir() helper, which maps writes to ITER_SOURCE (WRITE) and reads to ITER_DEST(READ). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Fixes: fe8f4ca7107e ("block: modify bio_integrity_map_user to accept iov_iter as argument") Link: https://lore.kernel.org/r/20250603184752.1185676-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-19blk-throttle: Fix wrong tg->[bytes/io]_disp update in __tg_update_carryover()Zizhi Wo1-6/+16
[ Upstream commit f66cf69eb8765341bbeff0e92a7d0d2027f62452 ] In commit 6cc477c36875 ("blk-throttle: carry over directly"), the carryover bytes/ios was be carried to [bytes/io]_disp. However, its update mechanism has some issues. In __tg_update_carryover(), we calculate "bytes" and "ios" to represent the carryover, but the computation when updating [bytes/io]_disp is incorrect. And if the sq->nr_queued is empty, we may not update tg->[bytes/io]_disp to 0 in tg_update_carryover(). We should set it to 0 in non carryover case. This patch fixes the issue. Fixes: 6cc477c36875 ("blk-throttle: carry over directly") Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250417132054.2866409-2-wozizhi@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-16Merge tag 'scsi-fixes' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI fix from James Bottomley: "Fix to zone block devices to make the maximum segment count match what the block layer is capable of" * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: scsi: sd_zbc: block: Respect bio vector limits for REPORT ZONES buffer
2025-05-16Merge tag 'block-6.15-20250515' of git://git.kernel.dk/linuxLinus Torvalds1-15/+47
Pull block fixes from Jens Axboe: - NVMe pull request via Christoph: - fixes for atomic writes (Alan Adamson) - fixes for polled CQs in nvmet-epf (Damien Le Moal) - fix for polled CQs in nvme-pci (Keith Busch) - fix compile on odd configs that need to be forced to inline (Kees Cook) - one more quirk (Ilya Guterman) - Fix for missing allocation of an integrity buffer for some cases - Fix for a regression with ublk command cancelation * tag 'block-6.15-20250515' of git://git.kernel.dk/linux: ublk: fix dead loop when canceling io command nvme-pci: add NVME_QUIRK_NO_DEEPEST_PS quirk for SOLIDIGM P44 Pro nvme: all namespaces in a subsystem must adhere to a common atomic write size nvme: multipath: enable BLK_FEAT_ATOMIC_WRITES for multipathing nvmet: pci-epf: remove NVMET_PCI_EPF_Q_IS_SQ nvmet: pci-epf: improve debug message nvmet: pci-epf: cleanup nvmet_pci_epf_raise_irq() nvmet: pci-epf: do not fall back to using INTX if not supported nvmet: pci-epf: clear completion queue IRQ flag on delete nvme-pci: acquire cq_poll_lock in nvme_poll_irqdisable nvme-pci: make nvme_pci_npages_prp() __always_inline block: always allocate integrity buffer when required
2025-05-12scsi: sd_zbc: block: Respect bio vector limits for REPORT ZONES bufferSteve Siwinski1-1/+1
The REPORT ZONES buffer size is currently limited by the HBA's maximum segment count to ensure the buffer can be mapped. However, the block layer further limits the number of iovec entries to 1024 when allocating a bio. To avoid allocation of buffers too large to be mapped, further restrict the maximum buffer size to BIO_MAX_INLINE_VECS. Replace the UIO_MAXIOV symbolic name with the more contextually appropriate BIO_MAX_INLINE_VECS. Fixes: b091ac616846 ("sd_zbc: Fix report zones buffer allocation") Cc: stable@vger.kernel.org Signed-off-by: Steve Siwinski <ssiwinski@atto.com> Link: https://lore.kernel.org/r/20250508200122.243129-1-ssiwinski@atto.com Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2025-05-12block: always allocate integrity buffer when requiredKeith Busch1-15/+47
Many nvme metadata formats can not strip or generate the metadata on the controller side. For these formats, a host provided integrity buffer is mandatory even if it isn't checked. The block integrity read_verify and write_generate attributes prevent allocating the metadata buffer, but we need it when the format requires it, otherwise reads and writes will be rejected by the driver with IO errors. Assume the integrity buffer can be offloaded to the controller if the metadata size is the same as the protection information size. Otherwise provide an unchecked host buffer when the read verify or write generation attributes are disabled. This fixes the following nvme warning: ------------[ cut here ]------------ WARNING: CPU: 1 PID: 371 at drivers/nvme/host/core.c:1036 nvme_setup_rw+0x122/0x210 ... RIP: 0010:nvme_setup_rw+0x122/0x210 ... Call Trace: <TASK> nvme_setup_cmd+0x1b4/0x280 nvme_queue_rqs+0xc4/0x1f0 [nvme] blk_mq_dispatch_queue_requests+0x24a/0x430 blk_mq_flush_plug_list+0x50/0x140 __blk_flush_plug+0xc1/0x100 __submit_bio+0x1c1/0x360 ? submit_bio_noacct_nocheck+0x2d6/0x3c0 submit_bio_noacct_nocheck+0x2d6/0x3c0 ? submit_bio_noacct+0x47/0x4c0 submit_bio_wait+0x48/0xa0 __blkdev_direct_IO_simple+0xee/0x210 ? current_time+0x1d/0x100 ? current_time+0x1d/0x100 ? __bio_clone+0xb0/0xb0 blkdev_read_iter+0xbb/0x140 vfs_read+0x239/0x310 ksys_read+0x58/0xc0 do_syscall_64+0x6c/0x180 entry_SYSCALL_64_after_hwframe+0x4b/0x53 Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250509153802.3482493-1-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09Merge tag 'block-6.15-20250509' of git://git.kernel.dk/linuxLinus Torvalds2-6/+3
Pull block fixes from Jens Axboe: - Fix for a regression in this series for loop and read/write iterator handling - zone append block update tweak - remove a broken IO priority test - NVMe pull request via Christoph: - unblock ctrl state transition for firmware update (Daniel Wagner) * tag 'block-6.15-20250509' of git://git.kernel.dk/linux: block: remove test of incorrect io priority level nvme: unblock ctrl state transition for firmware update block: only update request sector if needed loop: Add sanity check for read/write_iter
2025-05-08block: remove test of incorrect io priority levelAaron Lu1-5/+1
Ever since commit eca2040972b4("scsi: block: ioprio: Clean up interface definition"), the macro IOPRIO_PRIO_LEVEL() will mask the level value to something between 0 and 7 so necessarily, level will always be lower than IOPRIO_NR_LEVELS(8). Remove this obsolete check. Reported-by: Kexin Wei <ys.weikexin@h3c.com> Cc: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20250508083018.GA769554@bytedance Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06block: only update request sector if neededJohannes Thumshirn1-1/+2
In case of a ZONE APPEND write, regardless of native ZONE APPEND or the emulation layer in the zone write plugging code, the sector the data got written to by the device needs to be updated in the bio. At the moment, this is done for every native ZONE APPEND write and every request that is flagged with 'BIO_ZONE_WRITE_PLUGGING'. But thus superfluously updates the sector for regular writes to a zoned block device. Check if a bio is a native ZONE APPEND write or if the bio is flagged as 'BIO_EMULATES_ZONE_APPEND', meaning the block layer's zone write plugging code handles the ZONE APPEND and translates it into a regular write and back. Only if one of these two criterion is met, update the sector in the bio upon completion. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/dea089581cb6b777c1cd1500b38ac0b61df4b2d1.1746530748.git.jth@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-25Merge tag 'block-6.15-20250424' of git://git.kernel.dk/linuxLinus Torvalds7-20/+89
Pull block fixes from Jens Axboe: - Fix autoloading of drivers from stat*(2) - Fix losing read-ahead setting one suspend/resume, when a device is re-probed. - Fix race between setting the block size and page cache updates. Includes a helper that a coming XFS fix will use as well. - ublk cancelation fixes. - ublk selftest additions and fixes. - NVMe pull via Christoph: - fix an out-of-bounds access in nvmet_enable_port (Richard Weinberger) * tag 'block-6.15-20250424' of git://git.kernel.dk/linux: ublk: fix race between io_uring_cmd_complete_in_task and ublk_cancel_cmd ublk: call ublk_dispatch_req() for handling UBLK_U_IO_NEED_GET_DATA block: don't autoload drivers on blk-cgroup configuration block: don't autoload drivers on stat block: remove the backing_inode variable in bdev_statx block: move blkdev_{get,put} _no_open prototypes out of blkdev.h block: never reduce ra_pages in blk_apply_bdi_limits selftests: ublk: common: fix _get_disk_dev_t for pre-9.0 coreutils selftests: ublk: remove useless 'delay_us' from 'struct dev_ctx' selftests: ublk: fix recover test block: hoist block size validation code to a separate function block: fix race between set_blocksize and read paths nvmet: fix out-of-bounds access in nvmet_enable_port
2025-04-24block: don't autoload drivers on blk-cgroup configurationChristoph Hellwig1-1/+1
Loading a driver just to configure blk-cgroup doesn't make sense, as that assumes and already existing device. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20250423053810.1683309-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24block: don't autoload drivers on statChristoph Hellwig4-7/+7
blkdev_get_no_open can trigger the legacy autoload of block drivers. A simple stat of a block device has not historically done that, so disable this behavior again. Fixes: 9abcfbd235f5 ("block: Add atomic write support for statx") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20250423053810.1683309-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24block: remove the backing_inode variable in bdev_statxChristoph Hellwig1-7/+4
backing_inode is only used once, so remove it and update the comment describing the bdev lookup to be a bit more clear. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20250423053810.1683309-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24block: move blkdev_{get,put} _no_open prototypes out of blkdev.hChristoph Hellwig1-0/+3
These are only to be used by block internal code. Remove the comment as we grew more users due to reworking block device node opening. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20250423053810.1683309-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24block: never reduce ra_pages in blk_apply_bdi_limitsChristoph Hellwig1-1/+7
When the user increased the read-ahead size through sysfs this value currently get lost if the device is reprobe, including on a resume from suspend. As there is no hardware limitation for the read-ahead size there is no real need to reset it or track a separate hardware limitation like for max_sectors. This restores the pre-atomic queue limit behavior in the sd driver as sd did not use blk_queue_io_opt and thus never updated the read ahead size to the value based of the optimal I/O, but changes behavior for all other drivers. As the new behavior seems useful and sd is the driver for which the readahead size tweaks are most useful that seems like a worthwhile trade off. Fixes: 804e498e0496 ("sd: convert to the atomic queue limits API") Reported-by: Holger Hoffstätte <holger@applied-asynchrony.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20250424082521.1967286-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-23block: hoist block size validation code to a separate functionDarrick J. Wong1-6/+27
Hoist the block size validation code to bdev_validate_blocksize so that we can call it from filesystems that don't care about the bdev pagecache manipulations of set_blocksize. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/174543795720.4139148.840349813093799165.stgit@frogsfrogsfrogs Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-23block: fix race between set_blocksize and read pathsDarrick J. Wong4-1/+43
With the new large sector size support, it's now the case that set_blocksize can change i_blksize and the folio order in a manner that conflicts with a concurrent reader and causes a kernel crash. Specifically, let's say that udev-worker calls libblkid to detect the labels on a block device. The read call can create an order-0 folio to read the first 4096 bytes from the disk. But then udev is preempted. Next, someone tries to mount an 8k-sectorsize filesystem from the same block device. The filesystem calls set_blksize, which sets i_blksize to 8192 and the minimum folio order to 1. Now udev resumes, still holding the order-0 folio it allocated. It then tries to schedule a read bio and do_mpage_readahead tries to create bufferheads for the folio. Unfortunately, blocks_per_folio == 0 because the page size is 4096 but the blocksize is 8192 so no bufferheads are attached and the bh walk never sets bdev. We then submit the bio with a NULL block device and crash. Therefore, truncate the page cache after flushing but before updating i_blksize. However, that's not enough -- we also need to lock out file IO and page faults during the update. Take both the i_rwsem and the invalidate_lock in exclusive mode for invalidations, and in shared mode for read/write operations. I don't know if this is the correct fix, but xfs/259 found it. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Link: https://lore.kernel.org/r/174543795699.4139148.2086129139322431423.stgit@frogsfrogsfrogs Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-19Merge tag 'vfs-6.15-rc3.fixes.2' of ↵Linus Torvalds1-2/+1
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - Revert the hfs{plus} deprecation warning that's also included in this pull request. The commit introducing the deprecation warning resides rather early in this branch. So simply dropping it would've rebased all other commits which I decided to avoid. Hence the revert in the same branch [ Background - the deprecation warning discussion resulted in people stepping up, and so hfs{plus} will have a maintainer taking care of it after all.. - Linus ] - Switch CONFIG_SYSFS_SYCALL default to n and decouple from CONFIG_EXPERT - Fix an audit bug caused by changes to our kernel path lookup helpers this cycle. Audit needs the parent path even if the dentry it tried to look up is negative - Ensure that the kernel path lookup helpers leave the passed in path argument clean when they return an error. This is consistent with all our other helpers - Ensure that vfs_getattr_nosec() calls bdev_statx() so the relevant information is available to kernel consumers as well - Don't set a timer and call schedule() if the timer will expire immediately in epoll - Make netfs lookup tables with __nonstring * tag 'vfs-6.15-rc3.fixes.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: Revert "hfs{plus}: add deprecation warning" fs: move the bdex_statx call to vfs_getattr_nosec netfs: Mark __nonstring lookup tables eventpoll: Set epoll timeout if it's in the future fs: ensure that *path_locked*() helpers leave passed path pristine fs: add kern_path_locked_negative() hfs{plus}: add deprecation warning Kconfig: switch CONFIG_SYSFS_SYCALL default to n
2025-04-18Merge tag 'block-6.15-20250417' of git://git.kernel.dk/linuxLinus Torvalds3-11/+9
Pull block fixes from Jens Axboe: - MD pull via Yu: - fix raid10 missing discard IO accounting (Yu Kuai) - fix bitmap stats for bitmap file (Zheng Qixing) - fix oops while reading all member disks failed during check/repair (Meir Elisha) - NVMe pull via Christoph: - fix scan failure for non-ANA multipath controllers (Hannes Reinecke) - fix multipath sysfs links creation for some cases (Hannes Reinecke) - PCIe endpoint fixes (Damien Le Moal) - use NULL instead of 0 in the auth code (Damien Le Moal) - Various ublk fixes: - Slew of selftest additions - Improvements and fixes for IO cancelation - Tweak to Kconfig verbiage - Fix for page dirtying for blk integrity mapped pages - loop fixes: - buffered IO fix - uevent fixes - request priority inheritance fix - Various little fixes * tag 'block-6.15-20250417' of git://git.kernel.dk/linux: (38 commits) selftests: ublk: add generic_06 for covering fault inject ublk: simplify aborting ublk request ublk: remove __ublk_quiesce_dev() ublk: improve detection and handling of ublk server exit ublk: move device reset into ublk_ch_release() ublk: rely on ->canceling for dealing with ublk_nosrv_dev_should_queue_io ublk: add ublk_force_abort_dev() ublk: properly serialize all FETCH_REQs selftests: ublk: move creating UBLK_TMP into _prep_test() selftests: ublk: add test_stress_05.sh selftests: ublk: support user recovery selftests: ublk: support target specific command line selftests: ublk: increase max nr_queues and queue depth selftests: ublk: set queue pthread's cpu affinity selftests: ublk: setup ring with IORING_SETUP_SINGLE_ISSUER/IORING_SETUP_DEFER_TASKRUN selftests: ublk: add two stress tests for zero copy feature selftests: ublk: run stress tests in parallel selftests: ublk: make sure _add_ublk_dev can return in sub-shell selftests: ublk: cleanup backfile automatically selftests: ublk: add io_uring uapi header ...
2025-04-17fs: move the bdex_statx call to vfs_getattr_nosecChristoph Hellwig1-2/+1
Currently bdex_statx is only called from the very high-level vfs_statx_path function, and thus bypassing it for in-kernel calls to vfs_getattr or vfs_getattr_nosec. This breaks querying the block ѕize of the underlying device in the loop driver and also is a pitfall for any other new kernel caller. Move the call into the lowest level helper to ensure all callers get the right results. Fixes: 2d985f8c6b91 ("vfs: support STATX_DIOALIGN on block devices") Fixes: f4774e92aab8 ("loop: take the file system minimum dio alignment into account") Reported-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/20250417064042.712140-1-hch@lst.de Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-16block: integrity: Do not call set_page_dirty_lock()Martin K. Petersen1-11/+6
Placing multiple protection information buffers inside the same page can lead to oopses because set_page_dirty_lock() can't be called from interrupt context. Since a protection information buffer is not backed by a file there is no point in setting its page dirty, there is nothing to synchronize. Drop the call to set_page_dirty_lock() and remove the last argument to bio_integrity_unpin_bvec(). Cc: stable@vger.kernel.org Fixes: 492c5d455969 ("block: bio-integrity: directly map user buffers") Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/yq1v7r3ev9g.fsf@ca-mkp.ca.oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-14block: fix resource leak in blk_register_queue() error pathZheng Qixing1-0/+2
When registering a queue fails after blk_mq_sysfs_register() is successful but the function later encounters an error, we need to clean up the blk_mq_sysfs resources. Add the missing blk_mq_sysfs_unregister() call in the error path to properly clean up these resources and prevent a memory leak. Fixes: 320ae51feed5 ("blk-mq: new multi-queue block IO queueing mechanism") Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250412092554.475218-1-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-14block: add SPDX header line to blk-throttle.hBird, Tim1-0/+1
Add an SPDX license identifier line to blk-throttle.h Use 'GPL-2.0' as the identifier, since blk-throttle.c uses that, and blk.h (from which some material was copied when blk-throttle.h was created) also uses that identifier. Signed-off-by: Tim Bird <tim.bird@sony.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/MW5PR13MB5632EE4645BCA24ED111EC0EFDB62@MW5PR13MB5632.namprd13.prod.outlook.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-05treewide: Switch/rename to timer_delete[_sync]()Thomas Gleixner4-5/+5
timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree over and remove the historical wrapper inlines. Conversion was done with coccinelle plus manual fixups where necessary. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-04-03block: don't grab elevator lock during queue initializationMing Lei1-7/+17
->elevator_lock depends on queue freeze lock, see block/blk-sysfs.c. queue freeze lock depends on fs_reclaim. So don't grab elevator lock during queue initialization which needs to call kmalloc(GFP_KERNEL), and we can cut the dependency between ->elevator_lock and fs_reclaim, then the lockdep warning can be killed. This way is safe because elevator setting isn't ready to run during queue initialization. There isn't such issue in __blk_mq_update_nr_hw_queues() because memalloc_noio_save() is called before acquiring elevator lock. Fixes the following lockdep warning: https://lore.kernel.org/linux-block/67e6b425.050a0220.2f068f.007b.GAE@google.com/ Reported-by: syzbot+4c7e0f9b94ad65811efb@syzkaller.appspotmail.com Cc: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250403105402.1334206-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-01block: remove unused nseg parameterNitesh Shetty1-3/+2
We are no longer using nr_segs, after blk_mq_attempt_bio_merge was moved out of blk_mq_get_new_request. Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com> Link: https://lore.kernel.org/r/20250401044348.15588-1-nj.shetty@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-26Merge tag 'for-6.15/block-20250322' of git://git.kernel.dk/linuxLinus Torvalds39-800/+1180
Pull block updates from Jens Axboe: - Fixes for integrity handling - NVMe pull request via Keith: - Secure concatenation for TCP transport (Hannes) - Multipath sysfs visibility (Nilay) - Various cleanups (Qasim, Baruch, Wang, Chen, Mike, Damien, Li) - Correct use of 64-bit BARs for pci-epf target (Niklas) - Socket fix for selinux when used in containers (Peijie) - MD pull request via Yu: - fix recovery can preempt resync (Li Nan) - fix md-bitmap IO limit (Su Yue) - fix raid10 discard with REQ_NOWAIT (Xiao Ni) - fix raid1 memory leak (Zheng Qixing) - fix mddev uaf (Yu Kuai) - fix raid1,raid10 IO flags (Yu Kuai) - some refactor and cleanup (Yu Kuai) - Series cleaning up and fixing bugs in the bad block handling code - Improve support for write failure simulation in null_blk - Various lock ordering fixes - Fixes for locking for debugfs attributes - Various ublk related fixes and improvements - Cleanups for blk-rq-qos wait handling - blk-throttle fixes - Fixes for loop dio and sync handling - Fixes and cleanups for the auto-PI code - Block side support for hardware encryption keys in blk-crypto - Various cleanups and fixes * tag 'for-6.15/block-20250322' of git://git.kernel.dk/linux: (105 commits) nvmet: replace max(a, min(b, c)) by clamp(val, lo, hi) nvme-tcp: fix selinux denied when calling sock_sendmsg nvmet: pci-epf: Always configure BAR0 as 64-bit nvmet: Remove duplicate uuid_copy nvme: zns: Simplify nvme_zone_parse_entry() nvmet: pci-epf: Remove redundant 'flush_workqueue()' calls nvmet-fc: Remove unused functions nvme-pci: remove stale comment nvme-fc: Utilise min3() to simplify queue count calculation nvme-multipath: Add visibility for queue-depth io-policy nvme-multipath: Add visibility for numa io-policy nvme-multipath: Add visibility for round-robin io-policy nvmet: add tls_concat and tls_key debugfs entries nvmet-tcp: support secure channel concatenation nvmet: Add 'sq' argument to alloc_ctrl_args nvme-fabrics: reset admin connection for secure concatenation nvme-tcp: request secure channel concatenation nvme-keyring: add nvme_tls_psk_refresh() nvme: add nvme_auth_derive_tls_psk() nvme: add nvme_auth_generate_digest() ...
2025-03-25Merge tag 'crc-for-linus' of ↵Linus Torvalds2-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux Pull CRC updates from Eric Biggers: "Another set of improvements to the kernel's CRC (cyclic redundancy check) code: - Rework the CRC64 library functions to be directly optimized, like what I did last cycle for the CRC32 and CRC-T10DIF library functions - Rewrite the x86 PCLMULQDQ-optimized CRC code, and add VPCLMULQDQ support and acceleration for crc64_be and crc64_nvme - Rewrite the riscv Zbc-optimized CRC code, and add acceleration for crc_t10dif, crc64_be, and crc64_nvme - Remove crc_t10dif and crc64_rocksoft from the crypto API, since they are no longer needed there - Rename crc64_rocksoft to crc64_nvme, as the old name was incorrect - Add kunit test cases for crc64_nvme and crc7 - Eliminate redundant functions for calculating the Castagnoli CRC32, settling on just crc32c() - Remove unnecessary prompts from some of the CRC kconfig options - Further optimize the x86 crc32c code" * tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux: (36 commits) x86/crc: drop the avx10_256 functions and rename avx10_512 to avx512 lib/crc: remove unnecessary prompt for CONFIG_CRC64 lib/crc: remove unnecessary prompt for CONFIG_LIBCRC32C lib/crc: remove unnecessary prompt for CONFIG_CRC8 lib/crc: remove unnecessary prompt for CONFIG_CRC7 lib/crc: remove unnecessary prompt for CONFIG_CRC4 lib/crc7: unexport crc7_be_syndrome_table lib/crc_kunit.c: update comment in crc_benchmark() lib/crc_kunit.c: add test and benchmark for crc7_be() x86/crc32: optimize tail handling for crc32c short inputs riscv/crc64: add Zbc optimized CRC64 functions riscv/crc-t10dif: add Zbc optimized CRC-T10DIF function riscv/crc32: reimplement the CRC32 functions using new template riscv/crc: add "template" for Zbc optimized CRC functions x86/crc: add ANNOTATE_NOENDBR to suppress objtool warnings x86/crc32: improve crc32c_arch() code generation with clang x86/crc64: implement crc64_be and crc64_nvme using new template x86/crc-t10dif: implement crc_t10dif using new template x86/crc32: implement crc32_le using new template x86/crc: add "template" for [V]PCLMULQDQ based CRC functions ...
2025-03-25Merge tag 'timers-cleanups-2025-03-23' of ↵Linus Torvalds2-5/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer cleanups from Thomas Gleixner: "A treewide hrtimer timer cleanup hrtimers are initialized with hrtimer_init() and a subsequent store to the callback pointer. This turned out to be suboptimal for the upcoming Rust integration and is obviously a silly implementation to begin with. This cleanup replaces the hrtimer_init(T); T->function = cb; sequence with hrtimer_setup(T, cb); The conversion was done with Coccinelle and a few manual fixups. Once the conversion has completely landed in mainline, hrtimer_init() will be removed and the hrtimer::function becomes a private member" * tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (100 commits) wifi: rt2x00: Switch to use hrtimer_update_function() io_uring: Use helper function hrtimer_update_function() serial: xilinx_uartps: Use helper function hrtimer_update_function() ASoC: fsl: imx-pcm-fiq: Switch to use hrtimer_setup() RDMA: Switch to use hrtimer_setup() virtio: mem: Switch to use hrtimer_setup() drm/vmwgfx: Switch to use hrtimer_setup() drm/xe/oa: Switch to use hrtimer_setup() drm/vkms: Switch to use hrtimer_setup() drm/msm: Switch to use hrtimer_setup() drm/i915/request: Switch to use hrtimer_setup() drm/i915/uncore: Switch to use hrtimer_setup() drm/i915/pmu: Switch to use hrtimer_setup() drm/i915/perf: Switch to use hrtimer_setup() drm/i915/gvt: Switch to use hrtimer_setup() drm/i915/huc: Switch to use hrtimer_setup() drm/amdgpu: Switch to use hrtimer_setup() stm class: heartbeat: Switch to use hrtimer_setup() i2c: Switch to use hrtimer_setup() iio: Switch to use hrtimer_setup() ...
2025-03-24Merge tag 'cgroup-for-6.15' of ↵Linus Torvalds2-18/+13
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - Add deprecation info messages to cgroup1-only features - rstat updates including a bug fix and breaking up a critical section to reduce interrupt latency impact - Other misc and doc updates * tag 'cgroup-for-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: rstat: Cleanup flushing functions and locking cgroup/rstat: avoid disabling irqs for O(num_cpu) mm: Fix a build breakage in memcontrol-v1.c blk-cgroup: Simplify policy files registration cgroup: Update file naming comment cgroup: Add deprecation message to legacy freezer controller mm: Add transformation message for per-memcg swappiness RFC cgroup/cpuset-v1: Add deprecation messages to sched_relax_domain_level cgroup/cpuset-v1: Add deprecation messages to memory_migrate cgroup/cpuset-v1: Add deprecation messages to mem_exclusive and mem_hardwall cgroup: Print message when /proc/cgroups is read on v2-only system cgroup/blkio: Add deprecation messages to reset_stats cgroup/cpuset-v1: Add deprecation messages to memory_spread_page and memory_spread_slab cgroup/cpuset-v1: Add deprecation messages to sched_load_balance and memory_pressure_enabled cgroup, docs: Be explicit about independence of RT_GROUP_SCHED and non-cpu controllers cgroup/rstat: Fix forceidle time in cpu.stat cgroup/misc: Remove unused misc_cg_res_total_usage cgroup/cpuset: Move procfs cpuset attribute under cgroup-v1.c cgroup: update comment about dropping cgroup kn refs
2025-03-24Merge tag 'vfs-6.15-rc1.pagesize' of ↵Linus Torvalds1-5/+8
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs pagesize updates from Christian Brauner: "This enables block sizes greater than the page size for block devices. With this we can start supporting block devices with logical block sizes larger than 4k. It also allows to lift the device cache sector size support to 64k. This allows filesystems which can use larger sector sizes up to 64k to ensure that the filesystem will not generate writes that are smaller than the specified sector size" * tag 'vfs-6.15-rc1.pagesize' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: bdev: add back PAGE_SIZE block size validation for sb_set_blocksize() bdev: use bdev_io_min() for statx block size block/bdev: lift block size restrictions to 64k block/bdev: enable large folio support for large logical block sizes fs/buffer fs/mpage: remove large folio restriction fs/mpage: use blocks_per_folio instead of blocks_per_page fs/mpage: avoid negative shift for large blocksize fs/buffer: remove batching from async read fs/buffer: simplify block_read_full_folio() with bh_offset()
2025-03-19block/blk-iocost: ensure 'ret' is set on errorJens Axboe1-1/+3
In case blkg_conf_open_bdev_frozen() fails, ioc_qos_write() jumps to the error path without assigning a value to 'ret'. Ensure that it inherits the error from the passed back error value. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202503200454.QWpwKeJu-lkp@intel.com/ Fixes: 9730763f4756 ("block: correct locking order for protecting blk-wbt parameters") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-19block: correct locking order for protecting blk-wbt parametersNilay Shroff3-13/+58
The commit '245618f8e45f ("block: protect wbt_lat_usec using q-> elevator_lock")' introduced q->elevator_lock to protect updates to blk-wbt parameters when writing to the sysfs attribute wbt_ lat_usec and the cgroup attribute io.cost.qos. However, both these attributes also acquire q->rq_qos_mutex, leading