summaryrefslogtreecommitdiff
path: root/block
AgeCommit message (Collapse)AuthorFilesLines
2020-09-23block: allow 'chunk_sectors' to be non-power-of-2Mike Snitzer1-6/+4
It is possible, albeit more unlikely, for a block device to have a non power-of-2 for chunk_sectors (e.g. 10+2 RAID6 with 128K chunk_sectors, which results in a full-stripe size of 1280K. This causes the RAID6's io_opt to be advertised as 1280K, and a stacked device _could_ then be made to use a blocksize, aka chunk_sectors, that matches non power-of-2 io_opt of underlying RAID6 -- resulting in stacked device's chunk_sectors being a non power-of-2). Update blk_queue_chunk_sectors() and blk_max_size_offset() to accommodate drivers that need a non power-of-2 chunk_sectors. Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-23block: use lcm_not_zero() when stacking chunk_sectorsMike Snitzer1-4/+8
Like 'io_opt', blk_stack_limits() should stack 'chunk_sectors' using lcm_not_zero() rather than min_not_zero() -- otherwise the final 'chunk_sectors' could result in sub-optimal alignment of IO to component devices in the IO stack. Also, if 'chunk_sectors' isn't a multiple of 'physical_block_size' then it is a bug in the driver and the device should be flagged as 'misaligned'. Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-23block: fix bmd->is_null_mapped initializationChristoph Hellwig1-2/+1
bmd is allocated using kmalloc in bio_alloc_map_data, so make sure is_null_mapped is properly initialized to false for the !null_mapped case. Fixes: f3256075ba49 ("block: remove the BIO_NULL_MAPPED flag") Reported-by: Marc Hartmayer <mhartmay@linux.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-23block: drop double zeroingJulia Lawall1-1/+1
sg_init_table zeroes its first argument, so the allocation of that argument doesn't have to. the semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression x; @@ x = - kzalloc + kmalloc (...) ... sg_init_table(x,...) // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-15scsi: sd: sd_zbc: Fix handling of host-aware ZBC disksDamien Le Moal1-0/+46
When CONFIG_BLK_DEV_ZONED is disabled, allow using host-aware ZBC disks as regular disks. In this case, ensure that command completion is correctly executed by changing sd_zbc_complete() to return good_bytes instead of 0 and causing a hang during device probe (endless retries). When CONFIG_BLK_DEV_ZONED is enabled and a host-aware disk is detected to have partitions, it will be used as a regular disk. In this case, make sure to not do anything in sd_zbc_revalidate_zones() as that triggers warnings. Since all these different cases result in subtle settings of the disk queue zoned model, introduce the block layer helper function blk_queue_set_zoned() to generically implement setting up the effective zoned model according to the disk type, the presence of partitions on the disk and CONFIG_BLK_DEV_ZONED configuration. Link: https://lore.kernel.org/r/20200915073347.832424-2-damien.lemoal@wdc.com Fixes: b72053072c0b ("block: allow partitions on host aware zone devices") Cc: <stable@vger.kernel.org> Reported-by: Borislav Petkov <bp@alien8.de> Suggested-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2020-09-14blk-throttle: Avoid checking bps/iops limitation if bps or iops is unlimitedBaolin Wang1-0/+12
Do not need check the bps or iops limitation if bps or iops is unlimited. Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-14blk-throttle: Avoid calculating bps/iops limitation repeatedlyBaolin Wang1-9/+10
The tg_may_dispatch() will call tg_with_in_bps_limit() and tg_with_in_iops_limit() to check if we can dispatch a bio or not, which will calculate bps/iops limitation multiple times. But tg_may_dispatch() is always called under queue lock, which means the bps/iops limitation will not change in tg_may_dispatch(). So we can calculate the bps/iops limitation only once, and pass them to tg_with_in_bps_limit() and tg_with_in_iops_limit() to avoid calculating bps/iops limitation repeatedly. Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-14blk-throttle: Define readable macros instead of static variablesBaolin Wang1-5/+5
The 'throtl_grp_quantum' and 'throtl_quantum' are both read-only variables, thus better to use readable macros instead of static variables, which can also save some spaces for .bss area. Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-14blk-throttle: Use readable READ/WRITE macrosBaolin Wang1-2/+2
Use readable READ/WRITE macros instead of magic numbers. Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-14blk-throttle: Fix some comments' typosBaolin Wang1-7/+7
Fix some comments' typos. Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-14iocost: fix infinite loop bug in adjust_inuse_and_calc_cost()Tejun Heo1-3/+9
adjust_inuse_and_calc_cost() is responsible for reducing the amount of donated weights dynamically in period as the budget runs low. Because we don't want to do full donation calculation in period, we keep latching up inuse by INUSE_ADJ_STEP_PCT of the active weight of the cgroup until the resulting hweight_inuse is satisfactory. Unfortunately, the adj_step calculation was reading the active weight before acquiring ioc->lock. Because the current thread could have lost race to activate the iocg to another thread before entering this function, it may read the active weight as zero before acquiring ioc->lock. When this happens, the adj_step is calculated as zero and the incremental adjustment loop becomes an infinite one. Fix it by fetching the active weight after acquiring ioc->lock. Fixes: b0853ab4a238 ("blk-iocost: revamp in-period donation snapbacks") Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-11blk-iocost: fix divide-by-zero in transfer_surpluses()Tejun Heo1-4/+10
Conceptually, root_iocg->hweight_donating must be less than WEIGHT_ONE but all hweight calculations round up and thus it may end up >= WEIGHT_ONE triggering divide-by-zero and other issues. Bound the value to avoid surprises. Fixes: e08d02aa5fc9 ("blk-iocost: implement Andy's method for donation weight updates") Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-11block: introduce part_[begin|end]_io_acctSong Liu1-6/+33
These functions can be used to enable iostat for partitions on devices like md, bcache. Signed-off-by: Song Liu <songliubraving@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-11Merge tag 'block-5.9-2020-09-11' of git://git.kernel.dk/linux-blockLinus Torvalds4-15/+5
Pull block fixes from Jens Axboe: - Fix a regression in bdev partition locking (Christoph) - NVMe pull request from Christoph: - cancel async events before freeing them (David Milburn) - revert a broken race fix (James Smart) - fix command processing during resets (Sagi Grimberg) - Fix a kyber crash with requeued flushes (Omar) - Fix __bio_try_merge_page() same_page error for no merging (Ritesh) * tag 'block-5.9-2020-09-11' of git://git.kernel.dk/linux-block: block: Set same_page to false in __bio_try_merge_page if ret is false nvme-fabrics: allow to queue requests for live queues block: only call sched requeue_request() for scheduled requests nvme-tcp: cancel async events before freeing event struct nvme-rdma: cancel async events before freeing event struct nvme-fc: cancel async events before freeing event struct nvme: Revert: Fix controller creation races with teardown flow block: restore a specific error code in bdev_del_partition
2020-09-11blk-mq: always allow reserved allocation in hctx_may_queueMing Lei2-3/+5
NVMe shares tagset between fabric queue and admin queue or between connect_q and NS queue, so hctx_may_queue() can be called to allocate request for these queues. Tags can be reserved in these tagset. Before error recovery, there is often lots of in-flight requests which can't be completed, and new reserved request may be needed in error recovery path. However, hctx_may_queue() can always return false because there is too many in-flight requests which can't be completed during error handling. Finally, nothing can proceed. Fix this issue by always allowing reserved tag allocation in hctx_may_queue(). This is reasonable because reserved tags are supposed to always be available. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Cc: David Milburn <dmilburn@redhat.com> Cc: Ewan D. Milne <emilne@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-11block: remove duplicate include statement in scsi_ioctl.cTian Tao1-2/+0
scsi/sg.h is included more than once, Remove the one that isn't necessary. Signed-off-by: Tian Tao <tiantao6@hisilicon.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-10blkcg: add plugging support for punt bioXianting Tian1-0/+9
The test and the explaination of the patch as bellow. Before test we added more debug code in blkg_async_bio_workfn(): int count = 0 if (bios.head && bios.head->bi_next) { need_plug = true; blk_start_plug(&plug); } while ((bio = bio_list_pop(&bios))) { /*io_punt is a sysctl user interface to control the print*/ if(io_punt) { printk("[%s:%d] bio start,size:%llu,%d count=%d plug?%d\n", current->comm, current->pid, bio->bi_iter.bi_sector, (bio->bi_iter.bi_size)>>9, count++, need_plug); } submit_bio(bio); } if (need_plug) blk_finish_plug(&plug); Steps that need to be set to trigger *PUNT* io before testing: mount -t btrfs -o compress=lzo /dev/sda6 /btrfs mount -t cgroup2 nodev /cgroup2 mkdir /cgroup2/cg3 echo "+io" > /cgroup2/cgroup.subtree_control echo "8:0 wbps=1048576000" > /cgroup2/cg3/io.max #1000M/s echo $$ > /cgroup2/cg3/cgroup.procs Then use dd command to test btrfs PUNT io in current shell: dd if=/dev/zero of=/btrfs/file bs=64K count=100000 Test hardware environment as below: [root@localhost btrfs]# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel With above debug code, test command and test environment, I did the tests under 3 different system loads, which are triggered by stress: 1, Run 64 threads by command "stress -c 64 &" [53615.975974] [kworker/u66:18:1490] bio start,size:45583056,8 count=0 plug?1 [53615.975980] [kworker/u66:18:1490] bio start,size:45583064,8 count=1 plug?1 [53615.975984] [kworker/u66:18:1490] bio start,size:45583072,8 count=2 plug?1 [53615.975987] [kworker/u66:18:1490] bio start,size:45583080,8 count=3 plug?1 [53615.975990] [kworker/u66:18:1490] bio start,size:45583088,8 count=4 plug?1 [53615.975993] [kworker/u66:18:1490] bio start,size:45583096,8 count=5 plug?1 ... ... [53615.977041] [kworker/u66:18:1490] bio start,size:45585480,8 count=303 plug?1 [53615.977044] [kworker/u66:18:1490] bio start,size:45585488,8 count=304 plug?1 [53615.977047] [kworker/u66:18:1490] bio start,size:45585496,8 count=305 plug?1 [53615.977050] [kworker/u66:18:1490] bio start,size:45585504,8 count=306 plug?1 [53615.977053] [kworker/u66:18:1490] bio start,size:45585512,8 count=307 plug?1 [53615.977056] [kworker/u66:18:1490] bio start,size:45585520,8 count=308 plug?1 [53615.977058] [kworker/u66:18:1490] bio start,size:45585528,8 count=309 plug?1 2, Run 32 threads by command "stress -c 32 &" [50586.290521] [kworker/u66:6:32351] bio start,size:45806496,8 count=0 plug?1 [50586.290526] [kworker/u66:6:32351] bio start,size:45806504,8 count=1 plug?1 [50586.290529] [kworker/u66:6:32351] bio start,size:45806512,8 count=2 plug?1 [50586.290531] [kworker/u66:6:32351] bio start,size:45806520,8 count=3 plug?1 [50586.290533] [kworker/u66:6:32351] bio start,size:45806528,8 count=4 plug?1 [50586.290535] [kworker/u66:6:32351] bio start,size:45806536,8 count=5 plug?1 ... ... [50586.299640] [kworker/u66:5:32350] bio start,size:45808576,8 count=252 plug?1 [50586.299643] [kworker/u66:5:32350] bio start,size:45808584,8 count=253 plug?1 [50586.299646] [kworker/u66:5:32350] bio start,size:45808592,8 count=254 plug?1 [50586.299649] [kworker/u66:5:32350] bio start,size:45808600,8 count=255 plug?1 [50586.299652] [kworker/u66:5:32350] bio start,size:45808608,8 count=256 plug?1 [50586.299663] [kworker/u66:5:32350] bio start,size:45808616,8 count=257 plug?1 [50586.299665] [kworker/u66:5:32350] bio start,size:45808624,8 count=258 plug?1 [50586.299668] [kworker/u66:5:32350] bio start,size:45808632,8 count=259 plug?1 3, Don't run thread by stress [50861.355246] [kworker/u66:19:32376] bio start,size:13544504,8 count=0 plug?0 [50861.355288] [kworker/u66:19:32376] bio start,size:13544512,8 count=0 plug?0 [50861.355322] [kworker/u66:19:32376] bio start,size:13544520,8 count=0 plug?0 [50861.355353] [kworker/u66:19:32376] bio start,size:13544528,8 count=0 plug?0 [50861.355392] [kworker/u66:19:32376] bio start,size:13544536,8 count=0 plug?0 [50861.355431] [kworker/u66:19:32376] bio start,size:13544544,8 count=0 plug?0 [50861.355468] [kworker/u66:19:32376] bio start,size:13544552,8 count=0 plug?0 [50861.355499] [kworker/u66:19:32376] bio start,size:13544560,8 count=0 plug?0 [50861.355532] [kworker/u66:19:32376] bio start,size:13544568,8 count=0 plug?0 [50861.355575] [kworker/u66:19:32376] bio start,size:13544576,8 count=0 plug?0 [50861.355618] [kworker/u66:19:32376] bio start,size:13544584,8 count=0 plug?0 [50861.355659] [kworker/u66:19:32376] bio start,size:13544592,8 count=0 plug?0 [50861.355740] [kworker/u66:0:32346] bio start,size:13544600,8 count=0 plug?1 [50861.355748] [kworker/u66:0:32346] bio start,size:13544608,8 count=1 plug?1 [50861.355962] [kworker/u66:2:32347] bio start,size:13544616,8 count=0 plug?0 [50861.356272] [kworker/u66:7:31962] bio start,size:13544624,8 count=0 plug?0 [50861.356446] [kworker/u66:7:31962] bio start,size:13544632,8 count=0 plug?0 [50861.356567] [kworker/u66:7:31962] bio start,size:13544640,8 count=0 plug?0 [50861.356707] [kworker/u66:19:32376] bio start,size:13544648,8 count=0 plug?0 [50861.356748] [kworker/u66:15:32355] bio start,size:13544656,8 count=0 plug?0 [50861.356825] [kworker/u66:17:31970] bio start,size:13544664,8 count=0 plug?0 Analysis of above 3 test results with different system load: >From above test, we can see more and more continuous bios can be plugged with system load increasing. When run "stress -c 64 &", 310 continuous bios are plugged; When run "stress -c 32 &", 260 continuous bios are plugged; When don't run stress, at most only 2 continuous bios are plugged, in most cases, bio_list only contains one single bio. How to explain above phenomenon: We know, in submit_bio(), if the bio is a REQ_CGROUP_PUNT io, it will queue a work to workqueue blkcg_punt_bio_wq. But when the workqueue is scheduled, it depends on the system load. When system load is low, the workqueue will be quickly scheduled, and the bio in bio_list will be quickly processed in blkg_async_bio_workfn(), so there is less chance that the same io submit thread can add multiple continuous bios to bio_list before workqueue is scheduled to run. The analysis aligned with above test "3". When system load is high, there is some delay before the workqueue can be scheduled to run, the higher the system load the greater the delay. So there is more chance that the same io submit thread can add multiple continuous bios to bio_list. Then when the workqueue is scheduled to run, there are more continuous bios in bio_list, which will be processed in blkg_async_bio_workfn(). The analysis aligned with above test "1" and "2". According to test, we can get io performance improved with the patch, especially when system load is higher. Another optimazition is to use the plug only when bio_list contains at least 2 bios. Signed-off-by: Xianting Tian <tian.xianting@h3c.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-10block: add a bdev_check_media_change helperChristoph Hellwig1-1/+28
Like check_disk_changed, except that it does not call ->revalidate_disk but leaves that to the caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-09block: Set same_page to false in __bio_try_merge_page if ret is falseRitesh Harjani1-1/+3
If we hit the UINT_MAX limit of bio->bi_iter.bi_size and so we are anyway not merging this page in this bio, then it make sense to make same_page also as false before returning. Without this patch, we hit below WARNING in iomap. This mostly happens with very large memory system and / or after tweaking vm dirty threshold params to delay writeback of dirty data. WARNING: CPU: 18 PID: 5130 at fs/iomap/buffered-io.c:74 iomap_page_release+0x120/0x150 CPU: 18 PID: 5130 Comm: fio Kdump: loaded Tainted: G W 5.8.0-rc3 #6 Call Trace: __remove_mapping+0x154/0x320 (unreliable) iomap_releasepage+0x80/0x180 try_to_release_page+0x94/0xe0 invalidate_inode_page+0xc8/0x110 invalidate_mapping_pages+0x1dc/0x540 generic_fadvise+0x3c8/0x450 xfs_file_fadvise+0x2c/0xe0 [xfs] vfs_fadvise+0x3c/0x60 ksys_fadvise64_64+0x68/0xe0 sys_fadvise64+0x28/0x40 system_call_exception+0xf8/0x1c0 system_call_common+0xf0/0x278 Fixes: cc90bc68422 ("block: fix "check bi_size overflow before merge"") Reported-by: Shivaprasad G Bhat <sbhat@linux.ibm.com> Suggested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Anju T Sudhakar <anju@linux.vnet.ibm.com> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-08block: only call sched requeue_request() for scheduled requestsOmar Sandoval2-13/+1
Yang Yang reported the following crash caused by requeueing a flush request in Kyber: [ 2.517297] Unable to handle kernel paging request at virtual address ffffffd8071c0b00 ... [ 2.517468] pc : clear_bit+0x18/0x2c [ 2.517502] lr : sbitmap_queue_clear+0x40/0x228 [ 2.517503] sp : ffffff800832bc60 pstate : 00c00145 ... [ 2.517599] Process ksoftirqd/5 (pid: 51, stack limit = 0xffffff8008328000) [ 2.517602] Call trace: [ 2.517606] clear_bit+0x18/0x2c [ 2.517619] kyber_finish_request+0x74/0x80 [ 2.517627] blk_mq_requeue_request+0x3c/0xc0 [ 2.517637] __scsi_queue_insert+0x11c/0x148 [ 2.517640] scsi_softirq_done+0x114/0x130 [ 2.517643] blk_done_softirq+0x7c/0xb0 [ 2.517651] __do_softirq+0x208/0x3bc [ 2.517657] run_ksoftirqd+0x34/0x60 [ 2.517663] smpboot_thread_fn+0x1c4/0x2c0 [ 2.517667] kthread+0x110/0x120 [ 2.517669] ret_from_fork+0x10/0x18 This happens because Kyber doesn't track flush requests, so kyber_finish_request() reads a garbage domain token. Only call the scheduler's requeue_request() hook if RQF_ELVPRIV is set (like we do for the finish_request() hook in blk_mq_free_request()). Now that we're handling it in blk-mq, also remove the check from BFQ. Reported-by: Yang Yang <yang.yang@vivo.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-08block: make QUEUE_SYSFS_BIT_FNS more usefulChristoph Hellwig1-19/+5
Switch to the naming used by the other entries so that we can use the QUEUE_RW_ENTRY helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-08block: add helper macros for queue sysfs entriesChristoph Hellwig1-190/+58
Add two helpers macros to avoid boilerplate code for the queue sysfs entries. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-08block: restore a specific error code in bdev_del_partitionChristoph Hellwig1-1/+1
mdadm relies on the fact that deleting an invalid partition returns -ENXIO or -ENOTTY to detect if a block device is a partition or a whole device. Fixes: 08fc1ab6d748 ("block: fix locking in bdev_del_partition") Reported-by: kernel test robot <rong.a.chen@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-07block: Remove unused blk_mq_sched_free_hctx_data()Baolin Wang2-18/+0
Now we usually free the hctx->sched_data by e->type->ops.exit_hctx(), and no users will use blk_mq_sched_free_hctx_data() function. Remove it. Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-07block: Do not discard buffers under a mounted filesystemJan Kara1-6/+10
Discarding blocks and buffers under a mounted filesystem is hardly anything admin wants to do. Usually it will confuse the filesystem and sometimes the loss of buffer_head state (including b_private field) can even cause crashes like: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 PGD 0 P4D 0 Oops: 0002 [#1] SMP PTI CPU: 4 PID: 203778 Comm: jbd2/dm-3-8 Kdump: loaded Tainted: G O --------- - - 4.18.0-147.5.0.5.h126.eulerosv2r9.x86_64 #1 Hardware name: Huawei RH2288H V3/BC11HGSA0, BIOS 1.57 08/11/2015 RIP: 0010:jbd2_journal_grab_journal_head+0x1b/0x40 [jbd2] ... Call Trace: __jbd2_journal_insert_checkpoint+0x23/0x70 [jbd2] jbd2_journal_commit_transaction+0x155f/0x1b60 [jbd2] kjournald2+0xbd/0x270 [jbd2] So if we don't have block device open with O_EXCL already, claim the block device while we truncate buffer cache. This makes sure any exclusive block device user (such as filesystem) cannot operate on the device while we are discarding buffer cache. Reported-by: Ye Bin <yebin10@huawei.com> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> [axboe: fix !CONFIG_BLOCK error in truncate_bdev_range()] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-04Merge tag 'block-5.9-2020-09-04' of git://git.kernel.dk/linux-blockLinus Torvalds4-23/+37
Pull block fixes from Jens Axboe: "A bit larger than usual this week, mostly due to the NVMe fixes arriving late for -rc3 and hence didn't make last weeks pull request. - NVMe: - instance leak and io boundary fixes from Keith - fc locking fix from Christophe - various tcp/rdma reset during traffic fixes from Sagi - pci use-after-free fix from Tong - tcp target null deref fix from Ziye - Locking fix for partition removal (Christoph) - Ensure bdi->io_pages is always set (me) - Fixup for hd struct reference (Ming) - Fix for zero length bvecs (Ming) - Two small blk-iocost fixes (Tejun)" * tag 'block-5.9-2020-09-04' of git://git.kernel.dk/linux-block: block: allow for_each_bvec to support zero len bvec blk-stat: make q->stats->lock irqsafe blk-iocost: ioc_pd_free() shouldn't assume irq disabled block: fix locking in bdev_del_partition block: release disk reference in hd_struct_free_work block: ensure bdi->io_pages is always initialized nvme-pci: cancel nvme device request before disabling nvme: only use power of two io boundaries nvme: fix controller instance leak nvmet-fc: Fix a missed _irqsave version of spin_lock in 'nvmet_fc_fod_op_done()' nvme: Fix NULL dereference for pci nvme controllers nvme-rdma: fix reset hang if controller died in the middle of a reset nvme-rdma: fix timeout handler nvme-rdma: serialize controller teardown sequences nvme-tcp: fix reset hang if controller died in the middle of a reset nvme-tcp: fix timeout handler nvme-tcp: serialize controller teardown sequences nvme: have nvme_wait_freeze_timeout return if it timed out nvme-fabrics: don't check state NVME_CTRL_NEW for request acceptance nvmet-tcp: Fix NULL dereference when a connect data comes in h2cdata pdu
2020-09-03blk-mq, elevator: Count requests per hctx to improve performanceKashyap Desai3-0/+12
High CPU utilization on "native_queued_spin_lock_slowpath" due to lock contention is possible for mq-deadline and bfq IO schedulers when nr_hw_queues is more than one. It is because kblockd work queue can submit IO from all online CPUs (through blk_mq_run_hw_queues()) even though only one hctx has pending commands. The elevator callback .has_work for mq-deadline and bfq scheduler considers pending work if there are any IOs on request queue but it does not account hctx context. Add a per-hctx 'elevator_queued' count to the hctx to avoid triggering the elevator even though there are no requests queued. [jpg: Relocated atomic_dec() in dd_dispatch_request(), update commit message per Kashyap] Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com> Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: John Garry <john.garry@huawei.com> Tested-by: Douglas Gilbert <dgilbert@interlog.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-03blk-mq: Record active_queues_shared_sbitmap per tag_set for when using ↵John Garry3-11/+40
shared sbitmap For when using a shared sbitmap, no longer should the number of active request queues per hctx be relied on for when judging how to share the tag bitmap. Instead maintain the number of active request queues per tag_set, and make the judgement based on that. Originally-from: Kashyap Desai <kashyap.desai@broadcom.com> Signed-off-by: John Garry <john.garry@huawei.com> Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used Tested-by: Douglas Gilbert <dgilbert@interlog.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-03blk-mq: Record nr_active_requests per queue for when using shared sbitmapJohn Garry3-4/+28
The per-hctx nr_active value can no longer be used to fairly assign a share of tag depth per request queue for when using a shared sbitmap, as it does not consider that the tags are shared tags over all hctx's. For this case, record the nr_active_requests per request_queue, and make the judgement based on that value. Co-developed-with: Kashyap Desai <kashyap.desai@broadcom.com> Signed-off-by: John Garry <john.garry@huawei.com> Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used Tested-by: Douglas Gilbert <dgilbert@interlog.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-03blk-mq: Relocate hctx_may_queue()John Garry2-33/+32
blk-mq.h and blk-mq-tag.h include on each other, which is less than ideal. Locate hctx_may_queue() to blk-mq.h, as it is not really tag specific code. In this way, we can drop the blk-mq-tag.h include of blk-mq.h Signed-off-by: John Garry <john.garry@huawei.com> Tested-by: Douglas Gilbert <dgilbert@interlog.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-03blk-mq: Facilitate a shared sbitmap per tagsetJohn Garry5-7/+79
Some SCSI HBAs (such as HPSA, megaraid, mpt3sas, hisi_sas_v3 ..) support multiple reply queues with single hostwide tags. In addition, these drivers want to use interrupt assignment in pci_alloc_irq_vectors(PCI_IRQ_AFFINITY). However, as discussed in [0], CPU hotplug may cause in-flight IO completion to not be serviced when an interrupt is shutdown. That problem is solved in commit bf0beec0607d ("blk-mq: drain I/O when all CPUs in a hctx are offline"). However, to take advantage of that blk-mq feature, the HBA HW queuess are required to be mapped to that of the blk-mq hctx's; to do that, the HBA HW queues need to be exposed to the upper layer. In making that transition, the per-SCSI command request tags are no longer unique per Scsi host - they are just unique per hctx. As such, the HBA LLDD would have to generate this tag internally, which has a certain performance overhead. However another problem is that blk-mq assumes the host may accept (Scsi_host.can_queue * #hw queue) commands. In commit 6eb045e092ef ("scsi: core: avoid host-wide host_busy counter for scsi_mq"), the Scsi host busy counter was removed, which would stop the LLDD being sent more than .can_queue commands; however, it should still be ensured that the block layer does not issue more than .can_queue commands to the Scsi host. To solve this problem, introduce a shared sbitmap per blk_mq_tag_set, which may be requested at init time. New flag BLK_MQ_F_TAG_HCTX_SHARED should be set when requesting the tagset to indicate whether the shared sbitmap should be used. Even when BLK_MQ_F_TAG_HCTX_SHARED is set, a full set of tags and requests are still allocated per hctx; the reason for this is that if tags and requests were only allocated for a single hctx - like hctx0 - it may break block drivers which expect a request be associated with a specific hctx, i.e. not always hctx0. This will introduce extra memory usage. This change is based on work originally from Ming Lei in [1] and from Bart's suggestion in [2]. [0] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/ [1] https://lore.kernel.org/linux-block/20190531022801.10003-1-ming.lei@redhat.com/ [2] https://lore.kernel.org/linux-block/ff77beff-5fd9-9f05-12b6-826922bace1f@huawei.com/T/#m3db0a602f095cbcbff27e9c884d6b4ae826144be Signed-off-by: John Garry <john.garry@huawei.com> Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used Tested-by: Douglas Gilbert <dgilbert@interlog.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-03blk-mq: Use pointers for blk_mq_tags bitmap tagsJohn Garry6-33/+39
Introduce pointers for the blk_mq_tags regular and reserved bitmap tags, with the goal of later being able to use a common shared tag bitmap across all HW contexts in a set. Signed-off-by: John Garry <john.garry@huawei.com> Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used Tested-by: Douglas Gilbert <dgilbert@interlog.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-03blk-mq: Pass flags for tag init/freeJohn Garry5-23/+35
Pass hctx/tagset flags argument down to blk_mq_init_tags() and blk_mq_free_tags() for selective init/free. For now, make it include the alloc policy flag, which can be evaluated when needed (in blk_mq_init_tags()). Signed-off-by: John Garry <john.garry@huawei.com> Tested-by: Douglas Gilbert <dgilbert@interlog.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-03blk-mq: Free tags in blk_mq_init_tags() upon errorHannes Reinecke1-8/+10
Since the tags are allocated in blk_mq_init_tags(), it's better practice to free in that same function upon error, rather than a callee which is to init the bitmap tags (blk_mq_init_tags()). [jpg: Split from an earlier patch with a new commit message] Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: John Garry <john.garry@huawei.com> Tested-by: Douglas Gilbert <dgilbert@interlog.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-03blk-mq: Rename blk_mq_update_tag_set_depth()Hannes Reinecke1-4/+4
The function does not set the depth, but rather transitions from shared to non-shared queues and vice versa. So rename it to blk_mq_update_tag_set_shared() to better reflect its purpose. [jpg: take out some unrelated changes in blk_mq_init_bitmap_tags()] Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: John Garry <john.garry@huawei.com> Tested-by: Douglas Gilbert <dgilbert@interlog.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-03blk-mq: Rename BLK_MQ_F_TAG_SHARED as BLK_MQ_F_TAG_QUEUE_SHAREDMing Lei3-14/+14
BLK_MQ_F_TAG_SHARED actually means that tags is shared among request queues, all of which should belong to LUNs attached to same HBA. So rename it to make the point explicitly. [jpg: rebase a few times, add rnbd-clt.c change] Suggested-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: John Garry <john.garry@huawei.com> Tested-by: Douglas Gilbert <dgilbert@interlog.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-02block: use revalidate_disk_size in set_capacity_revalidate_and_notifyChristoph Hellwig1-4/+3
Only virtio_blk and xen-blkfront set the revalidate argument to true, and both do not implement the ->revalidate_disk method. So switch to the helper that just updates the size instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-02block: rename bd_invalidatedChristoph Hellwig1-1/+1
Replace bd_invalidate with a new BDEV_NEED_PART_SCAN flag in a bd_flags variable to better describe the condition. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01block: Remove a duplicative conditionBaolin Wang1-4/+2
Remove a duplicative condition to remove below cppcheck warnings: "warning: Redundant condition: sched_allow_merge. '!A || (A && B)' is equivalent to '!A || B' [redundantCondition]" Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01block: better deal with the delayed not supported case in ↵Ritika Srivastava1-5/+22
blk_cloned_rq_check_limits If WRITE_ZERO/WRITE_SAME operation is not supported by the storage, blk_cloned_rq_check_limits() will return IO error which will cause device-mapper to fail the paths. Instead, if the queue limit is set to 0, return BLK_STS_NOTSUPP. BLK_STS_NOTSUPP will be ignored by device-mapper and will not fail the paths. Suggested-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Ritika Srivastava <ritika.srivastava@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01block: Return blk_status_t instead of errno codesRitika Srivastava1-4/+4
Replace returning legacy errno codes with blk_status_t in blk_cloned_rq_check_limits(). Signed-off-by: Ritika Srivastava <ritika.srivastava@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01block: grant IOPRIO_CLASS_RT to CAP_SYS_NICEKhazhismel Kumykov1-1/+1
CAP_SYS_ADMIN is too broad, and ionice fits into CAP_SYS_NICE's grouping. Retain CAP_SYS_ADMIN permission for backwards compatibility. Signed-off-by: Khazhismel Kumykov <khazhy@google.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Acked-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01blk-iocost: add three debug stat - cost.wait, indebt and indelayTejun Heo1-5/+72
These are really cheap to collect and can be useful in debugging iocost behavior. Add them as debug stats for now. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01blk-iocost: restore inuse update tracepointsTejun Heo1-0/+16
Update and restore the inuse update tracepoints. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01blk-iocost: implement vtime loss compensationTejun Heo1-42/+90
When an iocg accumulates too much vtime or gets deactivated, we throw away some vtime, which lowers the overall device utilization. As the exact amount which is being thrown away is known, we can compensate by accelerating the vrate accordingly so that the extra vtime generated in the current period matches what got lost. This significantly improves work conservation when involving high weight cgroups with intermittent and bursty IO patterns. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01blk-iocost: halve debts if device stays idleTejun Heo1-1/+48
A low weight iocg can amass a large amount of debt, for example, when anonymous memory gets reclaimed aggressively. If the system has a lot of memory paired with a slow IO device, the debt can span multiple seconds or more. If there are no other subsequent IO issuers, the in-debt iocg may end up blocked paying its debt while the IO device is idle. This patch implements a mechanism to protect against such pathological cases. If the device has been sufficiently idle for a substantial amount of time, the debts are halved. The criteria are on the conservative side as we want to resolve the rare extreme cases without impacting regular operation by forgiving debts too readily. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01blk-iocost: implement delay adjustment hysteresisTejun Heo2-56/+86
Curently, iocost syncs the delay duration to the outstanding debt amount, which seemed enough to protect the system from anon memory hogs. However, that was mostly because the delay calcuation was using hweight_inuse which quickly converges towards zero under debt for delay duration calculation, often pusnishing debtors overly harshly for longer than deserved. The previous patch fixed the delay calcuation and now the protection against anonymous memory hogs isn't enough because the effect of delay is indirect and non-linear and a huge amount of future debt can accumulate abruptly while unthrottled. This patch implements delay hysteresis so that delay is decayed exponentially over time instead of getting cleared immediately as debt is paid off. While the overall behavior is similar to the blk-cgroup implementation used by blk-iolatency, a lot of the details are different and due to the empirical nature of the mechanism, it's challenging to adapt the mechanism for one controller without negatively impacting the other. As the delay is gradually decayed now, there's no point in running it from its own hrtimer. Periodic updates are now performed from ioc_timer_fn() and the dedicated hrtimer is removed. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01blk-iocost: revamp debt handlingTejun Heo1-24/+93
Debt handling had several issues. * How much inuse a debtor carries wasn't clearly defined. inuse would be driven down over time from not issuing IOs but it'd be better to clamp it to minimum immediately once in debt. * How much can be paid off was determined by hweight_inuse. As inuse was driven down, the payment amount would fall together regardless of the debtor's active weight. This means that the debtors were punished harshly. * ioc_rqos_merge() wasn't calling blkcg_schedule_throttle() after iocg_kick_delay(). This patch revamps debt handling so that * Debt handling owns inuse for iocgs in debt and keeps them at zero. * Payment amount is determined by hweight_active. This is more deterministic and safer than hweight_inuse but still far from ideal in that it doesn't factor in possible donations from other iocgs for debt payments. This likely needs further improvements in the future. * iocg_rqos_merge() now calls blkcg_schedule_throttle() as necessary. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Andy Newell <newella@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01blk-iocost: revamp in-period donation snapbacksTejun Heo1-37/+96
When the margin drops below the minimum on a donating iocg, donation is immediately canceled in full. There are a couple shortcomings with the current behavior. * It's abrupt. A small temporary budget deficit can lead to a wide swing in weight allocation and a large surplus. * It's open coded in the issue path but not implemented for the merge path. A series of merges at a low inuse can make the iocg incur debts and stall incorrectly. This patch reimplements in-period donation snapbacks so that * inuse adjustment and cost calculations are factored into adjust_inuse_and_calc_cost() which is called from both the issue and merge paths. * Snapbacks are more gradual. It occurs in quarter steps. * A snapback triggers if the margin goes below the low threshold and is lower than the budget at the time of the last adjustment. * For the above, __propagate_weights() stores the margin in iocg->saved_margin. Move iocg->last_inuse storing together into __propagate_weights() for consistency. * Full snapback is guaranteed when there are waiters. * With precise donation and gradual snapbacks, inuse adjustments are now a lot more effective and the value of scaling inuse on weight changes isn't clear. Removed inuse scaling from weight_update(). Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01blk-iocost: revamp donation amount determinationTejun Heo1-82