linux.git/block/blk-mq.c, branch v4.4.122

blk-mq: Avoid memory reclaim when remapping queues

2017-04-18T05:14:37+00:00

commit 36e1f3d107867b25c616c2fd294f5a1c9d4e5d09 upstream.

While stressing memory and IO at the same time we changed SMT settings,
we were able to consistently trigger deadlocks in the mm system, which
froze the entire machine.

I think that under memory stress conditions, the large allocations
performed by blk_mq_init_rq_map may trigger a reclaim, which stalls
waiting on the block layer remmaping completion, thus deadlocking the
system.  The trace below was collected after the machine stalled,
waiting for the hotplug event completion.

The simplest fix for this is to make allocations in this path
non-reclaimable, with GFP_NOIO.  With this patch, We couldn't hit the
issue anymore.

This should apply on top of Jens's for-next branch cleanly.

Changes since v1:
  - Use GFP_NOIO instead of GFP_NOWAIT.

 Call Trace:
[c000000f0160aaf0] [c000000f0160ab50] 0xc000000f0160ab50 (unreliable)
[c000000f0160acc0] [c000000000016624] __switch_to+0x2e4/0x430
[c000000f0160ad20] [c000000000b1a880] __schedule+0x310/0x9b0
[c000000f0160ae00] [c000000000b1af68] schedule+0x48/0xc0
[c000000f0160ae30] [c000000000b1b4b0] schedule_preempt_disabled+0x20/0x30
[c000000f0160ae50] [c000000000b1d4fc] __mutex_lock_slowpath+0xec/0x1f0
[c000000f0160aed0] [c000000000b1d678] mutex_lock+0x78/0xa0
[c000000f0160af00] [d000000019413cac] xfs_reclaim_inodes_ag+0x33c/0x380 [xfs]
[c000000f0160b0b0] [d000000019415164] xfs_reclaim_inodes_nr+0x54/0x70 [xfs]
[c000000f0160b0f0] [d0000000194297f8] xfs_fs_free_cached_objects+0x38/0x60 [xfs]
[c000000f0160b120] [c0000000003172c8] super_cache_scan+0x1f8/0x210
[c000000f0160b190] [c00000000026301c] shrink_slab.part.13+0x21c/0x4c0
[c000000f0160b2d0] [c000000000268088] shrink_zone+0x2d8/0x3c0
[c000000f0160b380] [c00000000026834c] do_try_to_free_pages+0x1dc/0x520
[c000000f0160b450] [c00000000026876c] try_to_free_pages+0xdc/0x250
[c000000f0160b4e0] [c000000000251978] __alloc_pages_nodemask+0x868/0x10d0
[c000000f0160b6f0] [c000000000567030] blk_mq_init_rq_map+0x160/0x380
[c000000f0160b7a0] [c00000000056758c] blk_mq_map_swqueue+0x33c/0x360
[c000000f0160b820] [c000000000567904] blk_mq_queue_reinit+0x64/0xb0
[c000000f0160b850] [c00000000056a16c] blk_mq_queue_reinit_notify+0x19c/0x250
[c000000f0160b8a0] [c0000000000f5d38] notifier_call_chain+0x98/0x100
[c000000f0160b8f0] [c0000000000c5fb0] __cpu_notify+0x70/0xe0
[c000000f0160b930] [c0000000000c63c4] notify_prepare+0x44/0xb0
[c000000f0160b9b0] [c0000000000c52f4] cpuhp_invoke_callback+0x84/0x250
[c000000f0160ba10] [c0000000000c570c] cpuhp_up_callbacks+0x5c/0x120
[c000000f0160ba60] [c0000000000c7cb8] _cpu_up+0xf8/0x1d0
[c000000f0160bac0] [c0000000000c7eb0] do_cpu_up+0x120/0x150
[c000000f0160bb40] [c0000000006fe024] cpu_subsys_online+0x64/0xe0
[c000000f0160bb90] [c0000000006f5124] device_online+0xb4/0x120
[c000000f0160bbd0] [c0000000006f5244] online_store+0xb4/0xc0
[c000000f0160bc20] [c0000000006f0a68] dev_attr_store+0x68/0xa0
[c000000f0160bc60] [c0000000003ccc30] sysfs_kf_write+0x80/0xb0
[c000000f0160bca0] [c0000000003cbabc] kernfs_fop_write+0x17c/0x250
[c000000f0160bcf0] [c00000000030fe6c] __vfs_write+0x6c/0x1e0
[c000000f0160bd90] [c000000000311490] vfs_write+0xd0/0x270
[c000000f0160bde0] [c0000000003131fc] SyS_write+0x6c/0x110
[c000000f0160be30] [c000000000009204] system_call+0x38/0xec

Signed-off-by: Gabriel Krisman Bertazi 
Cc: Brian King 
Cc: Douglas Miller 
Cc: linux-block@vger.kernel.org
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Jens Axboe 
Signed-off-by: Sumit Semwal 
Signed-off-by: Greg Kroah-Hartman

blk-mq: really fix plug list flushing for nomerge queues

2017-02-26T10:07:49+00:00

commit 87c279e613f848c691111b29d49de8df3f4f56da upstream.

Commit 0809e3ac6231 ("block: fix plug list flushing for nomerge queues")
updated blk_mq_make_request() to set request_count even when
blk_queue_nomerges() returns true. However, blk_mq_make_request() only
does limited plugging and doesn't use request_count;
blk_sq_make_request() is the one that should have been fixed. Do that
and get rid of the unnecessary work in the mq version.

Fixes: 0809e3ac6231 ("block: fix plug list flushing for nomerge queues")
Signed-off-by: Omar Sandoval 
Reviewed-by: Ming Lei 
Reviewed-by: Jeff Moyer 
Signed-off-by: Jens Axboe 
Cc: Sumit Semwal 
Signed-off-by: Greg Kroah-Hartman

blk-mq: Always schedule hctx->next_cpu

2017-01-19T19:17:22+00:00

commit c02ebfdddbafa9a6a0f52fbd715e6bfa229af9d3 upstream.

Commit 0e87e58bf60e ("blk-mq: improve warning for running a queue on the
wrong CPU") attempts to avoid triggering the WARN_ON in
__blk_mq_run_hw_queue when the expected CPU is dead.  Problem is, in the
last batch execution before round robin, blk_mq_hctx_next_cpu can
schedule a dead CPU and also update next_cpu to the next alive CPU in
the mask, which will trigger the WARN_ON despite the previous
workaround.

The following patch fixes this scenario by always scheduling the value
in hctx->next_cpu.  This changes the moment when we round-robin the CPU
running the hctx, but it really doesn't matter, since it still executes
BLK_MQ_CPU_WORK_BATCH times in a row before switching to another CPU.

Fixes: 0e87e58bf60e ("blk-mq: improve warning for running a queue on the wrong CPU")
Signed-off-by: Gabriel Krisman Bertazi 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

blk-mq: Do not invoke .queue_rq() for a stopped queue

2017-01-06T10:16:14+00:00

commit bc27c01b5c46d3bfec42c96537c7a3fae0bb2cc4 upstream.

The meaning of the BLK_MQ_S_STOPPED flag is "do not call
.queue_rq()". Hence modify blk_mq_make_request() such that requests
are queued instead of issued if a queue has been stopped.

Reported-by: Ming Lei 
Signed-off-by: Bart Van Assche 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Ming Lei 
Reviewed-by: Hannes Reinecke 
Reviewed-by: Johannes Thumshirn 
Reviewed-by: Sagi Grimberg 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

blk-mq: actually hook up defer list when running requests

2016-10-07T13:23:44+00:00

commit 52b9c330c6a8a4b5a1819bdaddf4ec76ab571e81 upstream.

If ->queue_rq() returns BLK_MQ_RQ_QUEUE_OK, we use continue and skip
over the rest of the loop body. However, dptr is assigned later in the
loop body, and the BLK_MQ_RQ_QUEUE_OK case is exactly the case that we'd
want it for.

NVMe isn't actually using BLK_MQ_F_DEFER_ISSUE yet, nor is any other
in-tree driver, but if the code's going to be there, it might as well
work.

Fixes: 74c450521dd8 ("blk-mq: add a 'list' parameter to ->queue_rq()")
Signed-off-by: Omar Sandoval 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

blk-mq: End unstarted requests on dying queue

2016-09-15T06:27:47+00:00

[ Upstream commit a59e0f5795fe52dad42a99c00287e3766153b312 ]

Go directly to ending a request if it wasn't started. Previously, completing a
request may invoke a driver callback for a request it didn't initialize.

Signed-off-by: Keith Busch 
Reviewed-by: Sagi Grimberg 
Reviewed-by: Johannes Thumshirn 
Acked-by: Christoph Hellwig 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman

blk-mq: fix calling unplug callbacks with preempt disabled

2015-11-21T03:29:45+00:00

Liu reported that running certain parts of xfstests threw the
following error:

BUG: sleeping function called from invalid context at mm/page_alloc.c:3190
in_atomic(): 1, irqs_disabled(): 0, pid: 6, name: kworker/u16:0
3 locks held by kworker/u16:0/6:
 #0:  ("writeback"){++++.+}, at: [] process_one_work+0x173/0x730
 #1:  ((&(&wb->dwork)->work)){+.+.+.}, at: [] process_one_work+0x173/0x730
 #2:  (&type->s_umount_key#44){+++++.}, at: [] trylock_super+0x25/0x60
CPU: 5 PID: 6 Comm: kworker/u16:0 Tainted: G           OE   4.3.0+ #3
Hardware name: Red Hat KVM, BIOS Bochs 01/01/2011
Workqueue: writeback wb_workfn (flush-btrfs-108)
 ffffffff81a3abab ffff88042e282ba8 ffffffff8130191b ffffffff81a3abab
 0000000000000c76 ffff88042e282ba8 ffff88042e27c180 ffff88042e282bd8
 ffffffff8108ed95 ffff880400000004 0000000000000000 0000000000000c76
Call Trace:
 [] dump_stack+0x4f/0x74
 [] ___might_sleep+0x185/0x240
 [] __might_sleep+0x52/0x90
 [] __alloc_pages_nodemask+0x268/0x410
 [] ? sched_clock_local+0x1c/0x90
 [] ? local_clock+0x21/0x40
 [] ? __lock_release+0x420/0x510
 [] ? __lock_acquired+0x16c/0x3c0
 [] alloc_pages_current+0xc5/0x210
 [] ? rbio_is_full+0x55/0x70 [btrfs]
 [] ? mark_held_locks+0x78/0xa0
 [] ? _raw_spin_unlock_irqrestore+0x40/0x60
 [] full_stripe_write+0x5a/0xc0 [btrfs]
 [] __raid56_parity_write+0x39/0x60 [btrfs]
 [] run_plug+0x11b/0x140 [btrfs]
 [] btrfs_raid_unplug+0x23/0x70 [btrfs]
 [] blk_flush_plug_list+0x82/0x1f0
 [] blk_sq_make_request+0x1f9/0x740
 [] ? generic_make_request_checks+0x222/0x7c0
 [] ? blk_queue_enter+0x124/0x310
 [] ? blk_queue_enter+0x92/0x310
 [] generic_make_request+0x172/0x2c0
 [] ? generic_make_request+0x164/0x2c0
 [] submit_bio+0x70/0x140
 [] ? rbio_add_io_page+0x99/0x150 [btrfs]
 [] finish_rmw+0x4d9/0x600 [btrfs]
 [] full_stripe_write+0x9c/0xc0 [btrfs]
 [] raid56_parity_write+0xef/0x160 [btrfs]
 [] btrfs_map_bio+0xe3/0x2d0 [btrfs]
 [] btrfs_submit_bio_hook+0x8d/0x1d0 [btrfs]
 [] submit_one_bio+0x74/0xb0 [btrfs]
 [] submit_extent_page+0xe5/0x1c0 [btrfs]
 [] __extent_writepage_io+0x408/0x4c0 [btrfs]
 [] ? alloc_dummy_extent_buffer+0x140/0x140 [btrfs]
 [] __extent_writepage+0x218/0x3a0 [btrfs]
 [] ? mark_held_locks+0x78/0xa0
 [] extent_write_cache_pages.clone.0+0x2f9/0x400 [btrfs]
 [] extent_writepages+0x52/0x70 [btrfs]
 [] ? btrfs_set_inode_index+0x70/0x70 [btrfs]
 [] btrfs_writepages+0x27/0x30 [btrfs]
 [] do_writepages+0x23/0x40
 [] __writeback_single_inode+0x89/0x4d0
 [] ? writeback_sb_inodes+0x260/0x480
 [] ? writeback_sb_inodes+0x260/0x480
 [] ? writeback_sb_inodes+0x15f/0x480
 [] writeback_sb_inodes+0x2d2/0x480
 [] ? down_read_trylock+0x57/0x60
 [] ? trylock_super+0x25/0x60
 [] ? rcu_read_lock_sched_held+0x4f/0x90
 [] __writeback_inodes_wb+0x8c/0xc0
 [] wb_writeback+0x2b5/0x500
 [] ? mark_held_locks+0x78/0xa0
 [] ? __local_bh_enable_ip+0x68/0xc0
 [] ? wb_do_writeback+0x62/0x310
 [] wb_do_writeback+0xc1/0x310
 [] ? set_worker_desc+0x79/0x90
 [] wb_workfn+0x92/0x330
 [] process_one_work+0x223/0x730
 [] ? process_one_work+0x173/0x730
 [] ? worker_thread+0x18f/0x430
 [] worker_thread+0x11d/0x430
 [] ? maybe_create_worker+0xf0/0xf0
 [] ? maybe_create_worker+0xf0/0xf0
 [] kthread+0xef/0x110
 [] ? schedule_tail+0x1e/0xd0
 [] ? __init_kthread_worker+0x70/0x70
 [] ret_from_fork+0x3f/0x70
 [] ? __init_kthread_worker+0x70/0x70

The issue is that we've got the software context pinned while
calling blk_flush_plug_list(), which flushes callbacks that
are allowed to sleep. btrfs and raid has such callbacks.

Flip the checks around a bit, so we can enable preempt a bit
earlier and flush plugs without having preempt disabled.

This only affects blk-mq driven devices, and only those that
register a single queue.

Reported-by: Liu Bo 
Tested-by: Liu Bo 
Cc: stable@kernel.org
Signed-off-by: Jens Axboe

blk-mq: mark __blk_mq_complete_request() static

2015-11-11T16:36:56+00:00

It's no longer used outside of blk-mq core.

Signed-off-by: Jens Axboe

Merge branch 'for-4.4/io-poll' of git://git.kernel.dk/linux-block

2015-11-11T01:23:49+00:00

Pull block IO poll support from Jens Axboe:
 "Various groups have been doing experimentation around IO polling for
  (really) fast devices.  The code has been reviewed and has been
  sitting on the side for a few releases, but this is now good enough
  for coordinated benchmarking and further experimentation.

  Currently O_DIRECT sync read/write are supported.  A framework is in
  the works that allows scalable stats tracking so we can auto-tune
  this.  And we'll add libaio support as well soon.  Fow now, it's an
  opt-in feature for test purposes"

* 'for-4.4/io-poll' of git://git.kernel.dk/linux-block:
  direct-io: be sure to assign dio->bio_bdev for both paths
  directio: add block polling support
  NVMe: add blk polling support
  block: add block polling support
  blk-mq: return tag/queue combo in the make_request_fn handlers
  block: change ->make_request_fn() and users to return a queue cookie

blk-mq: return tag/queue combo in the make_request_fn handlers

2015-11-07T17:40:47+00:00

Return a cookie, blk_qc_t, from the blk-mq make request functions, that
allows a later caller to uniquely identify a specific IO. The cookie
doesn't mean anything to the caller, but the caller can use it to later
pass back to the block layer. The block layer can then identify the
hardware queue and request from that cookie.

Signed-off-by: Jens Axboe 
Acked-by: Christoph Hellwig 
Acked-by: Keith Busch