linux.git/block/blk-mq-tag.c, branch v4.4.117

mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd

2015-11-07T01:50:42+00:00

__GFP_WAIT has been used to identify atomic context in callers that hold
spinlocks or are in interrupts.  They are expected to be high priority and
have access one of two watermarks lower than "min" which can be referred
to as the "atomic reserve".  __GFP_HIGH users get access to the first
lower watermark and can be called the "high priority reserve".

Over time, callers had a requirement to not block when fallback options
were available.  Some have abused __GFP_WAIT leading to a situation where
an optimisitic allocation with a fallback option can access atomic
reserves.

This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
cannot sleep and have no alternative.  High priority users continue to use
__GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
redefined as a caller that is willing to enter direct reclaim and wake
kswapd for background reclaim.

This patch then converts a number of sites

o __GFP_ATOMIC is used by callers that are high priority and have memory
  pools for those requests. GFP_ATOMIC uses this flag.

o Callers that have a limited mempool to guarantee forward progress clear
  __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
  into this category where kswapd will still be woken but atomic reserves
  are not used as there is a one-entry mempool to guarantee progress.

o Callers that are checking if they are non-blocking should use the
  helper gfpflags_allow_blocking() where possible. This is because
  checking for __GFP_WAIT as was done historically now can trigger false
  positives. Some exceptions like dm-crypt.c exist where the code intent
  is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
  flag manipulations.

o Callers that built their own GFP flags instead of starting with GFP_KERNEL
  and friends now also need to specify __GFP_KSWAPD_RECLAIM.

The first key hazard to watch out for is callers that removed __GFP_WAIT
and was depending on access to atomic reserves for inconspicuous reasons.
In some cases it may be appropriate for them to use __GFP_HIGH.

The second key hazard is callers that assembled their own combination of
GFP flags instead of starting with something like GFP_KERNEL.  They may
now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
if it's missed in most cases as other activity will wake kswapd.

Signed-off-by: Mel Gorman 
Acked-by: Vlastimil Babka 
Acked-by: Michal Hocko 
Acked-by: Johannes Weiner 
Cc: Christoph Lameter 
Cc: David Rientjes 
Cc: Vitaly Wool 
Cc: Rik van Riel 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

Merge branch 'for-4.4/core' of git://git.kernel.dk/linux-block

2015-11-05T04:28:10+00:00

Pull core block updates from Jens Axboe:
 "This is the core block pull request for 4.4.  I've got a few more
  topic branches this time around, some of them will layer on top of the
  core+drivers changes and will come in a separate round.  So not a huge
  chunk of changes in this round.

  This pull request contains:

   - Enable blk-mq page allocation tracking with kmemleak, from Catalin.

   - Unused prototype removal in blk-mq from Christoph.

   - Cleanup of the q->blk_trace exchange, using cmpxchg instead of two
     xchg()'s, from Davidlohr.

   - A plug flush fix from Jeff.

   - Also from Jeff, a fix that means we don't have to update shared tag
     sets at init time unless we do a state change.  This cuts down boot
     times on thousands of devices a lot with scsi/blk-mq.

   - blk-mq waitqueue barrier fix from Kosuke.

   - Various fixes from Ming:

        - Fixes for segment merging and splitting, and checks, for
          the old core and blk-mq.

        - Potential blk-mq speedup by marking ctx pending at the end
          of a plug insertion batch in blk-mq.

        - direct-io no page dirty on kernel direct reads.

   - A WRITE_SYNC fix for mpage from Roman"

* 'for-4.4/core' of git://git.kernel.dk/linux-block:
  blk-mq: avoid excessive boot delays with large lun counts
  blktrace: re-write setting q->blk_trace
  blk-mq: mark ctx as pending at batch in flush plug path
  blk-mq: fix for trace_block_plug()
  block: check bio_mergeable() early before merging
  blk-mq: check bio_mergeable() early before merging
  block: avoid to merge splitted bio
  block: setup bi_phys_segments after splitting
  block: fix plug list flushing for nomerge queues
  blk-mq: remove unused blk_mq_clone_flush_request prototype
  blk-mq: fix waitqueue_active without memory barrier in block/blk-mq-tag.c
  fs: direct-io: don't dirtying pages for ITER_BVEC/ITER_KVEC direct read
  fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write
  block: kmemleak: Track the page allocations for struct request

blk-mq: fix use-after-free in blk_mq_free_tag_set()

2015-10-15T14:45:58+00:00

tags is freed in blk_mq_free_rq_map() and should not be used after that.
The problem doesn't manifest if CONFIG_CPUMASK_OFFSTACK is false because
free_cpumask_var() is nop.

tags->cpumask is allocated in blk_mq_init_tags() so it's natural to
free cpumask in its counter part, blk_mq_free_tags().

Fixes: f26cdc8536ad ("blk-mq: Shared tag enhancements")
Signed-off-by: Jun'ichi Nomura 
Cc: Keith Busch 
Reviewed-by: Jeff Moyer 
Signed-off-by: Jens Axboe

blk-mq: fix waitqueue_active without memory barrier in block/blk-mq-tag.c

2015-10-09T16:52:46+00:00

blk_mq_tag_update_depth() seems to be missing a memory barrier which
might cause the waker to not notice the waiter and fail to send a
wake_up as in the following figure.

	blk_mq_tag_update_depth			bt_get
------------------------------------------------------------------------
if (waitqueue_active(&bs->wait))
/* The CPU might reorder the test for
   the waitqueue up here, before
   prior writes complete */
					prepare_to_wait(&bs->wait, &wait,
					  TASK_UNINTERRUPTIBLE);
					tag = __bt_get(hctx, bt, last_tag,
					  tags);
					/* Value set in bt_update_count not
					   visible yet */
bt_update_count(&tags->bitmap_tags, tdepth);
/* blk_mq_tag_wakeup_all(tags, false); */
 bt = &tags->bitmap_tags;
 wake_index = atomic_read(&bt->wake_index);
					...
					io_schedule();
------------------------------------------------------------------------

This patch adds the missing memory barrier.

I found this issue when I was looking through the linux source code
for places calling waitqueue_active() before wake_up*(), but without
preceding memory barriers, after sending a patch to fix a similar
issue in drivers/tty/n_tty.c  (Details about the original issue can be
found here: https://lkml.org/lkml/2015/9/28/849).

Signed-off-by: Kosuke Tatsukawa 
Signed-off-by: Jens Axboe

blk-mq: factor out a helper to iterate all tags for a request_queue

2015-10-01T08:10:57+00:00

And replace the blk_mq_tag_busy_iter with it - the driver use has been
replaced with a new helper a while ago, and internal to the block we
only need the new version.

Signed-off-by: Christoph Hellwig 
Signed-off-by: Jens Axboe

blk-mq: fix race between timeout and freeing request

2015-08-15T15:45:21+00:00

Inside timeout handler, blk_mq_tag_to_rq() is called
to retrieve the request from one tag. This way is obviously
wrong because the request can be freed any time and some
fiedds of the request can't be trusted, then kernel oops
might be triggered[1].

Currently wrt. blk_mq_tag_to_rq(), the only special case is
that the flush request can share same tag with the request
cloned from, and the two requests can't be active at the same
time, so this patch fixes the above issue by updating tags->rqs[tag]
with the active request(either flush rq or the request cloned
from) of the tag.

Also blk_mq_tag_to_rq() gets much simplified with this patch.

Given blk_mq_tag_to_rq() is mainly for drivers and the caller must
make sure the request can't be freed, so in bt_for_each() this
helper is replaced with tags->rqs[tag].

[1] kernel oops log
[  439.696220] BUG: unable to handle kernel NULL pointer dereference at 0000000000000158^M
[  439.697162] IP: [] blk_mq_tag_to_rq+0x21/0x6e^M
[  439.700653] PGD 7ef765067 PUD 7ef764067 PMD 0 ^M
[  439.700653] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC ^M
[  439.700653] Dumping ftrace buffer:^M
[  439.700653]    (ftrace buffer empty)^M
[  439.700653] Modules linked in: nbd ipv6 kvm_intel kvm serio_raw^M
[  439.700653] CPU: 6 PID: 2779 Comm: stress-ng-sigfd Not tainted 4.2.0-rc5-next-20150805+ #265^M
[  439.730500] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011^M
[  439.730500] task: ffff880605308000 ti: ffff88060530c000 task.ti: ffff88060530c000^M
[  439.730500] RIP: 0010:[]  [] blk_mq_tag_to_rq+0x21/0x6e^M
[  439.730500] RSP: 0018:ffff880819203da0  EFLAGS: 00010283^M
[  439.730500] RAX: ffff880811b0e000 RBX: ffff8800bb465f00 RCX: 0000000000000002^M
[  439.730500] RDX: 0000000000000000 RSI: 0000000000000202 RDI: 0000000000000000^M
[  439.730500] RBP: ffff880819203db0 R08: 0000000000000002 R09: 0000000000000000^M
[  439.730500] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000202^M
[  439.730500] R13: ffff880814104800 R14: 0000000000000002 R15: ffff880811a2ea00^M
[  439.730500] FS:  00007f165b3f5740(0000) GS:ffff880819200000(0000) knlGS:0000000000000000^M
[  439.730500] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b^M
[  439.730500] CR2: 0000000000000158 CR3: 00000007ef766000 CR4: 00000000000006e0^M
[  439.730500] Stack:^M
[  439.730500]  0000000000000008 ffff8808114eed90 ffff880819203e00 ffffffff812dc104^M
[  439.755663]  ffff880819203e40 ffffffff812d9f5e 0000020000000000 ffff8808114eed80^M
[  439.755663] Call Trace:^M
[  439.755663]   ^M
[  439.755663]  [] bt_for_each+0x6e/0xc8^M
[  439.755663]  [] ? blk_mq_rq_timed_out+0x6a/0x6a^M
[  439.755663]  [] ? blk_mq_rq_timed_out+0x6a/0x6a^M
[  439.755663]  [] blk_mq_tag_busy_iter+0x55/0x5e^M
[  439.755663]  [] ? blk_mq_bio_to_request+0x38/0x38^M
[  439.755663]  [] blk_mq_rq_timer+0x5d/0xd4^M
[  439.755663]  [] call_timer_fn+0xf7/0x284^M
[  439.755663]  [] ? call_timer_fn+0x5/0x284^M
[  439.755663]  [] ? blk_mq_bio_to_request+0x38/0x38^M
[  439.755663]  [] run_timer_softirq+0x1ce/0x1f8^M
[  439.755663]  [] __do_softirq+0x181/0x3a4^M
[  439.755663]  [] irq_exit+0x40/0x94^M
[  439.755663]  [] smp_apic_timer_interrupt+0x33/0x3e^M
[  439.755663]  [] apic_timer_interrupt+0x84/0x90^M
[  439.755663]   ^M
[  439.755663]  [] ? _raw_spin_unlock_irq+0x32/0x4a^M
[  439.755663]  [] finish_task_switch+0xe0/0x163^M
[  439.755663]  [] ? finish_task_switch+0xa2/0x163^M
[  439.755663]  [] __schedule+0x469/0x6cd^M
[  439.755663]  [] schedule+0x82/0x9a^M
[  439.789267]  [] signalfd_read+0x186/0x49a^M
[  439.790911]  [] ? wake_up_q+0x47/0x47^M
[  439.790911]  [] __vfs_read+0x28/0x9f^M
[  439.790911]  [] ? __fget_light+0x4d/0x74^M
[  439.790911]  [] vfs_read+0x7a/0xc6^M
[  439.790911]  [] SyS_read+0x49/0x7f^M
[  439.790911]  [] entry_SYSCALL_64_fastpath+0x12/0x6f^M
[  439.790911] Code: 48 89 e5 e8 a9 b8 e7 ff 5d c3 0f 1f 44 00 00 55 89
f2 48 89 e5 41 54 41 89 f4 53 48 8b 47 60 48 8b 1c d0 48 8b 7b 30 48 8b
53 38 <48> 8b 87 58 01 00 00 48 85 c0 75 09 48 8b 97 88 0c 00 00 eb 10
^M
[  439.790911] RIP  [] blk_mq_tag_to_rq+0x21/0x6e^M
[  439.790911]  RSP ^M
[  439.790911] CR2: 0000000000000158^M
[  439.790911] ---[ end trace d40af58949325661 ]---^M

Cc: 
Signed-off-by: Ming Lei 
Signed-off-by: Jens Axboe

blk-mq: Shared tag enhancements

2015-06-01T20:35:56+00:00

Storage controllers may expose multiple block devices that share hardware
resources managed by blk-mq. This patch enhances the shared tags so a
low-level driver can access the shared resources not tied to the unshared
h/w contexts. This way the LLD can dynamically add and delete disks and
request queues without having to track all the request_queue hctx's to
iterate outstanding tags.

Signed-off-by: Keith Busch 
Signed-off-by: Jens Axboe

blkmq: Fix NULL pointer deref when all reserved tags in

2015-03-18T23:06:18+00:00

When allocating from the reserved tags pool, bt_get() is called with
a NULL hctx.  If all tags are in use, the hw queue is kicked to push
out any pending IO, potentially freeing tags, and tag allocation is
retried.  The problem is that blk_mq_run_hw_queue() doesn't check for
a NULL hctx.  So we avoid it with a simple NULL hctx test.

Tested by hammering mtip32xx with concurrent smartctl/hdparm.

Signed-off-by: Sam Bradshaw 
Signed-off-by: Selvan Mani 
Fixes: b32232073e80 ("blk-mq: fix hang in bt_get()")
Cc: stable@kernel.org

Added appropriate comment.

Signed-off-by: Jens Axboe

blk-mq: fix double-free in error path

2015-02-11T16:35:21+00:00

If the allocation of bt->bs fails, then bt->map can be freed twice, once
in blk_mq_init_bitmap_tags() -> bt_alloc(), and once in
blk_mq_init_bitmap_tags() -> bt_free().  Fix by setting the pointer to
NULL after the first free.

Cc: 
Signed-off-by: Tony Battersby 
Signed-off-by: Jens Axboe

blk-mq: add tag allocation policy

2015-01-23T21:18:00+00:00

This is the blk-mq part to support tag allocation policy. The default
allocation policy isn't changed (though it's not a strict FIFO). The new
policy is round-robin for libata. But it's a try-best implementation. If
multiple tasks are competing, the tags returned will be mixed (which is
unavoidable even with !mq, as requests from different tasks can be
mixed in queue)

Cc: Jens Axboe 
Cc: Tejun Heo 
Cc: Christoph Hellwig 
Signed-off-by: Shaohua Li 
Signed-off-by: Jens Axboe