| Age | Commit message (Collapse) | Author | Files | Lines |
|
[ Upstream commit 7ecd2cd4fae3e8410c0a6620f3a83dcdbb254f02 ]
Otherwise it can create unaligned writes on zoned devices.
Fixes: a805a4fa4fa3 ("block: introduce zone_write_granularity limit")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20241104062647.91160-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit b402328a24ee7193a8ab84277c0c90ae16768126 ]
elevator_get_default() and elv_support_iosched() both check for whether
or not q->tag_set is non-NULL, however it's not possible for them to be
NULL. This messes up some static checkers, as the checking of tag_set
isn't consistent.
Remove the checks, which both simplifies the logic and avoids checker
errors.
Signed-off-by: SurajSonawane2415 <surajsonawane0215@gmail.com>
Link: https://lore.kernel.org/r/20241007111416.13814-1-surajsonawane0215@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 2ff949441802a8d076d9013c7761f63e8ae5a9bd ]
blk_rq_map_user_bvec contains a check bytes + bv->bv_len > nr_iter which
causes unnecessary failures in NVMe passthrough I/O, reproducible as
follows:
- register a 2 page, page-aligned buffer against a ring
- use that buffer to do a 1 page io_uring NVMe passthrough read
The second (i = 1) iteration of the loop in blk_rq_map_user_bvec will
then have nr_iter == 1 page, bytes == 1 page, bv->bv_len == 1 page, so
the check bytes + bv->bv_len > nr_iter will succeed, causing the I/O to
fail. This failure is unnecessary, as when the check succeeds, it means
we've checked the entire buffer that will be used by the request - i.e.
blk_rq_map_user_bvec should complete successfully. Therefore, terminate
the loop early and return successfully when the check bytes + bv->bv_len
> nr_iter succeeds.
While we're at it, also remove the check that all segments in the bvec
are single-page. While this seems to be true for all users of the
function, it doesn't appear to be required anywhere downstream.
CC: stable@vger.kernel.org
Signed-off-by: Xinyu Zhang <xizhang@purestorage.com>
Co-developed-by: Uday Shankar <ushankar@purestorage.com>
Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Fixes: 37987547932c ("block: extend functionality to map bvec iterator")
Link: https://lore.kernel.org/r/20241023211519.4177873-1-ushankar@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit e972b08b91ef48488bae9789f03cfedb148667fb upstream.
We're seeing crashes from rq_qos_wake_function that look like this:
BUG: unable to handle page fault for address: ffffafe180a40084
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 100000067 P4D 100000067 PUD 10027c067 PMD 10115d067 PTE 0
Oops: Oops: 0002 [#1] PREEMPT SMP PTI
CPU: 17 UID: 0 PID: 0 Comm: swapper/17 Not tainted 6.12.0-rc3-00013-geca631b8fe80 #11
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
RIP: 0010:_raw_spin_lock_irqsave+0x1d/0x40
Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 41 54 9c 41 5c fa 65 ff 05 62 97 30 4c 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 0a 4c 89 e0 41 5c c3 cc cc cc cc 89 c6 e8 2c 0b 00
RSP: 0018:ffffafe180580ca0 EFLAGS: 00010046
RAX: 0000000000000000 RBX: ffffafe180a3f7a8 RCX: 0000000000000011
RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffffafe180a40084
RBP: 0000000000000000 R08: 00000000001e7240 R09: 0000000000000011
R10: 0000000000000028 R11: 0000000000000888 R12: 0000000000000002
R13: ffffafe180a40084 R14: 0000000000000000 R15: 0000000000000003
FS: 0000000000000000(0000) GS:ffff9aaf1f280000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffafe180a40084 CR3: 000000010e428002 CR4: 0000000000770ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
<IRQ>
try_to_wake_up+0x5a/0x6a0
rq_qos_wake_function+0x71/0x80
__wake_up_common+0x75/0xa0
__wake_up+0x36/0x60
scale_up.part.0+0x50/0x110
wb_timer_fn+0x227/0x450
...
So rq_qos_wake_function() calls wake_up_process(data->task), which calls
try_to_wake_up(), which faults in raw_spin_lock_irqsave(&p->pi_lock).
p comes from data->task, and data comes from the waitqueue entry, which
is stored on the waiter's stack in rq_qos_wait(). Analyzing the core
dump with drgn, I found that the waiter had already woken up and moved
on to a completely unrelated code path, clobbering what was previously
data->task. Meanwhile, the waker was passing the clobbered garbage in
data->task to wake_up_process(), leading to the crash.
What's happening is that in between rq_qos_wake_function() deleting the
waitqueue entry and calling wake_up_process(), rq_qos_wait() is finding
that it already got a token and returning. The race looks like this:
rq_qos_wait() rq_qos_wake_function()
==============================================================
prepare_to_wait_exclusive()
data->got_token = true;
list_del_init(&curr->entry);
if (data.got_token)
break;
finish_wait(&rqw->wait, &data.wq);
^- returns immediately because
list_empty_careful(&wq_entry->entry)
is true
... return, go do something else ...
wake_up_process(data->task)
(NO LONGER VALID!)-^
Normally, finish_wait() is supposed to synchronize against the waker.
But, as noted above, it is returning immediately because the waitqueue
entry has already been removed from the waitqueue.
The bug is that rq_qos_wake_function() is accessing the waitqueue entry
AFTER deleting it. Note that autoremove_wake_function() wakes the waiter
and THEN deletes the waitqueue entry, which is the proper order.
Fix it by swapping the order. We also need to use
list_del_init_careful() to match the list_empty_careful() in
finish_wait().
Fixes: 38cfb5a45ee0 ("blk-wbt: improve waking of tasks")
Cc: stable@vger.kernel.org
Signed-off-by: Omar Sandoval <osandov@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/d3bee2463a67b1ee597211823bf7ad3721c26e41.1729014591.git.osandov@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 9bce8005ec0dcb23a58300e8522fe4a31da606fa ]
Recently running UBSAN caught few out of bound shifts in the
ioc_forgive_debts() function:
UBSAN: shift-out-of-bounds in block/blk-iocost.c:2142:38
shift exponent 80 is too large for 64-bit type 'u64' (aka 'unsigned long
long')
...
UBSAN: shift-out-of-bounds in block/blk-iocost.c:2144:30
shift exponent 80 is too large for 64-bit type 'u64' (aka 'unsigned long
long')
...
Call Trace:
<IRQ>
dump_stack_lvl+0xca/0x130
__ubsan_handle_shift_out_of_bounds+0x22c/0x280
? __lock_acquire+0x6441/0x7c10
ioc_timer_fn+0x6cec/0x7750
? blk_iocost_init+0x720/0x720
? call_timer_fn+0x5d/0x470
call_timer_fn+0xfa/0x470
? blk_iocost_init+0x720/0x720
__run_timer_base+0x519/0x700
...
Actual impact of this issue was not identified but I propose to fix the
undefined behaviour.
The proposed fix to prevent those out of bound shifts consist of
precalculating exponent before using it the shift operations by taking
min value from the actual exponent and maximum possible number of bits.
Reported-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Konstantin Ovsepian <ovs@ovs.to>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20240822154137.2627818-1-ovs@ovs.to
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 26e197b7f9240a4ac301dd0ad520c0c697c2ea7d ]
The blk_add_partition() function initially used a single if-condition
(IS_ERR(part)) to check for errors when adding a partition. This was
modified to handle the specific case of -ENXIO separately, allowing the
function to proceed without logging the error in this case. However,
this change unintentionally left a path where md_autodetect_dev()
could be called without confirming that part is a valid pointer.
This commit separates the error handling logic by splitting the
initial if-condition, improving code readability and handling specific
error scenarios explicitly. The function now distinguishes the general
error case from -ENXIO without altering the existing behavior of
md_autodetect_dev() calls.
Fixes: b72053072c0b (block: allow partitions on host aware zone devices)
Signed-off-by: Riyan Dhiman <riyandhiman14@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240911132954.5874-1-riyandhiman14@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 25c1772a0493463408489b1fae65cf77fe46cac1 ]
Utilize the %pe print specifier to get the symbolic error name as a
string (i.e "-ENOMEM") in the log message instead of the error code to
increase its readablility.
This change was suggested in
https://lore.kernel.org/all/92972476-0b1f-4d0a-9951-af3fc8bc6e65@suswa.mountain/
Signed-off-by: Christian Heusel <christian@heusel.eu>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20240111231521.1596838-1-christian@heusel.eu
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Stable-dep-of: 26e197b7f924 ("block: fix potential invalid pointer dereference in blk_add_partition")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 73aeab373557fa6ee4ae0b742c6211ccd9859280 ]
Original state:
Process 1 Process 2 Process 3 Process 4
(BIC1) (BIC2) (BIC3) (BIC4)
Λ | | |
\--------------\ \-------------\ \-------------\|
V V V
bfqq1--------->bfqq2---------->bfqq3----------->bfqq4
ref 0 1 2 4
After commit 0e456dba86c7 ("block, bfq: choose the last bfqq from merge
chain in bfq_setup_cooperator()"), if P1 issues a new IO:
Without the patch:
Process 1 Process 2 Process 3 Process 4
(BIC1) (BIC2) (BIC3) (BIC4)
Λ | | |
\------------------------------\ \-------------\|
V V
bfqq1--------->bfqq2---------->bfqq3----------->bfqq4
ref 0 0 2 4
bfqq3 will be used to handle IO from P1, this is not expected, IO
should be redirected to bfqq4;
With the patch:
-------------------------------------------
| |
Process 1 Process 2 Process 3 | Process 4
(BIC1) (BIC2) (BIC3) | (BIC4)
| | | |
\-------------\ \-------------\|
V V
bfqq1--------->bfqq2---------->bfqq3----------->bfqq4
ref 0 0 2 4
IO is redirected to bfqq4, however, procress reference of bfqq3 is still
2, while there is only P2 using it.
Fix the problem by calling bfq_merge_bfqqs() for each bfqq in the merge
chain. Also change bfqq_merge_bfqqs() to return new_bfqq to simplify
code.
Fixes: 0e456dba86c7 ("block, bfq: choose the last bfqq from merge chain in bfq_setup_cooperator()")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240909134154.954924-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 1ba0403ac6447f2d63914fb760c44a3b19c44eaf ]
After commit 42c306ed7233 ("block, bfq: don't break merge chain in
bfq_split_bfqq()"), if the current procress is the last holder of bfqq,
the bfqq can be freed after bfq_split_bfqq(). Hence recored the bfqq and
then access bfqq->waker_bfqq may trigger UAF. What's more, the waker_bfqq
may in the merge chain of bfqq, hence just recored waker_bfqq is still
not safe.
Fix the problem by adding a helper bfq_waker_bfqq() to check if
bfqq->waker_bfqq is in the merge chain, and current procress is the only
holder.
Fixes: 42c306ed7233 ("block, bfq: don't break merge chain in bfq_split_bfqq()")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240909134154.954924-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 42c306ed723321af4003b2a41bb73728cab54f85 ]
Consider the following scenario:
Process 1 Process 2 Process 3 Process 4
(BIC1) (BIC2) (BIC3) (BIC4)
Λ | | |
\-------------\ \-------------\ \--------------\|
V V V
bfqq1--------->bfqq2---------->bfqq3----------->bfqq4
ref 0 1 2 4
If Process 1 issue a new IO and bfqq2 is found, and then bfq_init_rq()
decide to spilt bfqq2 by bfq_split_bfqq(). Howerver, procress reference
of bfqq2 is 1 and bfq_split_bfqq() just clear the coop flag, which will
break the merge chain.
Expected result: caller will allocate a new bfqq for BIC1
Process 1 Process 2 Process 3 Process 4
(BIC1) (BIC2) (BIC3) (BIC4)
| | |
\-------------\ \--------------\|
V V
bfqq1--------->bfqq2---------->bfqq3----------->bfqq4
ref 0 0 1 3
Since the condition is only used for the last bfqq4 when the previous
bfqq2 and bfqq3 are already splited. Fix the problem by checking if
bfqq is the last one in the merge chain as well.
Fixes: 36eca8948323 ("block, bfq: add Early Queue Merge (EQM)")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240902130329.3787024-4-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 0e456dba86c7f9a19792204a044835f1ca2c8dbb ]
Consider the following merge chain:
Process 1 Process 2 Process 3 Process 4
(BIC1) (BIC2) (BIC3) (BIC4)
Λ | | |
\--------------\ \-------------\ \-------------\|
V V V
bfqq1--------->bfqq2---------->bfqq3----------->bfqq4
IO from Process 1 will get bfqf2 from BIC1 first, then
bfq_setup_cooperator() will found bfqq2 already merged to bfqq3 and then
handle this IO from bfqq3. However, the merge chain can be much deeper
and bfqq3 can be merged to other bfqq as well.
Fix this problem by iterating to the last bfqq in
bfq_setup_cooperator().
Fixes: 36eca8948323 ("block, bfq: add Early Queue Merge (EQM)")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240902130329.3787024-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 18ad4df091dd5d067d2faa8fce1180b79f7041a7 ]
1) initial state, three tasks:
Process 1 Process 2 Process 3
(BIC1) (BIC2) (BIC3)
| Λ | Λ | Λ
| | | | | |
V | V | V |
bfqq1 bfqq2 bfqq3
process ref: 1 1 1
2) bfqq1 merged to bfqq2:
Process 1 Process 2 Process 3
(BIC1) (BIC2) (BIC3)
| | | Λ
\--------------\| | |
V V |
bfqq1--------->bfqq2 bfqq3
process ref: 0 2 1
3) bfqq2 merged to bfqq3:
Process 1 Process 2 Process 3
(BIC1) (BIC2) (BIC3)
here -> Λ | |
\--------------\ \-------------\|
V V
bfqq1--------->bfqq2---------->bfqq3
process ref: 0 1 3
In this case, IO from Process 1 will get bfqq2 from BIC1 first, and then
get bfqq3 through merge chain, and finially handle IO by bfqq3.
Howerver, current code will think bfqq2 is owned by BIC1, like initial
state, and set bfqq2->bic to BIC1.
bfq_insert_request
-> by Process 1
bfqq = bfq_init_rq(rq)
bfqq = bfq_get_bfqq_handle_split
bfqq = bic_to_bfqq
-> get bfqq2 from BIC1
bfqq->ref++
rq->elv.priv[0] = bic
rq->elv.priv[1] = bfqq
if (bfqq_process_refs(bfqq) == 1)
bfqq->bic = bic
-> record BIC1 to bfqq2
__bfq_insert_request
new_bfqq = bfq_setup_cooperator
-> get bfqq3 from bfqq2->new_bfqq
bfqq_request_freed(bfqq)
new_bfqq->ref++
rq->elv.priv[1] = new_bfqq
-> handle IO by bfqq3
Fix the problem by checking bfqq is from merge chain fist. And this
might fix a following problem reported by our syzkaller(unreproducible):
==================================================================
BUG: KASAN: slab-use-after-free in bfq_do_early_stable_merge block/bfq-iosched.c:5692 [inline]
BUG: KASAN: slab-use-after-free in bfq_do_or_sched_stable_merge block/bfq-iosched.c:5805 [inline]
BUG: KASAN: slab-use-after-free in bfq_get_queue+0x25b0/0x2610 block/bfq-iosched.c:5889
Write of size 1 at addr ffff888123839eb8 by task kworker/0:1H/18595
CPU: 0 PID: 18595 Comm: kworker/0:1H Tainted: G L 6.6.0-07439-gba2303cacfda #6
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
Workqueue: kblockd blk_mq_requeue_work
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x91/0xf0 lib/dump_stack.c:106
print_address_description mm/kasan/report.c:364 [inline]
print_report+0x10d/0x610 mm/kasan/report.c:475
kasan_report+0x8e/0xc0 mm/kasan/report.c:588
bfq_do_early_stable_merge block/bfq-iosched.c:5692 [inline]
bfq_do_or_sched_stable_merge block/bfq-iosched.c:5805 [inline]
bfq_get_queue+0x25b0/0x2610 block/bfq-iosched.c:5889
bfq_get_bfqq_handle_split+0x169/0x5d0 block/bfq-iosched.c:6757
bfq_init_rq block/bfq-iosched.c:6876 [inline]
bfq_insert_request block/bfq-iosched.c:6254 [inline]
bfq_insert_requests+0x1112/0x5cf0 block/bfq-iosched.c:6304
blk_mq_insert_request+0x290/0x8d0 block/blk-mq.c:2593
blk_mq_requeue_work+0x6bc/0xa70 block/blk-mq.c:1502
process_one_work kernel/workqueue.c:2627 [inline]
process_scheduled_works+0x432/0x13f0 kernel/workqueue.c:2700
worker_thread+0x6f2/0x1160 kernel/workqueue.c:2781
kthread+0x33c/0x440 kernel/kthread.c:388
ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:305
</TASK>
Allocated by task 20776:
kasan_save_stack+0x20/0x40 mm/kasan/common.c:45
kasan_set_track+0x25/0x30 mm/kasan/common.c:52
__kasan_slab_alloc+0x87/0x90 mm/kasan/common.c:328
kasan_slab_alloc include/linux/kasan.h:188 [inline]
slab_post_alloc_hook mm/slab.h:763 [inline]
slab_alloc_node mm/slub.c:3458 [inline]
kmem_cache_alloc_node+0x1a4/0x6f0 mm/slub.c:3503
ioc_create_icq block/blk-ioc.c:370 [inline]
ioc_find_get_icq+0x180/0xaa0 block/blk-ioc.c:436
bfq_prepare_request+0x39/0xf0 block/bfq-iosched.c:6812
blk_mq_rq_ctx_init.isra.7+0x6ac/0xa00 block/blk-mq.c:403
__blk_mq_alloc_requests+0xcc0/0x1070 block/blk-mq.c:517
blk_mq_get_new_requests block/blk-mq.c:2940 [inline]
blk_mq_submit_bio+0x624/0x27c0 block/blk-mq.c:3042
__submit_bio+0x331/0x6f0 block/blk-core.c:624
__submit_bio_noacct_mq block/blk-core.c:703 [inline]
submit_bio_noacct_nocheck+0x816/0xb40 block/blk-core.c:732
submit_bio_noacct+0x7a6/0x1b50 block/blk-core.c:826
xlog_write_iclog+0x7d5/0xa00 fs/xfs/xfs_log.c:1958
xlog_state_release_iclog+0x3b8/0x720 fs/xfs/xfs_log.c:619
xlog_cil_push_work+0x19c5/0x2270 fs/xfs/xfs_log_cil.c:1330
process_one_work kernel/workqueue.c:2627 [inline]
process_scheduled_works+0x432/0x13f0 kernel/workqueue.c:2700
worker_thread+0x6f2/0x1160 kernel/workqueue.c:2781
kthread+0x33c/0x440 kernel/kthread.c:388
ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:305
Freed by task 946:
kasan_save_stack+0x20/0x40 mm/kasan/common.c:45
kasan_set_track+0x25/0x30 mm/kasan/common.c:52
kasan_save_free_info+0x2b/0x50 mm/kasan/generic.c:522
____kasan_slab_free mm/kasan/common.c:236 [inline]
__kasan_slab_free+0x12c/0x1c0 mm/kasan/common.c:244
kasan_slab_free include/linux/kasan.h:164 [inline]
slab_free_hook mm/slub.c:1815 [inline]
slab_free_freelist_hook mm/slub.c:1841 [inline]
slab_free mm/slub.c:3786 [inline]
kmem_cache_free+0x118/0x6f0 mm/slub.c:3808
rcu_do_batch+0x35c/0xe30 kernel/rcu/tree.c:2189
rcu_core+0x819/0xd90 kernel/rcu/tree.c:2462
__do_softirq+0x1b0/0x7a2 kernel/softirq.c:553
Last potentially related work creation:
kasan_save_stack+0x20/0x40 mm/kasan/common.c:45
__kasan_record_aux_stack+0xaf/0xc0 mm/kasan/generic.c:492
__call_rcu_common kernel/rcu/tree.c:2712 [inline]
call_rcu+0xce/0x1020 kernel/rcu/tree.c:2826
ioc_destroy_icq+0x54c/0x830 block/blk-ioc.c:105
ioc_release_fn+0xf0/0x360 block/blk-ioc.c:124
process_one_work kernel/workqueue.c:2627 [inline]
process_scheduled_works+0x432/0x13f0 kernel/workqueue.c:2700
worker_thread+0x6f2/0x1160 kernel/workqueue.c:2781
kthread+0x33c/0x440 kernel/kthread.c:388
ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:305
Second to last potentially related work creation:
kasan_save_stack+0x20/0x40 mm/kasan/common.c:45
__kasan_record_aux_stack+0xaf/0xc0 mm/kasan/generic.c:492
__call_rcu_common kernel/rcu/tree.c:2712 [inline]
call_rcu+0xce/0x1020 kernel/rcu/tree.c:2826
ioc_destroy_icq+0x54c/0x830 block/blk-ioc.c:105
ioc_release_fn+0xf0/0x360 block/blk-ioc.c:124
process_one_work kernel/workqueue.c:2627 [inline]
process_scheduled_works+0x432/0x13f0 kernel/workqueue.c:2700
worker_thread+0x6f2/0x1160 kernel/workqueue.c:2781
kthread+0x33c/0x440 kernel/kthread.c:388
ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:305
The buggy address belongs to the object at ffff888123839d68
which belongs to the cache bfq_io_cq of size 1360
The buggy address is located 336 bytes inside of
freed 1360-byte region [ffff888123839d68, ffff88812383a2b8)
The buggy address belongs to the physical page:
page:ffffea00048e0e00 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88812383f588 pfn:0x123838
head:ffffea00048e0e00 order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0
flags: 0x17ffffc0000a40(workingset|slab|head|node=0|zone=2|lastcpupid=0x1fffff)
page_type: 0xffffffff()
raw: 0017ffffc0000a40 ffff88810588c200 ffffea00048ffa10 ffff888105889488
raw: ffff88812383f588 0000000000150006 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff888123839d80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff888123839e00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff888123839e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff888123839f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff888123839f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================
Fixes: 36eca8948323 ("block, bfq: add Early Queue Merge (EQM)")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240902130329.3787024-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit f3c89983cb4fc00be64eb0d5cbcfcdf2cacb965e ]
Commit 82b74cac2849 ("blk-ioprio: Convert from rqos policy to direct
call") pushed setting bio I/O priority down into blk_mq_submit_bio()
-- which is too low within block core's submit_bio() because it
skips setting I/O priority for block drivers that implement
fops->submit_bio() (e.g. DM, MD, etc).
Fix this by moving bio_set_ioprio() up from blk-mq.c to blk-core.c and
call it from submit_bio(). This ensures all block drivers call
bio_set_ioprio() during initial bio submission.
Fixes: a78418e6a04c ("block: Always initialize bio IO priority on submit")
Co-developed-by: Yibin Ding <yibin.ding@unisoc.com>
Signed-off-by: Yibin Ding <yibin.ding@unisoc.com>
Signed-off-by: Hongyu Jin <hongyu.jin@unisoc.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
[snitzer: revised commit header]
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20240130202638.62600-2-snitzer@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit e8bc14d116aeac8f0f133ec8d249acf4e0658da7 ]
Now that there are no indirect calls for PI processing there is no
way to dereference a NULL pointer here. Additionally drivers now always
freeze the queue (or in case of stacking drivers use their internal
equivalent) around changing the integrity profile.
This is effectively a revert of commit 3df49967f6f1 ("block: flush the
integrity workqueue in blk_integrity_unregister").
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240613084839.1044015-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit b313a8c835516bdda85025500be866ac8a74e022 ]
Lockdep reported a warning in Linux version 6.6:
[ 414.344659] ================================
[ 414.345155] WARNING: inconsistent lock state
[ 414.345658] 6.6.0-07439-gba2303cacfda #6 Not tainted
[ 414.346221] --------------------------------
[ 414.346712] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
[ 414.347545] kworker/u10:3/1152 [HC0[0]:SC0[0]:HE0:SE1] takes:
[ 414.349245] ffff88810edd1098 (&sbq->ws[i].wait){+.?.}-{2:2}, at: blk_mq_dispatch_rq_list+0x131c/0x1ee0
[ 414.351204] {IN-SOFTIRQ-W} state was registered at:
[ 414.351751] lock_acquire+0x18d/0x460
[ 414.352218] _raw_spin_lock_irqsave+0x39/0x60
[ 414.352769] __wake_up_common_lock+0x22/0x60
[ 414.353289] sbitmap_queue_wake_up+0x375/0x4f0
[ 414.353829] sbitmap_queue_clear+0xdd/0x270
[ 414.354338] blk_mq_put_tag+0xdf/0x170
[ 414.354807] __blk_mq_free_request+0x381/0x4d0
[ 414.355335] blk_mq_free_request+0x28b/0x3e0
[ 414.355847] __blk_mq_end_request+0x242/0xc30
[ 414.356367] scsi_end_request+0x2c1/0x830
[ 414.345155] WARNING: inconsistent lock state
[ 414.345658] 6.6.0-07439-gba2303cacfda #6 Not tainted
[ 414.346221] --------------------------------
[ 414.346712] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
[ 414.347545] kworker/u10:3/1152 [HC0[0]:SC0[0]:HE0:SE1] takes:
[ 414.349245] ffff88810edd1098 (&sbq->ws[i].wait){+.?.}-{2:2}, at: blk_mq_dispatch_rq_list+0x131c/0x1ee0
[ 414.351204] {IN-SOFTIRQ-W} state was registered at:
[ 414.351751] lock_acquire+0x18d/0x460
[ 414.352218] _raw_spin_lock_irqsave+0x39/0x60
[ 414.352769] __wake_up_common_lock+0x22/0x60
[ 414.353289] sbitmap_queue_wake_up+0x375/0x4f0
[ 414.353829] sbitmap_queue_clear+0xdd/0x270
[ 414.354338] blk_mq_put_tag+0xdf/0x170
[ 414.354807] __blk_mq_free_request+0x381/0x4d0
[ 414.355335] blk_mq_free_request+0x28b/0x3e0
[ 414.355847] __blk_mq_end_request+0x242/0xc30
[ 414.356367] scsi_end_request+0x2c1/0x830
[ 414.356863] scsi_io_completion+0x177/0x1610
[ 414.357379] scsi_complete+0x12f/0x260
[ 414.357856] blk_complete_reqs+0xba/0xf0
[ 414.358338] __do_softirq+0x1b0/0x7a2
[ 414.358796] irq_exit_rcu+0x14b/0x1a0
[ 414.359262] sysvec_call_function_single+0xaf/0xc0
[ 414.359828] asm_sysvec_call_function_single+0x1a/0x20
[ 414.360426] default_idle+0x1e/0x30
[ 414.360873] default_idle_call+0x9b/0x1f0
[ 414.361390] do_idle+0x2d2/0x3e0
[ 414.361819] cpu_startup_entry+0x55/0x60
[ 414.362314] start_secondary+0x235/0x2b0
[ 414.362809] secondary_startup_64_no_verify+0x18f/0x19b
[ 414.363413] irq event stamp: 428794
[ 414.363825] hardirqs last enabled at (428793): [<ffffffff816bfd1c>] ktime_get+0x1dc/0x200
[ 414.364694] hardirqs last disabled at (428794): [<ffffffff85470177>] _raw_spin_lock_irq+0x47/0x50
[ 414.365629] softirqs last enabled at (428444): [<ffffffff85474780>] __do_softirq+0x540/0x7a2
[ 414.366522] softirqs last disabled at (428419): [<ffffffff813f65ab>] irq_exit_rcu+0x14b/0x1a0
[ 414.367425]
other info that might help us debug this:
[ 414.368194] Possible unsafe locking scenario:
[ 414.368900] CPU0
[ 414.369225] ----
[ 414.369548] lock(&sbq->ws[i].wait);
[ 414.370000] <Interrupt>
[ 414.370342] lock(&sbq->ws[i].wait);
[ 414.370802]
*** DEADLOCK ***
[ 414.371569] 5 locks held by kworker/u10:3/1152:
[ 414.372088] #0: ffff88810130e938 ((wq_completion)writeback){+.+.}-{0:0}, at: process_scheduled_works+0x357/0x13f0
[ 414.373180] #1: ffff88810201fdb8 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}, at: process_scheduled_works+0x3a3/0x13f0
[ 414.374384] #2: ffffffff86ffbdc0 (rcu_read_lock){....}-{1:2}, at: blk_mq_run_hw_queue+0x637/0xa00
[ 414.375342] #3: ffff88810edd1098 (&sbq->ws[i].wait){+.?.}-{2:2}, at: blk_mq_dispatch_rq_list+0x131c/0x1ee0
[ 414.376377] #4: ffff888106205a08 (&hctx->dispatch_wait_lock){+.-.}-{2:2}, at: blk_mq_dispatch_rq_list+0x1337/0x1ee0
[ 414.378607]
stack backtrace:
[ 414.379177] CPU: 0 PID: 1152 Comm: kworker/u10:3 Not tainted 6.6.0-07439-gba2303cacfda #6
[ 414.380032] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[ 414.381177] Workqueue: writeback wb_workfn (flush-253:0)
[ 414.381805] Call Trace:
[ 414.382136] <TASK>
[ 414.382429] dump_stack_lvl+0x91/0xf0
[ 414.382884] mark_lock_irq+0xb3b/0x1260
[ 414.383367] ? __pfx_mark_lock_irq+0x10/0x10
[ 414.383889] ? stack_trace_save+0x8e/0xc0
[ 414.384373] ? __pfx_stack_trace_save+0x10/0x10
[ 414.384903] ? graph_lock+0xcf/0x410
[ 414.385350] ? save_trace+0x3d/0xc70
[ 414.385808] mark_lock.part.20+0x56d/0xa90
[ 414.386317] mark_held_locks+0xb0/0x110
[ 414.386791] ? __pfx_do_raw_spin_lock+0x10/0x10
[ 414.387320] lockdep_hardirqs_on_prepare+0x297/0x3f0
[ 414.387901] ? _raw_spin_unlock_irq+0x28/0x50
[ 414.388422] trace_hardirqs_on+0x58/0x100
[ 414.388917] _raw_spin_unlock_irq+0x28/0x50
[ 414.389422] __blk_mq_tag_busy+0x1d6/0x2a0
[ 414.389920] __blk_mq_get_driver_tag+0x761/0x9f0
[ 414.390899] blk_mq_dispatch_rq_list+0x1780/0x1ee0
[ 414.391473] ? __pfx_blk_mq_dispatch_rq_list+0x10/0x10
[ 414.392070] ? sbitmap_get+0x2b8/0x450
[ 414.392533] ? __blk_mq_get_driver_tag+0x210/0x9f0
[ 414.393095] __blk_mq_sched_dispatch_requests+0xd99/0x1690
[ 414.393730] ? elv_attempt_insert_merge+0x1b1/0x420
[ 414.394302] ? __pfx___blk_mq_sched_dispatch_requests+0x10/0x10
[ 414.394970] ? lock_acquire+0x18d/0x460
[ 414.395456] ? blk_mq_run_hw_queue+0x637/0xa00
[ 414.395986] ? __pfx_lock_acquire+0x10/0x10
[ 414.396499] blk_mq_sched_dispatch_requests+0x109/0x190
[ 414.397100] blk_mq_run_hw_queue+0x66e/0xa00
[ 414.397616] blk_mq_flush_plug_list.part.17+0x614/0x2030
[ 414.398244] ? __pfx_blk_mq_flush_plug_list.part.17+0x10/0x10
[ 414.398897] ? writeback_sb_inodes+0x241/0xcc0
[ 414.399429] blk_mq_flush_plug_list+0x65/0x80
[ 414.399957] __blk_flush_plug+0x2f1/0x530
[ 414.400458] ? __pfx___blk_flush_plug+0x10/0x10
[ 414.400999] blk_finish_plug+0x59/0xa0
[ 414.401467] wb_writeback+0x7cc/0x920
[ 414.401935] ? __pfx_wb_writeback+0x10/0x10
[ 414.402442] ? mark_held_locks+0xb0/0x110
[ 414.402931] ? __pfx_do_raw_spin_lock+0x10/0x10
[ 414.403462] ? lockdep_hardirqs_on_prepare+0x297/0x3f0
[ 414.404062] wb_workfn+0x2b3/0xcf0
[ 414.404500] ? __pfx_wb_workfn+0x10/0x10
[ 414.404989] process_scheduled_works+0x432/0x13f0
[ 414.405546] ? __pfx_process_scheduled_works+0x10/0x10
[ 414.406139] ? do_raw_spin_lock+0x101/0x2a0
[ 414.406641] ? assign_work+0x19b/0x240
[ 414.407106] ? lock_is_held_type+0x9d/0x110
[ 414.407604] worker_thread+0x6f2/0x1160
[ 414.408075] ? __kthread_parkme+0x62/0x210
[ 414.408572] ? lockdep_hardirqs_on_prepare+0x297/0x3f0
[ 414.409168] ? __kthread_parkme+0x13c/0x210
[ 414.409678] ? __pfx_worker_thread+0x10/0x10
[ 414.410191] kthread+0x33c/0x440
[ 414.410602] ? __pfx_kthread+0x10/0x10
[ 414.411068] ret_from_fork+0x4d/0x80
[ 414.411526] ? __pfx_kthread+0x10/0x10
[ 414.411993] ret_from_fork_asm+0x1b/0x30
[ 414.412489] </TASK>
When interrupt is turned on while a lock holding by spin_lock_irq it
throws a warning because of potential deadlock.
blk_mq_prep_dispatch_rq
blk_mq_get_driver_tag
__blk_mq_get_driver_tag
__blk_mq_alloc_driver_tag
blk_mq_tag_busy -> tag is already busy
// failed to get driver tag
blk_mq_mark_tag_wait
spin_lock_irq(&wq->lock) -> lock A (&sbq->ws[i].wait)
__add_wait_queue(wq, wait) -> wait queue active
blk_mq_get_driver_tag
__blk_mq_tag_busy
-> 1) tag must be idle, which means there can't be inflight IO
spin_lock_irq(&tags->lock) -> lock B (hctx->tags)
spin_unlock_irq(&tags->lock) -> unlock B, turn on interrupt accidentally
-> 2) context must be preempt by IO interrupt to trigger deadlock.
As shown above, the deadlock is not possible in theory, but the warning
still need to be fixed.
Fix it by using spin_lock_irqsave to get lockB instead of spin_lock_irq.
Fixes: 4f1731df60f9 ("blk-mq: fix potential io hang by wrong 'wake_batch'")
Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240815024736.2040971-1-lilingfeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 7e04da2dc7013af50ed3a2beb698d5168d1e594b upstream.
Our test report the following hung task:
[ 2538.459400] INFO: task "kworker/0:0":7 blocked for more than 188 seconds.
[ 2538.459427] Call trace:
[ 2538.459430] __switch_to+0x174/0x338
[ 2538.459436] __schedule+0x628/0x9c4
[ 2538.459442] schedule+0x7c/0xe8
[ 2538.459447] schedule_preempt_disabled+0x24/0x40
[ 2538.459453] __mutex_lock+0x3ec/0xf04
[ 2538.459456] __mutex_lock_slowpath+0x14/0x24
[ 2538.459459] mutex_lock+0x30/0xd8
[ 2538.459462] del_gendisk+0xdc/0x350
[ 2538.459466] sd_remove+0x30/0x60
[ 2538.459470] device_release_driver_internal+0x1c4/0x2c4
[ 2538.459474] device_release_driver+0x18/0x28
[ 2538.459478] bus_remove_device+0x15c/0x174
[ 2538.459483] device_del+0x1d0/0x358
[ 2538.459488] __scsi_remove_device+0xa8/0x198
[ 2538.459493] scsi_forget_host+0x50/0x70
[ 2538.459497] scsi_remove_host+0x80/0x180
[ 2538.459502] usb_stor_disconnect+0x68/0xf4
[ 2538.459506] usb_unbind_interface+0xd4/0x280
[ 2538.459510] device_release_driver_internal+0x1c4/0x2c4
[ 2538.459514] device_release_driver+0x18/0x28
[ 2538.459518] bus_remove_device+0x15c/0x174
[ 2538.459523] device_del+0x1d0/0x358
[ 2538.459528] usb_disable_device+0x84/0x194
[ 2538.459532] usb_disconnect+0xec/0x300
[ 2538.459537] hub_event+0xb80/0x1870
[ 2538.459541] process_scheduled_works+0x248/0x4dc
[ 2538.459545] worker_thread+0x244/0x334
[ 2538.459549] kthread+0x114/0x1bc
[ 2538.461001] INFO: task "fsck.":15415 blocked for more than 188 seconds.
[ 2538.461014] Call trace:
[ 2538.461016] __switch_to+0x174/0x338
[ 2538.461021] __schedule+0x628/0x9c4
[ 2538.461025] schedule+0x7c/0xe8
[ 2538.461030] blk_queue_enter+0xc4/0x160
[ 2538.461034] blk_mq_alloc_request+0x120/0x1d4
[ 2538.461037] scsi_execute_cmd+0x7c/0x23c
[ 2538.461040] ioctl_internal_command+0x5c/0x164
[ 2538.461046] scsi_set_medium_removal+0x5c/0xb0
[ 2538.461051] sd_release+0x50/0x94
[ 2538.461054] blkdev_put+0x190/0x28c
[ 2538.461058] blkdev_release+0x28/0x40
[ 2538.461063] __fput+0xf8/0x2a8
[ 2538.461066] __fput_sync+0x28/0x5c
[ 2538.461070] __arm64_sys_close+0x84/0xe8
[ 2538.461073] invoke_syscall+0x58/0x114
[ 2538.461078] el0_svc_common+0xac/0xe0
[ 2538.461082] do_el0_svc+0x1c/0x28
[ 2538.461087] el0_svc+0x38/0x68
[ 2538.461090] el0t_64_sync_handler+0x68/0xbc
[ 2538.461093] el0t_64_sync+0x1a8/0x1ac
T1: T2:
sd_remove
del_gendisk
__blk_mark_disk_dead
blk_freeze_queue_start
++q->mq_freeze_depth
bdev_release
mutex_lock(&disk->open_mutex)
sd_release
scsi_execute_cmd
blk_queue_enter
wait_event(!q->mq_freeze_depth)
mutex_lock(&disk->open_mutex)
SCSI does not set GD_OWNS_QUEUE, so QUEUE_FLAG_DYING is not set in
this scenario. This is a classic ABBA deadlock. To fix the deadlock,
make sure we don't try to acquire disk->open_mutex after freezing
the queue.
Cc: stable@vger.kernel.org
Fixes: eec1be4c30df ("block: delete partitions later in del_gendisk")
Signed-off-by: Yang Yang <yang.yang@vivo.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Fixes: and Cc: stable tags are missing. Otherwise this patch looks fine
Link: https://lore.kernel.org/r/20240724070412.22521-1-yang.yang@vivo.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 39823b47bbd40502632ffba90ebb34fff7c8b5e8 ]
The current tag reservation code is based on a misunderstanding of the
meaning of data->shallow_depth. Fix the tag reservation code as follows:
* By default, do not reserve any tags for synchronous requests because
for certain use cases reserving tags reduces performance. See also
Harshit Mogalapalli, [bug-report] Performance regression with fio
sequential-write on a multipath setup, 2024-03-07
(https://lore.kernel.org/linux-block/5ce2ae5d-61e2-4ede-ad55-551112602401@oracle.com/)
* Reduce min_shallow_depth to one because min_shallow_depth must be less
than or equal any shallow_depth value.
* Scale dd->async_depth from the range [1, nr_requests] to [1,
bits_per_sbitmap_word].
Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Zhiguo Niu <zhiguo.niu@unisoc.com>
Fixes: 07757588e507 ("block/mq-deadline: Reserve 25% of scheduler tags for synchronous requests")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240509170149.7639-3-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 6259151c04d4e0085e00d2dcb471ebdd1778e72e ]
Call .limit_depth() after data->hctx has been set such that data->hctx can
be used in .limit_depth() implementations.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Zhiguo Niu <zhiguo.niu@unisoc.com>
Fixes: 07757588e507 ("block/mq-deadline: Reserve 25% of scheduler tags for synchronous requests")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: Zhiguo Niu <zhiguo.niu@unisoc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240509170149.7639-2-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 899ee2c3829c5ac14bfc7d3c4a5846c0b709b78f ]
Metadata added by bio_integrity_prep is using plain kmalloc, which leads
to random kernel memory being written media. For PI metadata this is
limited to the app tag that isn't used by kernel generated metadata,
but for non-PI metadata the entire buffer leaks kernel memory.
Fix this by adding the __GFP_ZERO flag to allocations for writes.
Fixes: 7ba1ba12eeef ("block: Block layer data integrity support")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20240613084839.1044015-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit ccb326b5f9e623eb7f130fbbf2505ec0e2dcaff9 ]
Running syzkaller with the newly reintroduced signed integer overflow
sanitizer shows this report:
[ 62.982337] ------------[ cut here ]------------
[ 62.985692] cgroup: Invalid name
[ 62.986211] UBSAN: signed-integer-overflow in ../block/ioctl.c:36:46
[ 62.989370] 9pnet_fd: p9_fd_create_tcp (7343): problem connecting socket to 127.0.0.1
[ 62.992992] 9223372036854775807 + 4095 cannot be represented in type 'long long'
[ 62.997827] 9pnet_fd: p9_fd_create_tcp (7345): problem connecting socket to 127.0.0.1
[ 62.999369] random: crng reseeded on system resumption
[ 63.000634] GUP no longer grows the stack in syz-executor.2 (7353): 20002000-20003000 (20001000)
[ 63.000668] CPU: 0 PID: 7353 Comm: syz-executor.2 Not tainted 6.8.0-rc2-00035-gb3ef86b5a957 #1
[ 63.000677] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 63.000682] Call Trace:
[ 63.000686] <TASK>
[ 63.000731] dump_stack_lvl+0x93/0xd0
[ 63.000919] __get_user_pages+0x903/0xd30
[ 63.001030] __gup_longterm_locked+0x153e/0x1ba0
[ 63.001041] ? _raw_read_unlock_irqrestore+0x17/0x50
[ 63.001072] ? try_get_folio+0x29c/0x2d0
[ 63.001083] internal_get_user_pages_fast+0x1119/0x1530
[ 63.001109] iov_iter_extract_pages+0x23b/0x580
[ 63.001206] bio_iov_iter_get_pages+0x4de/0x1220
[ 63.001235] iomap_dio_bio_iter+0x9b6/0x1410
[ 63.001297] __iomap_dio_rw+0xab4/0x1810
[ 63.001316] iomap_dio_rw+0x45/0xa0
[ 63.001328] ext4_file_write_iter+0xdde/0x1390
[ 63.001372] vfs_write+0x599/0xbd0
[ 63.001394] ksys_write+0xc8/0x190
[ 63.001403] do_syscall_64+0xd4/0x1b0
[ 63.001421] ? arch_exit_to_user_mode_prepare+0x3a/0x60
[ 63.001479] entry_SYSCALL_64_after_hwframe+0x6f/0x77
[ 63.001535] RIP: 0033:0x7f7fd3ebf539
[ 63.001551] Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 f1 14 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
[ 63.001562] RSP: 002b:00007f7fd32570c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 63.001584] RAX: ffffffffffffffda RBX: 00007f7fd3ff3f80 RCX: 00007f7fd3ebf539
[ 63.001590] RDX: 4db6d1e4f7e43360 RSI: 0000000020000000 RDI: 0000000000000004
[ 63.001595] RBP: 00007f7fd3f1e496 R08: 0000000000000000 R09: 0000000000000000
[ 63.001599] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 63.001604] R13: 0000000000000006 R14: 00007f7fd3ff3f80 R15: 00007ffd415ad2b8
...
[ 63.018142] ---[ end trace ]---
Historically, the signed integer overflow sanitizer did not work in the
kernel due to its interaction with `-fwrapv` but this has since been
changed [1] in the newest version of Clang; It was re-enabled in the
kernel with Commit 557f8c582a9ba8ab ("ubsan: Reintroduce signed overflow
sanitizer").
Let's rework this overflow checking logic to not actually perform an
overflow during the check itself, thus avoiding the UBSAN splat.
[1]: https://github.com/llvm/llvm-project/pull/82432
Signed-off-by: Justin Stitt <justinstitt@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240507-b4-sio-block-ioctl-v3-1-ba0c2b32275e@google.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit d0321c812d89c5910d8da8e4b10c891c6b96ff70 ]
Friedrich Weber reported a kernel crash problem and bisected to commit
81ada09cc25e ("blk-flush: reuse rq queuelist in flush state machine").
The root cause is that we use "list_move_tail(&rq->queuelist, pending)"
in the PREFLUSH/POSTFLUSH sequences. But rq->queuelist.next == xxx since
it's popped out from plug->cached_rq in __blk_mq_alloc_requests_batch().
We don't initialize its queuelist just for this first request, although
the queuelist of all later popped requests will be initialized.
Fix it by changing to use "list_add_tail(&rq->queuelist, pending)" so
rq->queuelist doesn't need to be initialized. It should be ok since rq
can't be on any list when PREFLUSH or POSTFLUSH, has no move actually.
Please note the commit 81ada09cc25e ("blk-flush: reuse rq queuelist in
flush state machine") also has another requirement that no drivers would
touch rq->queuelist after blk_mq_end_request() since we will reuse it to
add rq to the post-flush pending list in POSTFLUSH. If this is not true,
we will have to revert that commit IMHO.
This updated version adds "list_del_init(&rq->queuelist)" in flush rq
callback since the dm layer may submit request of a weird invalid format
(REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH), which causes double list_add
if without this "list_del_init(&rq->queuelist)". The weird invalid format
problem should be fixed in dm layer.
Reported-by: Friedrich Weber <f.weber@proxmox.com>
Closes: https://lore.kernel.org/lkml/14b89dfb-505c-49f7-aebb-01c54451db40@proxmox.com/
Closes: https://lore.kernel.org/lkml/c9d03ff7-27c5-4ebd-b3f6-5a90d96f35ba@proxmox.com/
Fixes: 81ada09cc25e ("blk-flush: reuse rq queuelist in flush state machine")
Cc: Christoph Hellwig <hch@lst.de>
Cc: ming.lei@redhat.com
Cc: bvanassche@acm.org
Tested-by: Friedrich Weber <f.weber@proxmox.com>
Signed-off-by: Chengming Zhou <chengming.zhou@linux.dev>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240608143115.972486-1-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 9b1ebce6a1fded90d4a1c6c57dc6262dac4c4c14 ]
Clang static checker (scan-build) warning:
block/sed-opal.c:line 317, column 3
Value stored to 'ret' is never read.
Fix this problem by returning the error code when keyring_search() failed.
Otherwise, 'key' will have a wrong value when 'kerf' stores the error code.
Fixes: 3bfeb6125664 ("block: sed-opal: keyring support for SED keys")
Signed-off-by: Su Hui <suhui@nfschina.com>
Link: https://lore.kernel.org/r/20240611073659.429582-1-suhui@nfschina.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 9d230c09964e6e18c8f6e4f0d41ee90eef45ec1c ]
During a cgroup_rstat_flush() call, the lowest level of nodes are flushed
first before their parents. Since commit 3b8cc6298724 ("blk-cgroup:
Optimize blkcg_rstat_flush()"), iostat propagation was still done to
the parent. Grandparent, however, may not get the iostat update if the
parent has no blkg_iostat_set queued in its lhead lockless list.
Fix this iostat propagation problem by queuing the parent's global
blkg->iostat into one of its percpu lockless lists to make sure that
the delta will always be propagated up to the grandparent and so on
toward the root blkcg.
Note that successive calls to __blkcg_rstat_flush() are serialized by
the cgroup_rstat_lock. So no special barrier is used in the reading
and writing of blkg->iostat.lqueued.
Fixes: 3b8cc6298724 ("blk-cgroup: Optimize blkcg_rstat_flush()")
Reported-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Closes: https://lore.kernel.org/lkml/ZkO6l%2FODzadSgdhC@dschatzberg-fedora-PF3DHTBV/
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20240515143059.276677-1-longman@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit d0aac2363549e12cc79b8e285f13d5a9f42fd08e ]
__blkcg_rstat_flush() can be run anytime, especially when blk_cgroup_bio_start
is being executed.
If WRITE of `->lqueued` is re-ordered with READ of 'bisc->lnode.next' in
the loop of __blkcg_rstat_flush(), `next_bisc` can be assigned with one
stat instance being added in blk_cgroup_bio_start(), then the local
list in __blkcg_rstat_flush() could be corrupted.
Fix the issue by adding one barrier.
Cc: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Fixes: 3b8cc6298724 ("blk-cgroup: Optimize blkcg_rstat_flush()")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20240515013157.443672-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin < |