| Age | Commit message (Collapse) | Author | Files | Lines |
|
Pull block fixes from Jens Axboe:
"A set of fixes for an issue with md array assembly and drbd for
devices supporting write zeros"
* tag 'block-6.17-20250918' of git://git.kernel.dk/linux:
drbd: init queue_limits->max_hw_wzeroes_unmap_sectors parameter
md: init queue_limits->max_hw_wzeroes_unmap_sectors parameter
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mikulas Patocka:
- fix integer overflow in dm-stripe
- limit tag size in dm-integrity to 255 bytes
- fix 'alignment inconsistency' warning in dm-raid
* tag 'for-6.17/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm-raid: don't set io_min and io_opt for raid1
dm-integrity: limit MAX_TAG_SIZE to 255
dm-stripe: fix a possible integer overflow
|
|
These commands
modprobe brd rd_size=1048576
vgcreate vg /dev/ram*
lvcreate -m4 -L10 -n lv vg
trigger the following warnings:
device-mapper: table: 252:10: adding target device (start sect 0 len 24576) caused an alignment inconsistency
device-mapper: table: 252:10: adding target device (start sect 0 len 24576) caused an alignment inconsistency
The warnings are caused by the fact that io_min is 512 and physical block
size is 4096.
If there's chunk-less raid, such as raid1, io_min shouldn't be set to zero
because it would be raised to 512 and it would trigger the warning.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Cc: stable@vger.kernel.org
|
|
The parameter max_hw_wzeroes_unmap_sectors in queue_limits should be
equal to max_write_zeroes_sectors if it is set to a non-zero value.
However, the stacked md drivers call md_init_stacking_limits() to
initialize this parameter to UINT_MAX but only adjust
max_write_zeroes_sectors when setting limits. Therefore, this
discrepancy triggers a value check failure in blk_validate_limits().
$ modprobe scsi_debug num_parts=2 dev_size_mb=8 lbprz=1 lbpws=1
$ mdadm --create /dev/md0 --level=0 --raid-device=2 /dev/sda1 /dev/sda2
mdadm: Defaulting to version 1.2 metadata
mdadm: RUN_ARRAY failed: Invalid argument
Fix this failure by explicitly setting max_hw_wzeroes_unmap_sectors to
max_write_zeroes_sectors. Since the linear and raid0 drivers support
write zeroes, so they can support unmap write zeroes operation if all of
the backend devices support it. However, the raid1/10/5 drivers don't
support write zeroes, so we have to set it to zero.
Fixes: 0c40d7cb5ef3 ("block: introduce max_{hw|user}_wzeroes_unmap_sectors to queue limits")
Reported-by: John Garry <john.g.garry@oracle.com>
Closes: https://lore.kernel.org/linux-block/803a2183-a0bb-4b7a-92f1-afc5097630d2@oracle.com/
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Tested-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Li Nan <linan122@huawei.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/linux-raid/20250910111107.3247530-2-yi.zhang@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
|
|
MAX_TAG_SIZE was 0x1a8 and it may be truncated in the "bi->metadata_size
= ic->tag_size" assignment. We need to limit it to 255.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
|
|
In md_do_sync(), when md_sync_action returns ACTION_FROZEN, subsequent
call to md_sync_position() will return MaxSector. This causes
'curr_resync' (and later 'recovery_offset') to be set to MaxSector too,
which incorrectly signals that recovery/resync has completed, even though
disk data has not actually been updated.
To fix this issue, skip updating any offset values when the sync action
is FROZEN. The same holds true for IDLE.
Fixes: 7d9f107a4e94 ("md: use new helpers in md_do_sync()")
Signed-off-by: Li Nan <linan122@huawei.com>
Link: https://lore.kernel.org/linux-raid/20250904073452.3408516-1-linan666@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
|
|
If writemostly is enabled, alloc_behind_master_bio() will allocate a new
bio for rdev, with bi_opf set to 0. Later, raid1_write_request() will
clone from this bio, hence bi_opf is still 0 for the cloned bio. Submit
this cloned bio will end up to be read, causing write data lost.
Fix this problem by inheriting bi_opf from original bio for
behind_mast_bio.
Fixes: e879a0d9cb08 ("md/raid1,raid10: don't ignore IO flags")
Reported-and-tested-by: Ian Dall <ian@beware.dropbear.id.au>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220507
Link: https://lore.kernel.org/linux-raid/20250903014140.3690499-1-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Li Nan <linan122@huawei.com>
|
|
During raid resync, if a disk becomes faulty, the operation is
briefly interrupted. The MD_RECOVERY_RECOVER flag triggered by
the disk failure causes sync_action to incorrectly show "recover"
instead of "resync". The same issue affects reshape operations.
Reproduction steps:
mdadm -Cv /dev/md1 -l1 -n4 -e1.2 /dev/sd{a..d} // -> resync happened
mdadm -f /dev/md1 /dev/sda // -> resync interrupted
cat sync_action
-> recover
Add progress checks in md_sync_action() for resync/recover/reshape
to ensure the interface correctly reports the actual operation type.
Fixes: 4b10a3bc67c1 ("md: ensure resync is prioritized over recovery")
Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
Link: https://lore.kernel.org/linux-raid/20250816002534.1754356-3-zhengqixing@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
|
|
Add a helper for checking if an rdev needs recovery.
Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
Link: https://lore.kernel.org/linux-raid/20250816002534.1754356-2-zhengqixing@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
|
|
commit 907a99c314a5 ("md: rename recovery_cp to resync_offset") replaces
recovery_cp with resync_offset in mdp_superblock_s which is in md_p.h.
md_p.h is used in userspace too. So mdadm building fails because of this.
This patch revert this change.
Fixes: 907a99c314a5 ("md: rename recovery_cp to resync_offset")
Signed-off-by: Xiao Ni <xni@redhat.com>
Link: https://lore.kernel.org/linux-raid/20250815040028.18085-1-xni@redhat.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
|
|
commit 9e59d609763f ("md: call del_gendisk in control path") changes the
async way to sync way of calling del_gendisk. But it breaks mdadm
--assemble command. The assemble command runs like this:
1. create the array
2. stop the array
3. access the sysfs files after stopping
The sync way calls del_gendisk in step 2, so all sysfs files are removed.
Now to avoid breaking mdadm assemble command, this patch adds the parameter
legacy_async_del_gendisk that can be used to choose which way. The default
is async way. In future, we plan to change default to sync way in kernel
7.0. Then users need to upgrade to mdadm 4.5+ which removes step 2.
Fixes: 9e59d609763f ("md: call del_gendisk in control path")
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Closes: https://lore.kernel.org/linux-raid/CAMw=ZnQ=ET2St-+hnhsuq34rRPnebqcXqP1QqaHW5Bh4aaaZ4g@mail.gmail.com/T/#t
Suggested-and-reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Link: https://lore.kernel.org/linux-raid/20250813032929.54978-1-xni@redhat.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
|
|
There's a possible integer overflow in stripe_io_hints if we have too
large chunk size. Test if the overflow happened, and if it did, don't set
limits->io_min and limits->io_opt;
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Suggested-by: Dongsheng Yang <dongsheng.yang@linux.dev>
Cc: stable@vger.kernel.org
|
|
Pull more block updates from Jens Axboe:
- MD pull request via Yu:
- mddev null-ptr-dereference fix, by Erkun
- md-cluster fail to remove the faulty disk regression fix, by
Heming
- minor cleanup, by Li Nan and Jinchao
- mdadm lifetime regression fix reported by syzkaller, by Yu Kuai
- MD pull request via Christoph
- add support for getting the FDP featuee in fabrics passthru path
(Nitesh Shetty)
- add capability to connect to an administrative controller
(Kamaljit Singh)
- fix a leak on sgl setup error (Keith Busch)
- initialize discovery subsys after debugfs is initialized
(Mohamed Khalfella)
- fix various comment typos (Bjorn Helgaas)
- remove unneeded semicolons (Jiapeng Chong)
- nvmet debugfs ordering issue fix
- Fix UAF in the tag_set in zloop
- Ensure sbitmap shallow depth covers entire set
- Reduce lock roundtrips in io context lookup
- Move scheduler tags alloc/free out of elevator and freeze lock, to
fix some lockdep found issues
- Improve robustness of queue limits checking
- Fix a regression with IO priorities, if no io context exists
* tag 'block-6.17-20250808' of git://git.kernel.dk/linux: (26 commits)
lib/sbitmap: make sbitmap_get_shallow() internal
lib/sbitmap: convert shallow_depth from one word to the whole sbitmap
nvmet: exit debugfs after discovery subsystem exits
block, bfq: Reorder struct bfq_iocq_bfqq_data
md: make rdev_addable usable for rcu mode
md/raid1: remove struct pool_info and related code
md/raid1: change r1conf->r1bio_pool to a pointer type
block: ensure discard_granularity is zero when discard is not supported
zloop: fix KASAN use-after-free of tag set
block: Fix default IO priority if there is no IO context
nvme: fix various comment typos
nvme-auth: remove unneeded semicolon
nvme-pci: fix leak on sgl setup error
nvmet: initialize discovery subsys after debugfs is initialized
nvme: add capability to connect to an administrative controller
nvmet: add support for FDP in fabrics passthru path
md: rename recovery_cp to resync_offset
md/md-cluster: handle REMOVE message earlier
md: fix create on open mddev lifetime regression
block: fix potential deadlock while running nr_hw_queue update
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mikulas Patocka:
- fix checking for request-based stackable devices (dm-table)
- fix corrupt_bio_byte setup checks (dm-flakey)
- add support for resync w/o metadata devices (dm raid)
- small code simplification (dm, dm-mpath, vm-vdo, dm-raid)
- remove support for asynchronous hashes (dm-verity)
- close smatch warning (dm-zoned-target)
- update the documentation and enable inline-crypto passthrough
(dm-thin)
* tag 'for-6.17/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm: set DM_TARGET_PASSES_CRYPTO feature for dm-thin
dm-thin: update the documentation
dm-raid: do not include dm-core.h
vdo: omit need_resched() before cond_resched()
md: dm-zoned-target: Initialize return variable r to avoid uninitialized use
dm-verity: remove support for asynchronous hashes
dm-mpath: don't print the "loaded" message if registering fails
dm-mpath: make dm_unregister_path_selector return void
dm: ima: avoid extra calls to strlen()
dm: Simplify dm_io_complete()
dm: Remove unnecessary return in dm_zone_endio()
dm raid: add support for resync w/o metadata devices
dm-flakey: Fix corrupt_bio_byte setup checks
dm-table: fix checking for rq stackable devices
|
|
Our testcase trigger panic:
BUG: kernel NULL pointer dereference, address: 00000000000000e0
...
Oops: Oops: 0000 [#1] SMP NOPTI
CPU: 2 UID: 0 PID: 85 Comm: kworker/2:1 Not tainted 6.16.0+ #94
PREEMPT(none)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.16.1-2.fc37 04/01/2014
Workqueue: md_misc md_start_sync
RIP: 0010:rdev_addable+0x4d/0xf0
...
Call Trace:
<TASK>
md_start_sync+0x329/0x480
process_one_work+0x226/0x6d0
worker_thread+0x19e/0x340
kthread+0x10f/0x250
ret_from_fork+0x14d/0x180
ret_from_fork_asm+0x1a/0x30
</TASK>
Modules linked in: raid10
CR2: 00000000000000e0
---[ end trace 0000000000000000 ]---
RIP: 0010:rdev_addable+0x4d/0xf0
md_spares_need_change in md_start_sync will call rdev_addable which
protected by rcu_read_lock/rcu_read_unlock. This rcu context will help
protect rdev won't be released, but rdev->mddev will be set to NULL
before we call synchronize_rcu in md_kick_rdev_from_array. Fix this by
using READ_ONCE and check does rdev->mddev still alive.
Fixes: bc08041b32ab ("md: suspend array in md_start_sync() if array need reconfiguration")
Fixes: 570b9147deb6 ("md: use RCU lock to protect traversal in md_spares_need_change()")
Signed-off-by: Yang Erkun <yangerkun@huawei.com>
Link: https://lore.kernel.org/linux-raid/20250731114530.776670-1-yangerkun@huawei.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
|
|
The struct pool_info was originally introduced mainly to support reshape
operations, serving as a parameter for mempool_init() when raid_disks
changes. Now that mempool_create_kmalloc_pool() is sufficient for this
purpose, struct pool_info and its related code are no longer needed.
Remove struct pool_info and all associated code.
Signed-off-by: Wang Jinchao <wangjinchao600@gmail.com>
Link: https://lore.kernel.org/linux-raid/20250707012711.376844-3-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
|
|
In raid1_reshape(), newpool is a stack variable.
mempool_init() initializes newpool->wait with the stack address.
After assigning newpool to conf->r1bio_pool, the wait queue
need to be reinitialized, which is not ideal.
Change raid1_conf->r1bio_pool to a pointer type and
replace mempool_init() with mempool_create_kmalloc_pool() to
avoid referencing a stack-based wait queue.
Signed-off-by: Wang Jinchao <wangjinchao600@gmail.com>
Link: https://lore.kernel.org/linux-raid/20250707012711.376844-2-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"As usual, many cleanups. The below blurbiage describes 42 patchsets.
21 of those are partially or fully cleanup work. "cleans up",
"cleanup", "maintainability", "rationalizes", etc.
I never knew the MM code was so dirty.
"mm: ksm: prevent KSM from breaking merging of new VMAs" (Lorenzo Stoakes)
addresses an issue with KSM's PR_SET_MEMORY_MERGE mode: newly
mapped VMAs were not eligible for merging with existing adjacent
VMAs.
"mm/damon: introduce DAMON_STAT for simple and practical access monitoring" (SeongJae Park)
adds a new kernel module which simplifies the setup and usage of
DAMON in production environments.
"stop passing a writeback_control to swap/shmem writeout" (Christoph Hellwig)
is a cleanup to the writeback code which removes a couple of
pointers from struct writeback_control.
"drivers/base/node.c: optimization and cleanups" (Donet Tom)
contains largely uncorrelated cleanups to the NUMA node setup and
management code.
"mm: userfaultfd: assorted fixes and cleanups" (Tal Zussman)
does some maintenance work on the userfaultfd code.
"Readahead tweaks for larger folios" (Ryan Roberts)
implements some tuneups for pagecache readahead when it is reading
into order>0 folios.
"selftests/mm: Tweaks to the cow test" (Mark Brown)
provides some cleanups and consistency improvements to the
selftests code.
"Optimize mremap() for large folios" (Dev Jain)
does that. A 37% reduction in execution time was measured in a
memset+mremap+munmap microbenchmark.
"Remove zero_user()" (Matthew Wilcox)
expunges zero_user() in favor of the more modern memzero_page().
"mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn_pud() fixes" (David Hildenbrand)
addresses some warts which David noticed in the huge page code.
These were not known to be causing any issues at this time.
"mm/damon: use alloc_migrate_target() for DAMOS_MIGRATE_{HOT,COLD" (SeongJae Park)
provides some cleanup and consolidation work in DAMON.
"use vm_flags_t consistently" (Lorenzo Stoakes)
uses vm_flags_t in places where we were inappropriately using other
types.
"mm/memfd: Reserve hugetlb folios before allocation" (Vivek Kasireddy)
increases the reliability of large page allocation in the memfd
code.
"mm: Remove pXX_devmap page table bit and pfn_t type" (Alistair Popple)
removes several now-unneeded PFN_* flags.
"mm/damon: decouple sysfs from core" (SeongJae Park)
implememnts some cleanup and maintainability work in the DAMON
sysfs layer.
"madvise cleanup" (Lorenzo Stoakes)
does quite a lot of cleanup/maintenance work in the madvise() code.
"madvise anon_name cleanups" (Vlastimil Babka)
provides additional cleanups on top or Lorenzo's effort.
"Implement numa node notifier" (Oscar Salvador)
creates a standalone notifier for NUMA node memory state changes.
Previously these were lumped under the more general memory
on/offline notifier.
"Make MIGRATE_ISOLATE a standalone bit" (Zi Yan)
cleans up the pageblock isolation code and fixes a potential issue
which doesn't seem to cause any problems in practice.
"selftests/damon: add python and drgn based DAMON sysfs functionality tests" (SeongJae Park)
adds additional drgn- and python-based DAMON selftests which are
more comprehensive than the existing selftest suite.
"Misc rework on hugetlb faulting path" (Oscar Salvador)
fixes a rather obscure deadlock in the hugetlb fault code and
follows that fix with a series of cleanups.
"cma: factor out allocation logic from __cma_declare_contiguous_nid" (Mike Rapoport)
rationalizes and cleans up the highmem-specific code in the CMA
allocator.
"mm/migration: rework movable_ops page migration (part 1)" (David Hildenbrand)
provides cleanups and future-preparedness to the migration code.
"mm/damon: add trace events for auto-tuned monitoring intervals and DAMOS quota" (SeongJae Park)
adds some tracepoints to some DAMON auto-tuning code.
"mm/damon: fix misc bugs in DAMON modules" (SeongJae Park)
does that.
"mm/damon: misc cleanups" (SeongJae Park)
also does what it claims.
"mm: folio_pte_batch() improvements" (David Hildenbrand)
cleans up the large folio PTE batching code.
"mm/damon/vaddr: Allow interleaving in migrate_{hot,cold} actions" (SeongJae Park)
facilitates dynamic alteration of DAMON's inter-node allocation
policy.
"Remove unmap_and_put_page()" (Vishal Moola)
provides a couple of page->folio conversions.
"mm: per-node proactive reclaim" (Davidlohr Bueso)
implements a per-node control of proactive reclaim - beyond the
current memcg-based implementation.
"mm/damon: remove damon_callback" (SeongJae Park)
replaces the damon_callback interface with a more general and
powerful damon_call()+damos_walk() interface.
"mm/mremap: permit mremap() move of multiple VMAs" (Lorenzo Stoakes)
implements a number of mremap cleanups (of course) in preparation
for adding new mremap() functionality: newly permit the remapping
of multiple VMAs when the user is specifying MREMAP_FIXED. It still
excludes some specialized situations where this cannot be performed
reliably.
"drop hugetlb_free_pgd_range()" (Anthony Yznaga)
switches some sparc hugetlb code over to the generic version and
removes the thus-unneeded hugetlb_free_pgd_range().
"mm/damon/sysfs: support periodic and automated stats update" (SeongJae Park)
augments the present userspace-requested update of DAMON sysfs
monitoring files. Automatic update is now provided, along with a
tunable to control the update interval.
"Some randome fixes and cleanups to swapfile" (Kemeng Shi)
does what is claims.
"mm: introduce snapshot_page" (Luiz Capitulino and David Hildenbrand)
provides (and uses) a means by which debug-style functions can grab
a copy of a pageframe and inspect it locklessly without tripping
over the races inherent in operating on the live pageframe
directly.
"use per-vma locks for /proc/pid/maps reads" (Suren Baghdasaryan)
addresses the large contention issues which can be triggered by
reads from that procfs file. Latencies are reduced by more than
half in some situations. The series also introduces several new
selftests for the /proc/pid/maps interface.
"__folio_split() clean up" (Zi Yan)
cleans up __folio_split()!
"Optimize mprotect() for large folios" (Dev Jain)
provides some quite large (>3x) speedups to mprotect() when dealing
with large folios.
"selftests/mm: reuse FORCE_READ to replace "asm volatile("" : "+r" (XXX));" and some cleanup" (wang lian)
does some cleanup work in the selftests code.
"tools/testing: expand mremap testing" (Lorenzo Stoakes)
extends the mremap() selftest in several ways, including adding
more checking of Lorenzo's recently added "permit mremap() move of
multiple VMAs" feature.
"selftests/damon/sysfs.py: test all parameters" (SeongJae Park)
extends the DAMON sysfs interface selftest so that it tests all
possible user-requested parameters. Rather than the present minimal
subset"
* tag 'mm-stable-2025-07-30-15-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (370 commits)
MAINTAINERS: add missing headers to mempory policy & migration section
MAINTAINERS: add missing file to cgroup section
MAINTAINERS: add MM MISC section, add missing files to MISC and CORE
MAINTAINERS: add missing zsmalloc file
MAINTAINERS: add missing files to page alloc section
MAINTAINERS: add missing shrinker files
MAINTAINERS: move memremap.[ch] to hotplug section
MAINTAINERS: add missing mm_slot.h file THP section
MAINTAINERS: add missing interval_tree.c to memory mapping section
MAINTAINERS: add missing percpu-internal.h file to per-cpu section
mm/page_alloc: remove trace_mm_alloc_contig_migrate_range_info()
selftests/damon: introduce _common.sh to host shared function
selftests/damon/sysfs.py: test runtime reduction of DAMON parameters
selftests/damon/sysfs.py: test non-default parameters runtime commit
selftests/damon/sysfs.py: generalize DAMON context commit assertion
selftests/damon/sysfs.py: generalize monitoring attributes commit assertion
selftests/damon/sysfs.py: generalize DAMOS schemes commit assertion
selftests/damon/sysfs.py: test DAMOS filters commitment
selftests/damon/sysfs.py: generalize DAMOS scheme commit assertion
selftests/damon/sysfs.py: test DAMOS destinations commitment
...
|
|
dm-thin obviously can pass through inline crypto support.
Signed-off-by: LongPing Wei <weilongping@oppo.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
|
|
In commit 4cc96131afce ("dm: move request-based code out to dm-rq.[hc]")
we have a note: "DM targets should _never_ include dm-core.h!". And it
is not used in any DM targets except dm-raid now, so let's remove it
from dm-raid for consistency, also use special helpers instead of
accessing dm_table and mapper_device fields directly. This change is
merely a cleanup and should not affect functionality.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
|
|
There's no need to call need_resched() because cond_resched() will do
nothing if need_resched() returns false.
Reviewed-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
|
|
Fix Smatch-detected error:
drivers/md/dm-zoned-target.c:1073 dmz_iterate_devices()
error: uninitialized symbol 'r'.
Smatch detects a possible use of the uninitialized variable 'r' in
dmz_iterate_devices() because if dmz->nr_ddevs is zero, the loop is
skipped and 'r' is returned without being set, leading to undefined
behavior.
Initialize 'r' to 0 before the loop. This ensures that if there are no
devices to iterate over, the function still returns a defined value.
Signed-off-by: Purva Yeshi <purvayeshi550@gmail.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
|
|
The support for asynchronous hashes in dm-verity has outlived its
usefulness. It adds significant code complexity and opportunity for
bugs. I don't know of anyone using it in practice. (The original
submitter of the code possibly was, but that was 8 years ago.) Data I
recently collected for en/decryption shows that using off-CPU crypto
"accelerators" is consistently much slower than the CPU
(https://lore.kernel.org/r/20250704070322.20692-1-ebiggers@kernel.org/),
even on CPUs that lack dedicated cryptographic instructions. Similar
results are likely to be seen for hashing.
I already removed support for asynchronous hashes from fsverity two
years ago, and no one ever complained.
Moreover, neither dm-verity, fsverity, nor fscrypt has ever actually
used the asynchronous crypto algorithms in a truly asynchronous manner.
The lack of interest in such optimizations provides further evidence
that it's only the CPU-based crypto that actually matters.
Historically, it's also been common for people to forget to enable the
optimized SHA-256 code, which could contribute to an off-CPU crypto
engine being perceived as more useful than it really is. In 6.16 I
fixed that: the optimized SHA-256 code is now enabled by default.
Therefore, let's drop the support for asynchronous hashes in dm-verity.
Tested with verity-compat-test.
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
|
|
'recovery_cp' was used to represent the progress of sync, but its name
contains recovery, which can cause confusion. Replaces 'recovery_cp'
with 'resync_offset' for clarity.
Signed-off-by: Li Nan <linan122@huawei.com>
Link: https://lore.kernel.org/linux-raid/20250722033340.1933388-1-linan666@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
|
|
Commit a1fd37f97808 ("md: Don't wait for MD_RECOVERY_NEEDED for
HOT_REMOVE_DISK ioctl") introduced a regression in the md_cluster
module. (Failed cases 02r1_Manage_re-add & 02r10_Manage_re-add)
Consider a 2-node cluster:
- node1 set faulty & remove command on a disk.
- node2 must correctly update the array metadata.
Before a1fd37f97808, on node1, the delay between msg:METADATA_UPDATED
(triggered by faulty) and msg:REMOVE was sufficient for node2 to
reload the disk info (written by node1).
After a1fd37f97808, node1 no longer waits between faulty and remove,
causing it to send msg:REMOVE while node2 is still reloading disk info.
This often results in node2 failing to remove the faulty disk.
== how to trigger ==
set up a 2-node cluster (node1 & node2) with disks vdc & vdd.
on node1:
mdadm -CR /dev/md0 -l1 -b clustered -n2 /dev/vdc /dev/vdd --assume-clean
ssh node2-ip mdadm -A /dev/md0 /dev/vdc /dev/vdd
mdadm --manage /dev/md0 --fail /dev/vdc --remove /dev/vdc
check array status on both nodes with "mdadm -D /dev/md0".
node1 output:
Number Major Minor RaidDevice State
- 0 0 0 removed
1 254 48 1 active sync /dev/vdd
node2 output:
Number Major Minor RaidDevice State
- 0 0 0 removed
1 254 48 1 active sync /dev/vdd
0 254 32 - faulty /dev/vdc
Fixes: a1fd37f97808 ("md: Don't wait for MD_RECOVERY_NEEDED for HOT_REMOVE_DISK ioctl")
Signed-off-by: Heming Zhao <heming.zhao@suse.com>
Reviewed-by: Su Yue <glass.su@suse.com>
Link: https://lore.kernel.org/linux-raid/20250728042145.9989-1-heming.zhao@suse.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
|
|
Commit 9e59d609763f ("md: call del_gendisk in control path") moves
setting MD_DELETED from __mddev_put() to do_md_stop(), however, for the
case create on open, mddev can be freed without do_md_stop():
1) open
md_probe
md_alloc_and_put
md_alloc
mddev_alloc
atomic_set(&mddev->active, 1);
mddev->hold_active = UNTIL_IOCTL
mddev_put
atomic_dec_and_test(&mddev->active)
if (mddev->hold_active)
-> active is 0, hold_active is set
md_open
mddev_get
atomic_inc(&mddev->active);
2) ioctl that is not STOP_ARRAY, for example, GET_ARRAY_INFO:
md_ioctl
mddev->hold_active = 0
3) close
md_release
mddev_put(mddev);
atomic_dec_and_lock(&mddev->active, &all_mddevs_lock)
__mddev_put
-> hold_active is cleared, mddev will be freed
queue_work(md_misc_wq, &mddev->del_work)
Now that MD_DELETED is not set, before mddev is freed by
mddev_delayed_delete(), md_open can still succeed and break mddev
lifetime, causing mddev->kobj refcount underflow or mddev uaf
problem.
Fix this problem by setting MD_DELETED before queuing del_work.
Reported-by: syzbot+9921e319bd6168140b40@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/68894408.a00a0220.26d0e1.0012.GAE@google.com/
Reported-by: syzbot+fa3a12519f0d3fd4ec16@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/68894408.a00a0220.26d0e1.0013.GAE@google.com/
Fixes: 9e59d609763f ("md: call del_gendisk in control path")
Link: https://lore.kernel.org/linux-raid/20250730073321.2583158-1-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Reviewed-by: Xiao Ni <xni@redhat.com>
|
|
Pull block updates from Jens Axboe:
- MD pull request via Yu:
- call del_gendisk synchronously (Xiao)
- cleanup unused variable (John)
- cleanup workqueue flags (Ryo)
- fix faulty rdev can't be removed during resync (Qixing)
- NVMe pull request via Christoph:
- try PCIe function level reset on init failure (Keith Busch)
- log TLS handshake failures at error level (Maurizio Lombardi)
- pci-epf: do not complete commands twice if nvmet_req_init()
fails (Rick Wertenbroek)
- misc cleanups (Alok Tiwari)
- Removal of the pktcdvd driver
This has been more than a decade coming at this point, and some
recently revealed breakages that had it causing issues even for cases
where it isn't required made me re-pull the trigger on this one. It's
known broken and nobody has stepped up to maintain the code
- Series for ublk supporting batch commands, enabling the use of
multishot where appropriate
- Speed up ublk exit handling
- Fix for the two-stage elevator fixing which could leak data
- Convert NVMe to use the new IOVA based API
- Increase default max transfer size to something more reasonable
- Series fixing write operations on zoned DM devices
- Add tracepoints for zoned block device operations
- Prep series working towards improving blk-mq queue management in the
presence of isolated CPUs
- Don't allow updating of the block size of a loop device that is
currently under exclusively ownership/open
- Set chunk sectors from stacked device stripe size and use it for the
atomic write size limit
- Switch to folios in bcache read_super()
- Fix for CD-ROM MRW exit flush handling
- Various tweaks, fixes, and cleanups
* tag 'for-6.17/block-20250728' of git://git.kernel.dk/linux: (94 commits)
block: restore two stage elevator switch while running nr_hw_queue update
cdrom: Call cdrom_mrw_exit from cdrom_release function
sunvdc: Balance device refcount in vdc_port_mpgroup_check
nvme-pci: try function level reset on init failure
dm: split write BIOs on zone boundaries when zone append is not emulated
block: use chunk_sectors when evaluating stacked atomic write limits
dm-stripe: limit chunk_sectors to the stripe size
md/raid10: set chunk_sectors limit
md/raid0: set chunk_sectors limit
block: sanitize chunk_sectors for atomic write limits
ilog2: add max_pow_of_two_factor()
nvmet: pci-epf: Do not complete commands twice if nvmet_req_init() fails
nvme-tcp: log TLS handshake failures at error level
docs: nvme: fix grammar in nvme-pci-endpoint-target.rst
nvme: fix typo in status code constant for self-test in progress
nvmet: remove redundant assignment of error code in nvmet_ns_enable()
nvme: fix incorrect variable in io cqes error message
nvme: fix multiple spelling and grammar issues in host drivers
block: fix blk_zone_append_update_request_bio() kernel-doc
md/raid10: fix set but not used variable in sync_request_write()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs 'protection info' updates from Christian Brauner:
"This adds the new FS_IOC_GETLBMD_CAP ioctl() to query metadata and
protection info (PI) capabilities. This ioctl returns information
about the files integrity profile. This is useful for userspace
applications to understand a files end-to-end data protection support
and configure the I/O accordingly.
For now this interface is only supported by block devices. However the
design and placement of this ioctl in generic FS ioctl space allows us
to extend it to work over files as well. This maybe useful when
filesystems start supporting PI-aware layouts.
A new structure struct logical_block_metadata_cap is introduced, which
contains the following fields:
- lbmd_flags:
bitmask of logical block metadata capability flags
- lbmd_interval:
the amount of data described by each unit of logical block metadata
- lbmd_size:
size in bytes of the logical block metadata associated with each
interval
- lbmd_opaque_size:
size in bytes of the opaque block tag associated with each interval
- lbmd_opaque_offset:
offset in bytes of the opaque block tag within the logical block
metadata
- lbmd_pi_size:
size in bytes of the T10 PI tuple associated with each interval
- lbmd_pi_offset:
offset in bytes of T10 PI tuple within the logical block metadata
- lbmd_pi_guard_tag_type:
T10 PI guard tag type
- lbmd_pi_app_tag_size:
size in bytes of the T10 PI application tag
- lbmd_pi_ref_tag_size:
size in bytes of the T10 PI reference tag
- lbmd_pi_storage_tag_size:
size in bytes of the T10 PI storage tag
The internal logic to fetch the capability is encapsulated in a helper
function blk_get_meta_cap(), which uses the blk_integrity profile
associated with the device. The ioctl returns -EOPNOTSUPP, if
CONFIG_BLK_DEV_INTEGRITY is not enabled"
* tag 'vfs-6.17-rc1.integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
block: fix lbmd_guard_tag_type assignment in FS_IOC_GETLBMD_CAP
block: fix FS_IOC_GETLBMD_CAP parsing in blkdev_common_ioctl()
fs: add ioctl to query metadata and protection info capabilities
nvme: set pi_offset only when checksum type is not BLK_INTEGRITY_CSUM_NONE
block: introduce pi_tuple_size field in blk_integrity
block: rename tuple_size field in blk_integrity to metadata_size
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull fallocate updates from Christian Brauner:
"fallocate() currently supports creating preallocated files
efficiently. However, on most filesystems fallocate() will preallocate
blocks in an unwriten state even if FALLOC_FL_ZERO_RANGE is specified.
The extent state must later be converted to a written state when the
user writes data into this range, which can trigger numerous metadata
changes and journal I/O. This may leads to significant write
amplification and performance degradation in synchronous write mode.
At the moment, the only method to avoid this is to create an empty
file and write zero data into it (for example, using 'dd' with a large
block size). However, this method is slow and consumes a considerable
amount of disk bandwidth.
Now that more and more flash-based storage devices are available it is
possible to efficiently write zeros to SSDs using the unmap write
zeroes command if the devices do not write physical zeroes to the
media.
For example, if SCSI SSDs support the UMMAP bit or NVMe SSDs support
the DEAC bit[1], the write zeroes command does not write actual data
to the device, instead, NVMe converts the zeroed range to a
deallocated state, which works fast and consumes almost no disk write
bandwidth.
This series implements the BLK_FEAT_WRITE_ZEROES_UNMAP feature and
BLK_FLAG_WRITE_ZEROES_UNMAP_DISABLED flag for SCSI, NVMe and
device-mapper drivers, and add the FALLOC_FL_WRITE_ZEROES and
STATX_ATTR_WRITE_ZEROES_UNMAP support for ext4 and raw bdev devices.
fallocate() is subsequently extended with the FALLOC_FL_WRITE_ZEROES
flag. FALLOC_FL_WRITE_ZEROES zeroes a specified file range in such a
way that subsequent writes to that range do not require further
changes to the file mapping metadata. This flag is beneficial for
subsequent pure overwriting within this range, as it can save on block
allocation and, consequently, significant metadata changes"
* tag 'vfs-6.17-rc1.fallocate' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
ext4: add FALLOC_FL_WRITE_ZEROES support
block: add FALLOC_FL_WRITE_ZEROES support
block: factor out common part in blkdev_fallocate()
fs: introduce FALLOC_FL_WRITE_ZEROES to fallocate
dm: clear unmap write zeroes limits when disabling write zeroes
scsi: sd: set max_hw_wzeroes_unmap_sectors if device supports SD_ZERO_*_UNMAP
nvmet: set WZDS and DRB if device enables unmap write zeroes operation
nvme: set max_hw_wzeroes_unmap_sectors if device supports DEAC bit
block: introduce max_{hw|user}_wzeroes_unmap_sectors to queue limits
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux into for-6.17/block
Pull MD updates from Yu:
"- call del_gendisk synchronously, from Xiao
- cleanup unused variable, from John
- cleanup workqueue flags, from Ryo
- fix faulty rdev can't be removed during resync, from Qixing"
* tag 'md-6.17-20250722' of https://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux:
md/raid10: fix set but not used variable in sync_request_write()
md: allow removing faulty rdev during resync
md/raid5: unset WQ_CPU_INTENSIVE for raid5 unbound workqueue
md: remove/add redundancy group only in level change
md: Don't clear MD_CLOSING until mddev is freed
md: call del_gendisk in control path
|
|
Commit 2df7168717b7 ("dm: Always split write BIOs to zoned device
limits") updates the device-mapper driver to perform splits for the
write BIOs. However, it did not address the cases where DM targets do
not emulate zone append, such as in the cases of dm-linear or dm-flakey.
For these targets, when the write BIOs span across zone boundaries, they
trigger WARN_ON_ONCE(bio_straddles_zones(bio)) in
blk_zone_wplug_handle_write(). This results in I/O errors. The errors
are reproduced by running blktests test case zbd/004 using zoned
dm-linear or dm-flakey devices.
To avoid the I/O errors, handle the write BIOs regardless whether DM
targets emulate zone append or not, so that all write BIOs are split at
zone boundaries. For that purpose, drop the check for zone append
emulation in dm_zone_bio_needs_split(). Its argument 'md' is no longer
used then drop it also.
Fixes: 2df7168717b7 ("dm: Always split write BIOs to zoned device limits")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Link: https://lore.kernel.org/r/20250717103539.37279-1-shinichiro.kawasaki@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Same as done for raid0, set chunk_sectors limit to appropriately set the
atomic write size limit.
Setting chunk_sectors limit in this way overrides the stacked limit
already calculated based on the bottom device limits. This is ok, as
when any bios are sent to the bottom devices, the block layer will still
respect the bottom device chunk_sectors.
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20250711105258.3135198-6-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Same as done for raid0, set chunk_sectors limit to appropriately set the
atomic write size limit.
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20250711105258.3135198-5-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Currently we use min io size as the chunk size when deciding on the
atomic write size limits - see blk_stack_atomic_writes_head().
The limit min_io size is not a reliable value to store the chunk size, as
this may be mutated by the block stacking code. Such an example would be
for the min io size less than the physical block size, and the min io size
is raised to the physical block size - see blk_stack_limits().
The block stacking limits will rely on chunk_sectors in future,
so set this value (to the chunk size).
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20250711105258.3135198-4-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Building with W=1 reports the following:
drivers/md/raid10.c: In function ‘sync_request_write’:
drivers/md/raid10.c:2441:21: error: variable ‘d’ set but not used [-Werror=unused-but-set-variable]
2441 | int d;
| ^
cc1: all warnings being treated as errors
Remove the usage of that variable.
Fixes: 752d0464b78a ("md: clean up accounting for issued sync IO")
Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/linux-raid/20250709104814.2307276-1-john.g.garry@oracle.com
Signed-off-by: Yu Kuai <yukuai@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fix from Mikulas Patocka:
- dm-bufio: fix scheduling in atomic
* tag 'for-6.16/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm-bufio: fix sched in atomic context
|
|
During RAID resync, faulty rdev cannot be removed and will result in
"Device or resource busy" error when attempting hot removal.
Reproduction steps:
mdadm -Cv /dev/md0 -l1 -n3 -e1.2 /dev/sd{b..d}
mdadm /dev/md0 -f /dev/sdb
mdadm /dev/md0 -r /dev/sdb
-> mdadm: hot remove failed for /dev/sdb: Device or resource busy
After commit 4b10a3bc67c1 ("md: ensure resync is prioritized over
recovery"), when a device becomes faulty during resync, the
md_choose_sync_action() function returns early without calling
remove_and_add_spares(), preventing faulty device removal.
This patch extracts a helper function remove_spares() to support
removing faulty devices during RAID resync operations.
Fixes: 4b10a3bc67c1 ("md: ensure resync is prioritized over recovery")
Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
Reviewed-by: Li Nan <linan122@huawei.com>
Link: https://lore.kernel.org/linux-raid/20250707075412.150301-1-zhengqixing@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
|
|
When specified with WQ_CPU_INTENSIVE, the |