summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)AuthorFilesLines
9 daysstring: add mem_is_zero() helper to check if memory area is all zerosJani Nikula1-0/+12
[ Upstream commit 3942bb49728ad9e1f94d953a88af169a8f5d8099 ] Almost two thirds of the memchr_inv() usages check if the memory area is all zeros, with no interest in where in the buffer the first non-zero byte is located. Checking for !memchr_inv(s, 0, n) is also not very intuitive or discoverable. Add an explicit mem_is_zero() helper for this use case. Reviewed-by: Kees Cook <kees@kernel.org> Reviewed-by: Andy Shevchenko <andy@kernel.org> Link: https://patchwork.freedesktop.org/patch/msgid/20240814100035.3100852-1-jani.nikula@intel.com Signed-off-by: Jani Nikula <jani.nikula@intel.com> Stable-dep-of: 3e6ccd790ed6 ("gpio: cdev: check if uAPI v2 config attributes are correctly zeroed") Signed-off-by: Sasha Levin <sashal@kernel.org>
9 dayscdrom, scsi: sr: propagate read-only status to block layer via set_disk_ro()Daan De Meyer1-0/+1
[ Upstream commit 0898a817621a2f0cddca8122d9b974003fe5036d ] The cdrom core never calls set_disk_ro() for a registered device, so BLKROGET on a CD-ROM device always returns 0 (writable), even when the drive has no write capabilities and writes will inevitably fail. This causes problems for userspace that relies on BLKROGET to determine whether a block device is read-only. For example, systemd's loop device setup uses BLKROGET to decide whether to create a loop device with LO_FLAGS_READ_ONLY. Without the read-only flag, writes pass through the loop device to the CD-ROM and fail with I/O errors. systemd-fsck similarly checks BLKROGET to decide whether to run fsck in no-repair mode (-n). The write-capability bits in cdi->mask come from two different sources: CDC_DVD_RAM and CDC_CD_RW are populated by the driver from the MODE SENSE capabilities page (page 0x2A) before register_cdrom() is called, while CDC_MRW_W and CDC_RAM require the MMC GET CONFIGURATION command and were only probed by cdrom_open_write() at device open time. This meant that any attempt to compute the writable state from the full mask at probe time was incorrect, because the GET CONFIGURATION bits were still unset (and cdi->mask is initialized such that capabilities are assumed present). Fix this by factoring the GET CONFIGURATION probing out of cdrom_open_write() into a new exported helper, cdrom_probe_write_features(), and having sr call it from sr_probe() right after get_capabilities() has populated the MODE SENSE bits. register_cdrom() then calls set_disk_ro() based on the full write-capability mask (CDC_DVD_RAM | CDC_MRW_W | CDC_RAM | CDC_CD_RW) so the block layer reflects the drive's actual write support. The feature queries used (CDF_MRW and CDF_RWRT via GET CONFIGURATION with RT=00) report drive-level capabilities that are persistent across media, so a single probe before register_cdrom() is sufficient and the redundant probe at open time is dropped. With set_disk_ro() now accurate, the long-vestigial cd->writeable flag in sr can go: get_capabilities() used to set cd->writeable based on the same four mask bits, but because CDC_MRW_W and CDC_RAM default to "capability present" in cdi->mask and aren't touched by MODE SENSE, the condition that gated cd->writeable was always true, making it unconditionally 1. Replace the corresponding gate in sr_init_command() with get_disk_ro(cd->disk), which turns a previously no-op check into a real one and also catches kernel-internal bio writers that bypass blkdev_write_iter()'s bdev_read_only() check. The sd driver (SCSI disks) does not have this problem because it checks the MODE SENSE Write Protect bit and calls set_disk_ro() accordingly. The sr driver cannot use the same approach because the MMC specification does not define the WP bit in the MODE SENSE device-specific parameter byte for CD-ROM devices. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Daan De Meyer <daan@amutable.com> Reviewed-by: Phillip Potter <phil@philpotter.co.uk> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Phillip Potter <phil@philpotter.co.uk> Link: https://patch.msgid.link/20260427210139.1400-2-phil@philpotter.co.uk Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
9 daysrtc: introduce features bitfieldAlexandre Belloni1-0/+2
[ Upstream commit 7ae41220ef5831674f446baef19bfe1b31358260 ] Introduce a bitfield to allow the drivers to announce the available features for an RTC. The main use case would be to better handle alarms, that could be present or not or have a minute resolution or may need a correct week day to be set. Use the newly introduced RTC_FEATURE_ALARM bit to then test whether alarms are available instead of relying on the presence of ops->set_alarm. Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Link: https://lore.kernel.org/r/20210110231752.1418816-2-alexandre.belloni@bootlin.com Stable-dep-of: 0fedce7244e4 ("rtc: abx80x: Disable alarm feature if no interrupt attached") Signed-off-by: Sasha Levin <sashal@kernel.org>
9 dayspppoe: drop PFC framesQingfang Deng1-0/+16
[ Upstream commit cc1ff87bce1ccd38410ab10960f576dcd17db679 ] RFC 2516 Section 7 states that Protocol Field Compression (PFC) is NOT RECOMMENDED for PPPoE. In practice, pppd does not support negotiating PFC for PPPoE sessions, and the current PPPoE driver assumes an uncompressed (2-byte) protocol field. However, the generic PPP layer function ppp_input() is not aware of the negotiation result, and still accepts PFC frames. If a peer with a broken implementation or an attacker sends a frame with a compressed (1-byte) protocol field, the subsequent PPP payload is shifted by one byte. This causes the network header to be 4-byte misaligned, which may trigger unaligned access exceptions on some architectures. To reduce the attack surface, drop PPPoE PFC frames. Introduce ppp_skb_is_compressed_proto() helper function to be used in both ppp_generic.c and pppoe.c to avoid open-coding. Fixes: 7fb1b8ca8fa1 ("ppp: Move PFC decompression to PPP generic layer") Signed-off-by: Qingfang Deng <qingfang.deng@linux.dev> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260415022456.141758-2-qingfang.deng@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
9 daysflow_dissector: Add PPPoE dissectorsWojciech Drewek1-0/+14
[ Upstream commit 46126db9c86110e5fc1e369b9bb89735ddefdae4 ] Allow to dissect PPPoE specific fields which are: - session ID (16 bits) - ppp protocol (16 bits) - type (16 bits) - this is PPPoE ethertype, for now only ETH_P_PPP_SES is supported, possible ETH_P_PPP_DISC in the future The goal is to make the following TC command possible: # tc filter add dev ens6f0 ingress prio 1 protocol ppp_ses \ flower \ pppoe_sid 12 \ ppp_proto ip \ action drop Note that only PPPoE Session is supported. Signed-off-by: Wojciech Drewek <wojciech.drewek@intel.com> Acked-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Stable-dep-of: cc1ff87bce1c ("pppoe: drop PFC frames") Signed-off-by: Sasha Levin <sashal@kernel.org>
9 dayslib/hexdump: print_hex_dump_bytes() calls print_hex_dump_debug()Geert Uytterhoeven1-2/+3
[ Upstream commit 36776b7f8a8955b4e75b5d490a75fee0c7a2a7ef ] print_hex_dump_bytes() claims to be a simple wrapper around print_hex_dump(), but it actally calls print_hex_dump_debug(), which means no output is printed if (dynamic) DEBUG is disabled. Update the documentation to match the implementation. Fixes: 091cb0994edd20d6 ("lib/hexdump: make print_hex_dump_bytes() a nop on !DEBUG builds") Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://patch.msgid.link/3d5c3069fd9102ecaf81d044b750cd613eb72a08.1774970392.git.geert+renesas@glider.be Signed-off-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
9 daysdev_printk: add new dev_err_probe() helpersNuno Sa1-0/+8
[ Upstream commit dbbe7eaf0e4795bf003ac06872aaf52b6b6b1310 ] This is similar to dev_err_probe() but for cases where an ERR_PTR() or ERR_CAST() is to be returned simplifying patterns like: dev_err_probe(dev, ret, ...); return ERR_PTR(ret) or dev_err_probe(dev, PTR_ERR(ptr), ...); return ERR_CAST(ptr) Signed-off-by: Nuno Sa <nuno.sa@analog.com> Link: https://patch.msgid.link/20240606-dev-add_dev_errp_probe-v3-1-51bb229edd79@analog.com Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Stable-dep-of: 797cc011ae02 ("backlight: sky81452-backlight: Check return value of devm_gpiod_get_optional() in sky81452_bl_parse_dt()") Signed-off-by: Sasha Levin <sashal@kernel.org>
9 daysdriver core: Move dev_err_probe() to where it belogsAndy Shevchenko2-2/+2
[ Upstream commit 9e0cace7a6254070159ebd86497eadc29ea307ca ] dev_err_probe() belongs to the printing API, hence move the definition from device.h to dev_printk.h. There is no change to the callers at all, since: 1) implementation is located in the same core.c; 2) dev_printk.h is guaranteed to be included by device.h. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Andi Shyti <andi.shyti@kernel.org> Link: https://lore.kernel.org/r/20230721131309.16821-1-andriy.shevchenko@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Stable-dep-of: 797cc011ae02 ("backlight: sky81452-backlight: Check return value of devm_gpiod_get_optional() in sky81452_bl_parse_dt()") Signed-off-by: Sasha Levin <sashal@kernel.org>
9 daysdriver core: device.h: remove extern from function prototypesGreg Kroah-Hartman1-2/+1
[ Upstream commit f43243c66e5e9ad839d235f82a58e73a7e7612af ] The kernel coding style does not require 'extern' in function prototypes in .h files, so remove them from include/linux/device.h as they are not needed. Acked-by: Rafael J. Wysocki <rafael@kernel.org> Link: https://lore.kernel.org/r/20230324122711.2664537-1-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Stable-dep-of: 797cc011ae02 ("backlight: sky81452-backlight: Check return value of devm_gpiod_get_optional() in sky81452_bl_parse_dt()") Signed-off-by: Sasha Levin <sashal@kernel.org>
9 daysquota: Fix race of dquot_scan_active() with quota deactivationJan Kara1-8/+1
[ Upstream commit e93ab401da4b2e2c1b8ef2424de2f238d51c8b2d ] dquot_scan_active() can race with quota deactivation in quota_release_workfn() like: CPU0 (quota_release_workfn) CPU1 (dquot_scan_active) ============================== ============================== spin_lock(&dq_list_lock); list_replace_init( &releasing_dquots, &rls_head); /* dquot X on rls_head, dq_count == 0, DQ_ACTIVE_B still set */ spin_unlock(&dq_list_lock); synchronize_srcu(&dquot_srcu); spin_lock(&dq_list_lock); list_for_each_entry(dquot, &inuse_list, dq_inuse) { /* finds dquot X */ dquot_active(X) -> true atomic_inc(&X->dq_count); } spin_unlock(&dq_list_lock); spin_lock(&dq_list_lock); dquot = list_first_entry(&rls_head); WARN_ON_ONCE(atomic_read(&dquot->dq_count)); The problem is not only a cosmetic one as under memory pressure the caller of dquot_scan_active() can end up working on freed dquot. Fix the problem by making sure the dquot is removed from releasing list when we acquire a reference to it. Fixes: 869b6ea1609f ("quota: Fix slow quotaoff") Reported-by: Sam Sun <samsun1006219@gmail.com> Link: https://lore.kernel.org/all/CAEkJfYPTt3uP1vAYnQ5V2ZWn5O9PLhhGi5HbOcAzyP9vbXyjeg@mail.gmail.com Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Sasha Levin <sashal@kernel.org>
9 daysfirmware: dmi: Correct an indexing error in dmi.hMario Limonciello (AMD)1-0/+5
[ Upstream commit c064abc68e009d2cc18416e7132d9c25e03125b6 ] The entries later in enum dmi_entry_type don't match the SMBIOS specification¹. The entry for type 33: `64-Bit Memory Error Information` is not present and thus the index for all later entries is incorrect. Add it. Also, add missing entry types 43-46, while at it. ¹ Search for "System Management BIOS (SMBIOS) Reference Specification" [ bp: Drop the flaky SMBIOS spec URL. ] Fixes: 93c890dbe5287 ("firmware: Add DMI entry types to the headers") Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Jean Delvare <jdelvare@suse.de> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com> Link: https://patch.msgid.link/20260307141024.819807-2-superm1@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
9 dayslocking: Fix rwlock support in <linux/spinlock_up.h>Bart Van Assche1-10/+10
[ Upstream commit 756a0e011cfca0b45a48464aa25b05d9a9c2fb0b ] Architecture support for rwlocks must be available whether or not CONFIG_DEBUG_SPINLOCK has been defined. Move the definitions of the arch_{read,write}_{lock,trylock,unlock}() macros such that these become visbile if CONFIG_DEBUG_SPINLOCK=n. This patch prepares for converting do_raw_{read,write}_trylock() into inline functions. Without this patch that conversion triggers a build failure for UP architectures, e.g. arm-ep93xx. I used the following kernel configuration to build the kernel for that architecture: CONFIG_ARCH_MULTIPLATFORM=y CONFIG_ARCH_MULTI_V7=n CONFIG_ATAGS=y CONFIG_MMU=y CONFIG_ARCH_MULTI_V4T=y CONFIG_CPU_LITTLE_ENDIAN=y CONFIG_ARCH_EP93XX=y Fixes: fb1c8f93d869 ("[PATCH] spinlock consolidation") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260313171510.230998-2-bvanassche@acm.org Signed-off-by: Sasha Levin <sashal@kernel.org>
9 daysfanotify: fix false positive on permission eventsMiklos Szeredi1-0/+1
commit 7746e3bd4cc19b5092e00d32d676e329bfcb6900 upstream. fsnotify_get_mark_safe() may return false for a mark on an unrelated group, which results in bypassing the permission check. Fix by skipping over detached marks that are not in the current group. CC: stable@vger.kernel.org Fixes: abc77577a669 ("fsnotify: Provide framework for dropping SRCU lock in ->handle_event") Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Link: https://patch.msgid.link/20260410144950.156160-1-mszeredi@redhat.com Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 daysdriver core: Add kernel-doc for DEV_FLAG_COUNT enum valueDouglas Anderson1-0/+1
commit 5b484311507b5d403c1f7a45f6aa3778549e268b upstream. Even though nobody should use this value (except when declaring the "flags" bitmap), kernel-doc still gets upset that it's not documented. It reports: WARNING: ../include/linux/device.h:519 Enum value 'DEV_FLAG_COUNT' not described in enum 'struct_device_flags' Add the description of DEV_FLAG_COUNT. Fixes: a2225b6e834a ("driver core: Don't let a device probe until it's ready") Reported-by: Randy Dunlap <rdunlap@infradead.org> Closes: https://lore.kernel.org/f318cd43-81fd-48b9-abf7-92af85f12f91@infradead.org Signed-off-by: Douglas Anderson <dianders@chromium.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Link: https://patch.msgid.link/20260413195910.1.I23aca74fe2d3636a47df196a80920fecb2643220@changeid Signed-off-by: Danilo Krummrich <dakr@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 daystpm: avoid -Wunused-but-set-variableArnd Bergmann1-2/+7
commit 6f1d4d2ecfcd1b577dc87350ea965fe81f272e83 upstream. Outside of the EFI tpm code, the TPM_MEMREMAP()/TPM_MEMUNMAP functions are defined as trivial macros, leading to the mapping_size variable ending up unused: In file included from drivers/char/tpm/tpm-sysfs.c:16: In file included from drivers/char/tpm/tpm.h:28: include/linux/tpm_eventlog.h:167:6: error: variable 'mapping_size' set but not used [-Werror,-Wunused-but-set-variable] 167 | int mapping_size; Turn the stubs into inline functions to avoid this warning. Cc: stable@vger.kernel.org # v5.3+ Fixes: c46f3405692d ("tpm: Reserve the TPM final events table") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Thorsten Blum <thorsten.blum@linux.dev> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 daysdriver core: Don't let a device probe until it's readyDouglas Anderson1-0/+44
[ Upstream commit a2225b6e834a838ae3c93709760edc0a169eb2f2 ] The moment we link a "struct device" into the list of devices for the bus, it's possible probe can happen. This is because another thread can load the driver at any time and that can cause the device to probe. This has been seen in practice with a stack crawl that looks like this [1]: really_probe() __driver_probe_device() driver_probe_device() __driver_attach() bus_for_each_dev() driver_attach() bus_add_driver() driver_register() __platform_driver_register() init_module() [some module] do_one_initcall() do_init_module() load_module() __arm64_sys_finit_module() invoke_syscall() As a result of the above, it was seen that device_links_driver_bound() could be called for the device before "dev->fwnode->dev" was assigned. This prevented __fw_devlink_pickup_dangling_consumers() from being called which meant that other devices waiting on our driver's sub-nodes were stuck deferring forever. It's believed that this problem is showing up suddenly for two reasons: 1. Android has recently (last ~1 year) implemented an optimization to the order it loads modules [2]. When devices opt-in to this faster loading, modules are loaded one-after-the-other very quickly. This is unlike how other distributions do it. The reproduction of this problem has only been seen on devices that opt-in to Android's "parallel module loading". 2. Android devices typically opt-in to fw_devlink, and the most noticeable issue is the NULL "dev->fwnode->dev" in device_links_driver_bound(). fw_devlink is somewhat new code and also not in use by all Linux devices. Even though the specific symptom where "dev->fwnode->dev" wasn't assigned could be fixed by moving that assignment higher in device_add(), other parts of device_add() (like the call to device_pm_add()) are also important to run before probe. Only moving the "dev->fwnode->dev" assignment would likely fix the current symptoms but lead to difficult-to-debug problems in the future. Fix the problem by preventing probe until device_add() has run far enough that the device is ready to probe. If somehow we end up trying to probe before we're allowed, __driver_probe_device() will return -EPROBE_DEFER which will make certain the device is noticed. In the race condition that was seen with Android's faster module loading, we will temporarily add the device to the deferred list and then take it off immediately when device_add() probes the device. Instead of adding another flag to the bitfields already in "struct device", instead add a new "flags" field and use that. This allows us to freely change the bit from different thread without worrying about corrupting nearby bits (and means threads changing other bit won't corrupt us). [1] Captured on a machine running a downstream 6.6 kernel [2] https://cs.android.com/android/platform/superproject/main/+/main:system/core/libmodprobe/libmodprobe.cpp?q=LoadModulesParallel Cc: stable@vger.kernel.org Fixes: 2023c610dc54 ("Driver core: add new device to bus's list before probing") Reviewed-by: Alan Stern <stern@rowland.harvard.edu> Reviewed-by: Rafael J. Wysocki (Intel) <rafael@kernel.org> Reviewed-by: Danilo Krummrich <dakr@kernel.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Douglas Anderson <dianders@chromium.org> Link: https://patch.msgid.link/20260406162231.v5.1.Id750b0fbcc94f23ed04b7aecabcead688d0d8c17@changeid Signed-off-by: Danilo Krummrich <dakr@kernel.org> Signed-off-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
9 dayspadata: Remove comment for reorder_workHerbert Xu1-1/+0
[ Upstream commit 82a0302e7167d0b7c6cde56613db3748f8dd806d ] Remove comment for reorder_work which no longer exists. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Fixes: 71203f68c774 ("padata: Fix pd UAF once and for all") Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Bin Lan <lanbincn@139.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
9 dayspadata: Fix pd UAF once and for allHerbert Xu1-3/+0
[ Upstream commit 71203f68c7749609d7fc8ae6ad054bdedeb24f91 ] There is a race condition/UAF in padata_reorder that goes back to the initial commit. A reference count is taken at the start of the process in padata_do_parallel, and released at the end in padata_serial_worker. This reference count is (and only is) required for padata_replace to function correctly. If padata_replace is never called then there is no issue. In the function padata_reorder which serves as the core of padata, as soon as padata is added to queue->serial.list, and the associated spin lock released, that padata may be processed and the reference count on pd would go away. Fix this by getting the next padata before the squeue->serial lock is released. In order to make this possible, simplify padata_reorder by only calling it once the next padata arrives. Fixes: 16295bec6398 ("padata: Generic parallelization/serialization interface") Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> [ Adjust context of padata_find_next(). Replace cpumask_next_wrap(cpu, pd->cpumask.pcpu) with cpumask_next_wrap(cpu, pd->cpumask.pcpu, -1, false) in padata_reorder() in v5.10 according to dc5bb9b769c9 ("cpumask: deprecate cpumask_next_wrap()") and f954a2d37637 ("padata: switch padata_find_next() to using cpumask_next_wrap()") . ] Signed-off-by: Bin Lan <lanbincn@139.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
9 daysusb: xhci: Make usb_host_endpoint.hcpriv survive endpoint_disable()Michal Pecio1-1/+2
commit 25e531b422dc2ac90cdae3b6e74b5cdeb081440d upstream. xHCI hardware maintains its endpoint state between add_endpoint() and drop_endpoint() calls followed by successful check_bandwidth(). So does the driver. Core may call endpoint_disable() during xHCI endpoint life, so don't clear host_ep->hcpriv then, because this breaks endpoint_reset(). If a driver calls usb_set_interface(), submits URBs which make host sequence state non-zero and calls usb_clear_halt(), the device clears its sequence state but xhci_endpoint_reset() bails out. The next URB malfunctions: USB2 loses one packet, USB3 gets Transaction Error or may not complete at all on some (buggy?) HCs from ASMedia and AMD. This is triggered by uvcvideo on bulk video devices. The code was copied from ehci_endpoint_disable() but it isn't needed here - hcpriv should only be NULL on emulated root hub endpoints. It might prevent resetting and inadvertently enabling a disabled and dropped endpoint, but core shouldn't try to reset dropped endpoints. Document xhci requirements regarding hcpriv. They are currently met. Fixes: 18b74067ac78 ("xhci: Fix use-after-free regression in xhci clear hub TT implementation") Cc: stable@vger.kernel.org Signed-off-by: Michal Pecio <michal.pecio@gmail.com> Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com> Link: https://patch.msgid.link/20260402131342.2628648-26-mathias.nyman@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 daysx86/uprobes: Fix XOL allocation failure for 32-bit tasksOleg Nesterov1-0/+1
[ Upstream commit d55c571e4333fac71826e8db3b9753fadfbead6a ] This script #!/usr/bin/bash echo 0 > /proc/sys/kernel/randomize_va_space echo 'void main(void) {}' > TEST.c # -fcf-protection to ensure that the 1st endbr32 insn can't be emulated gcc -m32 -fcf-protection=branch TEST.c -o test bpftrace -e 'uprobe:./test:main {}' -c ./test "hangs", the probed ./test task enters an endless loop. The problem is that with randomize_va_space == 0 get_unmapped_area(TASK_SIZE - PAGE_SIZE) called by xol_add_vma() can not just return the "addr == TASK_SIZE - PAGE_SIZE" hint, this addr is used by the stack vma. arch_get_unmapped_area_topdown() doesn't take TIF_ADDR32 into account and in_32bit_syscall() is false, this leads to info.high_limit > TASK_SIZE. vm_unmapped_area() happily returns the high address > TASK_SIZE and then get_unmapped_area() returns -ENOMEM after the "if (addr > TASK_SIZE - len)" check. handle_swbp() doesn't report this failure (probably it should) and silently restarts the probed insn. Endless loop. I think that the right fix should change the x86 get_unmapped_area() paths to rely on TIF_ADDR32 rather than in_32bit_syscall(). Note also that if CONFIG_X86_X32_ABI=y, in_x32_syscall() falsely returns true in this case because ->orig_ax = -1. But we need a simple fix for -stable, so this patch just sets TS_COMPAT if the probed task is 32-bit to make in_ia32_syscall() true. Fixes: 1b028f784e8c ("x86/mm: Introduce mmap_compat_base() for 32-bit mmap()") Reported-by: Paulo Andrade <pandrade@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/aV5uldEvV7pb4RA8@redhat.com/ Cc: stable@vger.kernel.org Link: https://patch.msgid.link/aWO7Fdxn39piQnxu@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 daysKVM: x86: Use scratch field in MMIO fragment to hold small write valuesSean Christopherson1-1/+2
commit 0b16e69d17d8c35c5c9d5918bf596c75a44655d3 upstream. When exiting to userspace to service an emulated MMIO write, copy the to-be-written value to a scratch field in the MMIO fragment if the size of the data payload is 8 bytes or less, i.e. can fit in a single chunk, instead of pointing the fragment directly at the source value. This fixes a class of use-after-free bugs that occur when the emulator initiates a write using an on-stack, local variable as the source, the write splits a page boundary, *and* both pages are MMIO pages. Because KVM's ABI only allows for physically contiguous MMIO requests, accesses that split MMIO pages are separated into two fragments, and are sent to userspace one at a time. When KVM attempts to complete userspace MMIO in response to KVM_RUN after the first fragment, KVM will detect the second fragment and generate a second userspace exit, and reference the on-stack variable. The issue is most visible if the second KVM_RUN is performed by a separate task, in which case the stack of the initiating task can show up as truly freed data. ================================================================== BUG: KASAN: use-after-free in complete_emulated_mmio+0x305/0x420 Read of size 1 at addr ffff888009c378d1 by task syz-executor417/984 CPU: 1 PID: 984 Comm: syz-executor417 Not tainted 5.10.0-182.0.0.95.h2627.eulerosv2r13.x86_64 #3 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack+0xbe/0xfd print_address_description.constprop.0+0x19/0x170 __kasan_report.cold+0x6c/0x84 kasan_report+0x3a/0x50 check_memory_region+0xfd/0x1f0 memcpy+0x20/0x60 complete_emulated_mmio+0x305/0x420 kvm_arch_vcpu_ioctl_run+0x63f/0x6d0 kvm_vcpu_ioctl+0x413/0xb20 __se_sys_ioctl+0x111/0x160 do_syscall_64+0x30/0x40 entry_SYSCALL_64_after_hwframe+0x67/0xd1 RIP: 0033:0x42477d Code: <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007faa8e6890e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00000000004d7338 RCX: 000000000042477d RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000005 RBP: 00000000004d7330 R08: 00007fff28d546df R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00000000004d733c R13: 0000000000000000 R14: 000000000040a200 R15: 00007fff28d54720 The buggy address belongs to the page: page:0000000029f6a428 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x9c37 flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff) raw: 000fffffc0000000 0000000000000000 ffffea0000270dc8 0000000000000000 raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff888009c37780: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ffff888009c37800: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff >ffff888009c37880: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ^ ffff888009c37900: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ffff888009c37980: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ================================================================== The bug can also be reproduced with a targeted KVM-Unit-Test by hacking KVM to fill a large on-stack variable in complete_emulated_mmio(), i.e. by overwrite the data value with garbage. Limit the use of the scratch fields to 8-byte or smaller accesses, and to just writes, as larger accesses and reads are not affected thanks to implementation details in the emulator, but add a sanity check to ensure those details don't change in the future. Specifically, KVM never uses on-stack variables for accesses larger that 8 bytes, e.g. uses an operand in the emulator context, and *all* reads are buffered through the mem_read cache. Note! Using the scratch field for reads is not only unnecessary, it's also extremely difficult to handle correctly. As above, KVM buffers all reads through the mem_read cache, and heavily relies on that behavior when re-emulating the instruction after a userspace MMIO read exit. If a read splits a page, the first page is NOT an MMIO page, and the second page IS an MMIO page, then the MMIO fragment needs to point at _just_ the second chunk of the destination, i.e. its position in the mem_read cache. Taking the "obvious" approach of copying the fragment value into the destination when re-emulating the instruction would clobber the first chunk of the destination, i.e. would clobber the data that was read from guest memory. Fixes: f78146b0f923 ("KVM: Fix page-crossing MMIO") Suggested-by: Yashu Zhang <zhangjiaji1@huawei.com> Reported-by: Yashu Zhang <zhangjiaji1@huawei.com> Closes: https://lore.kernel.org/all/369eaaa2b3c1425c85e8477066391bc7@huawei.com Cc: stable@vger.kernel.org Tested-by: Tom Lendacky <thomas.lendacky@gmail.com> Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Link: https://patch.msgid.link/20260225012049.920665-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 daysACPI: property: Constify stubs for CONFIG_ACPI=n caseAndy Shevchenko1-3/+3
commit 5c1a72a0fbe1b02c3ce0537f85f92ea935e0beec upstream. There is a few stubs that left untouched during constification of the fwnode related APIs. Constify three more stubs here. Fixes: 8b9d6802583a ("ACPI: Constify acpi_bus helper functions, switch to macros") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> [ rjw: Subject edit ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-05-15ptrace: slightly saner 'get_dumpable()' logicLinus Torvalds1-0/+3
commit 31e62c2ebbfdc3fe3dbdf5e02c92a9dc67087a3a upstream. The 'dumpability' of a task is fundamentally about the memory image of the task - the concept comes from whether it can core dump or not - and makes no sense when you don't have an associated mm. And almost all users do in fact use it only for the case where the task has a mm pointer. But we have one odd special case: ptrace_may_access() uses 'dumpable' to check various other things entirely independently of the MM (typically explicitly using flags like PTRACE_MODE_READ_FSCREDS). Including for threads that no longer have a VM (and maybe never did, like most kernel threads). It's not what this flag was designed for, but it is what it is. The ptrace code does check that the uid/gid matches, so you do have to be uid-0 to see kernel thread details, but this means that the traditional "drop capabilities" model doesn't make any difference for this all. Make it all make a *bit* more sense by saying that if you don't have a MM pointer, we'll use a cached "last dumpability" flag if the thread ever had a MM (it will be zero for kernel threads since it is never set), and require a proper CAP_SYS_PTRACE capability to override. Reported-by: Qualys Security Advisory <qsa@qualys.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kees Cook <kees@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-18device property: Allow error pointer to be passed to fwnode APIsAndy Shevchenko1-5/+5
[ Upstream commit 002752af7b89b74c64fe6bec8c5fde3d3a7810d8 ] Some of the fwnode APIs might return an error pointer instead of NULL or valid fwnode handle. The result of such API call may be considered optional and hence the test for it is usually done in a form of fwnode = fwnode_find_reference(...); if (IS_ERR(fwnode)) ...error handling... Nevertheless the resulting fwnode may have bumped the reference count and hence caller of the above API is obliged to call fwnode_handle_put(). Since fwnode may be not valid either as NULL or error pointer the check has to be performed there. This approach uglifies the code and adds a point of making a mistake, i.e. forgetting about error point case. To prevent this, allow an error pointer to be passed to the fwnode APIs. Fixes: 83b34afb6b79 ("device property: Introduce fwnode_find_reference()") Reported-by: Nuno Sá <nuno.sa@analog.com> Tested-by: Nuno Sá <nuno.sa@analog.com> Acked-by: Nuno Sá <nuno.sa@analog.com> Reviewed-by: Sakari Ailus <sakari.ailus@linux.intel.com> Reviewed-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Tested-by: Michael Walle <michael@walle.cc> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Stable-dep-of: 2692c614f8f0 ("device property: Allow secondary lookup in fwnode_get_next_child_node()") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-18device property: Unify access to of_nodeAndy Shevchenko1-1/+1
[ Upstream commit fb38f314fbd173e2e9f9f0f2e720a5f4889562da ] Historically we have a few variants how we access dev->fwnode and dev->of_node. Some of the functions during development gained different versions of the getters. Unify access to of_node and as a side change slightly refactor ACPI specific branches. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Stable-dep-of: 2692c614f8f0 ("device property: Allow secondary lookup in fwnode_get_next_child_node()") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-18device property: Add fwnode_is_ancestor_of() and fwnode_get_next_parent_dev()Saravana Kannan1-0/+3
[ Upstream commit b5d3e2fbcb10957521af14c4256cd0e5f68b9234 ] Add fwnode_is_ancestor_of() helper function to check if a fwnode is an ancestor of another fwnode. Add fwnode_get_next_parent_dev() helper function that take as input a fwnode and finds the closest ancestor fwnode that has a corresponding struct device and returns that struct device. Signed-off-by: Saravana Kannan <saravanak@google.com> Link: https://lore.kernel.org/r/20201121020232.908850-11-saravanak@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Stable-dep-of: 2692c614f8f0 ("device property: Allow secondary lookup in fwnode_get_next_child_node()") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-18mm/huge_memory: fix folio isn't locked in softleaf_to_folio()Jinjiang Tu1-0/+6
[ Upstream commit 4c5e7f0fcd592801c9cc18f29f80fbee84eb8669 ] On arm64 server, we found folio that get from migration entry isn't locked in softleaf_to_folio(). This issue triggers when mTHP splitting and zap_nonpresent_ptes() races, and the root cause is lack of memory barrier in softleaf_to_folio(). The race is as follows: CPU0 CPU1 deferred_split_scan() zap_nonpresent_ptes() lock folio split_folio() unmap_folio() change ptes to migration entries __split_folio_to_order() softleaf_to_folio() set flags(including PG_locked) for tail pages folio = pfn_folio(softleaf_to_pfn(entry)) smp_wmb() VM_WARN_ON_ONCE(!folio_test_locked(folio)) prep_compound_page() for tail pages In __split_folio_to_order(), smp_wmb() guarantees page flags of tail pages are visible before the tail page becomes non-compound. smp_wmb() should be paired with smp_rmb() in softleaf_to_folio(), which is missed. As a result, if zap_nonpresent_ptes() accesses migration entry that stores tail pfn, softleaf_to_folio() may see the updated compound_head of tail page before page->flags. This issue will trigger VM_WARN_ON_ONCE() in pfn_swap_entry_folio() because of the race between folio split and zap_nonpresent_ptes() leading to a folio incorrectly undergoing modification without a folio lock being held. This is a BUG_ON() before commit 93976a20345b ("mm: eliminate further swapops predicates"), which in merged in v6.19-rc1. To fix it, add missing smp_rmb() if the softleaf entry is migration entry in softleaf_to_folio() and softleaf_to_page(). [tujinjiang@huawei.com: update function name and comments] Link: https://lkml.kernel.org/r/20260321075214.3305564-1-tujinjiang@huawei.com Link: https://lkml.kernel.org/r/20260319012541.4158561-1-tujinjiang@huawei.com Fixes: e9b61f19858a ("thp: reintroduce split_huge_page()") Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Barry Song <baohua@kernel.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [ adapted upstream leafops.h changes to swapops.h ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-18mm/hugetlb: fix excessive IPI broadcasts when unsharing PMD tables using ↵David Hildenbrand (Red Hat)2-5/+12
mmu_gather commit 8ce720d5bd91e9dc16db3604aa4b1bf76770a9a1 upstream. As reported, ever since commit 1013af4f585f ("mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race") we can end up in some situations where we perform so many IPI broadcasts when unsharing hugetlb PMD page tables that it severely regresses some workloads. In particular, when we fork()+exit(), or when we munmap() a large area backed by many shared PMD tables, we perform one IPI broadcast per unshared PMD table. There are two optimizations to be had: (1) When we process (unshare) multiple such PMD tables, such as during exit(), it is sufficient to send a single IPI broadcast (as long as we respect locking rules) instead of one per PMD table. Locking prevents that any of these PMD tables could get reused before we drop the lock. (2) When we are not the last sharer (> 2 users including us), there is no need to send the IPI broadcast. The shared PMD tables cannot become exclusive (fully unshared) before an IPI will be broadcasted by the last sharer. Concurrent GUP-fast could walk into a PMD table just before we unshared it. It could then succeed in grabbing a page from the shared page table even after munmap() etc succeeded (and supressed an IPI). But there is not difference compared to GUP-fast just sleeping for a while after grabbing the page and re-enabling IRQs. Most importantly, GUP-fast will never walk into page tables that are no-longer shared, because the last sharer will issue an IPI broadcast. (if ever required, checking whether the PUD changed in GUP-fast after grabbing the page like we do in the PTE case could handle this) So let's rework PMD sharing TLB flushing + IPI sync to use the mmu_gather infrastructure so we can implement these optimizations and demystify the code at least a bit. Extend the mmu_gather infrastructure to be able to deal with our special hugetlb PMD table sharing implementation. To make initialization of the mmu_gather easier when working on a single VMA (in particular, when dealing with hugetlb), provide tlb_gather_mmu_vma(). We'll consolidate the handling for (full) unsharing of PMD tables in tlb_unshare_pmd_ptdesc() and tlb_flush_unshared_tables(), and track in "struct mmu_gather" whether we had (full) unsharing of PMD tables. Because locking is very special (concurrent unsharing+reuse must be prevented), we disallow deferring flushing to tlb_finish_mmu() and instead require an explicit earlier call to tlb_flush_unshared_tables(). >From hugetlb code, we call huge_pmd_unshare_flush() where we make sure that the expected lock protecting us from concurrent unsharing+reuse is still held. Check with a VM_WARN_ON_ONCE() in tlb_finish_mmu() that tlb_flush_unshared_tables() was properly called earlier. Document it all properly. Notes about tlb_remove_table_sync_one() interaction with unsharing: There are two fairly tricky things: (1) tlb_remove_table_sync_one() is a NOP on architectures without CONFIG_MMU_GATHER_RCU_TABLE_FREE. Here, the assumption is that the previous TLB flush would send an IPI to all relevant CPUs. Careful: some architectures like x86 only send IPIs to all relevant CPUs when tlb->freed_tables is set. The relevant architectures should be selecting MMU_GATHER_RCU_TABLE_FREE, but x86 might not do that in stable kernels and it might have been problematic before this patch. Also, the arch flushing behavior (independent of IPIs) is different when tlb->freed_tables is set. Do we have to enlighten them to also take care of tlb->unshared_tables? So far we didn't care, so hopefully we are fine. Of course, we could be setting tlb->freed_tables as well, but that might then unnecessarily flush too much, because the semantics of tlb->freed_tables are a bit fuzzy. This patch changes nothing in this regard. (2) tlb_remove_table_sync_one() is not a NOP on architectures with CONFIG_MMU_GATHER_RCU_TABLE_FREE that actually don't need a sync. Take x86 as an example: in the common case (!pv, !X86_FEATURE_INVLPGB) we still issue IPIs during TLB flushes and don't actually need the second tlb_remove_table_sync_one(). This optimized can be implemented on top of this, by checking e.g., in tlb_remove_table_sync_one() whether we really need IPIs. But as described in (1), it really must honor tlb->freed_tables then to send IPIs to all relevant CPUs. Notes on TLB flushing changes: (1) Flushing for non-shared PMD tables We're converting from flush_hugetlb_tlb_range() to tlb_remove_huge_tlb_entry(). Given that we properly initialize the MMU gather in tlb_gather_mmu_vma() to be hugetlb aware, similar to __unmap_hugepage_range(), that should be fine. (2) Flushing for shared PMD tables We're converting from various things (flush_hugetlb_tlb_range(), tlb_flush_pmd_range(), flush_tlb_range()) to tlb_flush_pmd_range(). tlb_flush_pmd_range() achieves the same that tlb_remove_huge_tlb_entry() would achieve in these scenarios. Note that tlb_remove_huge_tlb_entry() also calls __tlb_remove_tlb_entry(), however that is only implemented on powerpc, which does not support PMD table sharing. Similar to (1), tlb_gather_mmu_vma() should make sure that TLB flushing keeps on working as expected. Further, note that the ptdesc_pmd_pts_dec() in huge_pmd_share() is not a concern, as we are holding the i_mmap_lock the whole time, preventing concurrent unsharing. That ptdesc_pmd_pts_dec() usage will be removed separately as a cleanup later. There are plenty more cleanups to be had, but they have to wait until this is fixed. [david@kernel.org: fix kerneldoc] Link: https://lkml.kernel.org/r/f223dd74-331c-412d-93fc-69e360a5006c@kernel.org Link: https://lkml.kernel.org/r/20251223214037.580860-5-david@kernel.org Fixes: 1013af4f585f ("mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race") Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org> Reported-by: "Uschakow, Stanislav" <suschako@amazon.de> Closes: https://lore.kernel.org/all/4d3878531c76479d9f8ca9789dc6485d@amazon.de/ Tested-by: Laurence Oberman <loberman@redhat.com> Acked-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liu Shixin <liushixin2@huawei.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Rik van Riel <riel@surriel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> (cherry picked from commit 8ce720d5bd91e9dc16db3604aa4b1bf76770a9a1) [ David: We don't have ptdesc and the wrappers, so work directly on page->pt_share_count and pass "struct page" instead of "struct ptdesc". CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING is still called CONFIG_ARCH_WANT_HUGE_PMD_SHARE and is set even without CONFIG_HUGETLB_PAGE. We don't have 550a7d60bd5e ("mm, hugepages: add mremap() support for hugepage backed vma"), so move_hugetlb_page_tables() does not exist. We don't have 40549ba8f8e0 ("hugetlb: use new vma_lock for pmd sharing synchronization") and a98a2f0c8ce1 ("mm/rmap: split migration into its own so changes in mm/rmap.c looks quite different. We don't have 4ddb4d91b82f ("hugetlb: do not update address in huge_pmd_unshare"), so huge_pmd_unshare() still gets a pointer to an address. tlb_gather_mmu() + tlb_finish_mmu() still consume ranges, so also teach tlb_gather_mmu_vma() to forward ranges. Some smaller contextual stuff, in particular, around tlb_gather_mmu_full(). ] Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-04-18mm/hugetlb: fix hugetlb_pmd_shared()David Hildenbrand (Red Hat)1-1/+1
commit ca1a47cd3f5f4c46ca188b1c9a27af87d1ab2216 upstream. Patch series "mm/hugetlb: fixes for PMD table sharing (incl. using mmu_gather)", v3. One functional fix, one performance regression fix, and two related comment fixes. I cleaned up my prototype I recently shared [1] for the performance fix, deferring most of the cleanups I had in the prototype to a later point. While doing that I identified the other things. The goal of this patch set is to be backported to stable trees "fairly" easily. At least patch #1 and #4. Patch #1 fixes hugetlb_pmd_shared() not detecting any sharing Patch #2 + #3 are simple comment fixes that patch #4 interacts with. Patch #4 is a fix for the reported performance regression due to excessive IPI broadcasts during fork()+exit(). The last patch is all about TLB flushes, IPIs and mmu_gather. Read: complicated There are plenty of cleanups