summaryrefslogtreecommitdiff
path: root/arch
AgeCommit message (Collapse)AuthorFilesLines
2025-10-29treewide: remove MIGRATEPAGE_SUCCESSDavid Hildenbrand1-1/+1
[ Upstream commit fb49a4425cfa163faccd91f913773d3401d3a7d4 ] At this point MIGRATEPAGE_SUCCESS is misnamed for all folio users, and now that we remove MIGRATEPAGE_UNMAP, it's really the only "success" return value that the code uses and expects. Let's just get rid of MIGRATEPAGE_SUCCESS completely and just use "0" for success. Link: https://lkml.kernel.org/r/20250811143949.1117439-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> [mm] Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com> [jfs] Acked-by: David Sterba <dsterba@suse.com> [btrfs] Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Byungchul Park <byungchul@sk.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Chris Mason <clm@fb.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Kleikamp <shaggy@kernel.org> Cc: Eugenio Pé rez <eperezma@redhat.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Wang <jasowang@redhat.com> Cc: Jerrin Shaji George <jerrin.shaji-george@broadcom.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: Lance Yang <lance.yang@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Stable-dep-of: 4ba5a8a7faa6 ("vmw_balloon: indicate success when effectively deflating during migration") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-29x86/microcode: Fix Entrysign revision check for Zen1/NaplesAndrew Cooper1-1/+1
commit 876f0d43af78639790bee0e57b39d498ae35adcf upstream. ... to match AMD's statement here: https://www.amd.com/en/resources/product-security/bulletin/amd-sb-7033.html Fixes: 50cef76d5cb0 ("x86/microcode/AMD: Load only SHA256-checksummed patches") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: <stable@kernel.org> Link: https://patch.msgid.link/20251020144124.2930784-1-andrew.cooper3@citrix.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-29riscv: hwprobe: avoid uninitialized variable use in hwprobe_arch_id()Paul Walmsley1-0/+6
[ Upstream commit b7776a802f2f80139f96530a489dd00fd7089eda ] Resolve this smatch warning: arch/riscv/kernel/sys_hwprobe.c:50 hwprobe_arch_id() error: uninitialized symbol 'cpu_id'. This could happen if hwprobe_arch_id() was called with a key ID of something other than MVENDORID, MIMPID, and MARCHID. This does not happen in the current codebase. The only caller of hwprobe_arch_id() is a function that only passes one of those three key IDs. For the sake of reducing static analyzer warning noise, and in the unlikely event that hwprobe_arch_id() is someday called with some other key ID, validate hwprobe_arch_id()'s input to ensure that 'cpu_id' is always initialized before use. Fixes: ea3de9ce8aa280 ("RISC-V: Add a syscall for HW probing") Cc: Evan Green <evan@rivosinc.com> Signed-off-by: Paul Walmsley <pjw@kernel.org> Link: https://lore.kernel.org/r/cf5a13ec-19d0-9862-059b-943f36107bf3@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-29RISC-V: Don't print details of CPUs disabled in DTAnup Patel1-3/+1
[ Upstream commit d2721bb165b3ee00dd23525885381af07fec852a ] Early boot stages may disable CPU DT nodes for unavailable CPUs based on SKU, pinstraps, eFuse, etc. Currently, the riscv_early_of_processor_hartid() prints details of a CPU if it is disabled in DT which has no value and gives a false impression to the users that there some issue with the CPU. Fixes: e3d794d555cd ("riscv: treat cpu devicetree nodes without status as enabled") Signed-off-by: Anup Patel <apatel@ventanamicro.com> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Reviewed-by: Conor Dooley <conor.dooley@microchip.com> Link: https://lore.kernel.org/r/20251014163009.182381-1-apatel@ventanamicro.com Signed-off-by: Paul Walmsley <pjw@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-29RISC-V: Define pgprot_dmacoherent() for non-coherent devicesAnup Patel1-0/+2
[ Upstream commit ca525d53f994d45c8140968b571372c45f555ac1 ] The pgprot_dmacoherent() is used when allocating memory for non-coherent devices and by default pgprot_dmacoherent() is same as pgprot_noncached() unless architecture overrides it. Currently, there is no pgprot_dmacoherent() definition for RISC-V hence non-coherent device memory is being mapped as IO thereby making CPU access to such memory slow. Define pgprot_dmacoherent() to be same as pgprot_writecombine() for RISC-V so that CPU access non-coherent device memory as NOCACHE which is better than accessing it as IO. Fixes: ff689fd21cb1 ("riscv: add RISC-V Svpbmt extension support") Signed-off-by: Anup Patel <apatel@ventanamicro.com> Tested-by: Han Gao <rabenda.cn@gmail.com> Tested-by: Guo Ren (Alibaba DAMO Academy) <guoren@kernel.org> Link: https://lore.kernel.org/r/20250820152316.1012757-1-apatel@ventanamicro.com Signed-off-by: Paul Walmsley <pjw@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-29arm64: dts: broadcom: bcm2712: Define VGIC interruptPeter Robinson1-0/+2
[ Upstream commit aa960b597600bed80fe171729057dd6aa188b5b5 ] Define the interrupt in the GICv2 for vGIC so KVM can be used, it was missed from the original upstream DTB for some reason. Signed-off-by: Peter Robinson <pbrobinson@gmail.com> Cc: Andrea della Porta <andrea.porta@suse.com> Cc: Phil Elwell <phil@raspberrypi.com> Fixes: faa3381267d0 ("arm64: dts: broadcom: Add minimal support for Raspberry Pi 5") Link: https://lore.kernel.org/r/20250924085612.1039247-1-pbrobinson@gmail.com Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-29arm64: dts: broadcom: bcm2712: Add default GIC address cellsKrzysztof Kozlowski1-0/+1
[ Upstream commit 278b6cabf18bd804f956b98a2f1068717acdbfe3 ] Add missing address-cells 0 to GIC interrupt node to silence W=1 warning: bcm2712.dtsi:494.4-497.31: Warning (interrupt_map): /axi/pcie@1000110000:interrupt-map: Missing property '#address-cells' in node /soc@107c000000/interrupt-controller@7fff9000, using 0 as fallback Value '0' is correct because: 1. GIC interrupt controller does not have children, 2. interrupt-map property (in PCI node) consists of five components and the fourth component "parent unit address", which size is defined by '#address-cells' of the node pointed to by the interrupt-parent component, is not used (=0) Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Link: https://lore.kernel.org/r/20250822133407.312505-2-krzysztof.kozlowski@linaro.org Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com> Stable-dep-of: aa960b597600 ("arm64: dts: broadcom: bcm2712: Define VGIC interrupt") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-29MIPS: Malta: Fix keyboard resource preventing i8042 driver from registeringMaciej W. Rozycki1-1/+1
commit bf5570590a981d0659d0808d2d4bcda21b27a2a5 upstream. MIPS Malta platform code registers the PCI southbridge legacy port I/O PS/2 keyboard range as a standard resource marked as busy. It prevents the i8042 driver from registering as it fails to claim the resource in a call to i8042_platform_init(). Consequently PS/2 keyboard and mouse devices cannot be used with this platform. Fix the issue by removing the busy marker from the standard reservation, making the driver register successfully: serio: i8042 KBD port at 0x60,0x64 irq 1 serio: i8042 AUX port at 0x60,0x64 irq 12 and the resource show up as expected among the legacy devices: 00000000-00ffffff : MSC PCI I/O 00000000-0000001f : dma1 00000020-00000021 : pic1 00000040-0000005f : timer 00000060-0000006f : keyboard 00000060-0000006f : i8042 00000070-00000077 : rtc0 00000080-0000008f : dma page reg 000000a0-000000a1 : pic2 000000c0-000000df : dma2 [...] If the i8042 driver has not been configured, then the standard resource will remain there preventing any conflicting dynamic assignment of this PCI port I/O address range. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Acked-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/alpine.DEB.2.21.2510211919240.8377@angie.orcam.me.uk Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-29arm64: mte: Do not warn if the page is already tagged in copy_highpage()Catalin Marinas1-3/+8
commit b98c94eed4a975e0c80b7e90a649a46967376f58 upstream. The arm64 copy_highpage() assumes that the destination page is newly allocated and not MTE-tagged (PG_mte_tagged unset) and warns accordingly. However, following commit 060913999d7a ("mm: migrate: support poisoned recover from migrate folio"), folio_mc_copy() is called before __folio_migrate_mapping(). If the latter fails (-EAGAIN), the copy will be done again to the same destination page. Since copy_highpage() already set the PG_mte_tagged flag, this second copy will warn. Replace the WARN_ON_ONCE(page already tagged) in the arm64 copy_highpage() with a comment. Reported-by: syzbot+d1974fc28545a3e6218b@syzkaller.appspotmail.com Link: https://lore.kernel.org/r/68dda1ae.a00a0220.102ee.0065.GAE@google.com Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: stable@vger.kernel.org # 6.12.x Reviewed-by: Yang Shi <yang@os.amperecomputing.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-29riscv: cpufeature: avoid uninitialized variable in has_thead_homogeneous_vlenb()Paul Walmsley1-2/+2
commit 2dc99ea2727640b2fe12f9aa0e38ea2fc3cbb92d upstream. In has_thead_homogeneous_vlenb(), smatch detected that the vlenb variable could be used while uninitialized. It appears that this could happen if no CPUs described in DT have the "thead,vlenb" property. Fix by initializing vlenb to 0, which will keep thead_vlenb_of set to 0 (as it was statically initialized). This in turn will cause riscv_v_setup_vsize() to fall back to CSR probing - the desired result if thead,vlenb isn't provided in the DT data. While here, fix a nearby comment typo. Cc: stable@vger.kernel.org Cc: Charlie Jenkins <charlie@rivosinc.com> Fixes: 377be47f90e41 ("riscv: vector: Use vlenb from DT for thead") Signed-off-by: Paul Walmsley <pjw@kernel.org> Link: https://lore.kernel.org/r/22674afb-2fe8-2a83-1818-4c37bd554579@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-29riscv: hwprobe: Fix stale vDSO data for late-initialized keys at bootJingwei Wang5-15/+79
commit 5d15d2ad36b0f7afab83ca9fc8a2a6e60cbe54c4 upstream. The hwprobe vDSO data for some keys, like MISALIGNED_VECTOR_PERF, is determined by an asynchronous kthread. This can create a race condition where the kthread finishes after the vDSO data has already been populated, causing userspace to read stale values. To fix this race, a new 'ready' flag is added to the vDSO data, initialized to 'false' during arch_initcall_sync. This flag is checked by both the vDSO's user-space code and the riscv_hwprobe syscall. The syscall serves as a one-time gate, using a completion to wait for any pending probes before populating the data and setting the flag to 'true', thus ensuring userspace reads fresh values on its first request. Reported-by: Tsukasa OI <research_trasio@irq.a4lg.com> Closes: https://lore.kernel.org/linux-riscv/760d637b-b13b-4518-b6bf-883d55d44e7f@irq.a4lg.com/ Fixes: e7c9d66e313b ("RISC-V: Report vector unaligned access speed hwprobe") Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Olof Johansson <olof@lixom.net> Cc: stable@vger.kernel.org Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Co-developed-by: Palmer Dabbelt <palmer@dabbelt.com> Signed-off-by: Palmer Dabbelt <palmer@dabbelt.com> Signed-off-by: Jingwei Wang <wangjingwei@iscas.ac.cn> Link: https://lore.kernel.org/r/20250811142035.105820-1-wangjingwei@iscas.ac.cn [pjw@kernel.org: fix checkpatch issues] Signed-off-by: Paul Walmsley <pjw@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-29arm64, mm: avoid always making PTE dirty in pte_mkwrite()Huang Ying1-1/+2
[ Upstream commit 143937ca51cc6ae2fccc61a1cb916abb24cd34f5 ] Current pte_mkwrite_novma() makes PTE dirty unconditionally. This may mark some pages that are never written dirty wrongly. For example, do_swap_page() may map the exclusive pages with writable and clean PTEs if the VMA is writable and the page fault is for read access. However, current pte_mkwrite_novma() implementation always dirties the PTE. This may cause unnecessary disk writing if the pages are never written before being reclaimed. So, change pte_mkwrite_novma() to clear the PTE_RDONLY bit only if the PTE_DIRTY bit is set to make it possible to make the PTE writable and clean. The current behavior was introduced in commit 73e86cb03cf2 ("arm64: Move PTE_RDONLY bit handling out of set_pte_at()"). Before that, pte_mkwrite() only sets the PTE_WRITE bit, while set_pte_at() only clears the PTE_RDONLY bit if both the PTE_WRITE and the PTE_DIRTY bits are set. To test the performance impact of the patch, on an arm64 server machine, run 16 redis-server processes on socket 1 and 16 memtier_benchmark processes on socket 0 with mostly get transactions (that is, redis-server will mostly read memory only). The memory footprint of redis-server is larger than the available memory, so swap out/in will be triggered. Test results show that the patch can avoid most swapping out because the pages are mostly clean. And the benchmark throughput improves ~23.9% in the test. Fixes: 73e86cb03cf2 ("arm64: Move PTE_RDONLY bit handling out of set_pte_at()") Signed-off-by: Huang Ying <ying.huang@linux.alibaba.com> Cc: Will Deacon <will@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Gavin Shan <gshan@redhat.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Yicong Yang <yangyicong@hisilicon.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-kernel@vger.kernel.org Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-29s390/mm: Use __GFP_ACCOUNT for user page table allocationsHeiko Carstens1-3/+10
[ Upstream commit 5671ce2a1fc6b4a16cff962423bc416b92cac3c8 ] Add missing kmemcg accounting of user page table allocations. Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-29riscv: cpufeature: add validation for zfa, zfh and zfhminClément Léger2-11/+9
[ Upstream commit 2e2cf5581fccc562f7faf174ffb9866fed5cafbd ] These extensions depends on the F one. Add a validation callback checking for the F extension to be present. Now that extensions are correctly reported using the F/D presence, we can remove the has_fpu() check in hwprobe_isa_ext0(). Signed-off-by: Clément Léger <cleger@rivosinc.com> Reviewed-by: Conor Dooley <conor.dooley@microchip.com> Link: https://lore.kernel.org/r/20250527100001.33284-1-cleger@rivosinc.com Signed-off-by: Paul Walmsley <pjw@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-29riscv: mm: Use mmu-type from FDT to limit SATP modeJunhui Liu3-3/+49
[ Upstream commit 17e9521044c9b3ee839f861d1ac35c5b5c20d16b ] Some RISC-V implementations may hang when attempting to write an unsupported SATP mode, even though the latest RISC-V specification states such writes should have no effect. To avoid this issue, the logic for selecting SATP mode has been refined: The kernel now determines the SATP mode limit by taking the minimum of the value specified by the kernel command line (noXlvl) and the "mmu-type" property in the device tree (FDT). If only one is specified, use that. - If the resulting limit is sv48 or higher, the kernel will probe SATP modes from this limit downward until a supported mode is found. - If the limit is sv39, the kernel will directly use sv39 without probing. This ensures SATP mode selection is safe and compatible with both hardware and user configuration, minimizing the risk of hangs. Signed-off-by: Junhui Liu <junhui.liu@pigmoral.tech> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Reviewed-by: Nutty Liu <liujingqi@lanxincomputing.com> Link: https://lore.kernel.org/r/20250722-satp-from-fdt-v1-2-5ba22218fa5f@pigmoral.tech Signed-off-by: Paul Walmsley <pjw@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-29riscv: mm: Return intended SATP mode for noXlvl optionsJunhui Liu2-4/+4
[ Upstream commit f3243bed39c26ce0f13e6392a634f91d409b2d02 ] Change the return value of match_noXlvl() to return the SATP mode that will be used, rather than the mode being disabled. This enables unified logic for return value judgement with the function that obtains mmu-type from the fdt, avoiding extra conversion. This only changes the naming, with no functional impact. Signed-off-by: Junhui Liu <junhui.liu@pigmoral.tech> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Reviewed-by: Nutty Liu <liujingqi@lanxincomputing.com> Link: https://lore.kernel.org/r/20250722-satp-from-fdt-v1-1-5ba22218fa5f@pigmoral.tech Signed-off-by: Paul Walmsley <pjw@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-29powerpc/32: Remove PAGE_KERNEL_TEXT to fix startup failureChristophe Leroy3-15/+3
[ Upstream commit 9316512b717f6f25c4649b3fdb0a905b6a318e9f ] PAGE_KERNEL_TEXT is an old macro that is used to tell kernel whether kernel text has to be mapped read-only or read-write based on build time options. But nowadays, with functionnalities like jump_labels, static links, etc ... more only less all kernels need to be read-write at some point, and some combinations of configs failed to work due to innacurate setting of PAGE_KERNEL_TEXT. On the other hand, today we have CONFIG_STRICT_KERNEL_RWX which implements a more controlled access to kernel modifications. Instead of trying to keep PAGE_KERNEL_TEXT accurate with all possible options that may imply kernel text modification, always set kernel text read-write at startup and rely on CONFIG_STRICT_KERNEL_RWX to provide accurate protection. Do this by passing PAGE_KERNEL_X to map_kernel_page() in __maping_ram_chunk() instead of passing PAGE_KERNEL_TEXT. Once this is done, the only remaining user of PAGE_KERNEL_TEXT is mmu_mark_initmem_nx() which uses it in a call to setibat(). As setibat() ignores the RW/RO, we can seamlessly replace PAGE_KERNEL_TEXT by PAGE_KERNEL_X here as well and get rid of PAGE_KERNEL_TEXT completely. Reported-by: Erhard Furtner <erhard_f@mailbox.org> Closes: https://lore.kernel.org/all/342b4120-911c-4723-82ec-d8c9b03a8aef@mailbox.org/ Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Tested-by: Andrew Donnellan <ajd@linux.ibm.com> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/8e2d793abf87ae3efb8f6dce10f974ac0eda61b8.1757412205.git.christophe.leroy@csgroup.eu Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-29m68k: bitops: Fix find_*_bit() signaturesGeert Uytterhoeven1-11/+14
[ Upstream commit 6d5674090543b89aac0c177d67e5fb32ddc53804 ] The function signatures of the m68k-optimized implementations of the find_{first,next}_{,zero_}bit() helpers do not match the generic variants. Fix this by changing all non-pointer inputs and outputs to "unsigned long", and updating a few local variables. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202509092305.ncd9mzaZ-lkp@intel.com/ Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: "Yury Norov (NVIDIA)" <yury.norov@gmail.com> Link: https://patch.msgid.link/de6919554fbb4cd1427155c6bafbac8a9df822c8.1757517135.git.geert@linux-m68k.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-29arm64: sysreg: Correct sign definitions for EIESB and DoubleLockFuad Tabba1-2/+2
[ Upstream commit f4d4ebc84995178273740f3e601e97fdefc561d2 ] The `ID_AA64MMFR4_EL1.EIESB` field, is an unsigned enumeration, but was incorrectly defined as a `SignedEnum` when introduced in commit cfc680bb04c5 ("arm64: sysreg: Add layout for ID_AA64MMFR4_EL1"). This is corrected to `UnsignedEnum`. Conversely, the `ID_AA64DFR0_EL1.DoubleLock` field, is a signed enumeration, but was incorrectly defined as an `UnsignedEnum`. This is corrected to `SignedEnum`, which wasn't correctly set when annotated as such in commit ad16d4cf0b4f ("arm64/sysreg: Initial unsigned annotations for ID registers"). Signed-off-by: Fuad Tabba <tabba@google.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-29nios2: ensure that memblock.current_limit is set when setting pfn limitsSimon Schuster1-0/+15
[ Upstream commit a20b83cf45be2057f3d073506779e52c7fa17f94 ] On nios2, with CONFIG_FLATMEM set, the kernel relies on memblock_get_current_limit() to determine the limits of mem_map, in particular for max_low_pfn. Unfortunately, memblock.current_limit is only default initialized to MEMBLOCK_ALLOC_ANYWHERE at this point of the bootup, potentially leading to situations where max_low_pfn can erroneously exceed the value of max_pfn and, thus, the valid range of available DRAM. This can in turn cause kernel-level paging failures, e.g.: [ 76.900000] Unable to handle kernel paging request at virtual address 20303000 [ 76.900000] ea = c0080890, ra = c000462c, cause = 14 [ 76.900000] Kernel panic - not syncing: Oops [ 76.900000] ---[ end Kernel panic - not syncing: Oops ]--- This patch fixes this by pre-calculating memblock.current_limit based on the upper limits of the available memory ranges via adjust_lowmem_bounds, a simplified version of the equivalent implementation within the arm architecture. Signed-off-by: Simon Schuster <schuster.simon@siemens-energy.com> Signed-off-by: Andreas Oetken <andreas.oetken@siemens-energy.com> Signed-off-by: Dinh Nguyen <dinguyen@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-23x86/resctrl: Fix miscount of bandwidth event when reactivating previously ↵Babu Moger1-4/+10
unavailable RMID [ Upstream commit 15292f1b4c55a3a7c940dbcb6cb8793871ed3d92 ] Users can create as many monitoring groups as the number of RMIDs supported by the hardware. However, on AMD systems, only a limited number of RMIDs are guaranteed to be actively tracked by the hardware. RMIDs that exceed this limit are placed in an "Unavailable" state. When a bandwidth counter is read for such an RMID, the hardware sets MSR_IA32_QM_CTR.Unavailable (bit 62). When such an RMID starts being tracked again the hardware counter is reset to zero. MSR_IA32_QM_CTR.Unavailable remains set on first read after tracking re-starts and is clear on all subsequent reads as long as the RMID is tracked. resctrl miscounts the bandwidth events after an RMID transitions from the "Unavailable" state back to being tracked. This happens because when the hardware starts counting again after resetting the counter to zero, resctrl in turn compares the new count against the counter value stored from the previous time the RMID was tracked. This results in resctrl computing an event value that is either undercounting (when new counter is more than stored counter) or a mistaken overflow (when new counter is less than stored counter). Reset the stored value (arch_mbm_state::prev_msr) of MSR_IA32_QM_CTR to zero whenever the RMID is in the "Unavailable" state to ensure accurate counting after the RMID resets to zero when it starts to be tracked again. Example scenario that results in mistaken overflow ================================================== 1. The resctrl filesystem is mounted, and a task is assigned to a monitoring group. $mount -t resctrl resctrl /sys/fs/resctrl $mkdir /sys/fs/resctrl/mon_groups/test1/ $echo 1234 > /sys/fs/resctrl/mon_groups/test1/tasks $cat /sys/fs/resctrl/mon_groups/test1/mon_data/mon_L3_*/mbm_total_bytes 21323 <- Total bytes on domain 0 "Unavailable" <- Total bytes on domain 1 Task is running on domain 0. Counter on domain 1 is "Unavailable". 2. The task runs on domain 0 for a while and then moves to domain 1. The counter starts incrementing on domain 1. $cat /sys/fs/resctrl/mon_groups/test1/mon_data/mon_L3_*/mbm_total_bytes 7345357 <- Total bytes on domain 0 4545 <- Total bytes on domain 1 3. At some point, the RMID in domain 0 transitions to the "Unavailable" state because the task is no longer executing in that domain. $cat /sys/fs/resctrl/mon_groups/test1/mon_data/mon_L3_*/mbm_total_bytes "Unavailable" <- Total bytes on domain 0 434341 <- Total bytes on domain 1 4. Since the task continues to migrate between domains, it may eventually return to domain 0. $cat /sys/fs/resctrl/mon_groups/test1/mon_data/mon_L3_*/mbm_total_bytes 17592178699059 <- Overflow on domain 0 3232332 <- Total bytes on domain 1 In this case, the RMID on domain 0 transitions from "Unavailable" state to active state. The hardware sets MSR_IA32_QM_CTR.Unavailable (bit 62) when the counter is read and begins tracking the RMID counting from 0. Subsequent reads succeed but return a value smaller than the previously saved MSR value (7345357). Consequently, the resctrl's overflow logic is triggered, it compares the previous value (7345357) with the new, smaller value and incorrectly interprets this as a counter overflow, adding a large delta. In reality, this is a false positive: the counter did not overflow but was simply reset when the RMID transitioned from "Unavailable" back to active state. Here is the text from APM [1] available from [2]. "In PQOS Version 2.0 or higher, the MBM hardware will set the U bit on the first QM_CTR read when it begins tracking an RMID that it was not previously tracking. The U bit will be zero for all subsequent reads from that RMID while it is still tracked by the hardware. Therefore, a QM_CTR read with the U bit set when that RMID is in use by a processor can be considered 0 when calculating the difference with a subsequent read." [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming Publication # 24593 Revision 3.41 section 19.3.3 Monitoring L3 Memory Bandwidth (MBM). [ bp: Split commit message into smaller paragraph chunks for better consumption. ] Fixes: 4d05bf71f157d ("x86/resctrl: Introduce AMD QOS feature") Signed-off-by: Babu Moger <babu.moger@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Reinette Chatre <reinette.chatre@intel.com> Cc: stable@vger.kernel.org # needs adjustments for <= v6.17 Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 # [2] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-23x86/resctrl: Refactor resctrl_arch_rmid_read()Babu Moger1-15/+23
[ Upstream commit 7c9ac605e202c4668e441fc8146a993577131ca1 ] resctrl_arch_rmid_read() adjusts the value obtained from MSR_IA32_QM_CTR to account for the overflow for MBM events and apply counter scaling for all the events. This logic is common to both reading an RMID and reading a hardware counter directly. Refactor the hardware value adjustment logic into get_corrected_val() to prepare for support of reading a hardware counter. Signed-off-by: Babu Moger <babu.moger@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com Stable-dep-of: 15292f1b4c55 ("x86/resctrl: Fix miscount of bandwidth event when reactivating previously unavailable RMID") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-23arm64: errata: Apply workarounds for Neoverse-V3AEMark Rutland2-0/+2
commit 0c33aa1804d101c11ba1992504f17a42233f0e11 upstream. Neoverse-V3AE is also affected by erratum #3312417, as described in its Software Developer Errata Notice (SDEN) document: Neoverse V3AE (MP172) SDEN v9.0, erratum 3312417 https://developer.arm.com/documentation/SDEN-2615521/9-0/ Enable the workaround for Neoverse-V3AE, and document this. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: James Morse <james.morse@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-23arm64: cputype: Add Neoverse-V3AE definitionsMark Rutland1-0/+2
commit 3bbf004c4808e2c3241e5c1ad6cc102f38a03c39 upstream. Add cputype definitions for Neoverse-V3AE. These will be used for errata detection in subsequent patches. These values can be found in the Neoverse-V3AE TRM: https://developer.arm.com/documentation/SDEN-2615521/9-0/ ... in section A.6.1 ("MIDR_EL1, Main ID Register"). Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: James Morse <james.morse@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-23arm64: debug: always unmask interrupts in el0_softstp()Ada Couprie Diaz1-3/+5
commit ea0d55ae4b3207c33691a73da3443b1fd379f1d2 upstream. We intend that EL0 exception handlers unmask all DAIF exceptions before calling exit_to_user_mode(). When completing single-step of a suspended breakpoint, we do not call local_daif_restore(DAIF_PROCCTX) before calling exit_to_user_mode(), leaving all DAIF exceptions masked. When pseudo-NMIs are not in use this is benign. When pseudo-NMIs are in use, this is unsound. At this point interrupts are masked by both DAIF.IF and PMR_EL1, and subsequent irq flag manipulation may not work correctly. For example, a subsequent local_irq_enable() within exit_to_user_mode_loop() will only unmask interrupts via PMR_EL1 (leaving those masked via DAIF.IF), and anything depending on interrupts being unmasked (e.g. delivery of signals) will not work correctly. This was detected by CONFIG_ARM64_DEBUG_PRIORITY_MASKING. Move the call to `try_step_suspended_breakpoints()` outside of the check so that interrupts can be unmasked even if we don't call the step handler. Fixes: 0ac7584c08ce ("arm64: debug: split single stepping exception entry") Cc: <stable@vger.kernel.org> # 6.17 Signed-off-by: Ada Couprie Diaz <ada.coupriediaz@arm.com> Acked-by: Mark Rutland <mark.rutland@arm.com> [catalin.marinas@arm.com: added Mark's rewritten commit log and some whitespace] Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> [ada.coupriediaz@arm.com: Fix conflict for v6.17 stable] Signed-off-by: Ada Couprie Diaz <ada.coupriediaz@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-23x86/mm: Fix SMP ordering in switch_mm_irqs_off()Ingo Molnar1-2/+22
[ Upstream commit 83b0177a6c4889b3a6e865da5e21b2c9d97d0551 ] Stephen noted that it is possible to not have an smp_mb() between the loaded_mm store and the tlb_gen load in switch_mm(), meaning the ordering against flush_tlb_mm_range() goes out the window, and it becomes possible for switch_mm() to not observe a recent tlb_gen update and fail to flush the TLBs. [ dhansen: merge conflict fixed by Ingo ] Fixes: 209954cbc7d0 ("x86/mm/tlb: Update mm_cpumask lazily") Reported-by: Stephen Dolan <sdolan@janestreet.com> Closes: https://lore.kernel.org/all/CAHDw0oGd0B4=uuv8NGqbUQ_ZVmSheU2bN70e4QhFXWvuAZdt2w@mail.gmail.com/ Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-23powerpc/fadump: skip parameter area allocation when fadump is disabledSourabh Jain1-0/+3
[ Upstream commit 0843ba458439f38efdc14aa359c14ad0127edb01 ] Fadump allocates memory to pass additional kernel command-line argument to the fadump kernel. However, this allocation is not needed when fadump is disabled. So avoid allocating memory for the additional parameter area in such cases. Fixes: f4892c68ecc1 ("powerpc/fadump: allocate memory for additional parameters early") Reviewed-by: Hari Bathini <hbathini@linux.ibm.com> Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com> Fixes: f4892c68ecc1 ("powerpc/fadump: allocate memory for additional parameters early") Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20251008032934.262683-1-sourabhjain@linux.ibm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-23riscv: kprobes: Fix probe address validationFabian Vogt1-4/+9
[ Upstream commit 9e68bd803fac49274fde914466fd3b07c4d602c8 ] When adding a kprobe such as "p:probe/tcp_sendmsg _text+15392192", arch_check_kprobe would start iterating all instructions starting from _text until the probed address. Not only is this very inefficient, but literal values in there (e.g. left by function patching) are misinterpreted in a way that causes a desync. Fix this by doing it like x86: start the iteration at the closest preceding symbol instead of the given starting point. Fixes: 87f48c7ccc73 ("riscv: kprobe: Fixup kernel panic when probing an illegal position") Signed-off-by: Fabian Vogt <fvogt@suse.de> Signed-off-by: Marvin Friedrich <marvin.friedrich@suse.com> Acked-by: Guo Ren <guoren@kernel.org> Link: https://lore.kernel.org/r/6191817.lOV4Wx5bFT@fvogt-thinkpad Signed-off-by: Paul Walmsley <pjw@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-23KVM: arm64: Prevent access to vCPU events before initOliver Upton1-0/+6
commit 0aa1b76fe1429629215a7c79820e4b96233ac4a3 upstream. Another day, another syzkaller bug. KVM erroneously allows userspace to pend vCPU events for a vCPU that hasn't been initialized yet, leading to KVM interpreting a bunch of uninitialized garbage for routing / injecting the exception. In one case the injection code and the hyp disagree on whether the vCPU has a 32bit EL1 and put the vCPU into an illegal mode for AArch64, tripping the BUG() in exception_target_el() during the next injection: kernel BUG at arch/arm64/kvm/inject_fault.c:40! Internal error: Oops - BUG: 00000000f2000800 [#1] SMP CPU: 3 UID: 0 PID: 318 Comm: repro Not tainted 6.17.0-rc4-00104-g10fd0285305d #6 PREEMPT Hardware name: linux,dummy-virt (DT) pstate: 21402009 (nzCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--) pc : exception_target_el+0x88/0x8c lr : pend_serror_exception+0x18/0x13c sp : ffff800082f03a10 x29: ffff800082f03a10 x28: ffff0000cb132280 x27: 0000000000000000 x26: 0000000000000000 x25: ffff0000c2a99c20 x24: 0000000000000000 x23: 0000000000008000 x22: 0000000000000002 x21: 0000000000000004 x20: 0000000000008000 x19: ffff0000c2a99c20 x18: 0000000000000000 x17: 0000000000000000 x16: 0000000000000000 x15: 00000000200000c0 x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000000 x8 : ffff800082f03af8 x7 : 0000000000000000 x6 : 0000000000000000 x5 : ffff800080f621f0 x4 : 0000000000000000 x3 : 0000000000000000 x2 : 000000000040009b x1 : 0000000000000003 x0 : ffff0000c2a99c20 Call trace: exception_target_el+0x88/0x8c (P) kvm_inject_serror_esr+0x40/0x3b4 __kvm_arm_vcpu_set_events+0xf0/0x100 kvm_arch_vcpu_ioctl+0x180/0x9d4 kvm_vcpu_ioctl+0x60c/0x9f4 __arm64_sys_ioctl+0xac/0x104 invoke_syscall+0x48/0x110 el0_svc_common.constprop.0+0x40/0xe0 do_el0_svc+0x1c/0x28 el0_svc+0x34/0xf0 el0t_64_sync_handler+0xa0/0xe4 el0t_64_sync+0x198/0x19c Code: f946bc01 b4fffe61 9101e020 17fffff2 (d4210000) Reject the ioctls outright as no sane VMM would call these before KVM_ARM_VCPU_INIT anyway. Even if it did the exception would've been thrown away by the eventual reset of the vCPU's state. Cc: stable@vger.kernel.org # 6.17 Fixes: b7b27facc7b5 ("arm/arm64: KVM: Add KVM_GET/SET_VCPU_EVENTS") Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-23x86/CPU/AMD: Prevent reset reasons from being retained across rebootRong Zhang1-2/+14
commit e6416c2dfe23c9a6fec881fda22ebb9ae486cfc5 upstream. The S5_RESET_STATUS register is parsed on boot and printed to kmsg. However, this could sometimes be misleading and lead to users wasting a lot of time on meaningless debugging for two reasons: * Some bits are never cleared by hardware. It's the software's responsibility to clear them as per the Processor Programming Reference (see [1]). * Some rare hardware-initiated platform resets do not update the register at all. In both cases, a previous reboot could leave its trace in the register, resulting in users seeing unrelated reboot reasons while debugging random reboots afterward. Write the read value back to the register in order to clear all reason bits since they are write-1-to-clear while the others must be preserved. [1]: https://bugzilla.kernel.org/show_bug.cgi?id=206537#attach_303991 [ bp: Massage commit message. ] Fixes: ab8131028710 ("x86/CPU/AMD: Print the reason for the last reset") Signed-off-by: Rong Zhang <i@rong.moe> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com> Cc: <stable@kernel.org> Link: https://lore.kernel.org/all/20250913144245.23237-1-i@rong.moe/ Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-23rust: cfi: only 64-bit arm and x86 support CFI_CLANGConor Dooley1-0/+1
commit 812258ff4166bcd41c7d44707e0591f9ae32ac8c upstream. The kernel uses the standard rustc targets for non-x86 targets, and out of those only 64-bit arm's target has kcfi support enabled. For x86, the custom 64-bit target enables kcfi. The HAVE_CFI_ICALL_NORMALIZE_INTEGERS_RUSTC config option that allows CFI_CLANG to be used in combination with RUST does not check whether the rustc target supports kcfi. This breaks the build on riscv (and presumably 32-bit arm) when CFI_CLANG and RUST are enabled at the same time. Ordinarily, a rustc-option check would be used to detect target support but unfortunately rustc-option filters out the target for reasons given in commit 46e24a545cdb4 ("rust: kasan/kbuild: fix missing flags on first build"). As a result, if the host supports kcfi but the target does not, e.g. when building for riscv on x86_64, the build would remain broken. Instead, make HAVE_CFI_ICALL_NORMALIZE_INTEGERS_RUSTC depend on the only two architectures where the target used supports it to fix the build. CC: stable@vger.kernel.org Fixes: ca627e636551e ("rust: cfi: add support for CFI_CLANG with Rust") Signed-off-by: Conor Dooley <conor.dooley@microchip.com> Acked-by: Miguel Ojeda <ojeda@kernel.org> Reviewed-by: Alice Ryhl <aliceryhl@google.com> Link: https://lore.kernel.org/r/20250908-distill-lint-1ae78bcf777c@spud Signed-off-by: Paul Walmsley <pjw@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-23arm64/sysreg: Fix GIC CDEOI instruction encodingLorenzo Pieralisi1-1/+10
commit e9ad390a4812fd60c1da46823f7a6f84f2411f0c upstream. The GIC CDEOI system instruction requires the Rt field to be set to 0b11111 otherwise the instruction behaviour becomes CONSTRAINED UNPREDICTABLE. Currenly, its usage is encoded as a system register write, with a constant 0 value: write_sysreg_s(0, GICV5_OP_GIC_CDEOI) While compiling with GCC, the 0 constant value, through these asm constraints and modifiers ('x' modifier and 'Z' constraint combo): asm volatile(__msr_s(r, "%x0") : : "rZ" (__val)); forces the compiler to issue the XZR register for the MSR operation (ie that corresponds to Rt == 0b11111) issuing the right instruction encoding. Unfortunately LLVM does not yet understand that modifier/constraint combo so it ends up issuing a different register from XZR for the MSR source, which in turns means that it encodes the GIC CDEOI instruction wrongly and the instruction behaviour becomes CONSTRAINED UNPREDICTABLE that we must prevent. Add a conditional to write_sysreg_s() macro that detects whether it is passed a constant 0 value and issues an MSR write with XZR as source register - explicitly doing what the asm modifier/constraint is meant to achieve through constraints/modifiers, fixing the LLVM compilation issue. Fixes: 7ec80fb3f025 ("irqchip/gic-v5: Add GICv5 PPI support") Suggested-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Acked-by: Marc Zyngier <maz@kernel.org> Cc: stable@vger.kernel.org Cc: Sascha Bischoff <sascha.bischoff@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Marc Zyngier <maz@kernel.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-19arm64: dts: qcom: qcs615: add missing dt property in QUP SEsViken Dadhaniya1-0/+6
[ Upstream commit 6a5e9b9738a32229e2673d4eccfcbfe2ef3a1ab4 ] Add the missing required-opps and operating-points-v2 properties to several I2C, SPI, and UART nodes in the QUP SEs. Fixes: f6746dc9e379 ("arm64: dts: qcom: qcs615: Add QUPv3 configuration") Cc: stable@vger.kernel.org Signed-off-by: Viken Dadhaniya <viken.dadhaniya@oss.qualcomm.com> Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com> Link: https://lore.kernel.org/r/20250630064338.2487409-1-viken.dadhaniya@oss.qualcomm.com Signed-off-by: Bjorn Andersson <andersson@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-19s390: Add -Wno-pointer-sign to KBUILD_CFLAGS_DECOMPRESSORHeiko Carstens1-0/+1
commit fa7a0a53eeb7e16402f82c3d5a9ef4bf5efe9357 upstream. If the decompressor is compiled with clang this can lead to the following warning: In file included from arch/s390/boot/startup.c:4: ... In file included from ./include/linux/pgtable.h:6: ./arch/s390/include/asm/pgtable.h:2065:48: warning: passing 'unsigned long *' to parameter of type 'long *' converts between pointers to integer types with different sign [-Wpointer-sign] 2065 | value = __atomic64_or_barrier(PGSTE_PCL_BIT, ptr); Add -Wno-pointer-sign to the decompressor compile flags, like it is also done for the kernel. This is similar to what was done for x86 to address the same problem [1]. [1] commit dca5203e3fe2 ("x86/boot: Add -Wno-pointer-sign to KBUILD_CFLAGS") Cc: stable@vger.kernel.org Reported-by: Gerd Bayer <gbayer@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-19x86/umip: Fix decoding of register forms of 0F 01 (SGDT and SIDT aliases)Sean Christopherson1-0/+11
commit 27b1fd62012dfe9d3eb8ecde344d7aa673695ecf upstream. Filter out the register forms of 0F 01 when determining whether or not to emulate in response to a potential UMIP violation #GP, as SGDT and SIDT only accept memory operands. The register variants of 0F 01 are used to encode instructions for things like VMX and SGX, i.e. not checking the Mod field would cause the kernel to incorrectly emulate on #GP, e.g. due to a CPL violation on VMLAUNCH. Fixes: 1e5db223696a ("x86/umip: Add emulation code for UMIP instructions") Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-19x86/umip: Check that the instruction opcode is at least two bytesSean Christopherson1-2/+2
commit 32278c677947ae2f042c9535674a7fff9a245dd3 upstream. When checking for a potential UMIP violation on #GP, verify the decoder found at least two opcode bytes to avoid false positives when the kernel encounters an unknown instruction that starts with 0f. Because the array of opcode.bytes is zero-initialized by insn_init(), peeking at bytes[1] will misinterpret garbage as a potential SLDT or STR instruction, and can incorrectly trigger emulation. E.g. if a VPALIGNR instruction 62 83 c5 05 0f 08 ff vpalignr xmm17{k5},xmm23,XMMWORD PTR [r8],0xff hits a #GP, the kernel emulates it as STR and squashes the #GP (and corrupts the userspace code stream). Arguably the check should look for exactly two bytes, but no three byte opcodes use '0f 00 xx' or '0f 01 xx' as an escape, i.e. it should be impossible to get a false positive if the first two opcode bytes match '0f 00' or '0f 01'. Go with a more conservative check with respect to the existing code to minimize the chances of breaking userspace, e.g. due to decoder weirdness. Analyzed by Nick Bray <ncbray@google.com>. Fixes: 1e5db223696a ("x86/umip: Add emulation code for UMIP instructions") Reported-by: Dan Snyder <dansnyder@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-19x86/fred: Remove ENDBR64 from FRED entry pointsXin Li (Intel)1-1/+1
commit 3da01ffe1aeaa0d427ab5235ba735226670a80d9 upstream. The FRED specification has been changed in v9.0 to state that there is no need for FRED event handlers to begin with ENDBR64, because in the presence of supervisor indirect branch tracking, FRED event delivery does not enter the WAIT_FOR_ENDBRANCH state. As a result, remove ENDBR64 from FRED entry points. Then add ANNOTATE_NOENDBR to indicate that FRED entry points will never be used for indirect calls to suppress an objtool warning. This change implies that any indirect CALL/JMP to FRED entry points causes #CP in the presence of supervisor indirect branch tracking. Credit goes to Jennifer Miller <jmill@asu.edu> and other contributors from Arizona State University whose research shows that placing ENDBR at entry points has negative value thus led to this change. Note: This is obviously an incompatible change to the FRED architecture. But, it's OK because there no FRED systems out in the wild today. All production hardware and late pre-production hardware will follow the FRED v9 spec and be compatible with this approach. [ dhansen: add note to changelog about incompatibility ] Fixes: 14619d912b65 ("x86/fred: FRED entry/exit and dispatch code") Signed-off-by: Xin Li (Intel) <xin@zytor.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: H. Peter Anvin (Intel) <hpa@zytor.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Link: https://lore.kernel.org/linux-hardening/Z60NwR4w%2F28Z7XUa@ubun/ Cc:stable@vger.kernel.org Link: https://lore.kernel.org/all/20250716063320.1337818-1-xin%40zytor.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-19xtensa: simdisk: add input size check in proc_write_simdiskMiaoqian Lin1-1/+5
commit 5d5f08fd0cd970184376bee07d59f635c8403f63 upstream. A malicious user could pass an arbitrarily bad value to memdup_user_nul(), potentially causing kernel crash. This follows the same pattern as commit ee76746387f6 ("netdevsim: prevent bad user input in nsim_dev_health_break_write()") Fixes: b6c7e873daf7 ("xtensa: ISS: add host file-based simulated disk") Fixes: 16e5c1fc3604 ("convert a bunch of open-coded instances of memdup_user_nul()") Cc: stable@vger.kernel.org Signed-off-by: Miaoqian Lin <linmq006@gmail.com> Message-Id: <20250829083015.1992751-1-linmq006@gmail.com> Signed-off-by: Max Filippov <jcmvbkbc@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-19sparc: fix error handling in scan_one_device()Ma Ke2-0/+2
commit 302c04110f0ce70d25add2496b521132548cd408 upstream. Once of_device_register() failed, we should call put_device() to decrement reference count for cleanup. Or it could cause memory leak. So fix this by calling put_device(), then the name can be freed in kobject_cleanup(). Calling path: of_device_register() -> of_device_add() -> device_add(). As comment of device_add() says, 'if device_add() succeeds, you should call device_del() when you want to get rid of it. If device_add() has not succeeded, use only put_device() to drop the reference count'. Found by code review. Cc: stable@vger.kernel.org Fixes: cf44bbc26cf1 ("[SPARC]: Beginnings of generic of_device framework.") Signed-off-by: Ma Ke <make24@iscas.ac.cn> Reviewed-by: Andreas Larsson <andreas@gaisler.com> Signed-off-by: Andreas Larsson <andreas@gaisler.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-19sparc64: fix hugetlb for sun4uAnthony Yznaga1-0/+20
commit 6fd44a481b3c6111e4801cec964627791d0f3ec5 upstream. An attempt to exercise sparc hugetlb code in a sun4u-based guest running under qemu results in the guest hanging due to being stuck in a trap loop. This is due to invalid hugetlb TTEs being installed that do not have the expected _PAGE_PMD_HUGE and page size bits set. Although the breakage has gone apparently unnoticed for several years, fix it now so there is the option to exercise sparc hugetlb code under qemu. This can be useful because sun4v support in qemu does not support linux guests currently and sun4v-based hardware resources may not be readily available. Fix tested with a 6.15.2 and 6.16-rc6 kernels by running libhugetlbfs tests on a qemu guest running Debian 13. Fixes: c7d9f77d33a7 ("sparc64: Multi-page size support") Cc: stable@vger.kernel.org Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com> Tested-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Reviewed-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Reviewed-by: Andreas Larsson <andreas@gaisler.com> Link: https://lore.kernel.org/r/20250716012446.10357-1-anthony.yznaga@oracle.com Signed-off-by: Andreas Larsson <andreas@gaisler.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-19powerpc/pseries/msi: Fix potential underflow and leak issueNam Cao1-1/+1
commit 3443ff3be6e59b80d74036bb39f5b6409eb23cc9 upstream. pseries_irq_domain_alloc() allocates interrupts at parent's interrupt domain. If it fails in the progress, all allocated interrupts are freed. The number of successfully allocated interrupts so far is stored "i". However, "i - 1" interrupts are freed. This is broken: - One interrupt is not be freed - If "i" is zero, "i - 1" wraps around Correct the number of freed interrupts to 'i'. Fixes: a5f3d2c17b07 ("powerpc/pseries/pci: Add MSI domains") Signed-off-by: Nam Cao <namcao@linutronix.de> Cc: stable@vger.kernel.org Reviewed-by: Cédric Le Goater <clg@redhat.com> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/a980067f2b256bf716b4cd713bc1095966eed8cd.1754300646.git.namcao@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-19powerpc/powernv/pci: Fix underflow and leak issueNam Cao1-1/+1
commit a39087905af9ffecaa237a918a2c03a04e479934 upstream. pnv_irq_domain_alloc() allocates interrupts at parent's interrupt domain. If it fails in the progress, all allocated interrupts are freed. The number of successfully allocated interrupts so far is stored "i". However, "i - 1" interrupts are freed. This is broken: - One interrupt is not be freed - If "i" is zero, "i - 1" wraps around Correct the number of freed interrupts to "i". Fixes: 0fcfe2247e75 ("powerpc/powernv/pci: Add MSI domains") Signed-off-by: Nam Cao <namcao@linutronix.de> Cc: stable@vger.kernel.org Reviewed-by: Cédric Le Goater <clg@redhat.com> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/70f8debe8688e0b467367db769b71c20146a836d.1754300646.git.namcao@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-19parisc: Remove spurious if statement from raw_copy_from_user()John David Anglin1-1/+0
commit 16794e524d310780163fdd49d0bf7fac30f8dbc8 upstream. Accidently introduced in commit 91428ca9320e. Signed-off-by: John David Anglin <dave.anglin@bell.net> Signed-off-by: Helge Deller <deller@gmx.de> Fixes: 91428ca9320e ("parisc: Check region is readable by user in raw_copy_from_user()") Cc: stable@vger.kernel.org # v5.12+ Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-19parisc: don't reference obsolete termio struct for TC* constantsSam James1-4/+4
commit 8ec5a066f88f89bd52094ba18792b34c49dcd55a upstream. Similar in nature to ab107276607af90b13a5994997e19b7b9731e251. glibc-2.42 drops the legacy termio struct, but the ioctls.h header still defines some TC* constants in terms of termio (via sizeof). Hardcode the values instead. This fixes building Python for example, which falls over like: ./Modules/termios.c:1119:16: error: invalid application of 'sizeof' to incomplete type 'struct termio' Link: https://bugs.gentoo.org/961769 Link: https://bugs.gentoo.org/962600 Co-authored-by: Stian Halseth <stian@itx.no> Cc: stable@vger.kernel.org Signed-off-by: Sam James <sam@gentoo.org> Signed-off-by: Helge Deller <deller@gmx.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-19KVM: TDX: Fix uninitialized error code for __tdx_bringup()Tony Lindgren1-7/+3
commit 510c47f165f0c1f0b57329a30a9a797795519831 upstream. Fix a Smatch static checker warning reported by Dan: arch/x86/kvm/vmx/tdx.c:3464 __tdx_bringup() warn: missing error code 'r' Initialize r to -EINVAL before tdx_get_sysinfo() to simplify the code and to prevent similar issues from sneaking in later on as suggested by Kai. Cc: stable@vger.kernel.org Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Fixes: 61bb28279623 ("KVM: TDX: Get system-wide info about TDX module on initialization") Suggested-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com> Link: https://lore.kernel.org/r/20250918053226.802204-1-tony.lindgren@linux.intel.com [sean: tag for stable] Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-19KVM: SVM: Re-load current, not host, TSC_AUX on #VMEXIT from SEV-ES guestHou Wenlong3-19/+18
commit 29da8c823abffdacb71c7c07ec48fcf9eb38757c upstream. Prior to running an SEV-ES guest, set TSC_AUX in the host save area to the current value in hardware, as tracked by the user return infrastructure, instead of always loading the host's desired value for the CPU. If the pCPU is also running a non-SEV-ES vCPU, loading the host's value on #VMEXIT could clobber the other vCPU's value, e.g. if the SEV-ES vCPU preempted the non-SEV-ES vCPU, in which case KVM expects the other vCPU's TSC_AUX value to be resident in hardware. Note, unlike TDX, which blindly _zeroes_ TSC_AUX on TD-Exit, SEV-ES CPUs can load an arbitrary value. Stuff the current value in the host save area instead of refreshing the user return cache so that KVM doesn't need to track whether or not the vCPU actually enterred the guest and thus loaded TSC_AUX from the host save area. Opportunistically tag tsc_aux_uret_slot as read-only after init to guard against unexpected modifications, and to make it obvious that using the variable in sev_es_prepare_switch_to_guest() is safe. Fixes: 916e3e5f26ab ("KVM: SVM: Do not use user return MSR support for virtualized TSC_AUX") Cc: stable@vger.kernel.org Suggested-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> [sean: handle the SEV-ES case in sev_es_prepare_switch_to_guest()] Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250923153738.1875174-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-19x86/kvm: Force legacy PCI hole to UC when overriding MTRRs for TDX/SNPSean Christopherson1-2/+19
commit 0dccbc75e18df85399a71933d60b97494110f559 upstream. When running as an SNP or TDX guest under KVM, force the legacy PCI hole, i.e. memory between Top of Lower Usable DRAM and 4GiB, to be mapped as UC via a forced variable MTRR range. In most KVM-based setups, legacy devices such as the HPET and TPM are enumerated via ACPI. ACPI enumeration includes a Memory32Fixed entry, and optionally a SystemMemory descriptor for an OperationRegion, e.g. if the device needs to be accessed via a Control Method. If a SystemMemory entry is present, then the kernel's ACPI driver will auto-ioremap the region so that it can be accessed at will. However, the ACPI spec doesn't provide a way to enumerate the memory type of SystemMemory regions, i.e. there's no way to tell software that a region must be mapped as UC vs. WB, etc. As a result, Linux's ACPI driver always maps SystemMemory regions using ioremap_cache(), i.e. as WB on x86. The dedicated device drivers however, e.g. the HPET driver and TPM driver, want to map their associated memory as UC or WC, as accessing PCI devices using WB is unsupported. On bare metal and non-CoCO, the conflicting requirements "work" as firmware configures the PCI hole (and other device memory) to be UC in the MTRRs. So even though the ACPI mappings request WB, they are forced to UC- in the kernel's tracking due to the kernel properly handling the MTRR overrides, and thus are compatible with the drivers' requested WC/UC-. With force WB MTRRs on SNP and TDX guests, the ACPI mappings get their requested WB if the ACPI mappings are established before the dedicated driver code attempts to initialize the device. E.g. if acpi_init() runs before the corresponding device driver is probed, ACPI's WB mapping will "win", and result in the driver's ioremap() failing because the existing WB mapping isn't compatible with the requested WC/UC-. E.g. when a TPM is emulated by the hypervisor (ignoring the security implications of relying on what is allegedly an untrusted entity to store measurements), the TPM driver will request UC and fail: [ 1.730459] ioremap error for 0xfed40000-0xfed45000, requested 0x2, got 0x0 [ 1.732780] tpm_tis MSFT0101:00: probe with driver tpm_tis failed with error -12 Note, the '0x2' and '0x0' values refer to "enum page_cache_mode", not x86's memtypes (which frustratingly are an almost pure inversion; 2 == WB, 0 == UC). E.g. tracing mapping requests for TPM TIS yields: Mapping TPM TIS with req_type = 0 WARNING: CPU: 22 PID: 1 at arch/x86/mm/pat/memtype.c:530 memtype_reserve+0x2ab/0x460 Modules linked in: CPU: 22 UID: 0 PID: 1 Comm: swapper/0 Tainted: G W 6.16.0-rc7+ #2 VOLUNTARY Tainted: [W]=WARN Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/29/2025 RIP: 0010:memtype_reserve+0x2ab/0x460 __ioremap_caller+0x16d/0x3d0 ioremap_cache+0x17/0x30 x86_acpi_os_ioremap+0xe/0x20 acpi_os_map_iomem+0x1f3/0x240 acpi_os_map_memory+0xe/0x20 acpi_ex_system_memory_space_handler+0x273/0x440 acpi_ev_address_space_dispatch+0x176/0x4c0 acpi_ex_access_region+0x2ad/0x530 acpi_ex_field_datum_io+0xa2/0x4f0 acpi_ex_extract_from_field+0x296/0x3e0 acpi_ex_read_data_from_field+0xd1/0x460 acpi_ex_resolve_node_to_value+0x2ee/0x530 acpi_ex_resolve_to_value+0x1f2/0x540 acpi_ds_evaluate_name_path+0x11b/0x190 acpi_ds_exec_end_op+0x456/0x960 acpi_ps_parse_loop+0x27a/0xa50 acpi_ps_parse_aml+0x226/0x600 acpi_ps_execute_method+0x172/0x3e0 acpi_ns_evaluate+0x175/0x5f0 acpi_evaluate_object+0x213/0x490 acpi_evaluate_integer+0x6d/0x140 acpi_bus_get_status+0x93/0x150 acpi_add_single_object+0x43a/0x7c0 acpi_bus_check_add+0x149/0x3a0 acpi_bus_check_add_1+0x16/0x30 acpi_ns_walk_namespace+0x22c/0x360 acpi_walk_namespace+0x15c/0x170 acpi_bus_scan+0x1dd/0x200 acpi_scan_init+0xe5/0x2b0 acpi_init+0x264/0x5b0 do_one_initcall+0x5a/0x310 kernel_init_freeable+0x34f/0x4f0 kernel_init+0x1b/0x200 ret_from_fork+0x186/0x1b0 ret_from_fork_asm+0x1a/0x30 </TASK> The above traces are from a Google-VMM based VM, but the same behavior happens with a QEMU based VM that is modified to add a SystemMemory range for the TPM TIS address space. The only reason this doesn't cause problems for HPET, which appears to require a SystemMemory region, is because HPET gets special treatment via x86_init.timers.timer_init(), and so gets a chance to create its UC- mapping before acpi_init() clobbers things. Disabling the early call to hpet_time_init() yields the same behavior for HPET: [ 0.318264] ioremap error for 0xfed00000-0xfed01000, requested 0x2, got 0x0 Hack around the ACPI gap by forcing the legacy PCI hole to UC when overriding the (virtual) MTRRs for CoCo guest, so that ioremap handling of MTRRs naturally kicks in and forces the ACPI mappings to be UC. Note, the requested/mapped memtype doesn't actually matter in terms of accessing the device. In practically every setup, legacy PCI devices are emulated by the hypervisor, and accesses are intercepted and handled as emulated MMIO, i.e. never access physical memory and thus don't have an effective memtype. Even in a theoretical setup where such devices are passed through by the host, i.e. point at real MMIO memory, it is KVM's (as the hypervisor) responsibility to force the memory to be WC/UC, e.g. via EPT memtype under TDX or real hardware MTRRs under SNP. Not doing so cannot work, and the hypervisor is highly motivated to do the right thing as letting the guest access hardware MMIO with WB would likely result in a variety of fatal #MCs. In other words, forcing the range to be UC is all about coercing the kernel's tracking into thinking that it has established UC mappings, so that the ioremap code doesn't reject mappings from e.g. the TPM driver and thus prevent the driver from loading and the device from functioning. Note #2, relying on guest firmware to handle this scenario, e.g. by setting virtual MTRRs and then consuming them in Linux, is not a viable option, as the virtual MTRR state is managed by the untrusted hypervisor, and because OVMF at least has stopped programming virtual MTRRs when running as a TDX guest. Link: https://lore.kernel.org/all/8137d98e-8825-415b-9282-1d2a115bb51a@linux.intel.com Fixes: 8e690b817e38 ("x86/kvm: Override default caching mode for SEV-SNP and TDX") Cc: stable@vger.kernel.org Cc: Peter Gonda <pgonda@google.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Jürgen Groß <jgross@suse.com> Cc: Korakit Seemakhupt <korakit@google.com> Cc: Jianxiong Gao <jxgao@google.com> Cc: Nikolay Borisov <nik.borisov@suse.com> Suggested-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Tested-by: Korakit Seemakhupt <korakit@google.com> Link: https://lore.kernel.org/r/20250828005249.39339-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-19KVM: arm64: Fix page leak in user_mem_abort()Fuad Tabba1-2/+7
commit 5f9466b50c1b4253d91abf81780b90a722133162 upstream. The user_mem_abort() function acquires a page reference via __kvm_faultin_pfn() early in its execution. However, the subsequent checks for mismatched attributes between stage 1 and stage 2 mappings would return an error code directly, bypassing the corresponding page release. Fix this by storing the error and releasing the unused page before returning the error. Fixes: 6d674e28f642 ("KVM: arm/arm64: Properly handle faulting of device mappings") Fixes: 2a8dfab26677 ("KVM: arm64: Block cacheable PFNMAP mapping") Signed-off-by: Fuad Tabba <tabba@google.com> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-19KVM: arm64: Fix debug checking for np-guests using huge mappingsBen Horgan1-3/+6
commit 2ba972bf71cb71d2127ec6c3db1ceb6dd0c73173 upstream. When running with transparent huge pages and CONFIG_NVHE_EL2_DEBUG then the debug checking in assert_host_shared_guest() fails on the launch of an np-guest. This WARN_ON() causes a panic and generates the stack below. In __pkvm_host_relax_perms_guest() the debug checking assumes the mapping is a single page but it may be a block map. Update the checking so that the size is not checked and just assumes the correct size. While we're here make the same fix in __pkvm_host_mkyoung_guest(). Info: # lkvm run -k /share/arch/arm64/boot/Image -m 704 -c 8 --name guest-128 Info: Removed ghost socket file "/.lkvm//guest-128.sock". [ 1406.521757] kvm [141]: nVHE hyp BUG at: arch/arm64/kvm/hyp/nvhe/mem_protect.c:1088! [ 1406.521804] kvm [141]: nVHE call trace: [ 1406.521828] kvm [141]: [<ffff8000811676b4>] __kvm_nvhe_hyp_panic+0xb4/0xe8 [ 1406.521946] kvm [141]: [<ffff80008116d12c>] __kvm_nvhe_assert_host_shared_guest+0xb0/0x10c [ 1406.522049] kvm [141]: [<ffff80008116f068>] __kvm_nvhe___pkvm_host_relax_perms_guest+0x48/0x104 [ 1406.522157] kvm [141]: [<ffff800081169df8>] __kvm_nvhe_handle___pkvm_host_relax_perms_guest+0x64/0x7c [ 1406.522250] kvm [141]: [<ffff800081169f0c>] __kvm_nvhe_handle_trap+0x8c/0x1a8 [ 1406.522333] kvm [141]: [<ffff8000811680fc>] __kvm_nvhe___skip_pauth_save+0x4/0x4 [ 1406.522454] kvm [141]: ---[ end nVHE call trace ]--- [ 1406.522477] kvm [141]: Hyp Offset: 0xfffece8013600000 [ 1406.522554] Kernel panic - not syncing: HYP panic: [ 1406.522554] PS:834003c9 PC:0000b1806db6d170 ESR:00000000f2000800 [ 1406.522554] FAR:ffff8000804be420 HPFAR:0000000000804be0 PAR:0000000000000000 [ 1406.522554] VCPU:0000000000000000 [ 1406.523337] CPU: 3 UID: 0 PID: 141 Comm: kvm-vcpu-0 Not tainted 6.16.0-rc7 #97 PREEMPT [ 1406.523485] Hardware name: FVP Base RevC (DT) [ 1406.523566] Call trace: [ 1406.523629] show_stack+0x18/0x24 (C) [ 1406.523753] dump_stack_lvl+0xd4/0x108 [ 1406.523899] dump_stack+0x18/0x24 [ 1406.524040] panic+0x3d8/0x448 [ 1406.524184] nvhe_hyp_panic_handler+0x10c/0x23c [ 1406.524325] kvm_handle_guest_abort+0x68c/0x109c [ 1406.524500] handle_exit+0x60/0x17c [ 1406.524630] kvm_arch_vcpu_ioctl_run+0x2e0/0x8c0 [ 1406.524794] kvm_vcpu_ioctl+0x1a8/0x9cc [ 1406.524919] __arm64_sys_ioctl+0xac/0x104 [ 1406.525067] invoke_syscall+0x48/0x10c [ 1406.525189] el0_svc_common.constprop.0+0x40/0xe0 [ 1406.525322] do_el0_svc+0x1c/0x28 [ 1406.525441] el0_svc+0x38/0x120 [ 1406.525588] el0t_64_sync_handler+0x10c/0x138 [ 1406.525750] el0t_64_sync+0x1ac/0x1b0 [ 1406.525876] SMP: stopping secondary CPUs [ 1406.525965] Kernel Offset: disabled [ 1406.526032] CPU features: 0x0000,00000080,8e134ca1,9446773f [ 1406.526130] Memory Limit: none [ 1406.959099] ---[ end Kernel panic - not syncing: HYP panic: [ 1406.959099] PS:834003c9 PC:0000b1806db6d170 ESR:00000000f2000800 [ 1406.959099] FAR:ffff8000804be420 HPFAR:0000000000804be0 PAR:0000000000000000 [ 1406.959099] VCPU:0000000000000000 ] Signed-off-by: Ben Horgan <ben.horgan@arm.com> Fixes: f28f1d02f4eaa ("KVM: arm64: Add a range to __pkvm_host_unshare_guest()") Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Quentin Perret <qperret@google.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: stable@vger.kernel.org Reviewed-by: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-19KVM: s390: Fix to clear PTE when discarding a swapped pageGautam Gala3-23/+34
commit 5deafa27d9ae040b75d392f60b12e300b42b4792 upstream. KVM run fails when guests with 'cmm' cpu feature and host are under memory pressure and use swap heavily. This is because npages becomes ENOMEN (out of memory) in hva_to_pfn_slow() which inturn propagates as EFAULT to qemu. Clearing the page table entry when discarding an address that maps to a swap entry resolves the issue. Fixes: 200197908dc4 ("KVM: s390: Refactor and split some gmap helpers") Cc: stable@vger.kernel.org Suggested-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Signed-off-by: Gautam Gala <ggala@linux.ibm.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>