linux.git/arch, branch v6.9.4

genirq/cpuhotplug, x86/vector: Prevent vector leak during CPU offline

2024-06-12T09:39:58+00:00

commit a6c11c0a5235fb144a65e0cb2ffd360ddc1f6c32 upstream.

The absence of IRQD_MOVE_PCNTXT prevents immediate effectiveness of
interrupt affinity reconfiguration via procfs. Instead, the change is
deferred until the next instance of the interrupt being triggered on the
original CPU.

When the interrupt next triggers on the original CPU, the new affinity is
enforced within __irq_move_irq(). A vector is allocated from the new CPU,
but the old vector on the original CPU remains and is not immediately
reclaimed. Instead, apicd->move_in_progress is flagged, and the reclaiming
process is delayed until the next trigger of the interrupt on the new CPU.

Upon the subsequent triggering of the interrupt on the new CPU,
irq_complete_move() adds a task to the old CPU's vector_cleanup list if it
remains online. Subsequently, the timer on the old CPU iterates over its
vector_cleanup list, reclaiming old vectors.

However, a rare scenario arises if the old CPU is outgoing before the
interrupt triggers again on the new CPU.

In that case irq_force_complete_move() is not invoked on the outgoing CPU
to reclaim the old apicd->prev_vector because the interrupt isn't currently
affine to the outgoing CPU, and irq_needs_fixup() returns false. Even
though __vector_schedule_cleanup() is later called on the new CPU, it
doesn't reclaim apicd->prev_vector; instead, it simply resets both
apicd->move_in_progress and apicd->prev_vector to 0.

As a result, the vector remains unreclaimed in vector_matrix, leading to a
CPU vector leak.

To address this issue, move the invocation of irq_force_complete_move()
before the irq_needs_fixup() call to reclaim apicd->prev_vector, if the
interrupt is currently or used to be affine to the outgoing CPU.

Additionally, reclaim the vector in __vector_schedule_cleanup() as well,
following a warning message, although theoretically it should never see
apicd->move_in_progress with apicd->prev_cpu pointing to an offline CPU.

Fixes: f0383c24b485 ("genirq/cpuhotplug: Add support for cleaning up move in progress")
Signed-off-by: Dongli Zhang 
Signed-off-by: Thomas Gleixner 
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240522220218.162423-1-dongli.zhang@oracle.com
Signed-off-by: Greg Kroah-Hartman

x86/topology/intel: Unlock CPUID before evaluating anything

2024-06-12T09:39:58+00:00

commit 0c2f6d04619ec2b53ad4b0b591eafc9389786e86 upstream.

Intel CPUs have a MSR bit to limit CPUID enumeration to leaf two. If
this bit is set by the BIOS then CPUID evaluation including topology
enumeration does not work correctly as the evaluation code does not try
to analyze any leaf greater than two.

This went unnoticed before because the original topology code just
repeated evaluation several times and managed to overwrite the initial
limited information with the correct one later. The new evaluation code
does it once and therefore ends up with the limited and wrong
information.

Cure this by unlocking CPUID right before evaluating anything which
depends on the maximum CPUID leaf being greater than two instead of
rereading stuff after unlock.

Fixes: 22d63660c35e ("x86/cpu: Use common topology code for Intel")
Reported-by: Peter Schneider 
Signed-off-by: Thomas Gleixner 
Signed-off-by: Borislav Petkov (AMD) 
Tested-by: Peter Schneider 
Cc: 
Link: https://lore.kernel.org/r/fd3f73dc-a86f-4bcf-9c60-43556a21eb42@googlemail.com
Signed-off-by: Greg Kroah-Hartman

KVM: x86: Don't advertise guest.MAXPHYADDR as host.MAXPHYADDR in CPUID

2024-06-12T09:39:58+00:00

commit 6f5c9600621b4efb5c61b482d767432eb1ad3a9c upstream.

Drop KVM's propagation of GuestPhysBits (CPUID leaf 80000008, EAX[23:16])
to HostPhysBits (same leaf, EAX[7:0]) when advertising the address widths
to userspace via KVM_GET_SUPPORTED_CPUID.

Per AMD, GuestPhysBits is intended for software use, and physical CPUs do
not set that field.  I.e. GuestPhysBits will be non-zero if and only if
KVM is running as a nested hypervisor, and in that case, GuestPhysBits is
NOT guaranteed to capture the CPU's effective MAXPHYADDR when running with
TDP enabled.

E.g. KVM will soon use GuestPhysBits to communicate the CPU's maximum
*addressable* guest physical address, which would result in KVM under-
reporting PhysBits when running as an L1 on a CPU with MAXPHYADDR=52,
but without 5-level paging.

Signed-off-by: Gerd Hoffmann 
Cc: stable@vger.kernel.org
Reviewed-by: Xiaoyao Li 
Link: https://lore.kernel.org/r/20240313125844.912415-2-kraxel@redhat.com
[sean: rewrite changelog with --verbose, Cc stable@]
Signed-off-by: Sean Christopherson 
Signed-off-by: Greg Kroah-Hartman

x86/pci: Skip early E820 check for ECAM region

2024-06-12T09:39:58+00:00

commit 199f968f1484a14024d0d467211ffc2faf193eb4 upstream.

Arul, Mateusz, Imcarneiro91, and Aman reported a regression caused by
07eab0901ede ("efi/x86: Remove EfiMemoryMappedIO from E820 map").  On the
Lenovo Legion 9i laptop, that commit removes the ECAM area from E820, which
means the early E820 validation fails, which means we don't enable ECAM in
the "early MCFG" path.

The static MCFG table describes ECAM without depending on the ACPI
interpreter.  Many Legion 9i ACPI methods rely on that, so they fail when
PCI config access isn't available, resulting in the embedded controller,
PS/2, audio, trackpad, and battery devices not being detected.  The _OSC
method also fails, so Linux can't take control of the PCIe hotplug, PME,
and AER features:

  # pci_mmcfg_early_init()

  PCI: ECAM [mem 0xc0000000-0xce0fffff] (base 0xc0000000) for domain 0000 [bus 00-e0]
  PCI: not using ECAM ([mem 0xc0000000-0xce0fffff] not reserved)

  ACPI Error: AE_ERROR, Returned by Handler for [PCI_Config] (20230628/evregion-300)
  ACPI: Interpreter enabled
  ACPI: Ignoring error and continuing table load
  ACPI BIOS Error (bug): Could not resolve symbol [\_SB.PC00.RP01._SB.PC00], AE_NOT_FOUND (20230628/dswload2-162)
  ACPI Error: AE_NOT_FOUND, During name lookup/catalog (20230628/psobject-220)
  ACPI: Skipping parse of AML opcode: OpcodeName unavailable (0x0010)
  ACPI BIOS Error (bug): Could not resolve symbol [\_SB.PC00.RP01._SB.PC00], AE_NOT_FOUND (20230628/dswload2-162)
  ACPI Error: AE_NOT_FOUND, During name lookup/catalog (20230628/psobject-220)
  ...
  ACPI Error: Aborting method \_SB.PC00._OSC due to previous error (AE_NOT_FOUND) (20230628/psparse-529)
  acpi PNP0A08:00: _OSC: platform retains control of PCIe features (AE_NOT_FOUND)

  # pci_mmcfg_late_init()

  PCI: ECAM [mem 0xc0000000-0xce0fffff] (base 0xc0000000) for domain 0000 [bus 00-e0]
  PCI: [Firmware Info]: ECAM [mem 0xc0000000-0xce0fffff] not reserved in ACPI motherboard resources
  PCI: ECAM [mem 0xc0000000-0xce0fffff] is EfiMemoryMappedIO; assuming valid
  PCI: ECAM [mem 0xc0000000-0xce0fffff] reserved to work around lack of ACPI motherboard _CRS

Per PCI Firmware r3.3, sec 4.1.2, ECAM space must be reserved by a PNP0C02
resource, but there's no requirement to mention it in E820, so we shouldn't
look at E820 to validate the ECAM space described by MCFG.

In 2006, 946f2ee5c731 ("[PATCH] i386/x86-64: Check that MCFG points to an
e820 reserved area") added a sanity check of E820 to work around buggy MCFG
tables, but that over-aggressive validation causes failures like this one.

Keep the E820 validation check for machines older than 2016, an arbitrary
ten years after 946f2ee5c731, so machines that depend on it don't break.

Skip the early E820 check for 2016 and newer BIOSes since there's no
requirement to describe ECAM in E820.

Link: https://lore.kernel.org/r/20240417204012.215030-2-helgaas@kernel.org
Fixes: 07eab0901ede ("efi/x86: Remove EfiMemoryMappedIO from E820 map")
Reported-by: Mateusz Kaduk 
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218444
Signed-off-by: Bjorn Helgaas 
Tested-by: Mateusz Kaduk 
Reviewed-by: Andy Shevchenko 
Reviewed-by: Hans de Goede 
Reviewed-by: Kuppuswamy Sathyanarayanan 
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman

x86/topology: Handle bogus ACPI tables correctly

2024-06-12T09:39:57+00:00

commit 9d22c96316ac59ed38e80920c698fed38717b91b upstream.

The ACPI specification clearly states how the processors should be
enumerated in the MADT:

 "To ensure that the boot processor is supported post initialization,
  two guidelines should be followed. The first is that OSPM should
  initialize processors in the order that they appear in the MADT. The
  second is that platform firmware should list the boot processor as the
  first processor entry in the MADT.
  ...
  Failure of OSPM implementations and platform firmware to abide by
  these guidelines can result in both unpredictable and non optimal
  platform operation."

The kernel relies on that ordering to detect the real BSP on crash kernels
which is important to avoid sending a INIT IPI to it as that would cause a
full machine reset.

On a Dell XPS 16 9640 the BIOS ignores this rule and enumerates the CPUs in
the wrong order. As a consequence the kernel falsely detects a crash kernel
and disables the corresponding CPU.

Prevent this by checking the IA32_APICBASE MSR for the BSP bit on the boot
CPU. If that bit is set, then the MADT based BSP detection can be safely
ignored. If the kernel detects a mismatch between the BSP bit and the first
enumerated MADT entry then emit a firmware bug message.

This obviously also has to be taken into account when the boot APIC ID and
the first enumerated APIC ID match. If the boot CPU does not have the BSP
bit set in the APICBASE MSR then there is no way for the boot CPU to
determine which of the CPUs is the real BSP. Sending an INIT to the real
BSP would reset the machine so the only sane way to deal with that is to
limit the number of CPUs to one and emit a corresponding warning message.

Fixes: 5c5682b9f87a ("x86/cpu: Detect real BSP on crash kernels")
Reported-by: Carsten Tolkmit 
Signed-off-by: Thomas Gleixner 
Tested-by: Carsten Tolkmit 
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/87le48jycb.ffs@tglx
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218837
Signed-off-by: Greg Kroah-Hartman

riscv: prevent pt_regs corruption for secondary idle threads

2024-06-12T09:39:57+00:00

[ Upstream commit a638b0461b58aa3205cd9d5f14d6f703d795b4af ]

Top of the kernel thread stack should be reserved for pt_regs. However
this is not the case for the idle threads of the secondary boot harts.
Their stacks overlap with their pt_regs, so both may get corrupted.

Similar issue has been fixed for the primary hart, see c7cdd96eca28
("riscv: prevent stack corruption by reserving task_pt_regs(p) early").
However that fix was not propagated to the secondary harts. The problem
has been noticed in some CPU hotplug tests with V enabled. The function
smp_callin stored several registers on stack, corrupting top of pt_regs
structure including status field. As a result, kernel attempted to save
or restore inexistent V context.

Fixes: 9a2451f18663 ("RISC-V: Avoid using per cpu array for ordered booting")
Fixes: 2875fe056156 ("RISC-V: Add cpu_ops and modify default booting method")
Signed-off-by: Sergey Matyukevich 
Reviewed-by: Alexandre Ghiti 
Link: https://lore.kernel.org/r/20240523084327.2013211-1-geomatsi@gmail.com
Signed-off-by: Palmer Dabbelt 
Signed-off-by: Sasha Levin

powerpc/pseries/lparcfg: drop error message from guest name lookup

2024-06-12T09:39:56+00:00

[ Upstream commit 12870ae3818e39ea65bf710f645972277b634f72 ]

It's not an error or exceptional situation when the hosting
environment does not expose a name for the LP/guest via RTAS or the
device tree. This happens with qemu when run without the '-name'
option. The message also lacks a newline. Remove it.

Signed-off-by: Nathan Lynch 
Fixes: eddaa9a40275 ("powerpc/pseries: read the lpar name from the firmware")
Signed-off-by: Michael Ellerman 
Link: https://msgid.link/20240524-lparcfg-updates-v2-1-62e2e9d28724@linux.ibm.com
Signed-off-by: Sasha Levin

riscv: stacktrace: fixed walk_stackframe()

2024-06-12T09:39:42+00:00

[ Upstream commit a2a4d4a6a0bf5eba66f8b0b32502cc20d82715a0 ]

If the load access fault occures in a leaf function (with
CONFIG_FRAME_POINTER=y), when wrong stack trace will be displayed:

[] regmap_mmio_read32le+0xe/0x1c
---[ end trace 0000000000000000 ]---

Registers dump:
    ra     0xffffffff80485758 
    sp     0xffffffc80200b9a0
    fp     0xffffffc80200b9b0
    pc     0xffffffff804853ba 

Stack dump:
    0xffffffc80200b9a0:  0xffffffc80200b9e0  0xffffffc80200b9e0
    0xffffffc80200b9b0:  0xffffffff8116d7e8  0x0000000000000100
    0xffffffc80200b9c0:  0xffffffd8055b9400  0xffffffd8055b9400
    0xffffffc80200b9d0:  0xffffffc80200b9f0  0xffffffff8047c526
    0xffffffc80200b9e0:  0xffffffc80200ba30  0xffffffff8047fe9a

The assembler dump of the function preambula:
    add     sp,sp,-16
    sd      s0,8(sp)
    add     s0,sp,16

In the fist stack frame, where ra is not stored on the stack we can
observe:

        0(sp)                  8(sp)
        .---------------------------------------------.
    sp->|       frame->fp      | frame->ra (saved fp) |
        |---------------------------------------------|
    fp->|         ....         |         ....         |
        |---------------------------------------------|
        |                      |                      |

and in the code check is performed:
	if (regs && (regs->epc == pc) && (frame->fp & 0x7))

I see no reason to check frame->fp value at all, because it is can be
uninitialized value on the stack. A better way is to check frame->ra to
be an address on the stack. After the stacktrace shows as expect:

[] regmap_mmio_read32le+0xe/0x1c
[] regmap_mmio_read+0x24/0x52
[] _regmap_bus_reg_read+0x1a/0x22
[] _regmap_read+0x5c/0xea
[] _regmap_update_bits+0x76/0xc0
...
---[ end trace 0000000000000000 ]---
As pointed by Samuel Holland it is incorrect to remove check of the stackframe
entirely.

Changes since v2 [2]:
 - Add accidentally forgotten curly brace

Changes since v1 [1]:
 - Instead of just dropping frame->fp check, replace it with validation of
   frame->ra, which should be a stack address.
 - Move frame pointer validation into the separate function.

[1] https://lore.kernel.org/linux-riscv/20240426072701.6463-1-dev.mbstr@gmail.com/
[2] https://lore.kernel.org/linux-riscv/20240521131314.48895-1-dev.mbstr@gmail.com/

Fixes: f766f77a74f5 ("riscv/stacktrace: Fix stack output without ra on the stack top")
Signed-off-by: Matthew Bystrin 
Reviewed-by: Samuel Holland 
Link: https://lore.kernel.org/r/20240521191727.62012-1-dev.mbstr@gmail.com
Signed-off-by: Palmer Dabbelt 
Signed-off-by: Sasha Levin

riscv: cpufeature: Fix extension subset checking

2024-06-12T09:39:42+00:00

[ Upstream commit e67e98ee8952c7d5ce986d1dc6f8221ab8674afa ]

This loop is supposed to check if ext->subset_ext_ids[j] is valid, rather
than if ext->subset_ext_ids[i] is valid, before setting the extension
id ext->subset_ext_ids[j] in isainfo->isa.

Signed-off-by: Charlie Jenkins 
Reviewed-by: Conor Dooley 
Reviewed-by: Alexandre Ghiti 
Reviewed-by: Andrew Jones 
Fixes: 0d8295ed975b ("riscv: add ISA extension parsing for scalar crypto")
Link: https://lore.kernel.org/r/20240502-cpufeature_fixes-v4-2-b3d1a088722d@rivosinc.com
Signed-off-by: Palmer Dabbelt 
Signed-off-by: Sasha Levin

riscv: cpufeature: Fix thead vector hwcap removal

2024-06-12T09:39:42+00:00

[ Upstream commit e482eab4d1eb31031eff2b6afb71776483101979 ]

The riscv_cpuinfo struct that contains mvendorid and marchid is not
populated until all harts are booted which happens after the DT parsing.
Use the mvendorid/marchid from the boot hart to determine if the DT
contains an invalid V.

Fixes: d82f32202e0d ("RISC-V: Ignore V from the riscv,isa DT property on older T-Head CPUs")
Signed-off-by: Charlie Jenkins 
Reviewed-by: Conor Dooley 
Reviewed-by: Guo Ren 
Link: https://lore.kernel.org/r/20240502-cpufeature_fixes-v4-1-b3d1a088722d@rivosinc.com
Signed-off-by: Palmer Dabbelt 
Signed-off-by: Sasha Levin