summaryrefslogtreecommitdiff
path: root/arch
AgeCommit message (Collapse)AuthorFilesLines
2024-02-23x86/barrier: Do not serialize MSR accesses on AMDBorislav Petkov (AMD)6-19/+32
commit 04c3024560d3a14acd18d0a51a1d0a89d29b7eb5 upstream. AMD does not have the requirement for a synchronization barrier when acccessing a certain group of MSRs. Do not incur that unnecessary penalty there. There will be a CPUID bit which explicitly states that a MFENCE is not needed. Once that bit is added to the APM, this will be extended with it. While at it, move to processor.h to avoid include hell. Untangling that file properly is a matter for another day. Some notes on the performance aspect of why this is relevant, courtesy of Kishon VijayAbraham <Kishon.VijayAbraham@amd.com>: On a AMD Zen4 system with 96 cores, a modified ipi-bench[1] on a VM shows x2AVIC IPI rate is 3% to 4% lower than AVIC IPI rate. The ipi-bench is modified so that the IPIs are sent between two vCPUs in the same CCX. This also requires to pin the vCPU to a physical core to prevent any latencies. This simulates the use case of pinning vCPUs to the thread of a single CCX to avoid interrupt IPI latency. In order to avoid run-to-run variance (for both x2AVIC and AVIC), the below configurations are done: 1) Disable Power States in BIOS (to prevent the system from going to lower power state) 2) Run the system at fixed frequency 2500MHz (to prevent the system from increasing the frequency when the load is more) With the above configuration: *) Performance measured using ipi-bench for AVIC: Average Latency: 1124.98ns [Time to send IPI from one vCPU to another vCPU] Cumulative throughput: 42.6759M/s [Total number of IPIs sent in a second from 48 vCPUs simultaneously] *) Performance measured using ipi-bench for x2AVIC: Average Latency: 1172.42ns [Time to send IPI from one vCPU to another vCPU] Cumulative throughput: 40.9432M/s [Total number of IPIs sent in a second from 48 vCPUs simultaneously] From above, x2AVIC latency is ~4% more than AVIC. However, the expectation is x2AVIC performance to be better or equivalent to AVIC. Upon analyzing the perf captures, it is observed significant time is spent in weak_wrmsr_fence() invoked by x2apic_send_IPI(). With the fix to skip weak_wrmsr_fence() *) Performance measured using ipi-bench for x2AVIC: Average Latency: 1117.44ns [Time to send IPI from one vCPU to another vCPU] Cumulative throughput: 42.9608M/s [Total number of IPIs sent in a second from 48 vCPUs simultaneously] Comparing the performance of x2AVIC with and without the fix, it can be seen the performance improves by ~4%. Performance captured using an unmodified ipi-bench using the 'mesh-ipi' option with and without weak_wrmsr_fence() on a Zen4 system also showed significant performance improvement without weak_wrmsr_fence(). The 'mesh-ipi' option ignores CCX or CCD and just picks random vCPU. Average throughput (10 iterations) with weak_wrmsr_fence(), Cumulative throughput: 4933374 IPI/s Average throughput (10 iterations) without weak_wrmsr_fence(), Cumulative throughput: 6355156 IPI/s [1] https://github.com/bytedance/kvm-utils/tree/master/microbenchmark/ipi-bench Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230622095212.20940-1-bp@alien8.de Signed-off-by: Kishon Vijay Abraham I <kvijayab@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23x86/efistub: Use 1:1 file:memory mapping for PE/COFF .compat sectionArd Biesheuvel2-11/+9
commit 1ad55cecf22f05f1c884adf63cc09d3c3e609ebf upstream. The .compat section is a dummy PE section that contains the address of the 32-bit entrypoint of the 64-bit kernel image if it is bootable from 32-bit firmware (i.e., CONFIG_EFI_MIXED=y) This section is only 8 bytes in size and is only referenced from the loader, and so it is placed at the end of the memory view of the image, to avoid the need for padding it to 4k, which is required for sections appearing in the middle of the image. Unfortunately, this violates the PE/COFF spec, and even if most EFI loaders will work correctly (including the Tianocore reference implementation), PE loaders do exist that reject such images, on the basis that both the file and memory views of the file contents should be described by the section headers in a monotonically increasing manner without leaving any gaps. So reorganize the sections to avoid this issue. This results in a slight padding overhead (< 4k) which can be avoided if desired by disabling CONFIG_EFI_MIXED (which is only needed in rare cases these days) Fixes: 3e3eabe26dc8 ("x86/boot: Increase section and file alignment to 4k/512") Reported-by: Mike Beaton <mjsbeaton@gmail.com> Link: https://lkml.kernel.org/r/CAHzAAWQ6srV6LVNdmfbJhOwhBw5ZzxxZZ07aHt9oKkfYAdvuQQ%40mail.gmail.com Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23x86/boot: Increase section and file alignment to 4k/512Ard Biesheuvel4-125/+51
commit 3e3eabe26dc88692d34cf76ca0e0dd331481cc15 upstream. Align x86 with other EFI architectures, and increase the section alignment to the EFI page size (4k), so that firmware is able to honour the section permission attributes and map code read-only and data non-executable. There are a number of requirements that have to be taken into account: - the sign tools get cranky when there are gaps between sections in the file view of the image - the virtual offset of each section must be aligned to the image's section alignment - the file offset *and size* of each section must be aligned to the image's file alignment - the image size must be aligned to the section alignment - each section's virtual offset must be greater than or equal to the size of the headers. In order to meet all these requirements, while avoiding the need for lots of padding to accommodate the .compat section, the latter is placed at an arbitrary offset towards the end of the image, but aligned to the minimum file alignment (512 bytes). The space before the .text section is therefore distributed between the PE header, the .setup section and the .compat section, leaving no gaps in the file coverage, making the signing tools happy. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230915171623.655440-18-ardb@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23x86/boot: Split off PE/COFF .data sectionArd Biesheuvel2-5/+16
commit 34951f3c28bdf6481d949a20413b2ce7693687b2 upstream. Describe the code and data of the decompressor binary using separate .text and .data PE/COFF sections, so that we will be able to map them using restricted permissions once we increase the section and file alignment sufficiently. This avoids the need for memory mappings that are writable and executable at the same time, which is something that is best avoided for security reasons. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230915171623.655440-17-ardb@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23x86/boot: Drop PE/COFF .reloc sectionArd Biesheuvel3-51/+7
commit fa5750521e0a4efbc1af05223da9c4bbd6c21c83 upstream. Ancient buggy EFI loaders may have required a .reloc section to be present at some point in time, but this has not been true for a long time so the .reloc section can just be dropped. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230915171623.655440-16-ardb@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23x86/boot: Construct PE/COFF .text section from assemblerArd Biesheuvel2-62/+7
commit efa089e63b56bdc5eca754b995cb039dd7a5457e upstream. Now that the size of the setup block is visible to the assembler, it is possible to populate the PE/COFF header fields from the asm code directly, instead of poking the values into the binary using the build tool. This will make it easier to reorganize the section layout without having to tweak the build tool in lockstep. This change has no impact on the resulting bzImage binary. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230915171623.655440-15-ardb@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23x86/boot: Derive file size from _edata symbolArd Biesheuvel4-25/+12
commit aeb92067f6ae994b541d7f9752fe54ed3d108bcc upstream. Tweak the linker script so that the value of _edata represents the decompressor binary's file size rounded up to the appropriate alignment. This removes the need to calculate it in the build tool, and will make it easier to refer to the file size from the header directly in subsequent changes to the PE header layout. While adding _edata to the sed regex that parses the compressed vmlinux's symbol list, tweak the regex a bit for conciseness. This change has no impact on the resulting bzImage binary when configured with CONFIG_EFI_STUB=y. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230915171623.655440-14-ardb@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23x86/boot: Define setup size in linker scriptArd Biesheuvel3-7/+5
commit 093ab258e3fb1d1d3afdfd4a69403d44ce90e360 upstream. The setup block contains the real mode startup code that is used when booting from a legacy BIOS, along with the boot_params/setup_data that is used by legacy x86 bootloaders to pass the command line and initial ramdisk parameters, among other things. The setup block also contains the PE/COFF header of the entire combined image, which includes the compressed kernel image, the decompressor and the EFI stub. This PE header describes the layout of the executable image in memory, and currently, the fact that the setup block precedes it makes it rather fiddly to get the right values into the right place in the final image. Let's make things a bit easier by defining the setup_size in the linker script so it can be referenced from the asm code directly, rather than having to rely on the build tool to calculate it. For the time being, add 64 bytes of fixed padding for the .reloc and .compat sections - this will be removed in a subsequent patch after the PE/COFF header has been reorganized. This change has no impact on the resulting bzImage binary when configured with CONFIG_EFI_MIXED=y. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230915171623.655440-13-ardb@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23x86/boot: Set EFI handover offset directly in header asmArd Biesheuvel2-25/+17
commit eac956345f99dda3d68f4ae6cf7b494105e54780 upstream. The offsets of the EFI handover entrypoints are available to the assembler when constructing the header, so there is no need to set them from the build tool afterwards. This change has no impact on the resulting bzImage binary. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230915171623.655440-12-ardb@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23x86/boot: Grab kernel_info offset from zoffset header directlyArd Biesheuvel2-5/+1
commit 2e765c02dcbfc2a8a4527c621a84b9502f6b9bd2 upstream. Instead of parsing zoffset.h and poking the kernel_info offset value into the header from the build tool, just grab the value directly in the asm file that describes this header. This change has no impact on the resulting bzImage binary. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230915171623.655440-11-ardb@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23x86/boot: Drop references to startup_64Ard Biesheuvel2-4/+1
commit b618d31f112bea3d2daea19190d63e567f32a4db upstream. The x86 boot image generation tool assign a default value to startup_64 and subsequently parses the actual value from zoffset.h but it never actually uses the value anywhere. So remove this code. This change has no impact on the resulting bzImage binary. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230912090051.4014114-25-ardb@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23x86/boot: Drop redundant code setting the root deviceArd Biesheuvel2-8/+1
commit 7448e8e5d15a3c4df649bf6d6d460f78396f7e1e upstream. The root device defaults to 0,0 and is no longer configurable at build time [0], so there is no need for the build tool to ever write to this field. [0] 079f85e624189292 ("x86, build: Do not set the root_dev field in bzImage") This change has no impact on the resulting bzImage binary. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230912090051.4014114-23-ardb@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23x86/boot: Omit compression buffer from PE/COFF image memory footprintArd Biesheuvel2-48/+8
commit 8eace5b3555606e684739bef5bcdfcfe68235257 upstream. Now that the EFI stub decompresses the kernel and hands over to the decompressed image directly, there is no longer a need to provide a decompression buffer as part of the .BSS allocation of the PE/COFF image. It also means the PE/COFF image can be loaded anywhere in memory, and setting the preferred image base is unnecessary. So drop the handling of this from the header and from the build tool. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230912090051.4014114-22-ardb@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23x86/boot: Remove the 'bugger off' messageArd Biesheuvel2-52/+4
commit 768171d7ebbce005210e1cf8456f043304805c15 upstream. Ancient (pre-2003) x86 kernels could boot from a floppy disk straight from the BIOS, using a small real mode boot stub at the start of the image where the BIOS would expect the boot record (or boot block) to appear. Due to its limitations (kernel size < 1 MiB, no support for IDE, USB or El Torito floppy emulation), this support was dropped, and a Linux aware bootloader is now always required to boot the kernel from a legacy BIOS. To smoothen this transition, the boot stub was not removed entirely, but replaced with one that just prints an error message telling the user to install a bootloader. As it is unlikely that anyone doing direct floppy boot with such an ancient kernel is going to upgrade to v6.5+ and expect that this boot method still works, printing this message is kind of pointless, and so it should be possible to remove the logic that emits it. Let's free up this space so it can be used to expand the PE header in a subsequent patch. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: H. Peter Anvin (Intel) <hpa@zytor.com> Link: https://lore.kernel.org/r/20230912090051.4014114-21-ardb@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23x86/efi: Drop alignment flags from PE section headersArd Biesheuvel1-8/+4
commit bfab35f552ab3dd6d017165bf9de1d1d20f198cc upstream. The section header flags for alignment are documented in the PE/COFF spec as being applicable to PE object files only, not to PE executables such as the Linux bzImage, so let's drop them from the PE header. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230912090051.4014114-20-ardb@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23x86/efi: Drop EFI stub .bss from .data sectionArd Biesheuvel1-1/+0
commit 5f51c5d0e905608ba7be126737f7c84a793ae1aa upstream. Now that the EFI stub always zero inits its BSS section upon entry, there is no longer a need to place the BSS symbols carried by the stub into the .data section. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230912090051.4014114-18-ardb@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23parisc: Fix random data corruption from exception handlerHelge Deller8-71/+108
commit 8b1d72395635af45410b66cc4c4ab37a12c4a831 upstream. The current exception handler implementation, which assists when accessing user space memory, may exhibit random data corruption if the compiler decides to use a different register than the specified register %r29 (defined in ASM_EXCEPTIONTABLE_REG) for the error code. If the compiler choose another register, the fault handler will nevertheless store -EFAULT into %r29 and thus trash whatever this register is used for. Looking at the assembly I found that this happens sometimes in emulate_ldd(). To solve the issue, the easiest solution would be if it somehow is possible to tell the fault handler which register is used to hold the error code. Using %0 or %1 in the inline assembly is not posssible as it will show up as e.g. %r29 (with the "%r" prefix), which the GNU assembler can not convert to an integer. This patch takes another, better and more flexible approach: We extend the __ex_table (which is out of the execution path) by one 32-word. In this word we tell the compiler to insert the assembler instruction "or %r0,%r0,%reg", where %reg references the register which the compiler choosed for the error return code. In case of an access failure, the fault handler finds the __ex_table entry and can examine the opcode. The used register is encoded in the lowest 5 bits, and the fault handler can then store -EFAULT into this register. Since we extend the __ex_table to 3 words we can't use the BUILDTIME_TABLE_SORT config option any longer. Signed-off-by: Helge Deller <deller@gmx.de> Cc: <stable@vger.kernel.org> # v6.0+ Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23LoongArch: Fix earlycon parameter if KASAN enabledHuacai Chen1-0/+3
commit 639420e9f6cd9ca074732b17ac450d2518d5937f upstream. The earlycon parameter is based on fixmap, and fixmap addresses are not supposed to be shadowed by KASAN. So return the kasan_early_shadow_page in kasan_mem_to_shadow() if the input address is above FIXADDR_START. Otherwise earlycon cannot work after kasan_init(). Cc: stable@vger.kernel.org Fixes: 5aa4ac64e6add3e ("LoongArch: Add KASAN (Kernel Address Sanitizer) support") Signed-off-by: Huacai Chen <chenhuacai@loongson.cn> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23arm64: Subscribe Microsoft Azure Cobalt 100 to ARM Neoverse N2 errataEaswar Hariharan2-0/+7
commit fb091ff394792c018527b3211bbdfae93ea4ac02 upstream. Add the MIDR value of Microsoft Azure Cobalt 100, which is a Microsoft implemented CPU based on r0p0 of the ARM Neoverse N2 CPU, and therefore suffers from all the same errata. CC: stable@vger.kernel.org # 5.15+ Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20240214175522.2457857-1-eahariha@linux.microsoft.com Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23arm64/signal: Don't assume that TIF_SVE means we saved SVE stateMark Brown2-3/+3
commit 61da7c8e2a602f66be578cbbcebe8638c10e0f48 upstream. When we are in a syscall we will only save the FPSIMD subset even though the task still has access to the full register set, and on context switch we will only remove TIF_SVE when loading the register state. This means that the signal handling code should not assume that TIF_SVE means that the register state is stored in SVE format, it should instead check the format that was recorded during save. Fixes: 8c845e273104 ("arm64/sve: Leave SVE enabled on syscall if we don't context switch") Signed-off-by: Mark Brown <broonie@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240130-arm64-sve-signal-regs-v2-1-9fc6f9502782@kernel.org Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23KVM: arm64: Fix circular locking dependencySebastian Ene1-10/+17
commit 10c02aad111df02088d1a81792a709f6a7eca6cc upstream. The rule inside kvm enforces that the vcpu->mutex is taken *inside* kvm->lock. The rule is violated by the pkvm_create_hyp_vm() which acquires the kvm->lock while already holding the vcpu->mutex lock from kvm_vcpu_ioctl(). Avoid the circular locking dependency altogether by protecting the hyp vm handle with the config_lock, much like we already do for other forms of VM-scoped data. Signed-off-by: Sebastian Ene <sebastianene@google.com> Cc: stable@vger.kernel.org Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20240124091027.1477174-2-sebastianene@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23x86/mm/ident_map: Use gbpages only where full GB page should be mapped.Steve Wahl1-5/+18
commit d794734c9bbfe22f86686dc2909c25f5ffe1a572 upstream. When ident_pud_init() uses only gbpages to create identity maps, large ranges of addresses not actually requested can be included in the resulting table; a 4K request will map a full GB. On UV systems, this ends up including regions that will cause hardware to halt the system if accessed (these are marked "reserved" by BIOS). Even processor speculation into these regions is enough to trigger the system halt. Only use gbpages when map creation requests include the full GB page of space. Fall back to using smaller 2M pages when only portions of a GB page are included in the request. No attempt is made to coalesce mapping requests. If a request requires a map entry at the 2M (pmd) level, subsequent mapping requests within the same 1G region will also be at the pmd level, even if adjacent or overlapping such requests could have been combined to map a full gbpage. Existing usage starts with larger regions and then adds smaller regions, so this should not have any great consequence. [ dhansen: fix up comment formatting, simplifty changelog ] Signed-off-by: Steve Wahl <steve.wahl@hpe.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20240126164841.170866-1-steve.wahl%40hpe.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23KVM: x86/pmu: Fix type length error when reading pmu->fixed_ctr_ctrlMingwei Zhang1-1/+1
commit 05519c86d6997cfb9bb6c82ce1595d1015b718dc upstream. Use a u64 instead of a u8 when taking a snapshot of pmu->fixed_ctr_ctrl when reprogramming fixed counters, as truncating the value results in KVM thinking fixed counter 2 is already disabled (the bug also affects fixed counters 3+, but KVM doesn't yet support those). As a result, if the guest disables fixed counter 2, KVM will get a false negative and fail to reprogram/disable emulation of the counter, which can leads to incorrect counts and spurious PMIs in the guest. Fixes: 76d287b2342e ("KVM: x86/pmu: Drop "u8 ctrl, int idx" for reprogram_fixed_counter()") Cc: stable@vger.kernel.org Signed-off-by: Mingwei Zhang <mizhang@google.com> Link: https://lore.kernel.org/r/20240123221220.3911317-1-mizhang@google.com [sean: rewrite changelog to call out the effects of the bug] Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23KVM: x86: make KVM_REQ_NMI request iff NMI pending for vcpuPrasad Pandit1-1/+2
commit 6231c9e1a9f35b535c66709aa8a6eda40dbc4132 upstream. kvm_vcpu_ioctl_x86_set_vcpu_events() routine makes 'KVM_REQ_NMI' request for a vcpu even when its 'events->nmi.pending' is zero. Ex: qemu_thread_start kvm_vcpu_thread_fn qemu_wait_io_event qemu_wait_io_event_common process_queued_cpu_work do_kvm_cpu_synchronize_post_init/_reset kvm_arch_put_registers kvm_put_vcpu_events (cpu, level=[2|3]) This leads vCPU threads in QEMU to constantly acquire & release the global mutex lock, delaying the guest boot due to lock contention. Add check to make KVM_REQ_NMI request only if vcpu has NMI pending. Fixes: bdedff263132 ("KVM: x86: Route pending NMIs from userspace through process_nmi()") Cc: stable@vger.kernel.org Signed-off-by: Prasad Pandit <pjp@fedoraproject.org> Link: https://lore.kernel.org/r/20240103075343.549293-1-ppandit@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23x86/fpu: Stop relying on userspace for info to fault in xsave bufferAndrei Vagin1-8/+5
commit d877550eaf2dc9090d782864c96939397a3c6835 upstream. Before this change, the expected size of the user space buffer was taken from fx_sw->xstate_size. fx_sw->xstate_size can be changed from user-space, so it is possible construct a sigreturn frame where: * fx_sw->xstate_size is smaller than the size required by valid bits in fx_sw->xfeatures. * user-space unmaps parts of the sigrame fpu buffer so that not all of the buffer required by xrstor is accessible. In this case, xrstor tries to restore and accesses the unmapped area which results in a fault. But fault_in_readable succeeds because buf + fx_sw->xstate_size is within the still mapped area, so it goes back and tries xrstor again. It will spin in this loop forever. Instead, fault in the maximum size which can be touched by XRSTOR (taken from fpstate->user_size). [ dhansen: tweak subject / changelog ] Fixes: fcb3635f5018 ("x86/fpu/signal: Handle #PF in the direct restore path") Reported-by: Konstantin Bogomolov <bogomolov@google.com> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrei Vagin <avagin@google.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc:stable@vger.kernel.org Link: https://lore.kernel.org/all/20240130063603.3392627-1-avagin%40google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23x86/Kconfig: Transmeta Crusoe is CPU family 5, not 6Aleksander Mazur1-1/+1
commit f6a1892585cd19e63c4ef2334e26cd536d5b678d upstream. The kernel built with MCRUSOE is unbootable on Transmeta Crusoe. It shows the following error message: This kernel requires an i686 CPU, but only detected an i586 CPU. Unable to boot - please use a kernel appropriate for your CPU. Remove MCRUSOE from the condition introduced in commit in Fixes, effectively changing X86_MINIMUM_CPU_FAMILY back to 5 on that machine, which matches the CPU family given by CPUID. [ bp: Massage commit message. ] Fixes: 25d76ac88821 ("x86/Kconfig: Explicitly enumerate i686-class CPUs in Kconfig") Signed-off-by: Aleksander Mazur <deweloper@wp.pl> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: H. Peter Anvin <hpa@zytor.com> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/20240123134309.1117782-1-deweloper@wp.pl Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23powerpc/pseries: fix accuracy of stolen timeShrikanth Hegde1-2/+6
commit cbecc9fcbbec60136b0180ba0609c829afed5c81 upstream. powerVM hypervisor updates the VPA fields with stolen time data. It currently reports enqueue_dispatch_tb and ready_enqueue_tb for this purpose. In linux these two fields are used to report the stolen time. The VPA fields are updated at the TB frequency. On powerPC its mostly set at 512Mhz. Hence this needs a conversion to ns when reporting it back as rest of the kernel timings are in ns. This conversion is already handled in tb_to_ns function. So use that function to report accurate stolen time. Observed this issue and used an Capped Shared Processor LPAR(SPLPAR) to simplify the experiments. In all these cases, 100% VP Load is run using stress-ng workload. Values of stolen time is in percentages as reported by mpstat. With the patch values are close to expected. 6.8.rc1 +Patch 12EC/12VP 0.0 0.0 12EC/24VP 25.7 50.2 12EC/36VP 37.3 69.2 12EC/48VP 38.5 78.3 Fixes: 0e8a63132800 ("powerpc/pseries: Implement CONFIG_PARAVIRT_TIME_ACCOUNTING") Cc: stable@vger.kernel.org # v6.1+ Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240213052635.231597-1-sshegde@linux.ibm.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23powerpc/cputable: Add missing PPC_FEATURE_BOOKE on PPC64 Book-EDavid Engraf1-1/+2
commit eb6d871f4ba49ac8d0537e051fe983a3a4027f61 upstream. Commit e320a76db4b0 ("powerpc/cputable: Split cpu_specs[] out of cputable.h") moved the cpu_specs to separate header files. Previously PPC_FEATURE_BOOKE was enabled by CONFIG_PPC_BOOK3E_64. The definition in cpu_specs_e500mc.h for PPC64 no longer enables PPC_FEATURE_BOOKE. This breaks user space reading the ELF hwcaps and expect PPC_FEATURE_BOOKE. Debugging an application with gdb is no longer working on e5500/e6500 because the 64-bit detection relies on PPC_FEATURE_BOOKE for Book-E. Fixes: e320a76db4b0 ("powerpc/cputable: Split cpu_specs[] out of cputable.h") Cc: stable@vger.kernel.org # v6.1+ Signed-off-by: David Engraf <david.engraf@sysgo.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240207092758.1058893-1-david.engraf@sysgo.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23powerpc/64: Set task pt_regs->link to the LR value on scv entryNaveen N Rao1-2/+2
commit aad98efd0b121f63a2e1c221dcb4d4850128c697 upstream. Nysal reported that userspace backtraces are missing in offcputime bcc tool. As an example: $ sudo ./bcc/tools/offcputime.py -uU Tracing off-CPU time (us) of user threads by user stack... Hit Ctrl-C to end. ^C write - python (9107) 8 write - sudo (9105) 9 mmap - python (9107) 16 clock_nanosleep - multipathd (697) 3001604 The offcputime bcc tool attaches a bpf program to a kprobe on finish_task_switch(), which is usually hit on a syscall from userspace. With the switch to system call vectored, we started setting pt_regs->link to zero. This is because system call vectored behaves like a function call with LR pointing to the system call return address, and with no modification to SRR0/SRR1. The LR value does indicate our next instruction, so it is being saved as pt_regs->nip, and pt_regs->link is being set to zero. This is not a problem by itself, but BPF uses perf callchain infrastructure for capturing stack traces, and that stores LR as the second entry in the stack trace. perf has code to cope with the second entry being zero, and skips over it. However, generic userspace unwinders assume that a zero entry indicates end of the stack trace, resulting in a truncated userspace stack trace. Rather than fixing all userspace unwinders to ignore/skip past the second entry, store the real LR value in pt_regs->link so that there continues to be a valid, though duplicate entry in the stack trace. With this change: $ sudo ./bcc/tools/offcputime.py -uU Tracing off-CPU time (us) of user threads by user stack... Hit Ctrl-C to end. ^C write write [unknown] [unknown] [unknown] [unknown] [unknown] PyObject_VectorcallMethod [unknown] [unknown] PyObject_CallOneArg PyFile_WriteObject PyFile_WriteString [unknown] [unknown] PyObject_Vectorcall _PyEval_EvalFrameDefault PyEval_EvalCode [unknown] [unknown] [unknown] _PyRun_SimpleFileObject _PyRun_AnyFileObject Py_RunMain [unknown] Py_BytesMain [unknown] __libc_start_main - python (1293) 7 write write [unknown] sudo_ev_loop_v1 sudo_ev_dispatch_v1 [unknown] [unknown] [unknown] [unknown] __libc_start_main - sudo (1291) 7 syscall syscall bpf_open_perf_buffer_opts [unknown] [unknown] [unknown] [unknown] _PyObject_MakeTpCall PyObject_Vectorcall _PyEval_EvalFrameDefault PyEval_EvalCode [unknown] [unknown] [unknown] _PyRun_SimpleFileObject _PyRun_AnyFileObject Py_RunMain [unknown] Py_BytesMain [unknown] __libc_start_main - python (1293) 11 clock_nanosleep clock_nanosleep nanosleep sleep [unknown] [unknown] __clone - multipathd (698) 3001661 Fixes: 7fa95f9adaee ("powerpc/64s: system call support for scv/rfscv instructions") Cc: stable@vger.kernel.org Reported-by: "Nysal Jan K.A" <nysal@linux.ibm.com> Signed-off-by: Naveen N Rao <naveen@kernel.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240202154316.395276-1-naveen@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23um: Fix adding '-no-pie' for clangNathan Chancellor1-1/+3
commit 846cfbeed09b45d985079a9173cf390cc053715b upstream. The kernel builds with -fno-PIE, so commit 883354afbc10 ("um: link vmlinux with -no-pie") added the compiler linker flag '-no-pie' via cc-option because '-no-pie' was only supported in GCC 6.1.0 and newer. While this works for GCC, this does not work for clang because cc-option uses '-c', which stops the pipeline right before linking, so '-no-pie' is unconsumed and clang warns, causing cc-option to fail just as it would if the option was entirely unsupported: $ clang -Werror -no-pie -c -o /dev/null -x c /dev/null clang-16: error: argument unused during compilation: '-no-pie' [-Werror,-Wunused-command-line-argument] A recent version of clang exposes this because it generates a relocation under '-mcmodel=large' that is not supported in PIE mode: /usr/sbin/ld: init/main.o: relocation R_X86_64_32 against symbol `saved_command_line' can not be used when making a PIE object; recompile with -fPIE /usr/sbin/ld: failed to set dynamic section sizes: bad value clang: error: linker command failed with exit code 1 (use -v to see invocation) Remove the cc-option check altogether. It is wasteful to invoke the compiler to check for '-no-pie' because only one supported compiler version does not support it, GCC 5.x (as it is supported with the minimum version of clang and GCC 6.1.0+). Use a combination of the gcc-min-version macro and CONFIG_CC_IS_CLANG to unconditionally add '-no-pie' with CONFIG_LD_SCRIPT_DYN=y, so that it is enabled with all compilers that support this. Furthermore, using gcc-min-version can help turn this back into LINK-$(CONFIG_LD_SCRIPT_DYN) += -no-pie when the minimum version of GCC is bumped past 6.1.0. Cc: stable@vger.kernel.org Closes: https://github.com/ClangBuiltLinux/linux/issues/1982 Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23parisc: BTLB: Fix crash when setting up BTLB at CPU bringupHelge Deller1-1/+1
commit 913b9d443a0180cf0de3548f1ab3149378998486 upstream. When using hotplug and bringing up a 32-bit CPU, ask the firmware about the BTLB information to set up the static (block) TLB entries. For that write access to the static btlb_info struct is needed, but since it is marked __ro_after_init the kernel segfaults with missing write permissions. Fix the crash by dropping the __ro_after_init annotation. Fixes: e5ef93d02d6c ("parisc: BTLB: Initialize BTLB tables at CPU startup") Signed-off-by: Helge Deller <deller@gmx.de> Cc: <stable@vger.kernel.org> # v6.6+ Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23Revert "powerpc/pseries/iommu: Fix iommu initialisation during DLPAR add"Michael Ellerman3-23/+5
commit 1fba2bf8e9d5a27b7394856181b6200de7260b79 upstream. This reverts commit ed8b94f6e0acd652ce69bd69d678a0c769172df8. Gaurav reported that there are still problems with the patch and it should be reverted pending a fuller fix. Link: https://lore.kernel.org/all/4f6fc1ac-7a76-4447-9d0e-f55c0be373f8@linux.ibm.com/ Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23powerpc/kasan: Limit KASAN thread size increase to 32KBMichael Ellerman1-1/+1
[ Upstream commit f1acb109505d983779bbb7e20a1ee6244d2b5736 ] KASAN is seen to increase stack usage, to the point that it was reported to lead to stack overflow on some 32-bit machines (see link). To avoid overflows the stack size was doubled for KASAN builds in commit 3e8635fb2e07 ("powerpc/kasan: Force thread size increase with KASAN"). However with a 32KB stack size to begin with, the doubling leads to a 64KB stack, which causes build errors: arch/powerpc/kernel/switch.S:249: Error: operand out of range (0x000000000000fe50 is not between 0xffffffffffff8000 and 0x0000000000007fff) Although the asm could be reworked, in practice a 32KB stack seems sufficient even for KASAN builds - the additional usage seems to be in the 2-3KB range for a 64-bit KASAN build. So only increase the stack for KASAN if the stack size is < 32KB. Fixes: 18f14afe2816 ("powerpc/64s: Increase default stack size to 32KB") Reported-by: Spoorthy <spoorthy@linux.ibm.com> Reported-by: Benjamin Gray <bgray@linux.ibm.com> Reviewed-by: Benjamin Gray <bgray@linux.ibm.com> Link: https://lore.kernel.org/linuxppc-dev/bug-207129-206035@https.bugzilla.kernel.org%2F/ Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240212064244.3924505-1-mpe@ellerman.id.au Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-02-23powerpc/kasan: Fix addr error caused by page alignmentJiangfeng Xiao1-0/+1
[ Upstream commit 4a7aee96200ad281a5cc4cf5c7a2e2a49d2b97b0 ] In kasan_init_region, when k_start is not page aligned, at the begin of for loop, k_cur = k_start & PAGE_MASK is less than k_start, and then `va = block + k_cur - k_start` is less than block, the addr va is invalid, because the memory address space from va to block is not alloced by memblock_alloc, which will not be reserved by memblock_reserve later, it will be used by other places. As a result, memory overwriting occurs. for example: int __init __weak kasan_init_region(void *start, size_t size) { [...] /* if say block(dcd97000) k_start(feef7400) k_end(feeff3fe) */ block = memblock_alloc(k_end - k_start, PAGE_SIZE); [...] for (k_cur = k_start & PAGE_MASK; k_cur < k_end; k_cur += PAGE_SIZE) { /* at the begin of for loop * block(dcd97000) va(dcd96c00) k_cur(feef7000) k_start(feef7400) * va(dcd96c00) is less than block(dcd97000), va is invalid */ void *va = block + k_cur - k_start; [...] } [...] } Therefore, page alignment is performed on k_start before memblock_alloc() to ensure the validity of the VA address. Fixes: 663c0c9496a6 ("powerpc/kasan: Fix shadow area set up for modules.") Signed-off-by: Jiangfeng Xiao <xiaojiangfeng@huawei.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/1705974359-43790-1-git-send-email-xiaojiangfeng@huawei.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-02-23powerpc/6xx: set High BAT Enable flag on G2_LE coresMatthias Schiffer2-1/+21
[ Upstream commit a038a3ff8c6582404834852c043dadc73a5b68b4 ] MMU_FTR_USE_HIGH_BATS is set for G2_LE cores and derivatives like e300cX, but the high BATs need to be enabled in HID2 to work. Add register definitions and add the needed setup to __setup_cpu_603. This fixes boot on CPUs like the MPC5200B with STRICT_KERNEL_RWX enabled on systems where the flag has not been set by the bootloader already. Fixes: e4d6654ebe6e ("powerpc/mm/32s: rework mmu_mapin_ram()") Signed-off-by: Matthias Schiffer <matthias.schiffer@ew.tq-group.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240124103838.43675-1-matthias.schiffer@ew.tq-group.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-02-23powerpc/pseries/iommu: Fix iommu initialisation during DLPAR addGaurav Batra3-5/+23
[ Upstream commit ed8b94f6e0acd652ce69bd69d678a0c769172df8 ] When a PCI device is dynamically added, the kernel oopses with a NULL pointer dereference: BUG: Kernel NULL pointer dereference on read at 0x00000030 Faulting instruction address: 0xc0000000006bbe5c Oops: Kernel access of bad area, sig: 11 [#1] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries Modules linked in: rpadlpar_io rpaphp rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs xsk_diag bonding nft_compat nf_tables nfnetlink rfkill binfmt_misc dm_multipath rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_umad ib_iser libiscsi scsi_transport_iscsi ib_ipoib rdma_cm iw_cm ib_cm mlx5_ib ib_uverbs ib_core pseries_rng drm drm_panel_orientation_quirks xfs libcrc32c mlx5_core mlxfw sd_mod t10_pi sg tls ibmvscsi ibmveth scsi_transport_srp vmx_crypto pseries_wdt psample dm_mirror dm_region_hash dm_log dm_mod fuse CPU: 17 PID: 2685 Comm: drmgr Not tainted 6.7.0-203405+ #66 Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf000006 of:IBM,FW1060.00 (NH1060_008) hv:phyp pSeries NIP: c0000000006bbe5c LR: c000000000a13e68 CTR: c0000000000579f8 REGS: c00000009924f240 TRAP: 0300 Not tainted (6.7.0-203405+) MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 24002220 XER: 20040006 CFAR: c000000000a13e64 DAR: 0000000000000030 DSISR: 40000000 IRQMASK: 0 ... NIP sysfs_add_link_to_group+0x34/0x94 LR iommu_device_link+0x5c/0x118 Call Trace: iommu_init_device+0x26c/0x318 (unreliable) iommu_device_link+0x5c/0x118 iommu_init_device+0xa8/0x318 iommu_probe_device+0xc0/0x134 iommu_bus_notifier+0x44/0x104 notifier_call_chain+0xb8/0x19c blocking_notifier_call_chain+0x64/0x98 bus_notify+0x50/0x7c device_add+0x640/0x918 pci_device_add+0x23c/0x298 of_create_pci_dev+0x400/0x884 of_scan_pci_dev+0x124/0x1b0 __of_scan_bus+0x78/0x18c pcibios_scan_phb+0x2a4/0x3b0 init_phb_dynamic+0xb8/0x110 dlpar_add_slot+0x170/0x3b8 [rpadlpar_io] add_slot_store.part.0+0xb4/0x130 [rpadlpar_io] kobj_attr_store+0x2c/0x48 sysfs_kf_write+0x64/0x78 kernfs_fop_write_iter+0x1b0/0x290 vfs_write+0x350/0x4a0 ksys_write+0x84/0x140 system_call_exception+0x124/0x330 system_call_vectored_common+0x15c/0x2ec Commit a940904443e4 ("powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains") broke DLPAR add of PCI devices. The above added iommu_device structure to pci_controller. During system boot, PCI devices are discovered and this newly added iommu_device structure is initialized by a call to iommu_device_register(). During DLPAR add of a PCI device, a new pci_controller structure is allocated but there are no calls made to iommu_device_register() interface. Fix is to register the iommu device during DLPAR add as well. Fixes: a940904443e4 ("powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains") Signed-off-by: Gaurav Batra <gbatra@linux.ibm.com> [mpe: Trim oops and tweak some change log wording] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240122222407.39603-1-gbatra@linux.ibm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-02-23parisc: Prevent hung tasks when printing inventory on serial consoleHelge Deller1-0/+3
commit c8708d758e715c3824a73bf0cda97292b52be44d upstream. Printing the inventory on a serial console can be quite slow and thus may trigger the hung task detector (CONFIG_DETECT_HUNG_TASK=y) and possibly reboot the machine. Adding a cond_resched() prevents this. Signed-off-by: Helge Deller <deller@gmx.de> Cc: <stable@vger.kernel.org> # v6.0+ Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23scs: add CONFIG_MMU dependency for vfree_atomic()Samuel Holland1-0/+1
commit 6f9dc684cae638dda0570154509884ee78d0f75c upstream. The shadow call stack implementation fails to build without CONFIG_MMU: ld.lld: error: undefined symbol: vfree_atomic >>> referenced by scs.c >>> kernel/scs.o:(scs_free) in archive vmlinux.a Link: https://lkml.kernel.org/r/20240122175204.2371009-1-samuel.holland@sifive.com Fixes: a2abe7cbd8fe ("scs: switch to vmapped shadow stacks") Signed-off-by: Samuel Holland <samuel.holland@sifive.com> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Cc: Will Deacon <will@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23ptrace: Introduce exception_ip arch hookJiaxun Yang2-0/+9
[ Upstream commit 11ba1728be3edb6928791f4c622f154ebe228ae6 ] On architectures with de