summaryrefslogtreecommitdiff
path: root/arch/arm64/include
AgeCommit message (Collapse)AuthorFilesLines
2024-06-20KVM: arm64: nv: Load guest hyp's ZCR into EL1 stateOliver Upton1-0/+3
Load the guest hypervisor's ZCR_EL2 into the corresponding EL1 register when restoring SVE state, as ZCR_EL2 affects the VL in the hypervisor context. Reviewed-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20240620164653.1130714-5-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-06-20KVM: arm64: nv: Handle ZCR_EL2 trapsOliver Upton2-0/+11
Unlike other SVE-related registers, ZCR_EL2 takes a sysreg trap to EL2 when HCR_EL2.NV = 1. KVM still needs to honor the guest hypervisor's trap configuration, which expects an SVE trap (i.e. ESR_EL2.EC = 0x19) when CPTR traps are enabled for the vCPU's current context. Otherwise, if the guest hypervisor has traps disabled, emulate the access by mapping the requested VL into ZCR_EL1. Reviewed-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20240620164653.1130714-4-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-06-20KVM: arm64: nv: Forward SVE traps to guest hypervisorOliver Upton1-0/+4
Similar to FPSIMD traps, don't load SVE state if the guest hypervisor has SVE traps enabled and forward the trap instead. Note that ZCR_EL2 will require some special handling, as it takes a sysreg trap to EL2 when HCR_EL2.NV = 1. Reviewed-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20240620164653.1130714-3-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-06-20KVM: arm64: nv: Forward FP/ASIMD traps to guest hypervisorJintack Lim1-0/+43
Give precedence to the guest hypervisor's trap configuration when routing an FP/ASIMD trap taken to EL2. Take advantage of the infrastructure for translating CPTR_EL2 into the VHE (i.e. EL1) format and base the trap decision solely on the VHE view of the register. The in-memory value of CPTR_EL2 will always be up to date for the guest hypervisor (more on that later), so just read it directly from memory. Bury all of this behind a macro keyed off of the CPTR bitfield in anticipation of supporting other traps (e.g. SVE). [maz: account for HCR_EL2.E2H when testing for TFP/FPEN, with all the hard work actually being done by Chase Conklin] [ oliver: translate nVHE->VHE format for testing traps; macro for reuse in other CPTR_EL2.xEN fields ] Signed-off-by: Jintack Lim <jintack.lim@linaro.org> Signed-off-by: Christoffer Dall <christoffer.dall@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20240620164653.1130714-2-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-06-20arm64: Introduce esr_brk_comment, esr_is_cfi_brkPierre-Clément Tosi1-0/+11
As it is already used in two places, move esr_comment() to a header for re-use, with a clearer name. Introduce esr_is_cfi_brk() to detect kCFI BRK syndromes, currently used by early_brk64() but soon to also be used by hypervisor code. Signed-off-by: Pierre-Clément Tosi <ptosi@google.com> Acked-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20240610063244.2828978-7-ptosi@google.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-06-20KVM: arm64: Fix __pkvm_init_switch_pgd call ABIPierre-Clément Tosi1-2/+2
Fix the mismatch between the (incorrect) C signature, C call site, and asm implementation by aligning all three on an API passing the parameters (pgd and SP) separately, instead of as a bundled struct. Remove the now unnecessary memory accesses while the MMU is off from the asm, which simplifies the C caller (as it does not need to convert a VA struct pointer to PA) and makes the code slightly more robust by offsetting the struct fields from C and properly expressing the call to the C compiler (e.g. type checker and kCFI). Fixes: f320bc742bc2 ("KVM: arm64: Prepare the creation of s1 mappings at EL2") Signed-off-by: Pierre-Clément Tosi <ptosi@google.com> Acked-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20240610063244.2828978-3-ptosi@google.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-06-20KVM: arm64: Treat CTR_EL0 as a VM feature ID registerSebastian Ott1-0/+4
CTR_EL0 is currently handled as an invariant register, thus guests will be presented with the host value of that register. Add emulation for CTR_EL0 based on a per VM value. Userspace can switch off DIC and IDC bits and reduce DminLine and IminLine sizes. Naturally, ensure CTR_EL0 is trapped (HCR_EL2.TID2=1) any time that a VM's CTR_EL0 differs from hardware. Signed-off-by: Sebastian Ott <sebott@redhat.com> Reviewed-by: Shaoqin Huang <shahuang@redhat.com> Link: https://lore.kernel.org/r/20240619174036.483943-8-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-06-20KVM: arm64: unify code to prepare trapsSebastian Ott2-32/+10
There are 2 functions to calculate traps via HCR_EL2: * kvm_init_sysreg() called via KVM_RUN (before the 1st run or when the pid changes) * vcpu_reset_hcr() called via KVM_ARM_VCPU_INIT To unify these 2 and to support traps that are dependent on the ID register configuration, move the code from vcpu_reset_hcr() to sys_regs.c and call it via kvm_init_sysreg(). We still have to keep the non-FWB handling stuff in vcpu_reset_hcr(). Also the initialization with HCR_GUEST_FLAGS is kept there but guarded by !vcpu_has_run_once() to ensure that previous calculated values don't get overwritten. While at it rename kvm_init_sysreg() to kvm_calculate_traps() to better reflect what it's doing. Signed-off-by: Sebastian Ott <sebott@redhat.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Link: https://lore.kernel.org/r/20240619174036.483943-7-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-06-20KVM: arm64: nv: Use accessors for modifying ID registersOliver Upton1-1/+0
In the interest of abstracting away the underlying storage of feature ID registers, rework the nested code to go through the accessors instead of directly iterating the id_regs array. This means we now lose the property that ID registers unknown to the nested code get zeroed, but we really ought to be handling those explicitly going forward. Link: https://lore.kernel.org/r/20240619174036.483943-6-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-06-20KVM: arm64: Add helper for writing ID regsOliver Upton1-1/+2
Replace the remaining usage of IDREG() with a new helper for setting the value of a feature ID register, with the benefit of cramming in some extra sanity checks. Reviewed-by: Sebastian Ott <sebott@redhat.com> Link: https://lore.kernel.org/r/20240619174036.483943-5-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-06-20KVM: arm64: Use read-only helper for reading VM ID registersOliver Upton1-1/+15
IDREG() expands to the storage of a particular ID reg, which can be useful for handling both reads and writes. However, outside of a select few situations, the ID registers should be considered read only. Replace current readers with a new macro that expands to the value of the field rather than the field itself. Reviewed-by: Sebastian Ott <sebott@redhat.com> Link: https://lore.kernel.org/r/20240619174036.483943-4-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-06-19arm64: word-at-a-time: improve byte count calculations for LELinus Torvalds1-8/+3
Do the same optimization as x86-64: do __ffs() on the intermediate value that found whether there is a zero byte, before we've actually computed the final byte mask. The logic is: has_zero(): Check if the word has a zero byte in it, which indicates the end of the loop, and prepare a value to be used for the rest of the sequence. The standard LE implementation just creates a word that has the high bit set in each byte of the word that was zero. Example: 0xaa00bbccdd00eeff -> 0x0080000000800000 prep_zero_mask(): Possibly do more prep to then clean up the initial fast result from has_zero, so that it can be combined with another zero mask with a simple logical "or" to create a final mask. This is only used on big-endian machines that use a different algorithm, and is a no-op here. create_zero_mask(): This is "step 1" of creating the count and the mask, and is meant for any common operations between the two. In the old implementation, this actually created the zero mask, that was then used for masking and for counting the number of bits in the mask. In the new implementation, this is a no-op. count_zero(): This takes the mask bits, and counts the number of bytes before the first zero byte. In the old implementation, it counted the number of bits in the final byte mask (which was the same as the C standard "find last set bit" that uses the silly "starts at one" counting) and shifted the value down by three. In the new implementation, we know the intermediate mask isn't zero, and it just does "find first set" with the sane semantics without any off-by-one issues, and again shifts by three (which also masks off the bit offset in the zero byte itself). Example: 0x0080000000800000 -> 2 zero_bytemask(): This takes the mask bits, and turns it into an actual byte mask of the bytes preceding the first zero byte. In the old implementation, this was a no-op, because the work had already been done by create_zero_mask(). In the new implementation, this does what create_zero_mask() used to do. Example: 0x0080000000800000 -> 0x000000000000ffff The difference between the old and the new implementation is that "count_zero()" ends up scheduling better because it is being done on a value that is available earlier (before the final mask). But more importantly, it can be implemented without the insane semantics of the standard bit finding helpers that have the off-by-one issue and have to special-case the zero mask situation. On arm64, the new "count_zero()" ends up just "rbit + clz" plus the shift right that then ends up being subsumed by the "add to final length". Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-06-19arm64: start using 'asm goto' for put_user()Linus Torvalds2-34/+39
This generates noticeably better code since we don't need to test the error register etc, the exception just jumps to the error handling directly. Unlike get_user(), there's no need to worry about old compilers. All supported compilers support the regular non-output 'asm goto', as pointed out by Nathan Chancellor. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-06-19arm64: start using 'asm goto' for get_user() when availableLinus Torvalds1-23/+98
This generates noticeably better code with compilers that support it, since we don't need to test the error register etc, the exception just jumps to the error handling directly. Note that this also marks SW_TTBR0_PAN incompatible with KCSAN support, since KCSAN wants to save and restore the user access state. KCSAN and SW_TTBR0_PAN were probably always incompatible, but it became obvious only when implementing the unsafe user access functions. At that point the default empty user_access_save/restore() functions weren't provided by the default fallback functions. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-06-19arm64: mm: Permit PTE SW bits to change in live mappingsRyan Roberts1-0/+1
Previously pgattr_change_is_safe() was overly-strict and complained (e.g. "[ 116.262743] __check_safe_pte_update: unsafe attribute change: 0x0560000043768fc3 -> 0x0160000043768fc3") if it saw any SW bits change in a live PTE. There is no such restriction on SW bits in the Arm ARM. Until now, no SW bits have been updated in live mappings via the set_ptes() route. PTE_DIRTY would be updated live, but this is handled by ptep_set_access_flags() which does not call pgattr_change_is_safe(). However, with the introduction of uffd-wp for arm64, there is core-mm code that does ptep_get(); pte_clear_uffd_wp(); set_ptes(); which triggers this false warning. Silence this warning by masking out the SW bits during checks. The bug isn't technically in the highlighted commit below, but that's where bisecting would likely lead as its what made the bug user-visible. Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Fixes: 5b32510af77b ("arm64/mm: Add uffd write-protect support") Link: https://lore.kernel.org/r/20240619121859.4153966-1-ryan.roberts@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2024-06-19KVM: arm64: nv: Tag shadow S2 entries with guest's leaf S2 levelMarc Zyngier1-0/+8
Populate bits [56:55] of the leaf entry with the level provided by the guest's S2 translation. This will allow us to better scope the invalidation by remembering the mapping size. Of course, this assume that the guest will issue an invalidation with an address that falls into the same leaf. If the guest doesn't, we'll over-invalidate. Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20240614144552.2773592-13-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-06-19KVM: arm64: nv: Handle FEAT_TTL hinted TLB operationsMarc Zyngier1-0/+2
Support guest-provided information information to size the range of required invalidation. This helps with reducing over-invalidation, provided that the guest actually provides accurate information. Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20240614144552.2773592-12-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-06-19KVM: arm64: nv: Handle TLB invalidation targeting L2 stage-1Marc Zyngier1-0/+7
While dealing with TLB invalidation targeting the guest hypervisor's own stage-1 was easy, doing the same thing for its own guests is a bit more involved. Since such an invalidation is scoped by VMID, it needs to apply to all s2_mmu contexts that have been tagged by that VMID, irrespective of the value of VTTBR_EL2.BADDR. So for each s2_mmu context matching that VMID, we invalidate the corresponding TLBs, each context having its own "physical" VMID. Co-developed-by: Jintack Lim <jintack.lim@linaro.org> Co-developed-by: Christoffer Dall <christoffer.dall@arm.com> Signed-off-by: Jintack Lim <jintack.lim@linaro.org> Signed-off-by: Christoffer Dall <christoffer.dall@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20240614144552.2773592-8-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-06-19KVM: arm64: nv: Handle EL2 Stage-1 TLB invalidationMarc Zyngier2-0/+72
Due to the way FEAT_NV2 suppresses traps when accessing EL2 system registers, we can't track when the guest changes its HCR_EL2.TGE setting. This means we always trap EL1 TLBIs, even if they don't affect any L2 guest. Given that invalidating the EL2 TLBs doesn't require any messing with the shadow stage-2 page-tables, we can simply emulate the instructions early and return directly to the guest. This is conditioned on the instruction being an EL1 one and the guest's HCR_EL2.{E2H,TGE} being {1,1} (indicating that the instruction targets the EL2 S1 TLBs), or the instruction being one of the EL2 ones (which are not ambiguous). EL1 TLBIs issued with HCR_EL2.{E2H,TGE}={1,0} are not handled here, and cause a full exit so that they can be handled in the context of a VMID. Co-developed-by: Jintack Lim <jintack.lim@linaro.org> Co-developed-by: Christoffer Dall <christoffer.dall@arm.com> Signed-off-by: Jintack Lim <jintack.lim@linaro.org> Signed-off-by: Christoffer Dall <christoffer.dall@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20240614144552.2773592-7-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-06-19KVM: arm64: nv: Add Stage-1 EL2 invalidation primitivesMarc Zyngier1-0/+2
Provide the primitives required to handle TLB invalidation for Stage-1 EL2 TLBs, which by definition do not require messing with the Stage-2 page tables. Co-developed-by: Jintack Lim <jintack.lim@linaro.org> Co-developed-by: Christoffer Dall <christoffer.dall@arm.com> Signed-off-by: Jintack Lim <jintack.lim@linaro.org> Signed-off-by: Christoffer Dall <christoffer.dall@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20240614144552.2773592-6-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-06-19KVM: arm64: nv: Unmap/flush shadow stage 2 page tablesChristoffer Dall2-0/+5
Unmap/flush shadow stage 2 page tables for the nested VMs as well as the stage 2 page table for the guest hypervisor. Note: A bunch of the code in mmu.c relating to MMU notifiers is currently dealt with in an extremely abrupt way, for example by clearing out an entire shadow stage-2 table. This will be handled in a more efficient way using the reverse mapping feature in a later version of the patch series. Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Jintack Lim <jintack.lim@linaro.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20240614144552.2773592-5-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-06-19KVM: arm64: nv: Handle shadow stage 2 page faultsMarc Zyngier1-0/+33
If we are faulting on a shadow stage 2 translation, we first walk the guest hypervisor's stage 2 page table to see if it has a mapping. If not, we inject a stage 2 page fault to the virtual EL2. Otherwise, we create a mapping in the shadow stage 2 page table. Note that we have to deal with two IPAs when we got a shadow stage 2 page fault. One is the address we faulted on, and is in the L2 guest phys space. The other is from the guest stage-2 page table walk, and is in the L1 guest phys space. To differentiate them, we rename variables so that fault_ipa is used for the former and ipa is used for the latter. When mapping a page in a shadow stage-2, special care must be taken not to be more permissive than the guest is. Co-developed-by: Christoffer Dall <christoffer.dall@linaro.org> Co-developed-by: Jintack Lim <jintack.lim@linaro.org> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Jintack Lim <jintack.lim@linaro.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20240614144552.2773592-4-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-06-19KVM: arm64: nv: Implement nested Stage-2 page table walk logicChristoffer Dall2-0/+14
Based on the pseudo-code in the ARM ARM, implement a stage 2 software page table walker. Co-developed-by: Jintack Lim <jintack.lim@linaro.org> Signed-off-by: Jintack Lim <jintack.lim@linaro.org> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20240614144552.2773592-3-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-06-19KVM: arm64: nv: Support multiple nested Stage-2 mmu structuresMarc Zyngier3-0/+66
Add Stage-2 mmu data structures for virtual EL2 and for nested guests. We don't yet populate shadow Stage-2 page tables, but we now have a framework for getting to a shadow Stage-2 pgd. We allocate twice the number of vcpus as Stage-2 mmu structures because that's sufficient for each vcpu running two translation regimes without having to flush the Stage-2 page tables. Co-developed-by: Christoffer Dall <christoffer.dall@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20240614144552.2773592-2-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-06-12arm64: errata: Unify speculative SSBS errata logicMark Rutland1-1/+1
Cortex-X4 erratum 3194386 and Neoverse-V3 erratum 3312417 are identical, with duplicate Kconfig text and some unsightly ifdeffery. While we try to share code behind CONFIG_ARM64_WORKAROUND_SPECULATIVE_SSBS, having separate options results in a fair amount of boilerplate code, and this will only get worse as we expand the set of affected CPUs. To reduce this boilerplate, unify the two behind a common Kconfig option. This removes the duplicate text and Kconfig logic, and removes the need for the intermediate ARM64_WORKAROUND_SPECULATIVE_SSBS option. The set of affected CPUs is described as a list so that this can easily be extended. I've used ARM64_ERRATUM_3194386 (matching the Neoverse-V3 erratum ID) as the common option, matching the way we use ARM64_ERRATUM_1319367 to cover Cortex-A57 erratum 1319537 and Cortex-A72 erratum 1319367. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: James Morse <james.morse@arm.com> Cc: Will Deacon <wilL@kernel.org> Link: https://lore.kernel.org/r/20240603111812.1514101-5-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-06-12arm64: cputype: Add Cortex-X925 definitionsMark Rutland1-0/+2
Add cputype definitions for Cortex-X925. These will be used for errata detection in subsequent patches. These values can be found in Table A-285 ("MIDR_EL1 bit descriptions") in issue 0001-05 of the Cortex-X925 TRM, which can be found at: https://developer.arm.com/documentation/102807/0001/?lang=en Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: James Morse <james.morse@arm.com> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20240603111812.1514101-4-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-06-12arm64: cputype: Add Cortex-A720 definitionsMark Rutland1-0/+2
Add cputype definitions for Cortex-A720. These will be used for errata detection in subsequent patches. These values can be found in Table A-186 ("MIDR_EL1 bit descriptions") in issue 0002-05 of the Cortex-A720 TRM, which can be found at: https://developer.arm.com/documentation/102530/0002/?lang=en Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: James Morse <james.morse@arm.com> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20240603111812.1514101-3-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-06-12arm64: cputype: Add Cortex-X3 definitionsMark Rutland1-0/+2
Add cputype definitions for Cortex-X3. These will be used for errata detection in subsequent patches. These values can be found in Table A-263 ("MIDR_EL1 bit descriptions") in issue 07 of the Cortex-X3 TRM, which can be found at: https://developer.arm.com/documentation/101593/0102/?lang=en Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: James Morse <james.morse@arm.com> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20240603111812.1514101-2-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-06-12arm64: mte: Make mte_check_tfsr_*() conditional on KASAN instead of MTEPeter Collingbourne1-2/+2
The check in mte_check_tfsr_el1() is only necessary if HW tag based KASAN is enabled. However, we were also executing the check if MTE is enabled and KASAN is enabled at build time but disabled at runtime. This turned out to cause a measurable increase in power consumption on a specific microarchitecture after enabling MTE. Moreover, on the same system, an increase in invalid syscall latency (as measured by [1]) of around 20-30% (depending on the cluster) was observed after enabling MTE; this almost entirely goes away after removing this check. Therefore, make the check conditional on whether KASAN is enabled rather than on whether MTE is enabled. [1] https://lore.kernel.org/all/CAMn1gO4MwRV8bmFJ_SeY5tsYNPn2ZP56LjAhafygjFaKuu5ouw@mail.gmail.com/ Signed-off-by: Peter Collingbourne <pcc@google.com> Link: https://linux-review.googlesource.com/id/I22d98d1483dd400a95595946552b769a5a1ad7bd Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20240528225131.3577704-1-pcc@google.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-06-12arm64: implement raw_smp_processor_id() using thread_infoPuranjay Mohan1-12/+1
Historically, arm64 implemented raw_smp_processor_id() as a read of current_thread_info()->cpu. This changed when arm64 moved thread_info into task struct, as at the time CONFIG_THREAD_INFO_IN_TASK made core code use thread_struct::cpu for the cpu number, and due to header dependencies prevented using this in raw_smp_processor_id(). As a workaround, we moved to using a percpu variable in commit: 57c82954e77fa12c ("arm64: make cpu number a percpu variable") Since then, thread_info::cpu was reintroduced, and core code was made to use this in commits: 001430c1910df65a ("arm64: add CPU field to struct thread_info") bcf9033e5449bdca ("sched: move CPU field back into thread_info if THREAD_INFO_IN_TASK=y") Consequently it is possible to use current_thread_info()->cpu again. This decreases the number of emitted instructions like in the following example: Dump of assembler code for function bpf_get_smp_processor_id: 0xffff8000802cd608 <+0>: nop 0xffff8000802cd60c <+4>: nop 0xffff8000802cd610 <+8>: adrp x0, 0xffff800082138000 0xffff8000802cd614 <+12>: mrs x1, tpidr_el1 0xffff8000802cd618 <+16>: add x0, x0, #0x8 0xffff8000802cd61c <+20>: ldrsw x0, [x0, x1] 0xffff8000802cd620 <+24>: ret After this patch: Dump of assembler code for function bpf_get_smp_processor_id: 0xffff8000802c9130 <+0>: nop 0xffff8000802c9134 <+4>: nop 0xffff8000802c9138 <+8>: mrs x0, sp_el0 0xffff8000802c913c <+12>: ldr w0, [x0, #24] 0xffff8000802c9140 <+16>: ret A microbenchmark[1] was built to measure the performance improvement provided by this change. It calls the following function given number of times and finds the runtime overhead: static noinline int get_cpu_id(void) { return smp_processor_id(); } Run the benchmark like: modprobe smp_processor_id nr_function_calls=1000000000 +--------------------------+------------------------+ | | Number of Calls | Time taken | +--------+-----------------+------------------------+ | Before | 1000000000 | 1602888401ns | +--------+-----------------+------------------------+ | After | 1000000000 | 1206212658ns | +--------+-----------------+------------------------+ | Difference (decrease) | 396675743ns (24.74%) | +---------------------------------------------------+ Remove the percpu variable cpu_number as it is used only in set_smp_ipi_range() as a dummy variable to be passed to ipi_handler(). Use irq_stat in place of cpu_number here like arm32. [1] https://github.com/puranjaymohan/linux/commit/77d3fdd Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Stephen Boyd <swboyd@chromium.org> Link: https://lore.kernel.org/r/20240503171847.68267-2-puranjay@kernel.org Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-06-12arm64/arch_timer: include <linux/percpu.h>Puranjay Mohan1-1/+1
arch_timer.h includes linux/smp.h since the commit: 6acc71ccac7187fc ("arm64: arch_timer: Allows a CPU-specific erratum to only affect a subset of CPUs") It was included to use DEFINE_PER_CPU(), etc. But It should have included <linux/percpu.h> rather than <linux/smp.h>. It worked because smp.h includes percpu.h. The next commit will remove percpu.h from smp.h and it will break this usage. Explicitly include percpu.h and remove smp.h Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Stephen Boyd <swboyd@chromium.org> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Link: https://lore.kernel.org/r/20240503171847.68267-1-puranjay@kernel.org Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-06-11KVM: Delete the now unused kvm_arch_sched_in()Sean Christopherson1-1/+0
Delete kvm_arch_sched_in() now that all implementations are nops. Reviewed-by: Bibo Mao <maobibo@loongson.cn> Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20240522014013.1672962-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-11function_graph: Everyone uses HAVE_FUNCTION_GRAPH_RET_ADDR_PTR, remove itSteven Rostedt (Google)1-11/+0
All architectures that implement function graph also implements HAVE_FUNCTION_GRAPH_RET_ADDR_PTR. Remove it, as it is no longer a differentiator. Link: https://lore.kernel.org/linux-trace-kernel/20240611031737.982047614@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Guo Ren <guoren@kernel.org> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-06-07Merge tag 'arm64-fixes' of ↵Linus Torvalds1-20/+16
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 fixes from Will Deacon: - Fix spurious CPU hotplug warning message from SETEND emulation code - Fix the build when GCC wasn't inlining our I/O accessor internals * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: arm64/io: add constant-argument check arm64: armv8_deprecated: Fix warning in isndep cpuhp starting process
2024-06-05arm64/io: add constant-argument checkArnd Bergmann1-20/+16
In some configurations __const_iowrite32_copy() does not get inlined and gcc runs into the BUILD_BUG(): In file included from <command-line>: In function '__const_memcpy_toio_aligned32', inlined from '__const_iowrite32_copy' at arch/arm64/include/asm/io.h:203:3, inlined from '__const_iowrite32_copy' at arch/arm64/include/asm/io.h:199:20: include/linux/compiler_types.h:487:45: error: call to '__compiletime_assert_538' declared with attribute error: BUILD_BUG failed 487 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) | ^ include/linux/compiler_types.h:468:25: note: in definition of macro '__compiletime_assert' 468 | prefix ## suffix(); \ | ^~~~~~ include/linux/compiler_types.h:487:9: note: in expansion of macro '_compiletime_assert' 487 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) | ^~~~~~~~~~~~~~~~~~~ include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert' 39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg) | ^~~~~~~~~~~~~~~~~~ include/linux/build_bug.h:59:21: note: in expansion of macro 'BUILD_BUG_ON_MSG' 59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed") | ^~~~~~~~~~~~~~~~ arch/arm64/include/asm/io.h:193:17: note: in expansion of macro 'BUILD_BUG' 193 | BUILD_BUG(); | ^~~~~~~~~ Move the check for constant arguments into the inline function to ensure it is still constant if the compiler decides against inlining it, and mark them as __always_inline to override the logic that sometimes leads to the compiler not producing the simplified output. Note that either the __always_inline annotation or the check for a constant value are sufficient here, but combining the two looks cleaner as it also avoids the macro. With clang-8 and older, the macro was still needed, but all versions of gcc and clang can reliably perform constant folding here. Fixes: ead79118dae6 ("arm64/io: Provide a WC friendly __iowriteXX_copy()") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20240604210006.668912-1-arnd@kernel.org Signed-off-by: Will Deacon <will@kernel.org>
2024-06-04KVM: arm64: Refactor CPACR trap bit setting/clearing to use ELx formatFuad Tabba2-8/+7
When setting/clearing CPACR bits for EL0 and EL1, use the ELx format of the bits, which covers both. This makes the code clearer, and reduces the chances of accidentally missing a bit. No functional change intended. Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Fuad Tabba <tabba@google.com> Link: https://lore.kernel.org/r/20240603122852.3923848-9-tabba@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-06-04KVM: arm64: Consolidate initializing the host data's fpsimd_state/sve in pKVMFuad Tabba1-2/+8
Now that we have introduced finalize_init_hyp_mode(), lets consolidate the initializing of the host_data fpsimd_state and sve state. Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Fuad Tabba <tabba@google.com> Reviewed-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20240603122852.3923848-8-tabba@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-06-04KVM: arm64: Allocate memory mapped at hyp for host sve state in pKVMFuad Tabba3-0/+27
Protected mode needs to maintain (save/restore) the host's sve state, rather than relying on the host kernel to do that. This is to avoid leaking information to the host about guests and the type of operations they are performing. As a first step towards that, allocate memory mapped at hyp, per cpu, for the host sve state. The following patch will use this memory to save/restore the host state. Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Fuad Tabba <tabba@google.com> Link: https://lore.kernel.org/r/20240603122852.3923848-6-tabba@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-06-04KVM: arm64: Abstract set/clear of CPTR_EL2 bits behind helperFuad Tabba2-0/+68
The same traps controlled by CPTR_EL2 or CPACR_EL1 need to be toggled in different parts of the code, but the exact bits and their polarity differ between these two formats and the mode (vhe/nvhe/hvhe). To reduce the amount of duplicated code and the chance of getting the wrong bit/polarity or missing a field, abstract the set/clear of CPTR_EL2 bits behind a helper. Since (h)VHE is the way of the future, use the CPACR_EL1 format, which is a subset of the VHE CPTR_EL2, as a reference. No functional change intended. Suggested-by: Oliver Upton <oliver.upton@linux.dev> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Fuad Tabba <tabba@google.com> Link: https://lore.kernel.org/r/20240603122852.3923848-4-tabba@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-06-04KVM: arm64: Fix prototype for __sve_save_state/__sve_restore_stateFuad Tabba1-2/+2
Since the prototypes for __sve_save_state/__sve_restore_state at hyp were added, the underlying macro has acquired a third parameter for saving/restoring ffr. Fix the prototypes to account for the third parameter, and restore the ffr for the guest since it is saved. Suggested-by: Mark Brown <broonie@kernel.org> Signed-off-by: Fuad Tabba <tabba@google.com> Reviewed-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20240603122852.3923848-3-tabba@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-06-04KVM: arm64: Reintroduce __sve_save_stateFuad Tabba1-0/+1
Now that the hypervisor is handling the host sve state in protected mode, it needs to be able to save it. This reverts commit e66425fc9ba3 ("KVM: arm64: Remove unused __sve_save_state"). Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Fuad Tabba <tabba@google.com> Link: https://lore.kernel.org/r/20240603122852.3923848-2-tabba@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-05-24Merge tag 'mm-stable-2024-05-24-11-49' of ↵Linus Torvalds2-1/+3
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more mm updates from Andrew Morton: "Jeff Xu's implementation of the mseal() syscall" * tag 'mm-stable-2024-05-24-11-49' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: selftest mm/mseal read-only elf memory segment mseal: add documentation selftest mm/mseal memory sealing mseal: add mseal syscall mseal: wire up mseal syscall
2024-05-23mseal: wire up mseal syscallJeff Xu2-1/+3
Patch series "Introduce mseal", v10. This patchset proposes a new mseal() syscall for the Linux kernel. In a nutshell, mseal() protects the VMAs of a given virtual memory range against modifications, such as changes to their permission bits. Modern CPUs support memory permissions, such as the read/write (RW) and no-execute (NX) bits. Linux has supported NX since the release of kernel version 2.6.8 in August 2004 [1]. The memory permission feature improves the security stance on memory corruption bugs, as an attacker cannot simply write to arbitrary memory and point the code to it. The memory must be marked with the X bit, or else an exception will occur. Internally, the kernel maintains the memory permissions in a data structure called VMA (vm_area_struct). mseal() additionally protects the VMA itself against modifications of the selected seal type. Memory sealing is useful to mitigate memory corruption issues where a corrupted pointer is passed to a memory management system. For example, such an attacker primitive can break control-flow integrity guarantees since read-only memory that is supposed to be trusted can become writable or .text pages can get remapped. Memory sealing can automatically be applied by the runtime loader to seal .text and .rodata pages and applications can additionally seal security critical data at runtime. A similar feature already exists in the XNU kernel with the VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4]. Also, Chrome wants to adopt this feature for their CFI work [2] and this patchset has been designed to be compatible with the Chrome use case. Two system calls are involved in sealing the map: mmap() and mseal(). The new mseal() is an syscall on 64 bit CPU, and with following signature: int mseal(void addr, size_t len, unsigned long flags) addr/len: memory range. flags: reserved. mseal() blocks following operations for the given memory range. 1> Unmapping, moving to another location, and shrinking the size, via munmap() and mremap(), can leave an empty space, therefore can be replaced with a VMA with a new set of attributes. 2> Moving or expanding a different VMA into the current location, via mremap(). 3> Modifying a VMA via mmap(MAP_FIXED). 4> Size expansion, via mremap(), does not appear to pose any specific risks to sealed VMAs. It is included anyway because the use case is unclear. In any case, users can rely on merging to expand a sealed VMA. 5> mprotect() and pkey_mprotect(). 6> Some destructive madvice() behaviors (e.g. MADV_DONTNEED) for anonymous memory, when users don't have write permission to the memory. Those behaviors can alter region contents by discarding pages, effectively a memset(0) for anonymous memory. The idea that inspired this patch comes from Stephen Röttger’s work in V8 CFI [5]. Chrome browser in ChromeOS will be the first user of this API. Indeed, the Chrome browser has very specific requirements for sealing, which are distinct from those of most applications. For example, in the case of libc, sealing is only applied to read-only (RO) or read-execute (RX) memory segments (such as .text and .RELRO) to prevent them from becoming writable, the lifetime of those mappings are tied to the lifetime of the process. Chrome wants to seal two large address space reservations that are managed by different allocators. The memory is mapped RW- and RWX respectively but write access to it is restricted using pkeys (or in the future ARM permission overlay extensions). The lifetime of those mappings are not tied to the lifetime of the process, therefore, while the memory is sealed, the allocators still need to free or discard the unused memory. For example, with madvise(DONTNEED). However, always allowing madvise(DONTNEED) on this range poses a security risk. For example if a jump instruction crosses a page boundary and the second page gets discarded, it will overwrite the target bytes with zeros and change the control flow. Checking write-permission before the discard operation allows us to control when the operation is valid. In this case, the madvise will only succeed if the executing thread has PKEY write permissions and PKRU changes are protected in software by control-flow integrity. Although the initial version of this patch series is targeting the Chrome browser as its first user, it became evident during upstream discussions that we would also want to ensure that the patch set eventually is a complete solution for memory sealing and compatible with other use cases. The specific scenario currently in mind is glibc's use case of loading and sealing ELF executables. To this end, Stephen is working on a change to glibc to add sealing support to the dynamic linker, which will seal all non-writable segments at startup. Once this work is completed, all applications will be able to automatically benefit from these new protections. In closing, I would like to formally acknowledge the valuable contributions received during the RFC process, which were instrumental in shaping this patch: Jann Horn: raising awareness and providing valuable insights on the destructive madvise operations. Liam R. Howlett: perf optimization. Linus Torvalds: assisting in defining system call signature and scope. Theo de Raadt: sharing the experiences and insight gained from implementing mimmutable() in OpenBSD. MM perf benchmarks ================== This patch adds a loop in the mprotect/munmap/madvise(DONTNEED) to check the VMAs’ sealing flag, so that no partial update can be made, when any segment within the given memory range is sealed. To measure the performance impact of this loop, two tests are developed. [8] The first is measuring the time taken for a particular system call, by using clock_gettime(CLOCK_MONOTONIC). The second is using PERF_COUNT_HW_REF_CPU_CYCLES (exclude user space). Both tests have similar results. The tests have roughly below sequence: for (i = 0; i < 1000, i++) create 1000 mappings (1 page per VMA) start the sampling for (j = 0; j < 1000, j++) mprotect one mapping stop and save the sample delete 1000 mappings calculates all samples. Below tests are performed on Intel(R) Pentium(R) Gold 7505 @ 2.00GHz, 4G memory, Chromebook. Based on the latest upstream code: The first test (measuring time) syscall__ vmas t t_mseal delta_ns per_vma % munmap__ 1 909 944 35 35 104% munmap__ 2 1398 1502 104 52 107% munmap__ 4 2444 2594 149 37 106% munmap__ 8 4029 4323 293 37 107% munmap__ 16 6647 6935 288 18 104% munmap__