summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)AuthorFilesLines
2021-08-04bpf: Introduce BPF nospec instruction for mitigating Spectre v4Daniel Borkmann2-0/+13
[ Upstream commit f5e81d1117501546b7be050c5fbafa6efd2c722c ] In case of JITs, each of the JIT backends compiles the BPF nospec instruction /either/ to a machine instruction which emits a speculation barrier /or/ to /no/ machine instruction in case the underlying architecture is not affected by Speculative Store Bypass or has different mitigations in place already. This covers both x86 and (implicitly) arm64: In case of x86, we use 'lfence' instruction for mitigation. In case of arm64, we rely on the firmware mitigation as controlled via the ssbd kernel parameter. Whenever the mitigation is enabled, it works for all of the kernel code with no need to provide any additional instructions here (hence only comment in arm64 JIT). Other archs can follow as needed. The BPF nospec instruction is specifically targeting Spectre v4 since i) we don't use a serialization barrier for the Spectre v1 case, and ii) mitigation instructions for v1 and v4 might be different on some archs. The BPF nospec is required for a future commit, where the BPF verifier does annotate intermediate BPF programs with speculation barriers. Co-developed-by: Piotr Krysiuk <piotras@gmail.com> Co-developed-by: Benedict Schlueter <benedict.schlueter@rub.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Piotr Krysiuk <piotras@gmail.com> Signed-off-by: Benedict Schlueter <benedict.schlueter@rub.de> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-08-04KVM: x86: Check the right feature bit for MSR_KVM_ASYNC_PF_ACK accessVitaly Kuznetsov1-2/+2
[ Upstream commit 0a31df6823232516f61f174907e444f710941dfe ] MSR_KVM_ASYNC_PF_ACK MSR is part of interrupt based asynchronous page fault interface and not the original (deprecated) KVM_FEATURE_ASYNC_PF. This is stated in Documentation/virt/kvm/msr.rst. Fixes: 66570e966dd9 ("kvm: x86: only provide PV features if enabled in guest's CPUID") Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Oliver Upton <oupton@google.com> Message-Id: <20210722123018.260035-1-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-08-04x86/kvm: fix vcpu-id indexed array sizesJuergen Gross2-3/+3
commit 76b4f357d0e7d8f6f0013c733e6cba1773c266d3 upstream. KVM_MAX_VCPU_ID is the maximum vcpu-id of a guest, and not the number of vcpu-ids. Fix array indexed by vcpu-id to have KVM_MAX_VCPU_ID+1 elements. Note that this is currently no real problem, as KVM_MAX_VCPU_ID is an odd number, resulting in always enough padding being available at the end of those arrays. Nevertheless this should be fixed in order to avoid rare problems in case someone is using an even number for KVM_MAX_VCPU_ID. Signed-off-by: Juergen Gross <jgross@suse.com> Message-Id: <20210701154105.23215-2-jgross@suse.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-08-04x86/asm: Ensure asm/proto.h can be included stand-aloneJan Kiszka1-0/+2
[ Upstream commit f7b21a0e41171d22296b897dac6e4c41d2a3643c ] Fix: ../arch/x86/include/asm/proto.h:14:30: warning: ‘struct task_struct’ declared \ inside parameter list will not be visible outside of this definition or declaration long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2); ^~~~~~~~~~~ .../arch/x86/include/asm/proto.h:40:34: warning: ‘struct task_struct’ declared \ inside parameter list will not be visible outside of this definition or declaration long do_arch_prctl_common(struct task_struct *task, int option, ^~~~~~~~~~~ if linux/sched.h hasn't be included previously. This fixes a build error when this header is used outside of the kernel tree. [ bp: Massage commit message. ] Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/b76b4be3-cf66-f6b2-9a6c-3e7ef54f9845@web.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-31KVM: x86: determine if an exception has an error code only when injecting it.Maxim Levitsky1-4/+9
commit b97f074583736c42fb36f2da1164e28c73758912 upstream. A page fault can be queued while vCPU is in real paged mode on AMD, and AMD manual asks the user to always intercept it (otherwise result is undefined). The resulting VM exit, does have an error code. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210225154135.405125-2-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Zubin Mithra <zsm@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-28KVM: x86/pmu: Clear anythread deprecated bit when 0xa leaf is unsupported on ↵Like Xu1-1/+2
the SVM [ Upstream commit 7234c362ccb3c2228f06f19f93b132de9cfa7ae4 ] The AMD platform does not support the functions Ah CPUID leaf. The returned results for this entry should all remain zero just like the native does: AMD host: 0x0000000a 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 (uncanny) AMD guest: 0x0000000a 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00008000 Fixes: cadbaa039b99 ("perf/x86/intel: Make anythread filter support conditional") Signed-off-by: Like Xu <likexu@tencent.com> Message-Id: <20210628074354.33848-1-likexu@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-25bpf: Track subprog poke descriptors correctly and fix use-after-freeJohn Fastabend1-0/+3
commit f263a81451c12da5a342d90572e317e611846f2c upstream. Subprograms are calling map_poke_track(), but on program release there is no hook to call map_poke_untrack(). However, on program release, the aux memory (and poke descriptor table) is freed even though we still have a reference to it in the element list of the map aux data. When we run map_poke_run(), we then end up accessing free'd memory, triggering KASAN in prog_array_map_poke_run(): [...] [ 402.824689] BUG: KASAN: use-after-free in prog_array_map_poke_run+0xc2/0x34e [ 402.824698] Read of size 4 at addr ffff8881905a7940 by task hubble-fgs/4337 [ 402.824705] CPU: 1 PID: 4337 Comm: hubble-fgs Tainted: G I 5.12.0+ #399 [ 402.824715] Call Trace: [ 402.824719] dump_stack+0x93/0xc2 [ 402.824727] print_address_description.constprop.0+0x1a/0x140 [ 402.824736] ? prog_array_map_poke_run+0xc2/0x34e [ 402.824740] ? prog_array_map_poke_run+0xc2/0x34e [ 402.824744] kasan_report.cold+0x7c/0xd8 [ 402.824752] ? prog_array_map_poke_run+0xc2/0x34e [ 402.824757] prog_array_map_poke_run+0xc2/0x34e [ 402.824765] bpf_fd_array_map_update_elem+0x124/0x1a0 [...] The elements concerned are walked as follows: for (i = 0; i < elem->aux->size_poke_tab; i++) { poke = &elem->aux->poke_tab[i]; [...] The access to size_poke_tab is a 4 byte read, verified by checking offsets in the KASAN dump: [ 402.825004] The buggy address belongs to the object at ffff8881905a7800 which belongs to the cache kmalloc-1k of size 1024 [ 402.825008] The buggy address is located 320 bytes inside of 1024-byte region [ffff8881905a7800, ffff8881905a7c00) The pahole output of bpf_prog_aux: struct bpf_prog_aux { [...] /* --- cacheline 5 boundary (320 bytes) --- */ u32 size_poke_tab; /* 320 4 */ [...] In general, subprograms do not necessarily manage their own data structures. For example, BTF func_info and linfo are just pointers to the main program structure. This allows reference counting and cleanup to be done on the latter which simplifies their management a bit. The aux->poke_tab struct, however, did not follow this logic. The initial proposed fix for this use-after-free bug further embedded poke data tracking into the subprogram with proper reference counting. However, Daniel and Alexei questioned why we were treating these objects special; I agree, its unnecessary. The fix here removes the per subprogram poke table allocation and map tracking and instead simply points the aux->poke_tab pointer at the main programs poke table. This way, map tracking is simplified to the main program and we do not need to manage them per subprogram. This also means, bpf_prog_free_deferred(), which unwinds the program reference counting and kfrees objects, needs to ensure that we don't try to double free the poke_tab when free'ing the subprog structures. This is easily solved by NULL'ing the poke_tab pointer. The second detail is to ensure that per subprogram JIT logic only does fixups on poke_tab[] entries it owns. To do this, we add a pointer in the poke structure to point at the subprogram value so JITs can easily check while walking the poke_tab structure if the current entry belongs to the current program. The aux pointer is stable and therefore suitable for such comparison. On the jit_subprogs() error path, we omit cleaning up the poke->aux field because these are only ever referenced from the JIT side, but on error we will never make it to the JIT, so its fine to leave them dangling. Removing these pointers would complicate the error path for no reason. However, we do need to untrack all poke descriptors from the main program as otherwise they could race with the freeing of JIT memory from the subprograms. Lastly, a748c6975dea3 ("bpf: propagate poke descriptors to subprograms") had an off-by-one on the subprogram instruction index range check as it was testing 'insn_idx >= subprog_start && insn_idx <= subprog_end'. However, subprog_end is the next subprogram's start instruction. Fixes: a748c6975dea3 ("bpf: propagate poke descriptors to subprograms") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210707223848.14580-2-john.fastabend@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-25perf/x86/intel/uncore: Clean up error handling path of iio mappingKan Liang1-2/+4
[ Upstream commit d4ba0b06306a70c99a43f9d452886a86e2d3bd26 ] The error handling path of iio mapping looks fragile. We already fixed one issue caused by it, commit f797f05d917f ("perf/x86/intel/uncore: Fix for iio mapping on Skylake Server"). Clean up the error handling path and make the code robust. Reported-by: gushengxian <gushengxian@yulong.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/40e66cf9-398b-20d7-ce4d-433be6e08921@linux.intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-20x86/fpu: Limit xstate copy size in xstateregs_set()Thomas Gleixner1-1/+1
[ Upstream commit 07d6688b22e09be465652cf2da0da6bf86154df6 ] If the count argument is larger than the xstate size, this will happily copy beyond the end of xstate. Fixes: 91c3dba7dbc1 ("x86/fpu/xstate: Fix PTRACE frames for XSAVES") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Andy Lutomirski <luto@kernel.org> Reviewed-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210623121452.120741557@linutronix.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-20x86/fpu: Fix copy_xstate_to_kernel() gap handlingThomas Gleixner1-44/+61
[ Upstream commit 9625895011d130033d1bc7aac0d77a9bf68ff8a6 ] The gap handling in copy_xstate_to_kernel() is wrong when XSAVES is in use. Using init_fpstate for copying the init state of features which are not set in the xstate header is only correct for the legacy area, but not for the extended features area because when XSAVES is in use then init_fpstate is in compacted form which means the xstate offsets which are used to copy from init_fpstate are not valid. Fortunately, this is not a real problem today because all extended features in use have an all-zeros init state, but it is wrong nevertheless and with a potentially dynamically sized init_fpstate this would result in an access outside of the init_fpstate. Fix this by keeping track of the last copied state in the target buffer and explicitly zero it when there is a feature or alignment gap. Use the compacted offset when accessing the extended feature space in init_fpstate. As this is not a functional issue on older kernels this is intentionally not tagged for stable. Fixes: b8be15d58806 ("x86/fpu/xstate: Re-enable XSAVES") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210623121451.294282032@linutronix.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-20x86/signal: Detect and prevent an alternate signal stack overflowChang S. Bae1-4/+20
[ Upstream commit 2beb4a53fc3f1081cedc1c1a198c7f56cc4fc60c ] The kernel pushes context on to the userspace stack to prepare for the user's signal handler. When the user has supplied an alternate signal stack, via sigaltstack(2), it is easy for the kernel to verify that the stack size is sufficient for the current hardware context. Check if writing the hardware context to the alternate stack will exceed it's size. If yes, then instead of corrupting user-data and proceeding with the original signal handler, an immediate SIGSEGV signal is delivered. Refactor the stack pointer check code from on_sig_stack() and use the new helper. While the kernel allows new source code to discover and use a sufficient alternate signal stack size, this check is still necessary to protect binaries with insufficient alternate signal stack size from data corruption. Fixes: c2bc11f10a39 ("x86, AVX-512: Enable AVX-512 States Context Switch") Reported-by: Florian Weimer <fweimer@redhat.com> Suggested-by: Jann Horn <jannh@google.com> Suggested-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Len Brown <len.brown@intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20210518200320.17239-6-chang.seok.bae@intel.com Link: https://bugzilla.kernel.org/show_bug.cgi?id=153531 Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-20x86/fpu: Return proper error codes from user access functionsThomas Gleixner1-7/+12
[ Upstream commit aee8c67a4faa40a8df4e79316dbfc92d123989c1 ] When *RSTOR from user memory raises an exception, there is no way to differentiate them. That's bad because it forces the slow path even when the failure was not a fault. If the operation raised eg. #GP then going through the slow path is pointless. Use _ASM_EXTABLE_FAULT() which stores the trap number and let the exception fixup return the negated trap number as error. This allows to separate the fast path and let it handle faults directly and avoid the slow path for all other exceptions. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210623121457.601480369@linutronix.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-20KVM: X86: Disable hardware breakpoints unconditionally before kvm_x86->run()Lai Jiangshan1-0/+2
commit f85d40160691881a17a397c448d799dfc90987ba upstream. When the host is using debug registers but the guest is not using them nor is the guest in guest-debug state, the kvm code does not reset the host debug registers before kvm_x86->run(). Rather, it relies on the hardware vmentry instruction to automatically reset the dr7 registers which ensures that the host breakpoints do not affect the guest. This however violates the non-instrumentable nature around VM entry and exit; for example, when a host breakpoint is set on vcpu->arch.cr2, Another issue is consistency. When the guest debug registers are active, the host breakpoints are reset before kvm_x86->run(). But when the guest debug registers are inactive, the host breakpoints are delayed to be disabled. The host tracing tools may see different results depending on what the guest is doing. To fix the problems, we clear %db7 unconditionally before kvm_x86->run() if the host has set any breakpoints, no matter if the guest is using them or not. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20210628172632.81029-1-jiangshanlai@gmail.com> Cc: stable@vger.kernel.org [Only clear %db7 instead of reloading all debug registers. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-20KVM: nSVM: Check the value written to MSR_VM_HSAVE_PAVitaly Kuznetsov1-1/+10
commit fce7e152ffc8f89d02a80617b16c7aa1527847c8 upstream. APM states that #GP is raised upon write to MSR_VM_HSAVE_PA when the supplied address is not page-aligned or is outside of "maximum supported physical address for this implementation". page_address_valid() check seems suitable. Also, forcefully page-align the address when it's written from VMM. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210628104425.391276-2-vkuznets@redhat.com> Cc: stable@vger.kernel.org Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> [Add comment about behavior for host-provided values. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-20KVM: x86/mmu: Do not apply HPA (memory encryption) mask to GPAsSean Christopherson4-8/+18
commit fc9bf2e087efcd81bda2e52d09616d2a1bf982a8 upstream. Ignore "dynamic" host adjustments to the physical address mask when generating the masks for guest PTEs, i.e. the guest PA masks. The host physical address space and guest physical address space are two different beasts, e.g. even though SEV's C-bit is the same bit location for both host and guest, disabling SME in the host (which clears shadow_me_mask) does not affect the guest PTE->GPA "translation". For non-SEV guests, not dropping bits is the correct behavior. Assuming KVM and userspace correctly enumerate/configure guest MAXPHYADDR, bits that are lost as collateral damage from memory encryption are treated as reserved bits, i.e. KVM will never get to the point where it attempts to generate a gfn using the affected bits. And if userspace wants to create a bogus vCPU, then userspace gets to deal with the fallout of hardware doing odd things with bad GPAs. For SEV guests, not dropping the C-bit is technically wrong, but it's a moot point because KVM can't read SEV guest's page tables in any case since they're always encrypted. Not to mention that the current KVM code is also broken since sme_me_mask does not have to be non-zero for SEV to be supported by KVM. The proper fix would be to teach all of KVM to correctly handle guest private memory, but that's a task for the future. Fixes: d0ec49d4de90 ("kvm/x86/svm: Support Secure Memory Encryption within KVM") Cc: stable@vger.kernel.org Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210623230552.4027702-5-seanjc@google.com> [Use a new header instead of adding header guards to paging_tmpl.h. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-20KVM: x86: Use guest MAXPHYADDR from CPUID.0x8000_0008 iff TDP is enabledSean Christopherson1-1/+7
commit 4bf48e3c0aafd32b960d341c4925b48f416f14a5 upstream. Ignore the guest MAXPHYADDR reported by CPUID.0x8000_0008 if TDP, i.e. NPT, is disabled, and instead use the host's MAXPHYADDR. Per AMD'S APM: Maximum guest physical address size in bits. This number applies only to guests using nested paging. When this field is zero, refer to the PhysAddrSize field for the maximum guest physical address size. Fixes: 24c82e576b78 ("KVM: Sanitize cpuid") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210623230552.4027702-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-14x86/sev: Split up runtime #VC handler for correct state trackingJoerg Roedel3-90/+91
[ Upstream commit be1a5408868af341f61f93c191b5e346ee88c82a ] Split up the #VC handler code into a from-user and a from-kernel part. This allows clean and correct state tracking, as the #VC handler needs to enter NMI-state when raised from kernel mode and plain IRQ state when raised from user-mode. Fixes: 62441a1fb532 ("x86/sev-es: Correctly track IRQ states in runtime #VC handler") Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210618115409.22735-3-joro@8bytes.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14x86/sev: Make sure IRQs are disabled while GHCB is activeJoerg Roedel1-12/+22
[ Upstream commit d187f217335dba2b49fc9002aab2004e04acddee ] The #VC handler only cares about IRQs being disabled while the GHCB is active, as it must not be interrupted by something which could cause another #VC while it holds the GHCB (NMI is the exception for which the backup GHCB exits). Make sure nothing interrupts the code path while the GHCB is active by making sure that callers of __sev_{get,put}_ghcb() have disabled interrupts upfront. [ bp: Massage commit message. ] Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210618115409.22735-2-joro@8bytes.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14KVM: x86/mmu: Fix return value in tdp_mmu_map_handle_target_level()Kai Huang1-1/+1
[ Upstream commit 57a3e96d6d17ae5ac9861ef34af024a627f1c3bb ] Currently tdp_mmu_map_handle_target_level() returns 0, which is RET_PF_RETRY, when page fault is actually fixed. This makes kvm_tdp_mmu_map() also return RET_PF_RETRY in this case, instead of RET_PF_FIXED. Fix by initializing ret to RET_PF_FIXED. Note that kvm_mmu_page_fault() resumes guest on both RET_PF_RETRY and RET_PF_FIXED, which means in practice returning the two won't make difference, so this fix alone won't be necessary for stable tree. Fixes: bb18842e2111 ("kvm: x86/mmu: Add TDP MMU PF handler") Reviewed-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Message-Id: <f9e8956223a586cd28c090879a8ff40f5eb6d609.1623717884.git.kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14KVM: nVMX: Don't clobber nested MMU's A/D status on EPTP switchSean Christopherson1-7/+0
[ Upstream commit 272b0a998d084e7667284bdd2d0c675c6a2d11de ] Drop bogus logic that incorrectly clobbers the accessed/dirty enabling status of the nested MMU on an EPTP switch. When nested EPT is enabled, walk_mmu points at L2's _legacy_ page tables, not L1's EPT for L2. This is likely a benign bug, as mmu->ept_ad is never consumed (since the MMU is not a nested EPT MMU), and stuffing mmu_role.base.ad_disabled will never propagate into future shadow pages since the nested MMU isn't used to map anything, just to walk L2's page tables. Note, KVM also does a full MMU reload, i.e. the guest_mmu will be recreated using the new EPTP, and thus any change in A/D enabling will be properly recognized in the relevant MMU. Fixes: 41ab93727467 ("KVM: nVMX: Emulate EPTP switching for the L1 hypervisor") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609234235.1244004-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14KVM: nVMX: Ensure 64-bit shift when checking VMFUNC bitmapSean Christopherson1-1/+1
[ Upstream commit 0e75225dfa4c5d5d51291f54a3d2d5895bad38da ] Use BIT_ULL() instead of an open-coded shift to check whether or not a function is enabled in L1's VMFUNC bitmap. This is a benign bug as KVM supports only bit 0, and will fail VM-Enter if any other bits are set, i.e. bits 63:32 are guaranteed to be zero. Note, "function" is bounded by hardware as VMFUNC will #UD before taking a VM-Exit if the function is greater than 63. Before: if ((vmcs12->vm_function_control & (1 << function)) == 0) 0x000000000001a916 <+118>: mov $0x1,%eax 0x000000000001a91b <+123>: shl %cl,%eax 0x000000000001a91d <+125>: cltq 0x000000000001a91f <+127>: and 0x128(%rbx),%rax After: if (!(vmcs12->vm_function_control & BIT_ULL(function & 63))) 0x000000000001a955 <+117>: mov 0x128(%rbx),%rdx 0x000000000001a95c <+124>: bt %rax,%rdx Fixes: 27c42a1bb867 ("KVM: nVMX: Enable VMFUNC for the L1 hypervisor") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609234235.1244004-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14KVM: nVMX: Sync all PGDs on nested transition with shadow pagingSean Christopherson4-8/+15
[ Upstream commit 07ffaf343e34b555c9e7ea39a9c81c439a706f13 ] Trigger a full TLB flush on behalf of the guest on nested VM-Enter and VM-Exit when VPID is disabled for L2. kvm_mmu_new_pgd() syncs only the current PGD, which can theoretically leave stale, unsync'd entries in a previous guest PGD, which could be consumed if L2 is allowed to load CR3 with PCID_NOFLUSH=1. Rename KVM_REQ_HV_TLB_FLUSH to KVM_REQ_TLB_FLUSH_GUEST so that it can be utilized for its obvious purpose of emulating a guest TLB flush. Note, there is no change the actual TLB flush executed by KVM, even though the fast PGD switch uses KVM_REQ_TLB_FLUSH_CURRENT. When VPID is disabled for L2, vpid02 is guaranteed to be '0', and thus nested_get_vpid02() will return the VPID that is shared by L1 and L2. Generate the request outside of kvm_mmu_new_pgd(), as getting the common helper to correctly identify which requested is needed is quite painful. E.g. using KVM_REQ_TLB_FLUSH_GUEST when nested EPT is in play is wrong as a TLB flush from the L1 kernel's perspective does not invalidate EPT mappings. And, by using KVM_REQ_TLB_FLUSH_GUEST, nVMX can do future simplification by moving the logic into nested_vmx_transition_tlb_flush(). Fixes: 41fab65e7c44 ("KVM: nVMX: Skip MMU sync on nested VMX transition when possible") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609234235.1244004-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14crypto: x86/curve25519 - fix cpu feature checking logic in mod_exitHangbin Liu1-1/+1
[ Upstream commit 1b82435d17774f3eaab35dce239d354548aa9da2 ] In curve25519_mod_init() the curve25519_alg will be registered only when (X86_FEATURE_BMI2 && X86_FEATURE_ADX). But in curve25519_mod_exit() it still checks (X86_FEATURE_BMI2 || X86_FEATURE_ADX) when do crypto unregister. This will trigger a BUG_ON in crypto_unregister_alg() as alg->cra_refcnt is 0 if the cpu only supports one of X86_FEATURE_BMI2 and X86_FEATURE_ADX. Fixes: 07b586fe0662 ("crypto: x86/curve25519 - replace with formally verified implementation") Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14x86/elf: Use _BITUL() macro in UAPI headersJoe Richey1-2/+4
[ Upstream commit d06aca989c243dd9e5d3e20aa4e5c2ecfdd07050 ] Replace BIT() in x86's UAPI header with _BITUL(). BIT() is not defined in the UAPI headers and its usage may cause userspace build errors. Fixes: 742c45c3ecc9 ("x86/elf: Enumerate kernel FSGSBASE capability in AT_HWCAP2") Signed-off-by: Joe Richey <joerichey@google.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210521085849.37676-2-joerichey94@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14clocksource: Check per-CPU clock synchronization when marked unstablePaul E. McKenney1-1/+2
[ Upstream commit 7560c02bdffb7c52d1457fa551b9e745d4b9e754 ] Some sorts of per-CPU clock sources have a history of going out of synchronization with each other. However, this problem has purportedy been solved in the past ten years. Except that it is all too possible that the problem has instead simply been made less likely, which might mean that some of the occasional "Marking clocksource 'tsc' as unstable" messages might be due to desynchronization. How would anyone know? Therefore apply CPU-to-CPU synchronization checking to newly unstable clocksource that are marked with the new CLOCK_SOURCE_VERIFY_PERCPU flag. Lists of desynchronized CPUs are printed, with the caveat that if it is the reporting CPU that is itself desynchronized, it will appear that all the other clocks are wrong. Just like in real life. Reported-by: Chris Mason <clm@fb.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Feng Tang <feng.tang@intel.com> Link: https://lore.kernel.org/r/20210527190124.440372-2-paulmck@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14sched/core: Initialize the idle task with preemption disabledValentin Schneider2-2/+1
[ Upstream commit f1a0a376ca0c4ef1fc3d24e3e502acbb5b795674 ] As pointed out by commit de9b8f5dcbd9 ("sched: Fix crash trying to dequeue/enqueue the idle thread") init_idle() can and will be invoked more than once on the same idle task. At boot time, it is invoked for the boot CPU thread by sched_init(). Then smp_init() creates the threads for all the secondary CPUs and invokes init_idle() on them. As the hotplug machinery brings the secondaries to life, it will issue calls to idle_thread_get(), which itself invokes init_idle() yet again. In this case it's invoked twice more per secondary: at _cpu_up(), and at bringup_cpu(). Given smp_init() already initializes the idle tasks for all *possible* CPUs, no further initialization should be required. Now, removing init_idle() from idle_thread_get() exposes some interesting expectations with regards to the idle task's preempt_count: the secondary startup always issues a preempt_disable(), requiring some reset of the preempt count to 0 between hot-unplug and hotplug, which is currently served by idle_thread_get() -> idle_init(). Given the idle task is supposed to have preemption disabled once and never see it re-enabled, it seems that what we actually want is to initialize its preempt_count to PREEMPT_DISABLED and leave it there. Do that, and remove init_idle() from idle_thread_get(). Secondary startups were patched via coccinelle: @begone@ @@ -preempt_disable(); ... cpu_startup_entry(CPUHP_AP_ONLINE_IDLE); Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20210512094636.2958515-1-valentin.schneider@arm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14KVM: x86/mmu: Use MMU's role to detect CR4.SMEP value in nested NPT walkSean Christopherson1-2/+1
commit ef318b9edf66a082f23d00d79b70c17b4c055a26 upstream. Use the MMU's role to get its effective SMEP value when injecting a fault into the guest. When walking L1's (nested) NPT while L2 is active, vCPU state will reflect L2, whereas NPT uses the host's (L1 in this case) CR0, CR4, EFER, etc... If L1 and L2 have different settings for SMEP and L1 does not have EFER.NX=1, this can result in an incorrect PFEC.FETCH when injecting #NPF. Fixes: e57d4a356ad3 ("KVM: Add instruction fetch checking when walking guest page table") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-14KVM: x86/mmu: Treat NX as used (not reserved) for all !TDP shadow MMUsSean Christopherson1-1/+9
commit 112022bdb5bc372e00e6e43cb88ee38ea67b97bd upstream. Mark NX as being used for all non-nested shadow MMUs, as KVM will set the NX bit for huge SPTEs if the iTLB mutli-hit mitigation is enabled. Checking the mitigation itself is not sufficient as it can be toggled on at any time and KVM doesn't reset MMU contexts when that happens. KVM could reset the contexts, but that would require purging all SPTEs in all MMUs, for no real benefit. And, KVM already forces EFER.NX=1 when TDP is disabled (for WP=0, SMEP=1, NX=0), so technically NX is never reserved for shadow MMUs. Fixes: b8e8c8303ff2 ("kvm: mmu: ITLB_MULTIHIT mitigation") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-14KVM: nVMX: Handle split-lock #AC exceptions that happen in L2Sean Christopherson4-2/+11
commit b33bb78a1fada6445c265c585ee0dd0fc6279102 upstream. Mark #ACs that won't be reinjected to the guest as wanted by L0 so that KVM handles split-lock #AC from L2 instead of forwarding the exception to L1. Split-lock #AC isn't yet virtualized, i.e. L1 will treat it like a regular #AC and do the wrong thing, e.g. reinject it into L2. Fixes: e6f8b6c12f03 ("KVM: VMX: Extend VMXs #AC interceptor to handle split lock #AC in guest") Cc: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622172244.3561540-1-seanjc@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-07Revert "KVM: x86/mmu: Drop kvm_mmu_extended_role.cr4_la57 hack"Sean Christopherson2-0/+2
commit f71a53d1180d5ecc346f0c6a23191d837fe2871b upstream. Restore CR4.LA57 to the mmu_role to fix an amusing edge case with nested virtualization. When KVM (L0) is using TDP, CR4.LA57 is not reflected in mmu_role.base.level because that tracks the shadow root level, i.e. TDP level. Normally, this is not an issue because LA57 can't be toggled while long mode is active, i.e. the guest has to first disable paging, then toggle LA57, then re-enable paging, thus ensuring an MMU reinitialization. But if L1 is crafty, it can load a new CR4 on VM-Exit and toggle LA57 without having to bounce through an unpaged section. L1 can also load a new CR3 on exit, i.e. it doesn't even need to play crazy paging games, a single entry PML5 is sufficient. Such shenanigans are only problematic if L0 and L1 use TDP, otherwise L1 and L2 share an MMU that gets reinitialized on nested VM-Enter/VM-Exit due to mmu_role.base.guest_mode. Note, in the L2 case with nested TDP, even though L1 can switch between L2s with different LA57 settings, thus bypassing the paging requirement, in that case KVM's nested_mmu will track LA57 in base.level. This reverts commit 8053f924cad30bf9f9a24e02b6c8ddfabf5202ea. Fixes: 8053f924cad3 ("KVM: x86/mmu: Drop kvm_mmu_extended_role.cr4_la57 hack") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30KVM: SVM: Call SEV Guest Decommission if ASID binding failsAlper Gun1-11/+21
commit 934002cd660b035b926438244b4294e647507e13 upstream. Send SEV_CMD_DECOMMISSION command to PSP firmware if ASID binding fails. If a failure happens after a successful LAUNCH_START command, a decommission command should be executed. Otherwise, guest context will be unfreed inside the AMD SP. After the firmware will not have memory to allocate more SEV guest context, LAUNCH_START command will begin to fail with SEV_RET_RESOURCE_LIMIT error. The existing code calls decommission inside sev_unbind_asid, but it is not called if a failure happens before guest activation succeeds. If sev_bind_asid fails, decommission is never called. PSP firmware has a limit for the number of guests. If sev_asid_binding fails many times, PSP firmware will not have resources to create another guest context. Cc: stable@vger.kernel.org Fixes: 59414c989220 ("KVM: SVM: Add support for KVM_SEV_LAUNCH_START command") Reported-by: Peter Gonda <pgonda@google.com> Signed-off-by: Alper Gun <alpergun@google.com> Reviewed-by: Marc Orr <marcorr@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210610174604.2554090-1-alpergun@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30x86/fpu: Make init_fpstate correct with optimized XSAVEThomas Gleixner2-25/+46
commit f9dfb5e390fab2df9f7944bb91e7705aba14cd26 upstream. The XSAVE init code initializes all enabled and supported components with XRSTOR(S) to init state. Then it XSAVEs the state of the components back into init_fpstate which is used in several places to fill in the init state of components. This works correctly with XSAVE, but not with XSAVEOPT and XSAVES because those use the init optimization and skip writing state of components which are in init state. So init_fpstate.xsave still contains all zeroes after this operation. There are two ways to solve that: 1) Use XSAVE unconditionally, but that requires to reshuffle the buffer when XSAVES is enabled because XSAVES uses compacted format. 2) Save the components which are known to have a non-zero init state by other means. Looking deeper, #2 is the right thing to do because all components the kernel supports have all-zeroes init state except the legacy features (FP, SSE). Those cannot be hard coded because the states are not identical on all CPUs, but they can be saved with FXSAVE which avoids all conditionals. Use FXSAVE to save the legacy FP/SSE components in init_fpstate along with a BUILD_BUG_ON() which reminds developers to validate that a newly added component has all zeroes init state. As a bonus remove the now unused copy_xregs_to_kernel_booting() crutch. The XSAVE and reshuffle method can still be implemented in the unlikely case that components are added which have a non-zero init state and no other means to save them. For now, FXSAVE is just simple and good enough. [ bp: Fix a typo or two in the text. ] Fixes: 6bad06b76892 ("x86, xsave: Use xsaveopt in context-switch path when supported") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Borislav Petkov <bp@suse.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20210618143444.587311343@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30x86/fpu: Preserve supervisor states in sanitize_restored_user_xstate()Thomas Gleixner1-18/+8
commit 9301982c424a003c0095bf157154a85bf5322bd0 upstream. sanitize_restored_user_xstate() preserves the supervisor states only when the fx_only argument is zero, which allows unprivileged user space to put supervisor states back into init state. Preserve them unconditionally. [ bp: Fix a typo or two in the text. ] Fixes: 5d6b6a6f9b5c ("x86/fpu/xstate: Update sanitize_restored_xstate() for supervisor xstates") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20210618143444.438635017@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30perf/x86: Track pmu in per-CPU cpu_hw_eventsKan Liang5-12/+24
[ Upstream commit 61e76d53c39bb768ad264d379837cfc56b9e35b4 ] Some platforms, e.g. Alder Lake, have hybrid architecture. In the same package, there may be more than one type of CPU. The PMU capabilities are different among different types of CPU. Perf will register a dedicated PMU for each type of CPU. Add a 'pmu' variable in the struct cpu_hw_events to track the dedicated PMU of the current CPU. Current x86_get_pmu() use the global 'pmu', which will be broken on a hybrid platform. Modify it to apply the 'pmu' of the specific CPU. Initialize the per-CPU 'pmu' variable with the global 'pmu'. There is nothing changed for the non-hybrid platforms. The is_x86_event() will be updated in the later patch ("perf/x86: Register hybrid PMUs") for hybrid platforms. For the non-hybrid platforms, nothing is changed here. Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1618237865-33448-4-git-send-email-kan.liang@linux.intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-06-30PCI: Add AMD RS690 quirk to enable 64-bit DMAMikel Rychliski1-0/+44
[ Upstream commit cacf994a91d3a55c0c2f853d6429cd7b86113915 ] Although the AMD RS690 chipset has 64-bit DMA support, BIOS implementations sometimes fail to configure the memory limit registers correctly. The Acer F690GVM mainboard uses this chipset and a Marvell 88E8056 NIC. The sky2 driver programs the NIC to use 64-bit DMA, which will not work: sky2 0000:02:00.0: error interrupt status=0x8 sky2 0000:02:00.0 eth0: tx timeout sky2 0000:02:00.0 eth0: transmit ring 0 .. 22 report=0 done=0 Other drivers required by this mainboard either don't support 64-bit DMA, or have it disabled using driver specific quirks. For example, the ahci driver has quirks to enable or disable 64-bit DMA depending on the BIOS version (see ahci_sb600_enable_64bit() in ahci.c). This ahci quirk matches against the SB600 SATA controller, but the real issue is almost certainly with the RS690 PCI host that it was commonly attached to. To avoid this issue in all drivers with 64-bit DMA support, fix the configuration of the PCI host. If the kernel is aware of physical memory above 4GB, but the BIOS never configured the PCI host with this information, update the registers with our values. [bhelgaas: drop PCI_DEVICE_ID_ATI_RS690 definition] Link: https://lore.kernel.org/r/20210611214823.4898-1-mikel@mikelr.com Signed-off-by: Mikel Rychliski <mikel@mikelr.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-06-30perf/x86/intel/lbr: Zero the xstate buffer on allocationThomas Gleixner1-1/+2
[ Upstream commit 7f049fbdd57f6ea71dc741d903c19c73b2f70950 ] XRSTORS requires a valid xstate buffer to work correctly. XSAVES does not guarantee to write a fully valid buffer according to the SDM: "XSAVES does not write to any parts of the XSAVE header other than the XSTATE_BV and XCOMP_BV fields." XRSTORS triggers a #GP: "If bytes 63:16 of the XSAVE header are not all zero." It's dubious at best how this can work at all when the buffer is not zeroed before use. Allocate the buffers with __GFP_ZERO to prevent XRSTORS failure. Fixes: ce711ea3cab9 ("perf/x86/intel/lbr: Support XSAVES/XRSTORS for LBR context switch") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/87wnr0wo2z.ffs@nanos.tec.linutronix.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-06-30perf/x86/lbr: Remove cpuc->lbr_xsave allocation from atomic contextLike Xu3-8/+30
[ Upstream commit 488e13a489e9707a7e81e1991fdd1f20c0f04689 ] If the kernel is compiled with the CONFIG_LOCKDEP option, the conditional might_sleep_if() deep in kmem_cache_alloc() will generate the following trace, and potentially cause a deadlock when another LBR event is added: [] BUG: sleeping function called from invalid context at include/linux/sched/mm.h:196 [] Call Trace: [] kmem_cache_alloc+0x36/0x250 [] intel_pmu_lbr_add+0x152/0x170 [] x86_pmu_add+0x83/0xd0 Make it symmetric with the release_lbr_buffers() call and mirror the existing DS buffers. Fixes: c085fb8774 ("perf/x86/intel/lbr: Support XSAVES for arch LBR read") Signed-off-by: Like Xu <like.xu@linux.intel.com> [peterz: simplified] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Kan Liang <kan.liang@linux.intel.com> Link: https://lkml.kernel.org/r/20210430052247.3079672-2-like.xu@linux.intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-06-30x86/xen: Fix noinstr fail in exc_xen_unknown_trap()Peter Zijlstra1-0/+2
[ Upstream commit 4c9c26f1e67648f41f28f8c997c5c9467a3dbbe4 ] Fix: vmlinux.o: warning: objtool: exc_xen_unknown_trap()+0x7: call to printk() leaves .noinstr.text section Fixes: 2e92493637a0 ("x86/xen: avoid warning in Xen pv guest with CONFIG_AMD_MEM_ENCRYPT enabled") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20210621120120.606560778@infradead.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-06-30x86/entry: Fix noinstr fail in __do_fast_syscall_32()Peter Zijlstra1-1/+1