summaryrefslogtreecommitdiff
path: root/arch/x86/kvm/svm
AgeCommit message (Collapse)AuthorFilesLines
2024-10-20KVM: nSVM: Ignore nCR3[4:0] when loading PDPTEs from memorySean Christopherson1-1/+5
Ignore nCR3[4:0] when loading PDPTEs from memory for nested SVM, as bits 4:0 of CR3 are ignored when PAE paging is used, and thus VMRUN doesn't enforce 32-byte alignment of nCR3. In the absolute worst case scenario, failure to ignore bits 4:0 can result in an out-of-bounds read, e.g. if the target page is at the end of a memslot, and the VMM isn't using guard pages. Per the APM: The CR3 register points to the base address of the page-directory-pointer table. The page-directory-pointer table is aligned on a 32-byte boundary, with the low 5 address bits 4:0 assumed to be 0. And the SDM's much more explicit: 4:0 Ignored Note, KVM gets this right when loading PDPTRs, it's only the nSVM flow that is broken. Fixes: e4e517b4be01 ("KVM: MMU: Do not unconditionally read PDPTE from guest memory") Reported-by: Kirk Swidowski <swidowski@google.com> Cc: Andy Nguyen <theflow@google.com> Cc: 3pvd <3pvd@google.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20241009140838.1036226-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-09-28Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds4-48/+69
Pull x86 kvm updates from Paolo Bonzini: "x86: - KVM currently invalidates the entirety of the page tables, not just those for the memslot being touched, when a memslot is moved or deleted. This does not traditionally have particularly noticeable overhead, but Intel's TDX will require the guest to re-accept private pages if they are dropped from the secure EPT, which is a non starter. Actually, the only reason why this is not already being done is a bug which was never fully investigated and caused VM instability with assigned GeForce GPUs, so allow userspace to opt into the new behavior. - Advertise AVX10.1 to userspace (effectively prep work for the "real" AVX10 functionality that is on the horizon) - Rework common MSR handling code to suppress errors on userspace accesses to unsupported-but-advertised MSRs This will allow removing (almost?) all of KVM's exemptions for userspace access to MSRs that shouldn't exist based on the vCPU model (the actual cleanup is non-trivial future work) - Rework KVM's handling of x2APIC ICR, again, because AMD (x2AVIC) splits the 64-bit value into the legacy ICR and ICR2 storage, whereas Intel (APICv) stores the entire 64-bit value at the ICR offset - Fix a bug where KVM would fail to exit to userspace if one was triggered by a fastpath exit handler - Add fastpath handling of HLT VM-Exit to expedite re-entering the guest when there's already a pending wake event at the time of the exit - Fix a WARN caused by RSM entering a nested guest from SMM with invalid guest state, by forcing the vCPU out of guest mode prior to signalling SHUTDOWN (the SHUTDOWN hits the VM altogether, not the nested guest) - Overhaul the "unprotect and retry" logic to more precisely identify cases where retrying is actually helpful, and to harden all retry paths against putting the guest into an infinite retry loop - Add support for yielding, e.g. to honor NEED_RESCHED, when zapping rmaps in the shadow MMU - Refactor pieces of the shadow MMU related to aging SPTEs in prepartion for adding multi generation LRU support in KVM - Don't stuff the RSB after VM-Exit when RETPOLINE=y and AutoIBRS is enabled, i.e. when the CPU has already flushed the RSB - Trace the per-CPU host save area as a VMCB pointer to improve readability and cleanup the retrieval of the SEV-ES host save area - Remove unnecessary accounting of temporary nested VMCB related allocations - Set FINAL/PAGE in the page fault error code for EPT violations if and only if the GVA is valid. If the GVA is NOT valid, there is no guest-side page table walk and so stuffing paging related metadata is nonsensical - Fix a bug where KVM would incorrectly synthesize a nested VM-Exit instead of emulating posted interrupt delivery to L2 - Add a lockdep assertion to detect unsafe accesses of vmcs12 structures - Harden eVMCS loading against an impossible NULL pointer deref (really truly should be impossible) - Minor SGX fix and a cleanup - Misc cleanups Generic: - Register KVM's cpuhp and syscore callbacks when enabling virtualization in hardware, as the sole purpose of said callbacks is to disable and re-enable virtualization as needed - Enable virtualization when KVM is loaded, not right before the first VM is created Together with the previous change, this simplifies a lot the logic of the callbacks, because their very existence implies virtualization is enabled - Fix a bug that results in KVM prematurely exiting to userspace for coalesced MMIO/PIO in many cases, clean up the related code, and add a testcase - Fix a bug in kvm_clear_guest() where it would trigger a buffer overflow _if_ the gpa+len crosses a page boundary, which thankfully is guaranteed to not happen in the current code base. Add WARNs in more helpers that read/write guest memory to detect similar bugs Selftests: - Fix a goof that caused some Hyper-V tests to be skipped when run on bare metal, i.e. NOT in a VM - Add a regression test for KVM's handling of SHUTDOWN for an SEV-ES guest - Explicitly include one-off assets in .gitignore. Past Sean was completely wrong about not being able to detect missing .gitignore entries - Verify userspace single-stepping works when KVM happens to handle a VM-Exit in its fastpath - Misc cleanups" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (127 commits) Documentation: KVM: fix warning in "make htmldocs" s390: Enable KVM_S390_UCONTROL config in debug_defconfig selftests: kvm: s390: Add VM run test case KVM: SVM: let alternatives handle the cases when RSB filling is required KVM: VMX: Set PFERR_GUEST_{FINAL,PAGE}_MASK if and only if the GVA is valid KVM: x86/mmu: Use KVM_PAGES_PER_HPAGE() instead of an open coded equivalent KVM: x86/mmu: Add KVM_RMAP_MANY to replace open coded '1' and '1ul' literals KVM: x86/mmu: Fold mmu_spte_age() into kvm_rmap_age_gfn_range() KVM: x86/mmu: Morph kvm_handle_gfn_range() into an aging specific helper KVM: x86/mmu: Honor NEED_RESCHED when zapping rmaps and blocking is allowed KVM: x86/mmu: Add a helper to walk and zap rmaps for a memslot KVM: x86/mmu: Plumb a @can_yield parameter into __walk_slot_rmaps() KVM: x86/mmu: Move walk_slot_rmaps() up near for_each_slot_rmap_range() KVM: x86/mmu: WARN on MMIO cache hit when emulating write-protected gfn KVM: x86/mmu: Detect if unprotect will do anything based on invalid_list KVM: x86/mmu: Subsume kvm_mmu_unprotect_page() into the and_retry() version KVM: x86: Rename reexecute_instruction()=>kvm_unprotect_and_retry_on_failure() KVM: x86: Update retry protection fields when forcing retry on emulation failure KVM: x86: Apply retry protection to "unprotect on failure" path KVM: x86: Check EMULTYPE_WRITE_PF_TO_SP before unprotecting gfn ...
2024-09-23Merge tag 'pull-stable-struct_fd' of ↵Linus Torvalds1-8/+8
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull 'struct fd' updates from Al Viro: "Just the 'struct fd' layout change, with conversion to accessor helpers" * tag 'pull-stable-struct_fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: add struct fd constructors, get rid of __to_fd() struct fd: representation change introduce fd_file(), convert all accessors to it.
2024-09-17Merge tag 'kvm-x86-svm-6.12' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini4-22/+32
KVM SVM changes for 6.12: - Don't stuff the RSB after VM-Exit when RETPOLINE=y and AutoIBRS is enabled, i.e. when the CPU has already flushed the RSB. - Trace the per-CPU host save area as a VMCB pointer to improve readability and cleanup the retrieval of the SEV-ES host save area. - Remove unnecessary accounting of temporary nested VMCB related allocations.
2024-09-17Merge tag 'kvm-x86-misc-6.12' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini1-15/+29
KVM x86 misc changes for 6.12 - Advertise AVX10.1 to userspace (effectively prep work for the "real" AVX10 functionality that is on the horizon). - Rework common MSR handling code to suppress errors on userspace accesses to unsupported-but-advertised MSRs. This will allow removing (almost?) all of KVM's exemptions for userspace access to MSRs that shouldn't exist based on the vCPU model (the actual cleanup is non-trivial future work). - Rework KVM's handling of x2APIC ICR, again, because AMD (x2AVIC) splits the 64-bit value into the legacy ICR and ICR2 storage, whereas Intel (APICv) stores the entire 64-bit value a the ICR offset. - Fix a bug where KVM would fail to exit to userspace if one was triggered by a fastpath exit handler. - Add fastpath handling of HLT VM-Exit to expedite re-entering the guest when there's already a pending wake event at the time of the exit. - Finally fix the RSM vs. nested VM-Enter WARN by forcing the vCPU out of guest mode prior to signalling SHUTDOWN (architecturally, the SHUTDOWN is supposed to hit L1, not L2).
2024-09-17Merge branch 'kvm-redo-enable-virt' into HEADPaolo Bonzini1-11/+8
Register KVM's cpuhp and syscore callbacks when enabling virtualization in hardware, as the sole purpose of said callbacks is to disable and re-enable virtualization as needed. The primary motivation for this series is to simplify dealing with enabling virtualization for Intel's TDX, which needs to enable virtualization when kvm-intel.ko is loaded, i.e. long before the first VM is created. That said, this is a nice cleanup on its own. By registering the callbacks on-demand, the callbacks themselves don't need to check kvm_usage_count, because their very existence implies a non-zero count. Patch 1 (re)adds a dedicated lock for kvm_usage_count. This avoids a lock ordering issue between cpus_read_lock() and kvm_lock. The lock ordering issue still exist in very rare cases, and will be fixed for good by switching vm_list to an (S)RCU-protected list. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-09-10KVM: SVM: let alternatives handle the cases when RSB filling is requiredAmit Shah1-6/+2
Remove superfluous RSB filling after a VMEXIT when the CPU already has flushed the RSB after a VMEXIT when AutoIBRS is enabled. The initial implementation for adding RETPOLINES added an ALTERNATIVES implementation for filling the RSB after a VMEXIT in commit 117cc7a908c8 ("x86/retpoline: Fill return stack buffer on vmexit"). Later, X86_FEATURE_RSB_VMEXIT was added in commit 9756bba28470 ("x86/speculation: Fill RSB on vmexit for IBRS") to handle stuffing the RSB if RETPOLINE=y *or* KERNEL_IBRS=y, i.e. to also stuff the RSB if the kernel is configured to do IBRS mitigations on entry/exit. The AutoIBRS (on AMD) feature implementation added in commit e7862eda309e ("x86/cpu: Support AMD Automatic IBRS") used the already-implemented logic for EIBRS in spectre_v2_determine_rsb_fill_type_on_vmexit() -- but did not update the code at VMEXIT to act on the mode selected in that function -- resulting in VMEXITs continuing to clear the RSB when RETPOLINES are enabled, despite the presence of AutoIBRS. Signed-off-by: Amit Shah <amit.shah@amd.com> Link: https://lore.kernel.org/r/20240807123531.69677-1-amit@kernel.org [sean: massage changeloge, drop comment about AMD not needing RSB_VMEXIT_LITE] Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-04KVM: x86: Register "emergency disable" callbacks when virt is enabledSean Christopherson1-4/+1
Register the "disable virtualization in an emergency" callback just before KVM enables virtualization in hardware, as there is no functional need to keep the callbacks registered while KVM happens to be loaded, but is inactive, i.e. if KVM hasn't enabled virtualization. Note, unregistering the callback every time the last VM is destroyed could have measurable latency due to the synchronize_rcu() needed to ensure all references to the callback are dropped before KVM is unloaded. But the latency should be a small fraction of the total latency of disabling virtualization across all CPUs, and userspace can set enable_virt_at_load to completely eliminate the runtime overhead. Add a pointer in kvm_x86_ops to allow vendor code to provide its callback. There is no reason to force vendor code to do the registration, and either way KVM would need a new kvm_x86_ops hook. Suggested-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Acked-by: Kai Huang <kai.huang@intel.com> Tested-by: Farrah Chen <farrah.chen@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240830043600.127750-11-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-09-04KVM: x86: Rename virtualization {en,dis}abling APIs to match common KVMSean Christopherson1-9/+9
Rename x86's the per-CPU vendor hooks used to enable virtualization in hardware to align with the recently renamed arch hooks. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-ID: <20240830043600.127750-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-08-29KVM: x86: Add fastpath handling of HLT VM-ExitsSean Christopherson1-2/+11
Add a fastpath for HLT VM-Exits by immediately re-entering the guest if it has a pending wake event. When virtual interrupt delivery is enabled, i.e. when KVM doesn't need to manually inject interrupts, this allows KVM to stay in the fastpath run loop when a vIRQ arrives between the guest doing CLI and STI;HLT. Without AMD's Idle HLT-intercept support, the CPU generates a HLT VM-Exit even though KVM will immediately resume the guest. Note, on bare metal, it's relatively uncommon for a modern guest kernel to actually trigger this scenario, as the window between the guest checking for a wake event and committing to HLT is quite small. But in a nested environment, the timings change significantly, e.g. rudimentary testing showed that ~50% of HLT exits where HLT-polling was successful would be serviced by this fastpath, i.e. ~50% of the time that a nested vCPU gets a wake event before KVM schedules out the vCPU, the wake event was pending even before the VM-Exit. Link: https://lore.kernel.org/all/20240528041926.3989-3-manali.shukla@amd.com Link: https://lore.kernel.org/r/20240802195120.325560-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-29KVM: SVM: Track the per-CPU host save area as a VMCB pointerSean Christopherson2-8/+9
The host save area is a VMCB, track it as such to help readers follow along, but mostly to cleanup/simplify the retrieval of the SEV-ES host save area. Note, the compile-time assertion that offsetof(struct vmcb, save) == EXPECTED_VMCB_CONTROL_AREA_SIZE ensures that the SEV-ES save area is indeed at offset 0x400 (whoever added the expected/architectural VMCB offsets apparently likes decimal). No functional change intended. Link: https://lore.kernel.org/r/20240802204511.352017-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-29KVM: SVM: Add a helper to convert a SME-aware PA back to a struct pageSean Christopherson2-6/+19
Add __sme_pa_to_page() to pair with __sme_page_pa() and use it to replace open coded equivalents, including for "iopm_base", which previously avoided having to do __sme_clr() by storing the raw PA in the global variable. Opportunistically convert __sme_page_pa() to a helper to provide type safety. No functional change intended. Link: https://lore.kernel.org/r/20240802204511.352017-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-29KVM: x86: Re-split x2APIC ICR into ICR+ICR2 for AMD (x2AVIC)Sean Christopherson1-0/+2
Re-introduce the "split" x2APIC ICR storage that KVM used prior to Intel's IPI virtualization support, but only for AMD. While not stated anywhere in the APM, despite stating the ICR is a single 64-bit register, AMD CPUs store the 64-bit ICR as two separate 32-bit values in ICR and ICR2. When IPI virtualization (IPIv on Intel, all AVIC flavors on AMD) is enabled, KVM needs to match CPU behavior as some ICR ICR writes will be handled by the CPU, not by KVM. Add a kvm_x86_ops knob to control the underlying format used by the CPU to store the x2APIC ICR, and tune it to AMD vs. Intel regardless of whether or not x2AVIC is enabled. If KVM is handling all ICR writes, the storage format for x2APIC mode doesn't matter, and having the behavior follow AMD versus Intel will provide better test coverage and ease debugging. Fixes: 4d1d7942e36a ("KVM: SVM: Introduce logic to (de)activate x2AVIC mode") Cc: stable@vger.kernel.org Cc: Maxim Levitsky <mlevitsk@redhat.com> Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Link: https://lore.kernel.org/r/20240719235107.3023592-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22KVM: SVM: Don't advertise Bus Lock Detect to guest if SVM support is missingRavi Bangoria1-0/+3
If host supports Bus Lock Detect, KVM advertises it to guests even if SVM support is absent. Additionally, guest wouldn't be able to use it despite guest CPUID bit being set. Fix it by unconditionally clearing the feature bit in KVM cpu capability. Reported-by: Jim Mattson <jmattson@google.com> Closes: https://lore.kernel.org/r/CALMp9eRet6+v8Y1Q-i6mqPm4hUow_kJNhmVHfOV8tMfuSS=tVg@mail.gmail.com Fixes: 76ea438b4afc ("KVM: X86: Expose bus lock debug exception to guest") Cc: stable@vger.kernel.org Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com> Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20240808062937.1149-4-ravi.bangoria@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22KVM: x86: Rename get_msr_feature() APIs to get_feature_msr()Sean Christopherson1-3/+3
Rename all APIs related to feature MSRs from get_msr_feature() to get_feature_msr(). The APIs get "feature MSRs", not "MSR features". And unlike kvm_{g,s}et_msr_common(), the "feature" adjective doesn't describe the helper itself. No functional change intended. Link: https://lore.kernel.org/r/20240802181935.292540-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22KVM: x86: Refactor kvm_x86_ops.get_msr_feature() to avoid kvm_msr_entrySean Christopherson1-9/+7
Refactor get_msr_feature() to take the index and data pointer as distinct parameters in anticipation of eliminating "struct kvm_msr_entry" usage further up the primary callchain. No functional change intended. Link: https://lore.kernel.org/r/20240802181935.292540-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22KVM: x86: Rename KVM_MSR_RET_INVALID to KVM_MSR_RET_UNSUPPORTEDSean Christopherson1-1/+1
Rename the "INVALID" internal MSR error return code to "UNSUPPORTED" to try and make it more clear that access was denied because the MSR itself is unsupported/unknown. "INVALID" is too ambiguous, as it could just as easily mean the value for WRMSR as invalid. Avoid UNKNOWN and UNIMPLEMENTED, as the error code is used for MSRs that _are_ actually implemented by KVM, e.g. if the MSR is unsupported because an associated feature flag is not present in guest CPUID. Opportunistically beef up the comments for the internal MSR error codes. Link: https://lore.kernel.org/r/20240802181935.292540-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22KVM: SVM: Disallow guest from changing userspace's MSR_AMD64_DE_CFG valueSean Christopherson1-2/+7
Inject a #GP if the guest attempts to change MSR_AMD64_DE_CFG from its *current* value, not if the guest attempts to write a value other than KVM's set of supported bits. As per the comment and the changelog of the original code, the intent is to effectively make MSR_AMD64_DE_CFG read- only for the guest. Opportunistically use a more conventional equality check instead of an exclusive-OR check to detect attempts to change bits. Fixes: d1d93fa90f1a ("KVM: SVM: Add MSR-based feature support for serializing LFENCE") Cc: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20240802181935.292540-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22KVM: SVM: Remove unnecessary GFP_KERNEL_ACCOUNT in svm_set_nested_state()Yongqiang Liu1-2/+2
The fixed size temporary variables vmcb_control_area and vmcb_save_area allocated in svm_set_nested_state() are released when the function exits. Meanwhile, svm_set_nested_state() also have vcpu mutex held to avoid massive concurrency allocation, so we don't need to set GFP_KERNEL_ACCOUNT. Signed-off-by: Yongqiang Liu <liuyongqiang13@huawei.com> Link: https://lore.kernel.org/r/20240821112737.3649937-1-liuyongqiang13@huawei.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22KVM: SVM: fix emulation of msr reads/writes of MSR_FS_BASE and MSR_GS_BASEMaxim Levitsky1-0/+12
If these msrs are read by the emulator (e.g due to 'force emulation' prefix), SVM code currently fails to extract the corresponding segment bases, and return them to the emulator. Fix that. Cc: stable@vger.kernel.org Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20240802151608.72896-3-mlevitsk@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-13KVM: SVM: Fix an error code in sev_gmem_post_populate()Dan Carpenter1-2/+3
The copy_from_user() function returns the number of bytes which it was not able to copy. Return -EFAULT instead. Fixes: dee5a47cc7a4 ("KVM: SEV: Add KVM_SEV_SNP_LAUNCH_UPDATE command") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Message-ID: <20240612115040.2423290-4-dan.carpenter@linaro.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-08-13KVM: SVM: Fix uninitialized variable bugDan Carpenter1-1/+1
If snp_lookup_rmpentry() fails then "assigned" is printed in the error message but it was never initialized. Initialize it to false. Fixes: dee5a47cc7a4 ("KVM: SEV: Add KVM_SEV_SNP_LAUNCH_UPDATE command") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Message-ID: <20240612115040.2423290-3-dan.carpenter@linaro.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-08-12introduce fd_file(), convert all accessors to it.Al Viro1-8/+8
For any changes of struct fd representation we need to turn existing accesses to fields into calls of wrappers. Accesses to struct fd::flags are very few (3 in linux/file.h, 1 in net/socket.c, 3 in fs/overlayfs/file.c and 3 more in explicit initializers). Those can be dealt with in the commit converting to new layout; accesses to struct fd::file are too many for that. This commit converts (almost) all of f.file to fd_file(f). It's not entirely mechanical ('file' is used as a member name more than just in struct fd) and it does not even attempt to distinguish the uses in pointer context from those in boolean context; the latter will be eventually turned into a separate helper (fd_empty()). NOTE: mass conversion to fd_empty(), tempting as it might be, is a bad idea; better do that piecewise in commit that convert from fdget...() to CLASS(...). [conflicts in fs/fhandle.c, kernel/bpf/syscall.c, mm/memcontrol.c caught by git; fs/stat.c one got caught by git grep] [fs/xattr.c conflict] Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-07-26KVM: guest_memfd: let kvm_gmem_populate() operate only on private gfnsPaolo Bonzini1-7/+0
This check is currently performed by sev_gmem_post_populate(), but it applies to all callers of kvm_gmem_populate(): the point of the function is that the memory is being encrypted and some work has to be done on all the gfns in order to encrypt them. Therefore, check the KVM_MEMORY_ATTRIBUTE_PRIVATE attribute prior to invoking the callback, and stop the operation if a shared page is encountered. Because CONFIG_KVM_PRIVATE_MEM in principle does not require attributes, this makes kvm_gmem_populate() depend on CONFIG_KVM_GENERIC_PRIVATE_MEM (which does require them). Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-26KVM: guest_memfd: move check for already-populated page to common codePaolo Bonzini1-1/+1
Do not allow populating the same page twice with startup data. In the case of SEV-SNP, for example, the firmware does not allow it anyway, since the launch-update operation is only possible on pages that are still shared in the RMP. Even if it worked, kvm_gmem_populate()'s callback is meant to have side effects such as updating launch measurements, and updating the same page twice is unlikely to have the desired results. Races between calls to the ioctl are not possible because kvm_gmem_populate() holds slots_lock and the VM should not be running. But again, even if this worked on other confidential computing technology, it doesn't matter to guest_memfd.c whether this is something fishy such as missing synchronization in userspace, or rather something intentional. One of the racers wins, and the page is initialized by either kvm_gmem_prepare_folio() or kvm_gmem_populate(). Anyway, out of paranoia, adjust sev_gmem_post_populate() anyway to use the same errno that kvm_gmem_populate() is using. Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-26KVM: x86: disallow pre-fault for SNP VMs before initializationPaolo Bonzini2-0/+9
KVM_PRE_FAULT_MEMORY for an SNP guest can race with sev_gmem_post_populate() in bad ways. The following sequence for instance can potentially trigger an RMP fault: thread A, sev_gmem_post_populate: called thread B, sev_gmem_prepare: places below 'pfn' in a private state in RMP thread A, sev_gmem_post_populate: *vaddr = kmap_local_pfn(pfn + i); thread A, sev_gmem_post_populate: copy_from_user(vaddr, src + i * PAGE_SIZE, PAGE_SIZE); RMP #PF Fix this by only allowing KVM_PRE_FAULT_MEMORY to run after a guest's initial private memory contents have been finalized via KVM_SEV_SNP_LAUNCH_FINISH. Beyond fixing this issue, it just sort of makes sense to enforce this, since the KVM_PRE_FAULT_MEMORY documentation states: "KVM maps memory as if the vCPU generated a stage-2 read page fault" which sort of implies we should be acting on the same guest state that a vCPU would see post-launch after the initial guest memory is all set up. Co-developed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-16Merge branch 'kvm-6.11-sev-attestation' into HEADPaolo Bonzini2-0/+193
The GHCB 2.0 specification defines 2 GHCB request types to allow SNP guests to send encrypted messages/requests to firmware: SNP Guest Requests and SNP Extended Guest Requests. These encrypted messages are used for things like servicing attestation requests issued by the guest. Implementing support for these is required to be fully GHCB-compliant. For the most part, KVM only needs to handle forwarding these requests to firmware (to be issued via the SNP_GUEST_REQUEST firmware command defined in the SEV-SNP Firmware ABI), and then forwarding the encrypted response to the guest. However, in the case of SNP Extended Guest Requests, the host is also able to provide the certificate data corresponding to the endorsement key used by firmware to sign attestation report requests. This certificate data is provided by userspace because: 1) It allows for different keys/key types to be used for each particular guest with requiring any sort of KVM API to configure the certificate table in advance on a per-guest basis. 2) It provides additional flexibility with how attestation requests might be handled during live migration where the certificate data for source/dest might be different. 3) It allows all synchronization between certificates and firmware/signing key updates to be handled purely by userspace rather than requiring some in-kernel mechanism to facilitate it. [1] To support fetching certificate data from userspace, a new KVM exit type will be needed to handle fetching the certificate from userspace. An attempt to define a new KVM_EXIT_COCO/KVM_EXIT_COCO_REQ_CERTS exit type to handle this was introduced in v1 of this patchset, but is still being discussed by community, so for now this patchset only implements a stub version of SNP Extended Guest Requests that does not provide certificate data, but is still enough to provide compliance with the GHCB 2.0 spec.
2024-07-16KVM: SEV: Provide support for SNP_EXTENDED_GUEST_REQUEST NAE eventMichael Roth1-0/+56
Version 2 of GHCB specification added support for the SNP Extended Guest Request Message NAE event. This event serves a nearly identical purpose to the previously-added SNP_GUEST_REQUEST event, but for certain message types it allows the guest to supply a buffer to be used for additional information in some cases. Currently the GHCB spec only defines extended handling of this sort in the case of attestation requests, where the additional buffer is used to supply a table of certificate data corresponding to the attestion report's signing key. Support for this extended handling will require additional KVM APIs to handle coordinating with userspace. Whether or not the hypervisor opts to provide this certificate data is optional. However, support for processing SNP_EXTENDED_GUEST_REQUEST GHCB requests is required by the GHCB 2.0 specification for SNP guests, so for now implement a stub implementation that provides an empty certificate table to the guest if it supplies an additional buffer, but otherwise behaves identically to SNP_GUEST_REQUEST. Reviewed-by: Carlos Bilbao <carlos.bilbao.osdev@gmail.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Liam Merwick <liam.merwick@oracle.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Message-ID: <20240701223148.3798365-4-michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-16KVM: SEV: Provide support for SNP_GUEST_REQUEST NAE eventBrijesh Singh2-0/+137
Version 2 of GHCB specification added support for the SNP Guest Request Message NAE event. The event allows for an SEV-SNP guest to make requests to the SEV-SNP firmware through the hypervisor using the SNP_GUEST_REQUEST API defined in the SEV-SNP firmware specification. This is used by guests primarily to request attestation reports from firmware. There are other request types are available as well, but the specifics of what guest requests are being made generally does not affect how they are handled by the hypervisor, which only serves as a proxy for the guest requests and firmware responses. Implement handling for these events. When an SNP Guest Request is issued, the guest will provide its own request/response pages, which could in theory be passed along directly to firmware. However, these pages would need special care: - Both pages are from shared guest memory, so they need to be protected from migration/etc. occurring while firmware reads/writes to them. At a minimum, this requires elevating the ref counts and potentially needing an explicit pinning of the memory. This places additional restrictions on what type of memory backends userspace can use for shared guest memory since there would be some reliance on using refcounted pages. - The response page needs to be switched to Firmware-owned state before the firmware can write to it, which can lead to potential host RMP #PFs if the guest is misbehaved and hands the host a guest page that KVM is writing to for other reasons (e.g. virtio buffers). Both of these issues can be avoided completely by using separately-allocated bounce pages for both the request/response pages and passing those to firmware instead. So that's the approach taken here. Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> Co-developed-by: Alexey Kardashevskiy <aik@amd.com> Signed-off-by: Alexey Kardashevskiy <aik@amd.com> Co-developed-by: Ashish Kalra <ashish.kalra@amd.com> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Liam Merwick <liam.merwick@oracle.com> [mdr: ensure FW command failures are indicated to guest, drop extended request handling to be re-written as separate patch, massage commit] Signed-off-by: Michael Roth <michael.roth@amd.com> Message-ID: <20240701223148.3798365-2-michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-16Merge tag 'kvm-x86-svm-6.11' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini4-20/+29
KVM SVM changes for 6.11 - Make per-CPU save_area allocations NUMA-aware. - Force sev_es_host_save_area() to be inlined to avoid calling into an instrumentable function from noinstr code.
2024-07-16Merge tag 'kvm-x86-pmu-6.11' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini1-6/+5
KVM x86/pmu changes for 6.11 - Don't advertise IA32_PERF_GLOBAL_OVF_CTRL as an MSR-to-be-saved, as it reads '0' and writes from userspace are ignored. - Update to the newfangled Intel CPU FMS infrastructure. - Use macros instead of open-coded literals to clean up KVM's manipulation of FIXED_CTR_CTRL MSRs.
2024-07-16Merge tag 'kvm-x86-misc-6.11' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini3-10/+11
KVM x86 misc changes for 6.11 - Add a global struct to consolidate tracking of host values, e.g. EFER, and move "shadow_phys_bits" into the structure as "maxphyaddr". - Add KVM_CAP_X86_APIC_BUS_CYCLES_NS to allow configuring the effective APIC bus frequency, because TDX. - Print the name of the APICv/AVIC inhibits in the relevant tracepoint. - Clean up KVM's handling of vendor specific emulation to consistently act on "compatible with Intel/AMD", versus checking for a specific vendor. - Misc cleanups
2024-07-16Merge tag 'kvm-x86-generic-6.11' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini1-8/+3
KVM generic changes for 6.11 - Enable halt poll shrinking by default, as Intel found it to be a clear win. - Setup empty IRQ routing when creating a VM to avoid having to synchronize SRCU when creating a split IRQCHIP on x86. - Rework the sched_in/out() paths to replace kvm_arch_sched_in() with a flag that arch code can use for hooking both sched_in() and sched_out(). - Take the vCPU @id as an "unsigned long" instead of "u32" to avoid truncating a bogus value from userspace, e.g. to help userspace detect bugs. - Mark a vCPU as preempted if and only if it's scheduled out while in the KVM_RUN loop, e.g. to avoid marking it preempted and thus writing guest memory when retrieving guest state during live migration blackout. - A few minor cleanups
2024-07-12Merge tag 'loongarch-kvm-6.11' of ↵Paolo Bonzini1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD LoongArch KVM changes for v6.11 1. Add ParaVirt steal time support. 2. Add some VM migration enhancement. 3. Add perf kvm-stat support for loongarch.
2024-06-28KVM: x86/pmu: Introduce distinct macros for GP/fixed counter max numberDapeng Mi1-4/+3
Refine the macros which define maximum General Purpose (GP) and fixed counter numbers. Currently the macro KVM_INTEL_PMC_MAX_GENERIC is used to represent the maximum supported General Purpose (GP) counter number ambiguously across Intel and AMD platforms. This would cause issues if AMD begins to support more GP counters than Intel. Thus a bunch of new macros including vendor specific and vendor independent are introduced to replace the old macros. The vendor independent macros are used in x86 common code to hide vendor difference and eliminate the ambiguity. No logic changes are introduced in this patch. Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20240627021756.144815-1-dapeng1.mi@linux.intel.com Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-28KVM: SVM: Use sev_es_host_save_area() helper when initializing tsc_auxSean Christopherson1-9/+6
Use sev_es_host_save_area() instead of open coding an equivalent when setting the MSR_TSC_AUX field during setup. No functional change intended. Link: https://lore.kernel.org/r/20240617210432.1642542-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-28KVM: SVM: Force sev_es_host_save_area() to be inlined (for noinstr usage)Sean Christopherson1-1/+1
Force sev_es_host_save_area() to be always inlined, as it's used in the low level VM-Enter/VM-Exit path, which is non-instrumentable. vmlinux.o: warning: objtool: svm_vcpu_enter_exit+0xb0: call to sev_es_host_save_area() leaves .noinstr.text section vmlinux.o: warning: objtool: svm_vcpu_enter_exit+0xbf: call to sev_es_host_save_area.isra.0() leaves .noinstr.text section Fixes: c92be2fd8edf ("KVM: SVM: Save/restore non-volatile GPRs in SEV-ES VMRUN via host save area") Reported-by: Borislav Petkov <bp@alien8.de> Tested-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240617210432.1642542-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-28KVM: x86: Add missing MODULE_DESCRIPTION() macrosJeff Johnson1-0/+1
Add module descriptions for the vendor modules to fix allmodconfig 'make W=1' warnings: WARNING: modpost: missing MODULE_DESCRIPTION() in arch/x86/kvm/kvm-intel.o WARNING: modpost: missing MODULE_DESCRIPTION() in arch/x86/kvm/kvm-amd.o Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Link: https://lore.kernel.org/r/20240622-md-kvm-v2-1-29a60f7c48b1@quicinc.com [sean: split kvm.ko change to separate commit] Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-21KVM: SEV-ES: Fix svm_get_msr()/svm_set_msr() for KVM_SEV_ES_INIT guestsMichael Roth1-2/+2
With commit 27bd5fdc24c0 ("KVM: SEV-ES: Prevent MSR access post VMSA encryption"), older VMMs like QEMU 9.0 and older will fail when booting SEV-ES guests with something like the following error: qemu-system-x86_64: error: failed to get MSR 0x174 qemu-system-x86_64: ../qemu.git/target/i386/kvm/kvm.c:3950: kvm_get_msrs: Assertion `ret == cpu->kvm_msr_buf->nmsrs' failed. This is because older VMMs that might still call svm_get_msr()/svm_set_msr() for SEV-ES guests after guest boot even if those interfaces were essentially just noops because of the vCPU state being encrypted and stored separately in the VMSA. Now those VMMs will get an -EINVAL and generally crash. Newer VMMs that are aware of KVM_SEV_INIT2 however are already aware of the stricter limitations of what vCPU state can be sync'd during guest run-time, so newer QEMU for instance will work both for legacy KVM_SEV_ES_INIT interface as well as KVM_SEV_INIT2. So when using KVM_SEV_INIT2 it's okay to assume userspace can deal with -EINVAL, whereas for legacy KVM_SEV_ES_INIT the kernel might be dealing with either an older VMM and so it needs to assume that returning -EINVAL might break the VMM. Address this by only returning -EINVAL if the guest was started with KVM_SEV_INIT2. Otherwise, just silently return. Cc: Ravi Bangoria <ravi.bangoria@amd.com> Cc: Nikunj A Dadhania <nikunj@amd.com> Reported-by: Srikanth Aithal <sraithal@amd.com> Closes: https://lore.kernel.org/lkml/37usuu4yu4ok7be2hqexhmcyopluuiqj3k266z4gajc2rcj4yo@eujb23qc3zcm/ Fixes: 27bd5fdc24c0 ("KVM: SEV-ES: Prevent MSR access post VMSA encryption") Signed-off-by: Michael Roth <michael.roth@amd.com> Message-ID: <20240604233510.764949-1-michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-06-11KVM: x86: Fold kvm_arch_sched_in() into kvm_arch_vcpu_load()Sean Christopherson1-8/+3
Fold the guts of kvm_arch_sched_in() into kvm_arch_vcpu_load(), keying off the recently added kvm_vcpu.scheduled_out as appropriate. Note, there is a very slight functional change, as PLE shrink updates will now happen after blasting WBINVD, but that is quite uninteresting as the two operations do not interact in any way. Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20240522014013.1672962-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-10KVM: SVM: Emulate SYSENTER RIP/RSP behavior for all Intel compat vCPUsSean Christopherson1-7/+7
Emulate bits 63:32 of the SYSENTER_R{I,S}P MSRs for all vCPUs that are compatible with Intel's architecture, not just strictly vCPUs that have vendor==Intel. The behavior of bits 63:32 is architecturally defined in the SDM, i.e. not some uarch specific quirk of Intel CPUs. Link: https://lore.kernel.org/r/20240405235603.1173076-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-05KVM: x86: Keep consistent naming for APICv/AVIC inhibit reasonsAlejandro Jimenez1-1/+1
Keep kvm_apicv_inhibit enum naming consistent with the current pattern by renaming the reason/enumerator defined as APICV_INHIBIT_REASON_DISABLE to APICV_INHIBIT_REASON_DISABLED. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Link: https://lore.kernel.org/r/20240506225321.3440701-3-alejandro.j.jimenez@oracle.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-05KVM: SNP: Fix LBR Virtualization for SNP guestRavi Bangoria1-0/+8
SEV-ES and thus SNP guest mandates LBR Virtualization to be _always_ ON. Although commit b7e4be0a224f ("KVM: SEV-ES: Delegate LBR virtualization to the processor") did the correct change for SEV-ES guests, it missed the SNP. Fix it. Reported-by: Srikanth Aithal <sraithal@amd.com> Fixes: b7e4be0a224f ("KVM: SEV-ES: Delegate LBR virtualization to the processor") Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com> Message-ID: <20240605114810.1304-1-ravi.bangoria@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-06-03KVM: SVM: Consider NUMA affinity when allocating per-CPU save_areaLi RongQing3-9/+9
save_area of per-CPU svm_data are dominantly accessed from their own local CPUs, so allocate them node-local for performance reason so rename __snp_safe_alloc_page as snp_safe_alloc_page_node which accepts numa node id as input parameter, svm_cpu_init call it with node id switched from cpu id Signed-off-by: Li RongQing <lirongqing@baidu.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20240520120858.13117-4-lirongqing@baidu.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-03KVM: SVM: not account memory allocation for per-CPU svm_dataLi RongQing3-6/+17
The allocation for the per-CPU save area in svm_cpu_init shouldn't be accounted, So introduce __snp_safe_alloc_page helper, which has gfp flag as input, svm_cpu_init calls __snp_safe_alloc_page with GFP_KERNEL, snp_safe_alloc_page calls __snp_safe_alloc_page with GFP_KERNEL_ACCOUNT as input Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Li RongQing <lirongqing@baidu.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20240520120858.13117-3-lirongqing@baidu.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-03KVM: SVM: remove useless input parameter in snp_safe_alloc_pageLi RongQing4</