diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2023-05-01 12:06:20 -0700 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2023-05-01 12:06:20 -0700 |
commit | c8c655c34e33544aec9d64b660872ab33c29b5f1 (patch) | |
tree | 4aad88f698f04cef9e5d9d573a6df6283085dadd /arch/x86/kvm | |
parent | d75439d64a1e2b35e0f08906205b00279753cbed (diff) | |
parent | b3c98052d46948a8d65d2778c7f306ff38366aac (diff) | |
download | linux-c8c655c34e33544aec9d64b660872ab33c29b5f1.tar.gz linux-c8c655c34e33544aec9d64b660872ab33c29b5f1.tar.bz2 linux-c8c655c34e33544aec9d64b660872ab33c29b5f1.zip |
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"s390:
- More phys_to_virt conversions
- Improvement of AP management for VSIE (nested virtualization)
ARM64:
- Numerous fixes for the pathological lock inversion issue that
plagued KVM/arm64 since... forever.
- New framework allowing SMCCC-compliant hypercalls to be forwarded
to userspace, hopefully paving the way for some more features being
moved to VMMs rather than be implemented in the kernel.
- Large rework of the timer code to allow a VM-wide offset to be
applied to both virtual and physical counters as well as a
per-timer, per-vcpu offset that complements the global one. This
last part allows the NV timer code to be implemented on top.
- A small set of fixes to make sure that we don't change anything
affecting the EL1&0 translation regime just after having having
taken an exception to EL2 until we have executed a DSB. This
ensures that speculative walks started in EL1&0 have completed.
- The usual selftest fixes and improvements.
x86:
- Optimize CR0.WP toggling by avoiding an MMU reload when TDP is
enabled, and by giving the guest control of CR0.WP when EPT is
enabled on VMX (VMX-only because SVM doesn't support per-bit
controls)
- Add CR0/CR4 helpers to query single bits, and clean up related code
where KVM was interpreting kvm_read_cr4_bits()'s "unsigned long"
return as a bool
- Move AMD_PSFD to cpufeatures.h and purge KVM's definition
- Avoid unnecessary writes+flushes when the guest is only adding new
PTEs
- Overhaul .sync_page() and .invlpg() to utilize .sync_page()'s
optimizations when emulating invalidations
- Clean up the range-based flushing APIs
- Revamp the TDP MMU's reaping of Accessed/Dirty bits to clear a
single A/D bit using a LOCK AND instead of XCHG, and skip all of
the "handle changed SPTE" overhead associated with writing the
entire entry
- Track the number of "tail" entries in a pte_list_desc to avoid
having to walk (potentially) all descriptors during insertion and
deletion, which gets quite expensive if the guest is spamming
fork()
- Disallow virtualizing legacy LBRs if architectural LBRs are
available, the two are mutually exclusive in hardware
- Disallow writes to immutable feature MSRs (notably
PERF_CAPABILITIES) after KVM_RUN, similar to CPUID features
- Overhaul the vmx_pmu_caps selftest to better validate
PERF_CAPABILITIES
- Apply PMU filters to emulated events and add test coverage to the
pmu_event_filter selftest
- AMD SVM:
- Add support for virtual NMIs
- Fixes for edge cases related to virtual interrupts
- Intel AMX:
- Don't advertise XTILE_CFG in KVM_GET_SUPPORTED_CPUID if
XTILE_DATA is not being reported due to userspace not opting in
via prctl()
- Fix a bug in emulation of ENCLS in compatibility mode
- Allow emulation of NOP and PAUSE for L2
- AMX selftests improvements
- Misc cleanups
MIPS:
- Constify MIPS's internal callbacks (a leftover from the hardware
enabling rework that landed in 6.3)
Generic:
- Drop unnecessary casts from "void *" throughout kvm_main.c
- Tweak the layout of "struct kvm_mmu_memory_cache" to shrink the
struct size by 8 bytes on 64-bit kernels by utilizing a padding
hole
Documentation:
- Fix goof introduced by the conversion to rST"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (211 commits)
KVM: s390: pci: fix virtual-physical confusion on module unload/load
KVM: s390: vsie: clarifications on setting the APCB
KVM: s390: interrupt: fix virtual-physical confusion for next alert GISA
KVM: arm64: Have kvm_psci_vcpu_on() use WRITE_ONCE() to update mp_state
KVM: arm64: Acquire mp_state_lock in kvm_arch_vcpu_ioctl_vcpu_init()
KVM: selftests: Test the PMU event "Instructions retired"
KVM: selftests: Copy full counter values from guest in PMU event filter test
KVM: selftests: Use error codes to signal errors in PMU event filter test
KVM: selftests: Print detailed info in PMU event filter asserts
KVM: selftests: Add helpers for PMC asserts in PMU event filter test
KVM: selftests: Add a common helper for the PMU event filter guest code
KVM: selftests: Fix spelling mistake "perrmited" -> "permitted"
KVM: arm64: vhe: Drop extra isb() on guest exit
KVM: arm64: vhe: Synchronise with page table walker on MMU update
KVM: arm64: pkvm: Document the side effects of kvm_flush_dcache_to_poc()
KVM: arm64: nvhe: Synchronise with page table walker on TLBI
KVM: arm64: Handle 32bit CNTPCTSS traps
KVM: arm64: nvhe: Synchronise with page table walker on vcpu run
KVM: arm64: vgic: Don't acquire its_lock before config_lock
KVM: selftests: Add test to verify KVM's supported XCR0
...
Diffstat (limited to 'arch/x86/kvm')
29 files changed, 1430 insertions, 1066 deletions
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 599aebec2d52..123bf8b97a4b 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -60,12 +60,6 @@ u32 xstate_required_size(u64 xstate_bv, bool compacted) return ret; } -/* - * This one is tied to SSB in the user API, and not - * visible in /proc/cpuinfo. - */ -#define KVM_X86_FEATURE_AMD_PSFD (13*32+28) /* Predictive Store Forwarding Disable */ - #define F feature_bit /* Scattered Flag - For features that are scattered by cpufeatures.h. */ @@ -266,7 +260,7 @@ static void __kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu, struct kvm_cpuid_e /* Update OSXSAVE bit */ if (boot_cpu_has(X86_FEATURE_XSAVE)) cpuid_entry_change(best, X86_FEATURE_OSXSAVE, - kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE)); + kvm_is_cr4_bit_set(vcpu, X86_CR4_OSXSAVE)); cpuid_entry_change(best, X86_FEATURE_APIC, vcpu->arch.apic_base & MSR_IA32_APICBASE_ENABLE); @@ -275,7 +269,7 @@ static void __kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu, struct kvm_cpuid_e best = cpuid_entry2_find(entries, nent, 7, 0); if (best && boot_cpu_has(X86_FEATURE_PKU) && best->function == 0x7) cpuid_entry_change(best, X86_FEATURE_OSPKE, - kvm_read_cr4_bits(vcpu, X86_CR4_PKE)); + kvm_is_cr4_bit_set(vcpu, X86_CR4_PKE)); best = cpuid_entry2_find(entries, nent, 0xD, 0); if (best) @@ -420,7 +414,7 @@ static int kvm_set_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid_entry2 *e2, * KVM_SET_CPUID{,2} again. To support this legacy behavior, check * whether the supplied CPUID data is equal to what's already set. */ - if (vcpu->arch.last_vmentry_cpu != -1) { + if (kvm_vcpu_has_run(vcpu)) { r = kvm_cpuid_check_equal(vcpu, e2, nent); if (r) return r; @@ -653,7 +647,7 @@ void kvm_set_cpu_caps(void) F(SPEC_CTRL_SSBD) | F(ARCH_CAPABILITIES) | F(INTEL_STIBP) | F(MD_CLEAR) | F(AVX512_VP2INTERSECT) | F(FSRM) | F(SERIALIZE) | F(TSXLDTRK) | F(AVX512_FP16) | - F(AMX_TILE) | F(AMX_INT8) | F(AMX_BF16) + F(AMX_TILE) | F(AMX_INT8) | F(AMX_BF16) | F(FLUSH_L1D) ); /* TSC_ADJUST and ARCH_CAPABILITIES are emulated in software. */ @@ -715,7 +709,7 @@ void kvm_set_cpu_caps(void) F(CLZERO) | F(XSAVEERPTR) | F(WBNOINVD) | F(AMD_IBPB) | F(AMD_IBRS) | F(AMD_SSBD) | F(VIRT_SSBD) | F(AMD_SSB_NO) | F(AMD_STIBP) | F(AMD_STIBP_ALWAYS_ON) | - __feature_bit(KVM_X86_FEATURE_AMD_PSFD) + F(AMD_PSFD) ); /* @@ -1002,7 +996,7 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) entry->eax = entry->ebx = entry->ecx = 0; break; case 0xd: { - u64 permitted_xcr0 = kvm_caps.supported_xcr0 & xstate_get_guest_group_perm(); + u64 permitted_xcr0 = kvm_get_filtered_xcr0(); u64 permitted_xss = kvm_caps.supported_xss; entry->eax &= permitted_xcr0; diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index a20bec931764..936a397a08cd 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -1640,6 +1640,14 @@ static int __load_segment_descriptor(struct x86_emulate_ctxt *ctxt, goto exception; break; case VCPU_SREG_CS: + /* + * KVM uses "none" when loading CS as part of emulating Real + * Mode exceptions and IRET (handled above). In all other + * cases, loading CS without a control transfer is a KVM bug. + */ + if (WARN_ON_ONCE(transfer == X86_TRANSFER_NONE)) + goto exception; + if (!(seg_desc.type & 8)) goto exception; diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h index 4c91f626c058..75eae9c4998a 100644 --- a/arch/x86/kvm/kvm_cache_regs.h +++ b/arch/x86/kvm/kvm_cache_regs.h @@ -4,7 +4,7 @@ #include <linux/kvm_host.h> -#define KVM_POSSIBLE_CR0_GUEST_BITS X86_CR0_TS +#define KVM_POSSIBLE_CR0_GUEST_BITS (X86_CR0_TS | X86_CR0_WP) #define KVM_POSSIBLE_CR4_GUEST_BITS \ (X86_CR4_PVI | X86_CR4_DE | X86_CR4_PCE | X86_CR4_OSFXSR \ | X86_CR4_OSXMMEXCPT | X86_CR4_PGE | X86_CR4_TSD | X86_CR4_FSGSBASE) @@ -157,6 +157,14 @@ static inline ulong kvm_read_cr0_bits(struct kvm_vcpu *vcpu, ulong mask) return vcpu->arch.cr0 & mask; } +static __always_inline bool kvm_is_cr0_bit_set(struct kvm_vcpu *vcpu, + unsigned long cr0_bit) +{ + BUILD_BUG_ON(!is_power_of_2(cr0_bit)); + + return !!kvm_read_cr0_bits(vcpu, cr0_bit); +} + static inline ulong kvm_read_cr0(struct kvm_vcpu *vcpu) { return kvm_read_cr0_bits(vcpu, ~0UL); @@ -171,6 +179,14 @@ static inline ulong kvm_read_cr4_bits(struct kvm_vcpu *vcpu, ulong mask) return vcpu->arch.cr4 & mask; } +static __always_inline bool kvm_is_cr4_bit_set(struct kvm_vcpu *vcpu, + unsigned long cr4_bit) +{ + BUILD_BUG_ON(!is_power_of_2(cr4_bit)); + + return !!kvm_read_cr4_bits(vcpu, cr4_bit); +} + static inline ulong kvm_read_cr3(struct kvm_vcpu *vcpu) { if (!kvm_register_is_available(vcpu, VCPU_EXREG_CR3)) diff --git a/arch/x86/kvm/kvm_onhyperv.c b/arch/x86/kvm/kvm_onhyperv.c index 482d6639ef88..ded0bd688c65 100644 --- a/arch/x86/kvm/kvm_onhyperv.c +++ b/arch/x86/kvm/kvm_onhyperv.c @@ -10,17 +10,22 @@ #include "hyperv.h" #include "kvm_onhyperv.h" +struct kvm_hv_tlb_range { + u64 start_gfn; + u64 pages; +}; + static int kvm_fill_hv_flush_list_func(struct hv_guest_mapping_flush_list *flush, void *data) { - struct kvm_tlb_range *range = data; + struct kvm_hv_tlb_range *range = data; return hyperv_fill_flush_guest_mapping_list(flush, range->start_gfn, range->pages); } static inline int hv_remote_flush_root_tdp(hpa_t root_tdp, - struct kvm_tlb_range *range) + struct kvm_hv_tlb_range *range) { if (range) return hyperv_flush_guest_mapping_range(root_tdp, @@ -29,8 +34,8 @@ static inline int hv_remote_flush_root_tdp(hpa_t root_tdp, return hyperv_flush_guest_mapping(root_tdp); } -int hv_remote_flush_tlb_with_range(struct kvm *kvm, - struct kvm_tlb_range *range) +static int __hv_flush_remote_tlbs_range(struct kvm *kvm, + struct kvm_hv_tlb_range *range) { struct kvm_arch *kvm_arch = &kvm->arch; struct kvm_vcpu *vcpu; @@ -86,19 +91,29 @@ int hv_remote_flush_tlb_with_range(struct kvm *kvm, spin_unlock(&kvm_arch->hv_root_tdp_lock); return ret; } -EXPORT_SYMBOL_GPL(hv_remote_flush_tlb_with_range); -int hv_remote_flush_tlb(struct kvm *kvm) +int hv_flush_remote_tlbs_range(struct kvm *kvm, gfn_t start_gfn, gfn_t nr_pages) +{ + struct kvm_hv_tlb_range range = { + .start_gfn = start_gfn, + .pages = nr_pages, + }; + + return __hv_flush_remote_tlbs_range(kvm, &range); +} +EXPORT_SYMBOL_GPL(hv_flush_remote_tlbs_range); + +int hv_flush_remote_tlbs(struct kvm *kvm) { - return hv_remote_flush_tlb_with_range(kvm, NULL); + return __hv_flush_remote_tlbs_range(kvm, NULL); } -EXPORT_SYMBOL_GPL(hv_remote_flush_tlb); +EXPORT_SYMBOL_GPL(hv_flush_remote_tlbs); void hv_track_root_tdp(struct kvm_vcpu *vcpu, hpa_t root_tdp) { struct kvm_arch *kvm_arch = &vcpu->kvm->arch; - if (kvm_x86_ops.tlb_remote_flush == hv_remote_flush_tlb) { + if (kvm_x86_ops.flush_remote_tlbs == hv_flush_remote_tlbs) { spin_lock(&kvm_arch->hv_root_tdp_lock); vcpu->arch.hv_root_tdp = root_tdp; if (root_tdp != kvm_arch->hv_root_tdp) diff --git a/arch/x86/kvm/kvm_onhyperv.h b/arch/x86/kvm/kvm_onhyperv.h index 6272dabec02d..f9ca3e7432b2 100644 --- a/arch/x86/kvm/kvm_onhyperv.h +++ b/arch/x86/kvm/kvm_onhyperv.h @@ -7,12 +7,11 @@ #define __ARCH_X86_KVM_KVM_ONHYPERV_H__ #if IS_ENABLED(CONFIG_HYPERV) -int hv_remote_flush_tlb_with_range(struct kvm *kvm, - struct kvm_tlb_range *range); -int hv_remote_flush_tlb(struct kvm *kvm); +int hv_flush_remote_tlbs_range(struct kvm *kvm, gfn_t gfn, gfn_t nr_pages); +int hv_flush_remote_tlbs(struct kvm *kvm); void hv_track_root_tdp(struct kvm_vcpu *vcpu, hpa_t root_tdp); #else /* !CONFIG_HYPERV */ -static inline int hv_remote_flush_tlb(struct kvm *kvm) +static inline int hv_flush_remote_tlbs(struct kvm *kvm) { return -EOPNOTSUPP; } diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index 168c46fd8dd1..92d5a1924fc1 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -113,6 +113,8 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly, bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu); int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code, u64 fault_address, char *insn, int insn_len); +void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu, + struct kvm_mmu *mmu); int kvm_mmu_load(struct kvm_vcpu *vcpu); void kvm_mmu_unload(struct kvm_vcpu *vcpu); @@ -132,7 +134,7 @@ static inline unsigned long kvm_get_pcid(struct kvm_vcpu *vcpu, gpa_t cr3) { BUILD_BUG_ON((X86_CR3_PCID_MASK & PAGE_MASK) != 0); - return kvm_read_cr4_bits(vcpu, X86_CR4_PCIDE) + return kvm_is_cr4_bit_set(vcpu, X86_CR4_PCIDE) ? cr3 & X86_CR3_PCID_MASK : 0; } @@ -153,6 +155,24 @@ static inline void kvm_mmu_load_pgd(struct kvm_vcpu *vcpu) vcpu->arch.mmu->root_role.level); } +static inline void kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu, + struct kvm_mmu *mmu) +{ + /* + * When EPT is enabled, KVM may passthrough CR0.WP to the guest, i.e. + * @mmu's snapshot of CR0.WP and thus all related paging metadata may + * be stale. Refresh CR0.WP and the metadata on-demand when checking + * for permission faults. Exempt nested MMUs, i.e. MMUs for shadowing + * nEPT and nNPT, as CR0.WP is ignored in both cases. Note, KVM does + * need to refresh nested_mmu, a.k.a. the walker used to translate L2 + * GVAs to GPAs, as that "MMU" needs to honor L2's CR0.WP. + */ + if (!tdp_enabled || mmu == &vcpu->arch.guest_mmu) + return; + + __kvm_mmu_refresh_passthrough_bits(vcpu, mmu); +} + /* * Check if a given access (described through the I/D, W/R and U/S bits of a * page fault error code pfec) causes a permission fault with the given PTE @@ -184,8 +204,12 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, u64 implicit_access = access & PFERR_IMPLICIT_ACCESS; bool not_smap = ((rflags & X86_EFLAGS_AC) | implicit_access) == X86_EFLAGS_AC; int index = (pfec + (not_smap << PFERR_RSVD_BIT)) >> 1; - bool fault = (mmu->permissions[index] >> pte_access) & 1; u32 errcode = PFERR_PRESENT_MASK; + bool fault; + + kvm_mmu_refresh_passthrough_bits(vcpu, mmu); + + fault = (mmu->permissions[index] >> pte_access) & 1; WARN_ON(pfec & (PFERR_PK_MASK | PFERR_RSVD_MASK)); if (unlikely(mmu->pkru_mask)) { diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index c8ebe542c565..c8961f45e3b1 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -125,17 +125,31 @@ module_param(dbg, bool, 0644); #define PTE_LIST_EXT 14 /* - * Slight optimization of cacheline layout, by putting `more' and `spte_count' - * at the start; then accessing it will only use one single cacheline for - * either full (entries==PTE_LIST_EXT) case or entries<=6. + * struct pte_list_desc is the core data structure used to implement a custom + * list for tracking a set of related SPTEs, e.g. all the SPTEs that map a + * given GFN when used in the context of rmaps. Using a custom list allows KVM + * to optimize for the common case where many GFNs will have at most a handful + * of SPTEs pointing at them, i.e. allows packing multiple SPTEs into a small + * memory footprint, which in turn improves runtime performance by exploiting + * cache locality. + * + * A list is comprised of one or more pte_list_desc objects (descriptors). + * Each individual descriptor stores up to PTE_LIST_EXT SPTEs. If a descriptor + * is full and a new SPTEs needs to be added, a new descriptor is allocated and + * becomes the head of the list. This means that by definitions, all tail + * descriptors are full. + * + * Note, the meta data fields are deliberately placed at the start of the + * structure to optimize the cacheline layout; accessing the descriptor will + * touch only a single cacheline so long as @spte_count<=6 (or if only the + * descriptors metadata is accessed). */ struct pte_list_desc { struct pte_list_desc *more; - /* - * Stores number of entries stored in the pte_list_desc. No need to be - * u64 but just for easier alignment. When PTE_LIST_EXT, means full. - */ - u64 spte_count; + /* The number of PTEs stored in _this_ descriptor. */ + u32 spte_count; + /* The number of PTEs stored in all tails of this descriptor. */ + u32 tail_count; u64 *sptes[PTE_LIST_EXT]; }; @@ -242,32 +256,35 @@ static struct kvm_mmu_role_regs vcpu_to_role_regs(struct kvm_vcpu *vcpu) return regs; } -static inline bool kvm_available_flush_tlb_with_range(void) +static unsigned long get_guest_cr3(struct kvm_vcpu *vcpu) { - return kvm_x86_ops.tlb_remote_flush_with_range; + return kvm_read_cr3(vcpu); } -static void kvm_flush_remote_tlbs_with_range(struct kvm *kvm, - struct kvm_tlb_range *range) +static inline unsigned long kvm_mmu_get_guest_pgd(struct kvm_vcpu *vcpu, + struct kvm_mmu *mmu) { - int ret = -ENOTSUPP; - - if (range && kvm_x86_ops.tlb_remote_flush_with_range) - ret = static_call(kvm_x86_tlb_remote_flush_with_range)(kvm, range); + if (IS_ENABLED(CONFIG_RETPOLINE) && mmu->get_guest_pgd == get_guest_cr3) + return kvm_read_cr3(vcpu); - if (ret) - kvm_flush_remote_tlbs(kvm); + return mmu->get_guest_pgd(vcpu); } -void kvm_flush_remote_tlbs_with_address(struct kvm *kvm, - u64 start_gfn, u64 pages) +static inline bool kvm_available_flush_remote_tlbs_range(void) { - struct kvm_tlb_range range; + return kvm_x86_ops.flush_remote_tlbs_range; +} - range.start_gfn = start_gfn; - range.pages = pages; +void kvm_flush_remote_tlbs_range(struct kvm *kvm, gfn_t start_gfn, + gfn_t nr_pages) +{ + int ret = -EOPNOTSUPP; - kvm_flush_remote_tlbs_with_range(kvm, &range); + if (kvm_x86_ops.flush_remote_tlbs_range) + ret = static_call(kvm_x86_flush_remote_tlbs_range)(kvm, start_gfn, + nr_pages); + if (ret) + kvm_flush_remote_tlbs(kvm); } static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index); @@ -888,9 +905,9 @@ static void unaccount_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp) untrack_possible_nx_huge_page(kvm, sp); } -static struct kvm_memory_slot * -gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu, gfn_t gfn, - bool no_dirty_log) +static struct kvm_memory_slot *gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu, + gfn_t gfn, + bool no_dirty_log) { struct kvm_memory_slot *slot; @@ -929,53 +946,69 @@ static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte, desc->sptes[0] = (u64 *)rmap_head->val; desc->sptes[1] = spte; desc->spte_count = 2; + desc->tail_count = 0; rmap_head->val = (unsigned long)desc | 1; ++count; } else { rmap_printk("%p %llx many->many\n", spte, *spte); desc = (struct pte_list_desc *)(rmap_head->val & ~1ul); - while (desc->spte_count == PTE_LIST_EXT) { - count += PTE_LIST_EXT; - if (!desc->more) { - desc->more = kvm_mmu_memory_cache_alloc(cache); - desc = desc->more; - desc->spte_count = 0; - break; - } - desc = desc->more; + count = desc->tail_count + desc->spte_count; + + /* + * If the previous head is full, allocate a new head descriptor + * as tail descriptors are always kept full. + */ + if (desc->spte_count == PTE_LIST_EXT) { + desc = kvm_mmu_memory_cache_alloc(cache); + desc->more = (struct pte_list_desc *)(rmap_head->val & ~1ul); + desc->spte_count = 0; + desc->tail_count = count; + rmap_head->val = (unsigned long)desc | 1; } - count += desc->spte_count; desc->sptes[desc->spte_count++] = spte; } return count; } -static void -pte_list_desc_remove_entry(struct kvm_rmap_head *rmap_head, - struct pte_list_desc *desc, int i, - struct pte_list_desc *prev_desc) +static void pte_list_desc_remove_entry(struct kvm_rmap_head *rmap_head, + struct pte_list_desc *desc, int i) { - int j = desc->spte_count - 1; + struct pte_list_desc *head_desc = (struct pte_list_desc *)(rmap_head->val & ~1ul); + int j = head_desc->spte_count - 1; + + /* + * The head descriptor should never be empty. A new head is added only + * when adding an entry and the previous head is full, and heads are + * removed (this flow) when they become empty. + */ + BUG_ON(j < 0); - desc->sptes[i] = desc->sptes[j]; - desc->sptes[j] = NULL; - desc->spte_count--; - if (desc->spte_count) + /* + * Replace the to-be-freed SPTE with the last valid entry from the head + * descriptor to ensure that tail descriptors are full at all times. + * Note, this also means that tail_count is stable for each descriptor. + */ + desc->sptes[i] = head_desc->sptes[j]; + head_desc->sptes[j] = NULL; + head_desc->spte_count--; + if (head_desc->spte_count) return; - if (!prev_desc && !desc->more) + + /* + * The head descriptor is empty. If there are no tail descriptors, + * nullify the rmap head to mark the list as emtpy, else point the rmap + * head at the next descriptor, i.e. the new head. + */ + if (!head_desc->more) rmap_head->val = 0; else - if (prev_desc) - prev_desc->more = desc->more; - else - rmap_head->val = (unsigned long)desc->more | 1; - mmu_free_pte_list_desc(desc); + rmap_head->val = (unsigned long)head_desc->more | 1; + mmu_free_pte_list_desc(head_desc); } static void pte_list_remove(u64 *spte, struct kvm_rmap_head *rmap_head) { struct pte_list_desc *desc; - struct pte_list_desc *prev_desc; int i; if (!rmap_head->val) { @@ -991,16 +1024,13 @@ static void pte_list_remove(u64 *spte, struct kvm_rmap_head *rmap_head) } else { rmap_printk("%p many->many\n", spte); desc = (struct pte_list_desc *)(rmap_head->val & ~1ul); - prev_desc = NULL; while (desc) { for (i = 0; i < desc->spte_count; ++i) { if (desc->sptes[i] == spte) { - pte_list_desc_remove_entry(rmap_head, - desc, i, prev_desc); + pte_list_desc_remove_entry(rmap_head, desc, i); return; } } - prev_desc = desc; desc = desc->more; } pr_err("%s: %p many->many\n", __func__, spte); @@ -1047,7 +1077,6 @@ out: unsigned int pte_list_count(struct kvm_rmap_head *rmap_head) { struct pte_list_desc *desc; - unsigned int count = 0; if (!rmap_head->val) return 0; @@ -1055,13 +1084,7 @@ unsigned int pte_list_count(struct kvm_rmap_head *rmap_head) return 1; desc = (struct pte_list_desc *)(rmap_head->val & ~1ul); - - while (desc) { - count += desc->spte_count; - desc = desc->more; - } - - return count; + return desc->tail_count + desc->spte_count; } static struct kvm_rmap_head *gfn_to_rmap(gfn_t gfn, int level, @@ -1073,14 +1096,6 @@ static struct kvm_rmap_head *gfn_to_rmap(gfn_t gfn, int level, return &slot->arch.rmap[level - PG_LEVEL_4K][idx]; } -static bool rmap_can_add(struct kvm_vcpu *vcpu) -{ - struct kvm_mmu_memory_cache *mc; - - mc = &vcpu->arch.mmu_pte_list_desc_cache; - return kvm_mmu_memory_cache_nr_free_objects(mc); -} - static void rmap_remove(struct kvm *kvm, u64 *spte) { struct kvm_memslots *slots; @@ -1479,7 +1494,7 @@ restart: } } - if (need_flush && kvm_available_flush_tlb_with_range()) { + if (need_flush && kvm_available_flush_remote_tlbs_range()) { kvm_flush_remote_tlbs_gfn(kvm, gfn, level); return false; } @@ -1504,8 +1519,8 @@ struct slot_rmap_walk_iterator { struct kvm_rmap_head *end_rmap; }; -static void -rmap_walk_init_level(struct slot_rmap_walk_iterator *iterator, int level) +static void rmap_walk_init_level(struct slot_rmap_walk_iterator *iterator, + int level) { iterator->level = level; iterator->gfn = iterator->start_gfn; @@ -1513,10 +1528,10 @@ rmap_walk_init_level(struct slot_rmap_walk_iterator *iterator, int level) iterator->end_rmap = gfn_to_rmap(iterator->end_gfn, level, iterator->slot); } -static void -slot_rmap_walk_init(struct slot_rmap_walk_iterator *iterator, - const struct kvm_memory_slot *slot, int start_level, - int end_level, gfn_t start_gfn, gfn_t end_gfn) +static void slot_rmap_walk_init(struct slot_rmap_walk_iterator *iterator, + const struct kvm_memory_slot *slot, + int start_level, int end_level, + gfn_t start_gfn, gfn_t end_gfn) { iterator->slot = slot; iterator->start_level = start_level; @@ -1789,12 +1804,6 @@ static void mark_unsync(u64 *spte) kvm_mmu_mark_parents_unsync(sp); } -static int nonpaging_sync_page(struct kvm_vcpu *vcpu, - struct kvm_mmu_page *sp) -{ - return -1; -} - #define KVM_PAGE_ARRAY_NR 16 struct kvm_mmu_pages { @@ -1914,10 +1923,79 @@ static bool sp_has_gptes(struct kvm_mmu_page *sp) &(_kvm)->arch.mmu_page_hash[kvm_page_table_hashfn(_gfn)]) \ if ((_sp)->gfn != (_gfn) || !sp_has_gptes(_sp)) {} else +static bool kvm_sync_page_check(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) +{ + union kvm_mmu_page_role root_role = vcpu->arch.mmu->root_role; + + /* + * Ignore various flags when verifying that it's safe to sync a shadow + * page using the current MMU context. + * + * - level: not part of the overall MMU role and will never match as the MMU's + * level tracks the root level + * - access: updated based on the new guest PTE + * - quadrant: not part of the overall MMU role (similar to level) + */ + const union kvm_mmu_page_role sync_role_ign = { + .level = 0xf, + .access = 0x7, + .quadrant = 0x3, + .passthrough = 0x1, + }; + + /* + * Direct pages can never be unsync, and KVM should never attempt to + * sync a shadow page for a different MMU context, e.g. if the role + * differs then the memslot lookup (SMM vs. non-SMM) will be bogus, the + * reserved bits checks will be wrong, etc... + */ + if (WARN_ON_ONCE(sp->role.direct || !vcpu->arch.mmu->sync_spte || + (sp->role.word ^ root_role.word) & ~sync_role_ign.word)) + return false; + + return true; +} + +static int kvm_sync_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int i) +{ + if (!sp->spt[i]) + return 0; + + return vcpu->arch.mmu->sync_spte(vcpu, sp, i); +} + +static int __kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) +{ + int flush = 0; + int i; + + if (!kvm_sync_page_check(vcpu, sp)) + return -1; + + for (i = 0; i < SPTE_ENT_PER_PAGE; i++) { + int ret = kvm_sync_spte(vcpu, sp, i); + + if (ret < -1) + return -1; + flush |= ret; + } + + /* + * Note, any flush is purely for KVM's correctness, e.g. when dropping + * an existing SPTE or clearing W/A/D bits to ensure an mmu_notifier + * unmap or dirty logging event doesn't fail to flush. The guest is + * responsible for flushing the TLB to ensure any changes in protection + * bits are recognized, i.e. until the guest flushes or page faults on + * a relevant address, KVM is architecturally allowed to let vCPUs use + * cached translations with the old protection bits. + */ + return flush; +} + static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, struct list_head *invalid_list) { - int ret = vcpu->arch.mmu->sync_page(vcpu, sp); + int ret = __kvm_sync_page(vcpu, sp); if (ret < 0) kvm_mmu_prepare_zap_page(vcpu->kvm, sp, invalid_list); @@ -3304,9 +3382,9 @@ static bool page_fault_can_be_fast(struct kvm_page_fault *fault) * Returns true if the SPTE was fixed successfully. Otherwise, * someone else modified the SPTE from its original value. */ -static bool -fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, - u64 *sptep, u64 old_spte, u64 new_spte) +static bool fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, + struct kvm_page_fault *fault, + u64 *sptep, u64 old_spte, u64 new_spte) { /* * Theoretically we could also set dirty bit (and flush TLB) here in @@ -3513,6 +3591,8 @@ void kvm_mmu_free_roots(struct kvm *kvm, struct kvm_mmu *mmu, LIST_HEAD(invalid_list); bool free_active_root; + WARN_ON_ONCE(roots_to_free & ~KVM_MMU_ROOTS_ALL); + BUILD_BUG_ON(KVM_MMU_NUM_PREV_ROOTS >= BITS_PER_LONG); /* Before acquiring the MMU lock, see if we need to do any real work. */ |