| Age | Commit message (Collapse) | Author | Files | Lines |
|
Refactor get_msr_feature() to take the index and data pointer as distinct
parameters in anticipation of eliminating "struct kvm_msr_entry" usage
further up the primary callchain.
No functional change intended.
Link: https://lore.kernel.org/r/20240802181935.292540-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Rename the "INVALID" internal MSR error return code to "UNSUPPORTED" to
try and make it more clear that access was denied because the MSR itself
is unsupported/unknown. "INVALID" is too ambiguous, as it could just as
easily mean the value for WRMSR as invalid.
Avoid UNKNOWN and UNIMPLEMENTED, as the error code is used for MSRs that
_are_ actually implemented by KVM, e.g. if the MSR is unsupported because
an associated feature flag is not present in guest CPUID.
Opportunistically beef up the comments for the internal MSR error codes.
Link: https://lore.kernel.org/r/20240802181935.292540-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Move VMX's MSR_TYPE_{R,W,RW} #defines to x86.h, as enums, so that they can
be used by common x86 code, e.g. instead of doing "bool write".
Opportunistically tweak the definitions to make it more obvious that the
values are bitmasks, not arbitrary ascending values.
No functional change intended.
Link: https://lore.kernel.org/r/20240802181935.292540-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Inject a #GP if the guest attempts to change MSR_AMD64_DE_CFG from its
*current* value, not if the guest attempts to write a value other than
KVM's set of supported bits. As per the comment and the changelog of the
original code, the intent is to effectively make MSR_AMD64_DE_CFG read-
only for the guest.
Opportunistically use a more conventional equality check instead of an
exclusive-OR check to detect attempts to change bits.
Fixes: d1d93fa90f1a ("KVM: SVM: Add MSR-based feature support for serializing LFENCE")
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20240802181935.292540-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Rework the function comment for kvm_arch_mmu_enable_log_dirty_pt_masked()
into the body of the function, as it has gotten a bit stale, is harder to
read without the code context, and is the last source of warnings for W=1
builds in KVM x86 due to using a kernel-doc comment without documenting
all parameters.
Opportunistically subsume the functions comments for
kvm_mmu_write_protect_pt_masked() and kvm_mmu_clear_dirty_pt_masked(), as
there is no value in regurgitating similar information at a higher level,
and capturing the differences between write-protection and PML-based dirty
logging is best done in a common location.
No functional change intended.
Cc: David Matlack <dmatlack@google.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Link: https://lore.kernel.org/r/20240802202006.340854-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Use this_cpu_ptr() instead of open coding the equivalent in
kvm_user_return_msr_cpu_online.
Signed-off-by: Li Chen <chenl311@chinatelecom.cn>
Link: https://lore.kernel.org/r/87zfp96ojk.wl-me@linux.beauty
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
GCC 12.3.0 complains about a potential NULL pointer dereference in
evmcs_load() as hv_get_vp_assist_page() can return NULL. In fact, this
cannot happen because KVM verifies (hv_init_evmcs()) that every CPU has a
valid VP assist page and aborts enabling the feature otherwise. CPU
onlining path is also checked in vmx_hardware_enable().
To make the compiler happy and to future proof the code, add a KVM_BUG_ON()
sentinel. It doesn't seem to be possible (and logical) to observe
evmcs_load() happening without an active vCPU so it is presumed that
kvm_get_running_vcpu() can't return NULL.
No functional change intended.
Reported-by: Mirsad Todorovac <mtodorovac69@gmail.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20240816130124.286226-1-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
In prepare_vmcs02_rare(), call vmx_segment_cache_clear() instead of
setting segment_cache.bitmask directly. Using the helper minimizes the
chances of prepare_vmcs02_rare() doing the wrong thing in the future, e.g.
if KVM ends up doing more than just zero the bitmask when purging the
cache.
No functional change intended.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20240725175232.337266-2-mlevitsk@redhat.com
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Synthesize a consistency check VM-Exit (VM-Enter) or VM-Abort (VM-Exit) if
L1 attempts to load/store an MSR via the VMCS MSR lists that userspace has
disallowed access to via an MSR filter. Intel already disallows including
a handful of "special" MSRs in the VMCS lists, so denying access isn't
completely without precedent.
More importantly, the behavior is well-defined _and_ can be communicated
the end user, e.g. to the customer that owns a VM running as L1 on top of
KVM. On the other hand, ignoring userspace MSR filters is all but
guaranteed to result in unexpected behavior as the access will hit KVM's
internal state, which is likely not up-to-date.
Unlike KVM-internal accesses, instruction emulation, and dedicated VMCS
fields, the MSRs in the VMCS load/store lists are 100% guest controlled,
thus making it all but impossible to reason about the correctness of
ignoring the MSR filter. And if userspace *really* wants to deny access
to MSRs via the aforementioned scenarios, userspace can hide the
associated feature from the guest, e.g. by disabling the PMU to prevent
accessing PERF_GLOBAL_CTRL via its VMCS field. But for the MSR lists, KVM
is blindly processing MSRs; the MSR filters are the _only_ way for
userspace to deny access.
This partially reverts commit ac8d6cad3c7b ("KVM: x86: Only do MSR
filtering when access MSR by rdmsr/wrmsr").
Cc: Hou Wenlong <houwenlong.hwl@antgroup.com>
Cc: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20240722235922.3351122-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
In handle_encls_ecreate(), a page is allocated to store a copy of SECS
structure used by the ENCLS[ECREATE] leaf from the guest. This page is
only used temporarily and is freed after use in handle_encls_ecreate().
Don't account for the memory allocation of this page per [1].
Link: https://lore.kernel.org/kvm/b999afeb588eb75d990891855bc6d58861968f23.camel@intel.com/T/#mb81987afc3ab308bbb5861681aa9a20f2aece7fd [1]
Signed-off-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20240715101224.90958-1-kai.huang@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
vmcs_check16 function
According to the SDM, the meaning of field bit 0 is:
Access type (0 = full; 1 = high); must be full for 16-bit, 32-bit,
and natural-width fields. So there is no 32-bit high field here,
it should be a 32-bit field instead.
Signed-off-by: Qiang Liu <liuq131@chinatelecom.cn>
Link: https://lore.kernel.org/r/20240702064609.52487-1-liuq131@chinatelecom.cn
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
The fixed size temporary variables vmcb_control_area and vmcb_save_area
allocated in svm_set_nested_state() are released when the function exits.
Meanwhile, svm_set_nested_state() also have vcpu mutex held to avoid
massive concurrency allocation, so we don't need to set GFP_KERNEL_ACCOUNT.
Signed-off-by: Yongqiang Liu <liuyongqiang13@huawei.com>
Link: https://lore.kernel.org/r/20240821112737.3649937-1-liuyongqiang13@huawei.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Use macros in vmx_restore_vmx_misc() instead of open coding everything
using BIT_ULL() and GENMASK_ULL(). Opportunistically split feature bits
and reserved bits into separate variables, and add a comment explaining
the subset logic (it's not immediately obvious that the set of feature
bits is NOT the set of _supported_ feature bits).
Cc: Shan Kang <shan.kang@intel.com>
Cc: Kai Huang <kai.huang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
[sean: split to separate patch, write changelog, drop #defines]
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Zhao Liu <zhao1.liu@intel.com>
Link: https://lore.kernel.org/r/20240605231918.2915961-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Use vmx_misc_preemption_timer_rate() to get the rate in hardware_setup(),
and open code the rate's bitmask in vmx_misc_preemption_timer_rate() so
that the function looks like all the helpers that grab values from
VMX_BASIC and VMX_MISC MSR values.
No functional change intended.
Cc: Shan Kang <shan.kang@intel.com>
Cc: Kai Huang <kai.huang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
[sean: split to separate patch, write changelog]
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20240605231918.2915961-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Move the handful of MSR_IA32_VMX_MISC bit defines that are currently in
msr-indx.h to vmx.h so that all of the VMX_MISC defines and wrappers can
be found in a single location.
Opportunistically use BIT_ULL() instead of open coding hex values, add
defines for feature bits that are architecturally defined, and move the
defines down in the file so that they are colocated with the helpers for
getting fields from VMX_MISC.
No functional change intended.
Cc: Shan Kang <shan.kang@intel.com>
Cc: Kai Huang <kai.huang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
[sean: split to separate patch, write changelog]
Reviewed-by: Zhao Liu <zhao1.liu@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20240605231918.2915961-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add a helper to encode the VMCS revision, size, and supported memory types
in MSR_IA32_VMX_BASIC, i.e. when synthesizing KVM's supported BASIC MSR
value, and delete the now unused VMCS size and memtype shift macros.
For a variety of reasons, KVM has shifted (pun intended) to using helpers
to *get* information from the VMX MSRs, as opposed to defined MASK and
SHIFT macros for direct use. Provide a similar helper for the nested VMX
code, which needs to *set* information, so that KVM isn't left with a mix
of SHIFT macros and dedicated helpers.
Reported-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20240605231918.2915961-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Use macros in vmx_restore_vmx_basic() instead of open coding everything
using BIT_ULL() and GENMASK_ULL(). Opportunistically split feature bits
and reserved bits into separate variables, and add a comment explaining
the subset logic (it's not immediately obvious that the set of feature
bits is NOT the set of _supported_ feature bits).
Cc: Shan Kang <shan.kang@intel.com>
Cc: Kai Huang <kai.huang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
[sean: split to separate patch, write changelog, drop #defines]
Reviewed-by: Zhao Liu <zhao1.liu@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20240605231918.2915961-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Track the "basic" capabilities VMX MSR as a single u64 in vmcs_config
instead of splitting it across three fields, that obviously don't combine
into a single 64-bit value, so that KVM can use the macros that define MSR
bits using their absolute position. Replace all open coded shifts and
masks, many of which are relative to the "high" half, with the appropriate
macro.
Opportunistically use VMX_BASIC_32BIT_PHYS_ADDR_ONLY instead of an open
coded equivalent, and clean up the related comment to not reference a
specific SDM section (to the surprise of no one, the comment is stale).
No functional change intended (though obviously the code generation will
be quite different).
Cc: Shan Kang <shan.kang@intel.com>
Cc: Kai Huang <kai.huang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
[sean: split to separate patch, write changelog]
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Zhao Liu <zhao1.liu@intel.com>
Link: https://lore.kernel.org/r/20240605231918.2915961-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Move the bit defines for MSR_IA32_VMX_BASIC from msr-index.h to vmx.h so
that they are colocated with other VMX MSR bit defines, and with the
helpers that extract specific information from an MSR_IA32_VMX_BASIC value.
Opportunistically use BIT_ULL() instead of open coding hex values.
Opportunistically rename VMX_BASIC_64 to VMX_BASIC_32BIT_PHYS_ADDR_ONLY,
as "VMX_BASIC_64" is widly misleading. The flag enumerates that addresses
are limited to 32 bits, not that 64-bit addresses are allowed.
Last but not least, opportunistically #define DUAL_MONITOR_TREATMENT so
that all known single-bit feature flags are defined (this will allow
replacing open-coded literals in the future).
Cc: Shan Kang <shan.kang@intel.com>
Cc: Kai Huang <kai.huang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
[sean: split to separate patch, write changelog]
Reviewed-by: Zhao Liu <zhao1.liu@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20240605231918.2915961-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Move the stuffing of the vCPU's PAT to the architectural "default" value
from kvm_arch_vcpu_create() to kvm_vcpu_reset(), guarded by !init_event,
to better capture that the default value is the value "Following Power-up
or Reset". E.g. setting PAT only during creation would break if KVM were
to expose a RESET ioctl() to userspace (which is unlikely, but that's not
a good reason to have unintuitive code).
No functional change.
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Zhao Liu <zhao1.liu@intel.com>
Link: https://lore.kernel.org/r/20240605231918.2915961-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Move pat/memtype.c's PAT() macro to msr-index.h as PAT_VALUE(), and use it
in KVM to define the default (Power-On / RESET) PAT value instead of open
coding an inscrutable magic number.
No functional change intended.
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240605231918.2915961-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
etc.)
Add defines for the architectural memory types that can be shoved into
various MSRs and registers, e.g. MTRRs, PAT, VMX capabilities MSRs, EPTPs,
etc. While most MSRs/registers support only a subset of all memory types,
the values themselves are architectural and identical across all users.
Leave the goofy MTRR_TYPE_* definitions as-is since they are in a uapi
header, but add compile-time assertions to connect the dots (and sanity
check that the msr-index.h values didn't get fat-fingered).
Keep the VMX_EPTP_MT_* defines so that it's slightly more obvious that the
EPTP holds a single memory type in 3 of its 64 bits; those bits just
happen to be 2:0, i.e. don't need to be shifted.
Opportunistically use X86_MEMTYPE_WB instead of an open coded '6' in
setup_vmcs_config().
No functional change intended.
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20240605231918.2915961-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
If these msrs are read by the emulator (e.g due to 'force emulation' prefix),
SVM code currently fails to extract the corresponding segment bases,
and return them to the emulator.
Fix that.
Cc: stable@vger.kernel.org
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20240802151608.72896-3-mlevitsk@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Grab kvm->srcu when processing KVM_SET_VCPU_EVENTS, as KVM will forcibly
leave nested VMX/SVM if SMM mode is being toggled, and leaving nested VMX
reads guest memory.
Note, kvm_vcpu_ioctl_x86_set_vcpu_events() can also be called from KVM_RUN
via sync_regs(), which already holds SRCU. I.e. trying to precisely use
kvm_vcpu_srcu_read_lock() around the problematic SMM code would cause
problems. Acquiring SRCU isn't all that expensive, so for simplicity,
grab it unconditionally for KVM_SET_VCPU_EVENTS.
=============================
WARNING: suspicious RCU usage
6.10.0-rc7-332d2c1d713e-next-vm #552 Not tainted
-----------------------------
include/linux/kvm_host.h:1027 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
1 lock held by repro/1071:
#0: ffff88811e424430 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x7d/0x970 [kvm]
stack backtrace:
CPU: 15 PID: 1071 Comm: repro Not tainted 6.10.0-rc7-332d2c1d713e-next-vm #552
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Call Trace:
<TASK>
dump_stack_lvl+0x7f/0x90
lockdep_rcu_suspicious+0x13f/0x1a0
kvm_vcpu_gfn_to_memslot+0x168/0x190 [kvm]
kvm_vcpu_read_guest+0x3e/0x90 [kvm]
nested_vmx_load_msr+0x6b/0x1d0 [kvm_intel]
load_vmcs12_host_state+0x432/0xb40 [kvm_intel]
vmx_leave_nested+0x30/0x40 [kvm_intel]
kvm_vcpu_ioctl_x86_set_vcpu_events+0x15d/0x2b0 [kvm]
kvm_arch_vcpu_ioctl+0x1107/0x1750 [kvm]
? mark_held_locks+0x49/0x70
? kvm_vcpu_ioctl+0x7d/0x970 [kvm]
? kvm_vcpu_ioctl+0x497/0x970 [kvm]
kvm_vcpu_ioctl+0x497/0x970 [kvm]
? lock_acquire+0xba/0x2d0
? find_held_lock+0x2b/0x80
? do_user_addr_fault+0x40c/0x6f0
? lock_release+0xb7/0x270
__x64_sys_ioctl+0x82/0xb0
do_syscall_64+0x6c/0x170
entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x7ff11eb1b539
</TASK>
Fixes: f7e570780efc ("KVM: x86: Forcibly leave nested virt when SMM state is toggled")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240723232055.3643811-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Error out if kvm_mmu_reload() fails when pre-faulting memory, as trying to
fault-in SPTEs will fail miserably due to root.hpa pointing at garbage.
Note, kvm_mmu_reload() can return -EIO and thus trigger the WARN on -EIO
in kvm_vcpu_pre_fault_memory(), but all such paths also WARN, i.e. the
WARN isn't user-triggerable and won't run afoul of warn-on-panic because
the kernel would already be panicking.
BUG: unable to handle page fault for address: 000029ffffffffe8
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: Oops: 0000 [#1] PREEMPT SMP
CPU: 22 PID: 1069 Comm: pre_fault_memor Not tainted 6.10.0-rc7-332d2c1d713e-next-vm #548
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:is_page_fault_stale+0x3e/0xe0 [kvm]
RSP: 0018:ffffc9000114bd48 EFLAGS: 00010206
RAX: 00003fffffffffc0 RBX: ffff88810a07c080 RCX: ffffc9000114bd78
RDX: ffff88810a07c080 RSI: ffffea0000000000 RDI: ffff88810a07c080
RBP: ffffc9000114bd78 R08: 00007fa3c8c00000 R09: 8000000000000225
R10: ffffea00043d7d80 R11: 0000000000000000 R12: ffff88810a07c080
R13: 0000000100000000 R14: ffffc9000114be58 R15: 0000000000000000
FS: 00007fa3c9da0740(0000) GS:ffff888277d80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000029ffffffffe8 CR3: 000000011d698000 CR4: 0000000000352eb0
Call Trace:
<TASK>
kvm_tdp_page_fault+0xcc/0x160 [kvm]
kvm_mmu_do_page_fault+0xfb/0x1f0 [kvm]
kvm_arch_vcpu_pre_fault_memory+0xd0/0x1a0 [kvm]
kvm_vcpu_ioctl+0x761/0x8c0 [kvm]
__x64_sys_ioctl+0x82/0xb0
do_syscall_64+0x5b/0x160
entry_SYSCALL_64_after_hwframe+0x4b/0x53
</TASK>
Modules linked in: kvm_intel kvm
CR2: 000029ffffffffe8
---[ end trace 0000000000000000 ]---
Fixes: 6e01b7601dfe ("KVM: x86: Implement kvm_arch_vcpu_pre_fault_memory()")
Reported-by: syzbot+23786faffb695f17edaa@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/0000000000002b84dc061dd73544@google.com
Reviewed-by: Kai Huang <kai.huang@intel.com>
Tested-by: xingwei lee <xrivendell7@gmail.com>
Tested-by: yuxin wang <wang1315768607@163.com>
Link: https://lore.kernel.org/r/20240723000211.3352304-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Replace "removed" with "frozen" in comments as appropriate to complete the
rename of REMOVED_SPTE to FROZEN_SPTE.
Fixes: 964cea817196 ("KVM: x86/tdp_mmu: Rename REMOVED_SPTE to FROZEN_SPTE")
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Link: https://lore.kernel.org/r/20240712233438.518591-1-rick.p.edgecombe@intel.com
[sean: write changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Advertise AVX10.1 related CPUIDs, i.e. report AVX10 support bit via
CPUID.(EAX=07H, ECX=01H):EDX[bit 19] and new CPUID leaf 0x24H so that
guest OS and applications can query the AVX10.1 CPUIDs directly. Intel
AVX10 represents the first major new vector ISA since the introduction of
Intel AVX512, which will establish a common, converged vector instruction
set across all Intel architectures[1].
AVX10.1 is an early version of AVX10, that enumerates the Intel AVX512
instruction set at 128, 256, and 512 bits which is enabled on
Granite Rapids. I.e., AVX10.1 is only a new CPUID enumeration with no
new functionality. New features, e.g. Embedded Rounding and Suppress
All Exceptions (SAE) will be introduced in AVX10.2.
Advertising AVX10.1 is safe because there is nothing to enable for AVX10.1,
i.e. it's purely a new way to enumerate support, thus there will never be
anything for the kernel to enable. Note just the CPUID checking is changed
when using AVX512 related instructions, e.g. if using one AVX512
instruction needs to check (AVX512 AND AVX512DQ), it can check
((AVX512 AND AVX512DQ) OR AVX10.1) after checking XCR0[7:5].
The versions of AVX10 are expected to be inclusive, e.g. version N+1 is
a superset of version N. Per the spec, the version can never be 0, just
advertise AVX10.1 if it's supported in hardware. Moreover, advertising
AVX10_{128,256,512} needs to land in the same commit as advertising basic
AVX10.1 support, otherwise KVM would advertise an impossible CPU model.
E.g. a CPU with AVX512 but not AVX10.1/512 is impossible per the SDM.
As more and more AVX related CPUIDs are added (it would have resulted in
around 40-50 CPUID flags when developing AVX10), the versioning approach
is introduced. But incrementing version numbers are bad for virtualization.
E.g. if AVX10.2 has a feature that shouldn't be enumerated to guests for
whatever reason, then KVM can't enumerate any "later" features either,
because the only way to hide the problematic AVX10.2 feature is to set the
version to AVX10.1 or lower[2]. But most AVX features are just passed
through and don't have virtualization controls, so AVX10 should not be
problematic in practice, so long as Intel honors their promise that future
versions will be supersets of past versions.
[1] https://cdrdv2.intel.com/v1/dl/getContent/784267
[2] https://lore.kernel.org/all/Zkz5Ak0PQlAN8DxK@google.com/
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Tao Su <tao1.su@linux.intel.com>
Link: https://lore.kernel.org/r/20240819062327.3269720-1-tao1.su@linux.intel.com
[sean: minor changelog tweaks]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Change the data type of the local variable this_tsc_khz to u32 because
virtual_tsc_khz is also declared as u32.
Since do_div() casts the divisor to u32 anyway, changing the data type
of this_tsc_khz to u32 also removes the following Coccinelle/coccicheck
warning reported by do_div.cocci:
WARNING: do_div() does a 64-by-32 division, please consider using div64_ul instead
Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Link: https://lore.kernel.org/r/20240814203345.2234-2-thorsten.blum@toblux.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
If the early program check handler cannot resolve a program check dump
register contents and a call trace to the console before loading a disabled
wait psw. This makes debugging much easier.
Emit an extra message with early_printk() for cases where regular printk()
via the early console is not yet working so that at least some information
is available.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
|
|
__do_early_pgm_check() is a function which is only needed during early
setup code. Mark it __init in order to save a few bytes.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
|
|
Remove WARN_ON_ONCE statements. These have not triggered in the
past.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
|
|
Rework debug messages:
- Remove most of the debug_sprintf_event() invocations.
- Do not split string format statements
- Remove colon after function name.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
|
|
Symbol offsets to the KASLR base do not match symbol address in
the vmlinux image. That is the result of setting the KASLR base
to the beginning of .text section as result of an optimization.
Revert that optimization and allocate virtual memory for the
whole kernel image including __START_KERNEL bytes as per the
linker script. That allows keeping the semantics of the KASLR
base offset in sync with other architectures.
Rename __START_KERNEL to TEXT_OFFSET, since it represents the
offset of the .text section within the kernel image, rather than
a virtual address.
Still skip mapping TEXT_OFFSET bytes to save memory on pgtables
and provoke exceptions in case an attempt to access this area is
made, as no kernel symbol may reside there.
In case CONFIG_KASAN is enabled the location counter might exceed
the value of TEXT_OFFSET, while the decompressor linker script
forcefully resets it to TEXT_OFFSET, which leads to a sections
overlap link failure. Use MAX() expression to avoid that.
Reported-by: Omar Sandoval <osandov@osandov.com>
Closes: https://lore.kernel.org/linux-s390/ZnS8dycxhtXBZVky@telecaster.dhcp.thefacebook.com/
Fixes: 56b1069c40c7 ("s390/boot: Rework deployment of the kernel image")
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
|
|
When physical memory for the kernel image is allocated it does not
consider extra memory required for offsetting the image start to
match it with the lower 20 bits of KASLR virtual base address. That
might lead to kernel access beyond its memory range.
Suggested-by: Vasily Gorbik <gor@linux.ibm.com>
Fixes: 693d41f7c938 ("s390/mm: Restore mapping of kernel image using large pages")
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
|
|
userspace
KVM exposes the OS double lock feature bit to Guests but returns
RAZ/WI on Guest OSDLR_EL1 access. This breaks Guest migration between
systems where this feature differ. Add support to make this feature
writable from userspace by setting the mask bit. While at it, set the
mask bits for the exposed WRPs(Number of Watchpoints) as well.
Also update the selftest to cover these fields.
However we still can't make BRPs and CTX_CMPs fields writable, because
as per ARM ARM DDI 0487K.a, section D2.8.3 Breakpoint types and
linking of breakpoints, highest numbered breakpoints(BRPs) must be
context aware breakpoints(CTX_CMPs). KVM does not trap + emulate the
breakpoint registers, and as such cannot support a layout that misaligns
with the underlying hardware.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Link: https://lore.kernel.org/r/20240816132819.34316-1-shameerali.kolothum.thodi@huawei.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Cross-merge bpf fixes after downstream PR including
important fixes (from bpf-next point of view):
commit 41c24102af7b ("selftests/bpf: Filter out _GNU_SOURCE when compiling test_cpp")
commit fdad456cbcca ("bpf: Fix updating attached freplace prog in prog_array map")
No conflicts.
Adjacent changes in:
include/linux/bpf_verifier.h
kernel/bpf/verifier.c
tools/testing/selftests/bpf/Makefile
Link: https://lore.kernel.org/bpf/20240813234307.82773-1-alexei.starovoitov@gmail.com/
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Commit 6b0e82791bd0 ("powerpc/e500: switch to 64 bits PGD on 85xx
(32 bits)") switched PGD entries to 64 bits, but pgd_val() returns
an unsigned long which is 32 bits on PPC32. This is not a problem
for regular PMD entries because the upper part is always NULL, but
when PMD entries are leaf they contain 64 bits values, so pgd_val()
must return an unsigned long long instead of an unsigned long.
Also change the condition to CONFIG_PPC_85xx instead of CONFIG_PPC_E500
as the change was meant for 32 bits only. Allthough this should be
harmless on PPC64, it generates a warning with pgd_ERROR print.
Fixes: 6b0e82791bd0 ("powerpc/e500: switch to 64 bits PGD on 85xx (32 bits)")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/45f8fdf298ec3df7573b66d21b03a5cda92e2cb1.1724313510.git.christophe.leroy@csgroup.eu
|
|
After building the VDSO, there is a verification that it contains
no dynamic relocation, see commit aff69273af61 ("vdso: Improve
cmd_vdso_check to check all dynamic relocations").
This verification uses readelf -r and doesn't work if rela sections
are discarded.
Fixes: 8ad57add77d3 ("powerpc/build: vdso linker warning for orphan sections")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/45c3e6fc76cad05ad2cac0f5b5dfb4fae86dc9d6.1724153239.git.christophe.leroy@csgroup.eu
|
|
mmu_pte_psize is only used in the tlb_64e.c, define it static.
Fixes: 25d21ad6e799 ("powerpc: Add TLB management code for 64-bit Book3E")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202408011256.1O99IB0s-lkp@intel.com/
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/beb30d280eaa5d857c38a0834b147dffd6b28aa9.1724157750.git.christophe.leroy@csgroup.eu
|
|
On a system with a GICv3, if a guest hasn't been configured with
GICv3 and that the host is not capable of GICv2 emulation,
a write to any of the ICC_*SGI*_EL1 registers is trapped to EL2.
We therefore try to emulate the SGI access, only to hit a NULL
pointer as no private interrupt is allocated (no GIC, remember?).
The obvious fix is to give the guest what it deserves, in the
shape of a UNDEF exception.
Reported-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240820100349.3544850-2-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Zenghui reports that VMs backed by hugetlb pages are no longer booting
after commit fd276e71d1e7 ("KVM: arm64: nv: Handle shadow stage 2 page
faults").
Support for shadow stage-2 MMUs introduced the concept of a fault IPA
and canonical IPA to stage-2 fault handling. These are identical in the
non-nested case, as the hardware stage-2 context is always that of the
canonical IPA space.
Both addresses need to be hugepage-aligned when preparing to install a
hugepage mapping to ensure that KVM uses the correct GFN->PFN translation
and installs that at the correct IPA for the current stage-2.
And now I'm feeling thirsty after all this talk of IPAs...
Fixes: fd276e71d1e7 ("KVM: arm64: nv: Handle shadow stage 2 page faults")
Reported-by: Zenghui Yu <yuzenghui@huawei.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240822071710.2291690-1-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Add System Memory Management Unit(SysMMU) for dpum also called iommu.
This sysmmu is version 7.4, which has same functionality as exynos850.
DPUM has 4 dma channel, each channel is mapped to one iommu.
Signed-off-by: Kwanghoon Son <k.son@samsung.com>
Link: https://lore.kernel.org/r/20240819-add_sysmmu-v1-1-799c0f3f607f@samsung.com
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
|
|
Commit 791ab8b2e3db ("arm64: Ignore any DMA offsets in the
max_zone_phys() calculation") made arm64 DMA/DMA32 zones span the entire
RAM when RAM starts above 32-bits. This breaks hardware with DMA area
that start above 32-bits. But the commit log says that "we haven't
noticed any such hardware". It turns out that such hardware does exist.
One such platform has RAM starting at 32GB with an internal bus that has
the following DMA limits:
#address-cells = <2>;
#size-cells = <2>;
dma-ranges = <0x00 0xc0000000 0x08 0x00000000 0x00 0x40000000>;
That is, devices under this bus see 1GB of DMA range between 3GB-4GB in
their address space. This range is mapped to CPU memory at 32GB-33GB.
With current code DMA allocations for devices under this bus are not
limited to DMA area, leading to run-time allocation failure.
This commit reinstates DMA zone at the bottom of RAM. The result is DMA
zone that properly reflects the hardware constraints as follows:
[ 0.000000] Zone ranges:
[ 0.000000] DMA [mem 0x0000000800000000-0x000000083fffffff]
[ 0.000000] DMA32 empty
[ 0.000000] Normal [mem 0x0000000840000000-0x0000000bffffffff]
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
[baruch: split off the original patch]
Signed-off-by: Baruch Siach <baruch@tkos.co.il>
Reviewed-by: Petr Tesarik <ptesarik@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
The hardware DMA limit might not be power of 2. When RAM range starts
above 0, say 4GB, DMA limit of 30 bits should end at 5GB. A single high
bit can not encode this limit.
Use a plain address for the DMA zone limit instead.
Since the DMA zone can now potentially span beyond 4GB physical limit of
DMA32, make sure to use DMA zone for GFP_DMA32 allocations in that case.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Co-developed-by: Baruch Siach <baruch@tkos.co.il>
Signed-off-by: Baruch Siach <baruch@tkos.co.il>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Petr Tesarik <ptesarik@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Remove pca9546 device from i2c bus 9.
Signed-off-by: Peter Yin <peteryin.openbmc@gmail.com>
Link: https://lore.kernel.org/r/20240801160136.1281291-12-peteryin.openbmc@gmail.com
Signed-off-by: Andrew Jeffery <andrew@codeconstruct.com.au>
Signed-off-by: Joel Stanley <joel@jms.id.au>
|
|
Revise spi-gpio node and property name and remove max-ngpios.
Signed-off-by: Peter Yin <peteryin.openbmc@gmail.com>
Link: https://lore.kernel.org/r/20240801160136.1281291-11-peteryin.openbmc@gmail.com
Signed-off-by: Andrew Jeffery <andrew@codeconstruct.com.au>
Signed-off-by: Joel Stanley <joel@jms.id.au>
|
|
Enable the mctp controller in i2c9.
Signed-off-by: Peter Yin <peteryin.openbmc@gmail.com>
Link: https://lore.kernel.org/r/20240801160136.1281291-10-peteryin.openbmc@gmail.com
Signed-off-by: Andrew Jeffery <andrew@codeconstruct.com.au>
Signed-off-by: Joel Stanley <joel@jms.id.au>
|
|
Add temperature device in i2c0 and i2c2.
Signed-off-by: Peter Yin <peteryin.openbmc@gmail.com>
Link: https://lore.kernel.org/r/20240801160136.1281291-9-peteryin.openbmc@gmail.com
[andrew: Fixed 'deivce' typo in commit message]
Signed-off-by: Andrew Jeffery <andrew@codeconstruct.com.au>
Signed-off-by: Joel Stanley <joel@jms.id.au>
|
|
Add Aegis fru device and gpio expander device in bus 11.
Signed-off-by: Peter Yin <peteryin.openbmc@gmail.com>
Link: https://lore.kernel.org/r/20240801160136.1281291-8-peteryin.openbmc@gmail.com
Signed-off-by: Andrew Jeffery <andrew@codeconstruct.com.au>
Signed-off-by: Joel Stanley <joel@jms.id.au>
|
|
Harma is single host, so remove multi-host property.
Signed-off-by: Peter Yin <peteryin.openbmc@gmail.c |