summaryrefslogtreecommitdiff
path: root/virt
diff options
context:
space:
mode:
authorLinus Torvalds <torvalds@linux-foundation.org>2024-01-17 13:03:37 -0800
committerLinus Torvalds <torvalds@linux-foundation.org>2024-01-17 13:03:37 -0800
commit09d1c6a80f2cf94c6e70be919203473d4ab8e26c (patch)
tree144604e6cf9f513c45c4d035548ac1760e7dac11 /virt
parent1b1934dbbdcf9aa2d507932ff488cec47999cf3f (diff)
parent1c6d984f523f67ecfad1083bb04c55d91977bb15 (diff)
downloadlinux-09d1c6a80f2cf94c6e70be919203473d4ab8e26c.tar.gz
linux-09d1c6a80f2cf94c6e70be919203473d4ab8e26c.tar.bz2
linux-09d1c6a80f2cf94c6e70be919203473d4ab8e26c.zip
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini: "Generic: - Use memdup_array_user() to harden against overflow. - Unconditionally advertise KVM_CAP_DEVICE_CTRL for all architectures. - Clean up Kconfigs that all KVM architectures were selecting - New functionality around "guest_memfd", a new userspace API that creates an anonymous file and returns a file descriptor that refers to it. guest_memfd files are bound to their owning virtual machine, cannot be mapped, read, or written by userspace, and cannot be resized. guest_memfd files do however support PUNCH_HOLE, which can be used to switch a memory area between guest_memfd and regular anonymous memory. - New ioctl KVM_SET_MEMORY_ATTRIBUTES allowing userspace to specify per-page attributes for a given page of guest memory; right now the only attribute is whether the guest expects to access memory via guest_memfd or not, which in Confidential SVMs backed by SEV-SNP, TDX or ARM64 pKVM is checked by firmware or hypervisor that guarantees confidentiality (AMD PSP, Intel TDX module, or EL2 in the case of pKVM). x86: - Support for "software-protected VMs" that can use the new guest_memfd and page attributes infrastructure. This is mostly useful for testing, since there is no pKVM-like infrastructure to provide a meaningfully reduced TCB. - Fix a relatively benign off-by-one error when splitting huge pages during CLEAR_DIRTY_LOG. - Fix a bug where KVM could incorrectly test-and-clear dirty bits in non-leaf TDP MMU SPTEs if a racing thread replaces a huge SPTE with a non-huge SPTE. - Use more generic lockdep assertions in paths that don't actually care about whether the caller is a reader or a writer. - let Xen guests opt out of having PV clock reported as "based on a stable TSC", because some of them don't expect the "TSC stable" bit (added to the pvclock ABI by KVM, but never set by Xen) to be set. - Revert a bogus, made-up nested SVM consistency check for TLB_CONTROL. - Advertise flush-by-ASID support for nSVM unconditionally, as KVM always flushes on nested transitions, i.e. always satisfies flush requests. This allows running bleeding edge versions of VMware Workstation on top of KVM. - Sanity check that the CPU supports flush-by-ASID when enabling SEV support. - On AMD machines with vNMI, always rely on hardware instead of intercepting IRET in some cases to detect unmasking of NMIs - Support for virtualizing Linear Address Masking (LAM) - Fix a variety of vPMU bugs where KVM fail to stop/reset counters and other state prior to refreshing the vPMU model. - Fix a double-overflow PMU bug by tracking emulated counter events using a dedicated field instead of snapshotting the "previous" counter. If the hardware PMC count triggers overflow that is recognized in the same VM-Exit that KVM manually bumps an event count, KVM would pend PMIs for both the hardware-triggered overflow and for KVM-triggered overflow. - Turn off KVM_WERROR by default for all configs so that it's not inadvertantly enabled by non-KVM developers, which can be problematic for subsystems that require no regressions for W=1 builds. - Advertise all of the host-supported CPUID bits that enumerate IA32_SPEC_CTRL "features". - Don't force a masterclock update when a vCPU synchronizes to the current TSC generation, as updating the masterclock can cause kvmclock's time to "jump" unexpectedly, e.g. when userspace hotplugs a pre-created vCPU. - Use RIP-relative address to read kvm_rebooting in the VM-Enter fault paths, partly as a super minor optimization, but mostly to make KVM play nice with position independent executable builds. - Guard KVM-on-HyperV's range-based TLB flush hooks with an #ifdef on CONFIG_HYPERV as a minor optimization, and to self-document the code. - Add CONFIG_KVM_HYPERV to allow disabling KVM support for HyperV "emulation" at build time. ARM64: - LPA2 support, adding 52bit IPA/PA capability for 4kB and 16kB base granule sizes. Branch shared with the arm64 tree. - Large Fine-Grained Trap rework, bringing some sanity to the feature, although there is more to come. This comes with a prefix branch shared with the arm64 tree. - Some additional Nested Virtualization groundwork, mostly introducing the NV2 VNCR support and retargetting the NV support to that version of the architecture. - A small set of vgic fixes and associated cleanups. Loongarch: - Optimization for memslot hugepage checking - Cleanup and fix some HW/SW timer issues - Add LSX/LASX (128bit/256bit SIMD) support RISC-V: - KVM_GET_REG_LIST improvement for vector registers - Generate ISA extension reg_list using macros in get-reg-list selftest - Support for reporting steal time along with selftest s390: - Bugfixes Selftests: - Fix an annoying goof where the NX hugepage test prints out garbage instead of the magic token needed to run the test. - Fix build errors when a header is delete/moved due to a missing flag in the Makefile. - Detect if KVM bugged/killed a selftest's VM and print out a helpful message instead of complaining that a random ioctl() failed. - Annotate the guest printf/assert helpers with __printf(), and fix the various bugs that were lurking due to lack of said annotation" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (185 commits) x86/kvm: Do not try to disable kvmclock if it was not enabled KVM: x86: add missing "depends on KVM" KVM: fix direction of dependency on MMU notifiers KVM: introduce CONFIG_KVM_COMMON KVM: arm64: Add missing memory barriers when switching to pKVM's hyp pgd KVM: arm64: vgic-its: Avoid potential UAF in LPI translation cache RISC-V: KVM: selftests: Add get-reg-list test for STA registers RISC-V: KVM: selftests: Add steal_time test support RISC-V: KVM: selftests: Add guest_sbi_probe_extension RISC-V: KVM: selftests: Move sbi_ecall to processor.c RISC-V: KVM: Implement SBI STA extension RISC-V: KVM: Add support for SBI STA registers RISC-V: KVM: Add support for SBI extension registers RISC-V: KVM: Add SBI STA info to vcpu_arch RISC-V: KVM: Add steal-update vcpu request RISC-V: KVM: Add SBI STA extension skeleton RISC-V: paravirt: Implement steal-time support RISC-V: Add SBI STA extension definitions RISC-V: paravirt: Add skeleton for pv-time support RISC-V: KVM: Fix indentation in kvm_riscv_vcpu_set_reg_csr() ...
Diffstat (limited to 'virt')
-rw-r--r--virt/kvm/Kconfig30
-rw-r--r--virt/kvm/Makefile.kvm1
-rw-r--r--virt/kvm/dirty_ring.c2
-rw-r--r--virt/kvm/eventfd.c28
-rw-r--r--virt/kvm/guest_memfd.c532
-rw-r--r--virt/kvm/kvm_main.c522
-rw-r--r--virt/kvm/kvm_mm.h26
7 files changed, 1032 insertions, 109 deletions
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 484d0873061c..184dab4ee871 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -4,13 +4,16 @@
config HAVE_KVM
bool
-config HAVE_KVM_PFNCACHE
+config KVM_COMMON
bool
+ select EVENTFD
+ select INTERVAL_TREE
+ select PREEMPT_NOTIFIERS
-config HAVE_KVM_IRQCHIP
+config HAVE_KVM_PFNCACHE
bool
-config HAVE_KVM_IRQFD
+config HAVE_KVM_IRQCHIP
bool
config HAVE_KVM_IRQ_ROUTING
@@ -39,10 +42,6 @@ config NEED_KVM_DIRTY_RING_WITH_BITMAP
bool
depends on HAVE_KVM_DIRTY_RING
-config HAVE_KVM_EVENTFD
- bool
- select EVENTFD
-
config KVM_MMIO
bool
@@ -92,3 +91,20 @@ config HAVE_KVM_PM_NOTIFIER
config KVM_GENERIC_HARDWARE_ENABLING
bool
+
+config KVM_GENERIC_MMU_NOTIFIER
+ select MMU_NOTIFIER
+ bool
+
+config KVM_GENERIC_MEMORY_ATTRIBUTES
+ depends on KVM_GENERIC_MMU_NOTIFIER
+ bool
+
+config KVM_PRIVATE_MEM
+ select XARRAY_MULTI
+ bool
+
+config KVM_GENERIC_PRIVATE_MEM
+ select KVM_GENERIC_MEMORY_ATTRIBUTES
+ select KVM_PRIVATE_MEM
+ bool
diff --git a/virt/kvm/Makefile.kvm b/virt/kvm/Makefile.kvm
index 2c27d5d0c367..724c89af78af 100644
--- a/virt/kvm/Makefile.kvm
+++ b/virt/kvm/Makefile.kvm
@@ -12,3 +12,4 @@ kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o
kvm-$(CONFIG_HAVE_KVM_IRQ_ROUTING) += $(KVM)/irqchip.o
kvm-$(CONFIG_HAVE_KVM_DIRTY_RING) += $(KVM)/dirty_ring.o
kvm-$(CONFIG_HAVE_KVM_PFNCACHE) += $(KVM)/pfncache.o
+kvm-$(CONFIG_KVM_PRIVATE_MEM) += $(KVM)/guest_memfd.o
diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
index c1cd7dfe4a90..86d267db87bb 100644
--- a/virt/kvm/dirty_ring.c
+++ b/virt/kvm/dirty_ring.c
@@ -58,7 +58,7 @@ static void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask)
as_id = slot >> 16;
id = (u16)slot;
- if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
+ if (as_id >= kvm_arch_nr_memslot_as_ids(kvm) || id >= KVM_USER_MEM_SLOTS)
return;
memslot = id_to_memslot(__kvm_memslots(kvm, as_id), id);
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index c0e230f4c3e9..229570059a1b 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -28,7 +28,7 @@
#include <kvm/iodev.h>
-#ifdef CONFIG_HAVE_KVM_IRQFD
+#ifdef CONFIG_HAVE_KVM_IRQCHIP
static struct workqueue_struct *irqfd_cleanup_wq;
@@ -526,21 +526,7 @@ void kvm_unregister_irq_ack_notifier(struct kvm *kvm,
synchronize_srcu(&kvm->irq_srcu);
kvm_arch_post_irq_ack_notifier_list_update(kvm);
}
-#endif
-
-void
-kvm_eventfd_init(struct kvm *kvm)
-{
-#ifdef CONFIG_HAVE_KVM_IRQFD
- spin_lock_init(&kvm->irqfds.lock);
- INIT_LIST_HEAD(&kvm->irqfds.items);
- INIT_LIST_HEAD(&kvm->irqfds.resampler_list);
- mutex_init(&kvm->irqfds.resampler_lock);
-#endif
- INIT_LIST_HEAD(&kvm->ioeventfds);
-}
-#ifdef CONFIG_HAVE_KVM_IRQFD
/*
* shutdown any irqfd's that match fd+gsi
*/
@@ -1012,3 +998,15 @@ kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
return kvm_assign_ioeventfd(kvm, args);
}
+
+void
+kvm_eventfd_init(struct kvm *kvm)
+{
+#ifdef CONFIG_HAVE_KVM_IRQCHIP
+ spin_lock_init(&kvm->irqfds.lock);
+ INIT_LIST_HEAD(&kvm->irqfds.items);
+ INIT_LIST_HEAD(&kvm->irqfds.resampler_list);
+ mutex_init(&kvm->irqfds.resampler_lock);
+#endif
+ INIT_LIST_HEAD(&kvm->ioeventfds);
+}
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
new file mode 100644
index 000000000000..0f4e0cf4f158
--- /dev/null
+++ b/virt/kvm/guest_memfd.c
@@ -0,0 +1,532 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/backing-dev.h>
+#include <linux/falloc.h>
+#include <linux/kvm_host.h>
+#include <linux/pagemap.h>
+#include <linux/anon_inodes.h>
+
+#include "kvm_mm.h"
+
+struct kvm_gmem {
+ struct kvm *kvm;
+ struct xarray bindings;
+ struct list_head entry;
+};
+
+static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
+{
+ struct folio *folio;
+
+ /* TODO: Support huge pages. */
+ folio = filemap_grab_folio(inode->i_mapping, index);
+ if (IS_ERR_OR_NULL(folio))
+ return NULL;
+
+ /*
+ * Use the up-to-date flag to track whether or not the memory has been
+ * zeroed before being handed off to the guest. There is no backing
+ * storage for the memory, so the folio will remain up-to-date until
+ * it's removed.
+ *
+ * TODO: Skip clearing pages when trusted firmware will do it when
+ * assigning memory to the guest.
+ */
+ if (!folio_test_uptodate(folio)) {
+ unsigned long nr_pages = folio_nr_pages(folio);
+ unsigned long i;
+
+ for (i = 0; i < nr_pages; i++)
+ clear_highpage(folio_page(folio, i));
+
+ folio_mark_uptodate(folio);
+ }
+
+ /*
+ * Ignore accessed, referenced, and dirty flags. The memory is
+ * unevictable and there is no storage to write back to.
+ */
+ return folio;
+}
+
+static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
+ pgoff_t end)
+{
+ bool flush = false, found_memslot = false;
+ struct kvm_memory_slot *slot;
+ struct kvm *kvm = gmem->kvm;
+ unsigned long index;
+
+ xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) {
+ pgoff_t pgoff = slot->gmem.pgoff;
+
+ struct kvm_gfn_range gfn_range = {
+ .start = slot->base_gfn + max(pgoff, start) - pgoff,
+ .end = slot->base_gfn + min(pgoff + slot->npages, end) - pgoff,
+ .slot = slot,
+ .may_block = true,
+ };
+
+ if (!found_memslot) {
+ found_memslot = true;
+
+ KVM_MMU_LOCK(kvm);
+ kvm_mmu_invalidate_begin(kvm);
+ }
+
+ flush |= kvm_mmu_unmap_gfn_range(kvm, &gfn_range);
+ }
+
+ if (flush)
+ kvm_flush_remote_tlbs(kvm);
+
+ if (found_memslot)
+ KVM_MMU_UNLOCK(kvm);
+}
+
+static void kvm_gmem_invalidate_end(struct kvm_gmem *gmem, pgoff_t start,
+ pgoff_t end)
+{
+ struct kvm *kvm = gmem->kvm;
+
+ if (xa_find(&gmem->bindings, &start, end - 1, XA_PRESENT)) {
+ KVM_MMU_LOCK(kvm);
+ kvm_mmu_invalidate_end(kvm);
+ KVM_MMU_UNLOCK(kvm);
+ }
+}
+
+static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
+{
+ struct list_head *gmem_list = &inode->i_mapping->i_private_list;
+ pgoff_t start = offset >> PAGE_SHIFT;
+ pgoff_t end = (offset + len) >> PAGE_SHIFT;
+ struct kvm_gmem *gmem;
+
+ /*
+ * Bindings must be stable across invalidation to ensure the start+end
+ * are balanced.
+ */
+ filemap_invalidate_lock(inode->i_mapping);
+
+ list_for_each_entry(gmem, gmem_list, entry)
+ kvm_gmem_invalidate_begin(gmem, start, end);
+
+ truncate_inode_pages_range(inode->i_mapping, offset, offset + len - 1);
+
+ list_for_each_entry(gmem, gmem_list, entry)
+ kvm_gmem_invalidate_end(gmem, start, end);
+
+ filemap_invalidate_unlock(inode->i_mapping);
+
+ return 0;
+}
+
+static long kvm_gmem_allocate(struct inode *inode, loff_t offset, loff_t len)
+{
+ struct address_space *mapping = inode->i_mapping;
+ pgoff_t start, index, end;
+ int r;
+
+ /* Dedicated guest is immutable by default. */
+ if (offset + len > i_size_read(inode))
+ return -EINVAL;
+
+ filemap_invalidate_lock_shared(mapping);
+
+ start = offset >> PAGE_SHIFT;
+ end = (offset + len) >> PAGE_SHIFT;
+
+ r = 0;
+ for (index = start; index < end; ) {
+ struct folio *folio;
+
+ if (signal_pending(current)) {
+ r = -EINTR;
+ break;
+ }
+
+ folio = kvm_gmem_get_folio(inode, index);
+ if (!folio) {
+ r = -ENOMEM;
+ break;
+ }
+
+ index = folio_next_index(folio);
+
+ folio_unlock(folio);
+ folio_put(folio);
+
+ /* 64-bit only, wrapping the index should be impossible. */
+ if (WARN_ON_ONCE(!index))
+ break;
+
+ cond_resched();
+ }
+
+ filemap_invalidate_unlock_shared(mapping);
+
+ return r;
+}
+
+static long kvm_gmem_fallocate(struct file *file, int mode, loff_t offset,
+ loff_t len)
+{
+ int ret;
+
+ if (!(mode & FALLOC_FL_KEEP_SIZE))
+ return -EOPNOTSUPP;
+
+ if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
+ return -EOPNOTSUPP;
+
+ if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
+ return -EINVAL;
+
+ if (mode & FALLOC_FL_PUNCH_HOLE)
+ ret = kvm_gmem_punch_hole(file_inode(file), offset, len);
+ else
+ ret = kvm_gmem_allocate(file_inode(file), offset, len);
+
+ if (!ret)
+ file_modified(file);
+ return ret;
+}
+
+static int kvm_gmem_release(struct inode *inode, struct file *file)
+{
+ struct kvm_gmem *gmem = file->private_data;
+ struct kvm_memory_slot *slot;
+ struct kvm *kvm = gmem->kvm;
+ unsigned long index;
+
+ /*
+ * Prevent concurrent attempts to *unbind* a memslot. This is the last
+ * reference to the file and thus no new bindings can be created, but
+ * dereferencing the slot for existing bindings needs to be protected
+ * against memslot updates, specifically so that unbind doesn't race
+ * and free the memslot (kvm_gmem_get_file() will return NULL).
+ */
+ mutex_lock(&kvm->slots_lock);
+
+ filemap_invalidate_lock(inode->i_mapping);
+
+ xa_for_each(&gmem->bindings, index, slot)
+ rcu_assign_pointer(slot->gmem.file, NULL);
+
+ synchronize_rcu();
+
+ /*
+ * All in-flight operations are gone and new bindings can be created.
+ * Zap all SPTEs pointed at by this file. Do not free the backing
+ * memory, as its lifetime is associated with the inode, not the file.
+ */
+ kvm_gmem_invalidate_begin(gmem, 0, -1ul);
+ kvm_gmem_invalidate_end(gmem, 0, -1ul);
+
+ list_del(&gmem->entry);
+
+ filemap_invalidate_unlock(inode->i_mapping);
+
+ mutex_unlock(&kvm->slots_lock);
+
+ xa_destroy(&gmem->bindings);
+ kfree(gmem);
+
+ kvm_put_kvm(kvm);
+
+ return 0;
+}
+
+static inline struct file *kvm_gmem_get_file(struct kvm_memory_slot *slot)
+{
+ /*
+ * Do not return slot->gmem.file if it has already been closed;
+ * there might be some time between the last fput() and when
+ * kvm_gmem_release() clears slot->gmem.file, and you do not
+ * want to spin in the meanwhile.
+ */
+ return get_file_active(&slot->gmem.file);
+}
+
+static struct file_operations kvm_gmem_fops = {
+ .open = generic_file_open,
+ .release = kvm_gmem_release,
+ .fallocate = kvm_gmem_fallocate,
+};
+
+void kvm_gmem_init(struct module *module)
+{
+ kvm_gmem_fops.owner = module;
+}
+
+static int kvm_gmem_migrate_folio(struct address_space *mapping,
+ struct folio *dst, struct folio *src,
+ enum migrate_mode mode)
+{
+ WARN_ON_ONCE(1);
+ return -EINVAL;
+}
+
+static int kvm_gmem_error_folio(struct address_space *mapping, struct folio *folio)
+{
+ struct list_head *gmem_list = &mapping->i_private_list;
+ struct kvm_gmem *gmem;
+ pgoff_t start, end;
+
+ filemap_invalidate_lock_shared(mapping);
+
+ start = folio->index;
+ end = start + folio_nr_pages(folio);
+
+ list_for_each_entry(gmem, gmem_list, entry)
+ kvm_gmem_invalidate_begin(gmem, start, end);
+
+ /*
+ * Do not truncate the range, what action is taken in response to the
+ * error is userspace's decision (assuming the architecture supports
+ * gracefully handling memory errors). If/when the guest attempts to
+ * access a poisoned page, kvm_gmem_get_pfn() will return -EHWPOISON,
+ * at which point KVM can either terminate the VM or propagate the
+ * error to userspace.
+ */
+
+ list_for_each_entry(gmem, gmem_list, entry)
+ kvm_gmem_invalidate_end(gmem, start, end);
+
+ filemap_invalidate_unlock_shared(mapping);
+
+ return MF_DELAYED;
+}
+
+static const struct address_space_operations kvm_gmem_aops = {
+ .dirty_folio = noop_dirty_folio,
+ .migrate_folio = kvm_gmem_migrate_folio,
+ .error_remove_folio = kvm_gmem_error_folio,
+};
+
+static int kvm_gmem_getattr(struct mnt_idmap *idmap, const struct path *path,
+ struct kstat *stat, u32 request_mask,
+ unsigned int query_flags)
+{
+ struct inode *inode = path->dentry->d_inode;
+
+ generic_fillattr(idmap, request_mask, inode, stat);
+ return 0;
+}
+
+static int kvm_gmem_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
+ struct iattr *attr)
+{
+ return -EINVAL;
+}
+static const struct inode_operations kvm_gmem_iops = {
+ .getattr = kvm_gmem_getattr,
+ .setattr = kvm_gmem_setattr,
+};
+
+static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
+{
+ const char *anon_name = "[kvm-gmem]";
+ struct kvm_gmem *gmem;
+ struct inode *inode;
+ struct file *file;
+ int fd, err;
+
+ fd = get_unused_fd_flags(0);
+ if (fd < 0)
+ return fd;
+
+ gmem = kzalloc(sizeof(*gmem), GFP_KERNEL);
+ if (!gmem) {
+ err = -ENOMEM;
+ goto err_fd;
+ }
+
+ file = anon_inode_create_getfile(anon_name, &kvm_gmem_fops, gmem,
+ O_RDWR, NULL);
+ if (IS_ERR(file)) {
+ err = PTR_ERR(file);
+ goto err_gmem;
+ }
+
+ file->f_flags |= O_LARGEFILE;
+
+ inode = file->f_inode;
+ WARN_ON(file->f_mapping != inode->i_mapping);
+
+ inode->i_private = (void *)(unsigned long)flags;
+ inode->i_op = &kvm_gmem_iops;
+ inode->i_mapping->a_ops = &kvm_gmem_aops;
+ inode->i_mode |= S_IFREG;
+ inode->i_size = size;
+ mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+ mapping_set_unmovable(inode->i_mapping);
+ /* Unmovable mappings are supposed to be marked unevictable as well. */
+ WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
+
+ kvm_get_kvm(kvm);
+ gmem->kvm = kvm;
+ xa_init(&gmem->bindings);
+ list_add(&gmem->entry, &inode->i_mapping->i_private_list);
+
+ fd_install(fd, file);
+ return fd;
+
+err_gmem:
+ kfree(gmem);
+err_fd:
+ put_unused_fd(fd);
+ return err;
+}
+
+int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
+{
+ loff_t size = args->size;
+ u64 flags = args->flags;
+ u64 valid_flags = 0;
+
+ if (flags & ~valid_flags)
+ return -EINVAL;
+
+ if (size <= 0 || !PAGE_ALIGNED(size))
+ return -EINVAL;
+
+ return __kvm_gmem_create(kvm, size, flags);
+}
+
+int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
+ unsigned int fd, loff_t offset)
+{
+ loff_t size = slot->npages << PAGE_SHIFT;
+ unsigned long start, end;
+ struct kvm_gmem *gmem;
+ struct inode *inode;
+ struct file *file;
+ int r = -EINVAL;
+
+ BUILD_BUG_ON(sizeof(gfn_t) != sizeof(slot->gmem.pgoff));
+
+ file = fget(fd);
+ if (!file)
+ return -EBADF;
+
+ if (file->f_op != &kvm_gmem_fops)
+ goto err;
+
+ gmem = file->private_data;
+ if (gmem->kvm != kvm)
+ goto err;
+
+ inode = file_inode(file);
+
+ if (offset < 0 || !PAGE_ALIGNED(offset) ||
+ offset + size > i_size_read(inode))
+ goto err;
+
+ filemap_invalidate_lock(inode->i_mapping);
+
+ start = offset >> PAGE_SHIFT;
+ end = start + slot->npages;
+
+ if (!xa_empty(&gmem->bindings) &&
+ xa_find(&gmem->bindings, &start, end - 1, XA_PRESENT)) {
+ filemap_invalidate_unlock(inode->i_mapping);
+ goto err;
+ }
+
+ /*
+ * No synchronize_rcu() needed, any in-flight readers are guaranteed to
+ * be see either a NULL file or this new file, no need for them to go
+ * away.
+ */
+ rcu_assign_pointer(slot->gmem.file, file);
+ slot->gmem.pgoff = start;
+
+ xa_store_range(&gmem->bindings, start, end - 1, slot, GFP_KERNEL);
+ filemap_invalidate_unlock(inode->i_mapping);
+
+ /*
+ * Drop the reference to the file, even on success. The file pins KVM,
+ * not the other way 'round. Active bindings are invalidated if the
+ * file is closed before memslots are destroyed.
+ */
+ r = 0;
+err:
+ fput(file);
+ return r;
+}
+
+void kvm_gmem_unbind(struct kvm_memory_slot *slot)
+{
+ unsigned long start = slot->gmem.pgoff;
+ unsigned long end = start + slot->npages;
+ struct kvm_gmem *gmem;
+ struct file *file;
+
+ /*
+ * Nothing to do if the underlying file was already closed (or is being
+ * closed right now), kvm_gmem_release() invalidates all bindings.
+ */
+ file = kvm_gmem_get_file(slot);
+ if (!file)
+ return;
+
+ gmem = file->private_data;
+
+ filemap_invalidate_lock(file->f_mapping);
+ xa_store_range(&gmem->bindings, start, end - 1, NULL, GFP_KERNEL);
+ rcu_assign_pointer(slot->gmem.file, NULL);
+ synchronize_rcu();
+ filemap_invalidate_unlock(file->f_mapping);
+
+ fput(file);
+}
+
+int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
+ gfn_t gfn, kvm_pfn_t *pfn, int *max_order)
+{
+ pgoff_t index = gfn - slot->base_gfn + slot->gmem.pgoff;
+ struct kvm_gmem *gmem;
+ struct folio *folio;
+ struct page *page;
+ struct file *file;
+ int r;
+
+ file = kvm_gmem_get_file(slot);
+ if (!file)
+ return -EFAULT;
+
+ gmem = file->private_data;
+
+ if (WARN_ON_ONCE(xa_load(&gmem->bindings, index) != slot)) {
+ r = -EIO;
+ goto out_fput;
+ }
+
+ folio = kvm_gmem_get_folio(file_inode(file), index);
+ if (!folio) {
+ r = -ENOMEM;
+ goto out_fput;
+ }
+
+ if (folio_test_hwpoison(folio)) {
+ r = -EHWPOISON;
+ goto out_unlock;
+ }
+
+ page = folio_file_page(folio, index);
+
+ *pfn = page_to_pfn(page);
+ if (max_order)
+ *max_order = 0;
+
+ r = 0;
+
+out_unlock:
+ folio_unlock(folio);
+out_fput:
+ fput(file);
+
+ return r;
+}
+EXPORT_SYMBOL_GPL(kvm_gmem_get_pfn);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 7db96875ac46..10bfc88a69f7 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -533,31 +533,44 @@ void kvm_destroy_vcpus(struct kvm *kvm)
}
EXPORT_SYMBOL_GPL(kvm_destroy_vcpus);
-#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
+#ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
{
return container_of(mn, struct kvm, mmu_notifier);
}
-typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
+typedef bool (*gfn_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
-typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
- unsigned long end);
+typedef void (*on_lock_fn_t)(struct kvm *kvm);
-typedef void (*on_unlock_fn_t)(struct kvm *kvm);
-
-struct kvm_hva_range {
- unsigned long start;
- unsigned long end;
+struct kvm_mmu_notifier_range {
+ /*
+ * 64-bit addresses, as KVM notifiers can operate on host virtual
+ * addresses (unsigned long) and guest physical addresses (64-bit).
+ */
+ u64 start;
+ u64 end;
union kvm_mmu_notifier_arg arg;
- hva_handler_t handler;
+ gfn_handler_t handler;
on_lock_fn_t on_lock;
- on_unlock_fn_t on_unlock;
bool flush_on_ret;
bool may_block;
};
/*
+ * The inner-most helper returns a tuple containing the return value from the
+ * arch- and action-specific handler, plus a flag indicating whether or not at
+ * least one memslot was found, i.e. if the handler found guest memory.
+ *
+ * Note, most notifiers are averse to booleans, so even though KVM tracks the
+ * return from arch code as a bool, outer helpers will cast it to an int. :-(
+ */
+typedef struct kvm_mmu_notifier_return {
+ bool ret;
+ bool found_memslot;
+} kvm_mn_ret_t;
+
+/*
* Use a dedicated stub instead of NULL to indicate that there is no callback
* function/handler. The compiler technically can't guarantee that a real
* function will have a non-zero address, and so it will generate code to
@@ -578,26 +591,29 @@ static const union kvm_mmu_notifier_arg KVM_MMU_NOTIFIER_NO_ARG;
node; \
node = interval_tree_iter_next(node, start, last)) \
-static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
- const struct kvm_hva_range *range)
+static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
+ const struct kvm_mmu_notifier_range *range)
{
- bool ret = false, locked = false;
+ struct kvm_mmu_notifier_return r = {
+ .ret = false,
+ .found_memslot = false,
+ };
struct kvm_gfn_range gfn_range;
struct kvm_memory_slot *slot;
struct kvm_memslots *slots;
int i, idx;
if (WARN_ON_ONCE(range->end <= range->start))
- return 0;
+ return r;
/* A null handler is allowed if and only if on_lock() is provided. */
if (WARN_ON_ONCE(IS_KVM_NULL_FN(range->on_lock) &&
IS_KVM_NULL_FN(range->handler)))
- return 0;
+ return r;
idx = srcu_read_lock(&kvm->srcu);
- for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+ for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
struct interval_tree_node *node;
slots = __kvm_memslots(kvm, i);
@@ -606,9 +622,9 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
unsigned long hva_start, hva_end;
slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]);
- hva_start = max(range->start, slot->userspace_addr);
- hva_end = min(range->end, slot->userspace_addr +
- (slot->npages << PAGE_SHIFT));
+ hva_start = max_t(unsigned long, range->start, slot->userspace_addr);
+ hva_end = min_t(unsigned long, range->end,
+ slot->userspace_addr + (slot->npages << PAGE_SHIFT));
/*
* To optimize for the likely case where the address
@@ -627,71 +643,66 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
gfn_range.end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, slot);
gfn_range.slot = slot;
- if (!locked) {
- locked = true;
+ if (!r.found_memslot) {
+ r.found_memslot = true;
KVM_MMU_LOCK(kvm);
if (!IS_KVM_NULL_FN(range->on_lock))
- range->on_lock(kvm, range->start, range->end);
+ range->on_lock(kvm);
+
if (IS_KVM_NULL_FN(range->handler))
break;
}
- ret |= range->handler(kvm, &gfn_range);
+ r.ret |= range->handler(kvm, &gfn_range);
}
}
- if (range->flush_on_ret && ret)
+ if (range->flush_on_ret && r.ret)
kvm_flush_remote_tlbs(kvm);
- if (locked) {
+ if (r.found_memslot)
KVM_MMU_UNLOCK(kvm);
- if (!IS_KVM_NULL_FN(range->on_unlock))
- range->on_unlock(kvm);
- }
srcu_read_unlock(&kvm->srcu, idx);
- /* The notifiers are averse to booleans. :-( */
- return (int)ret;
+ return r;
}
static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
unsigned long start,
unsigned long end,
union kvm_mmu_notifier_arg arg,
- hva_handler_t handler)
+ gfn_handler_t handler)
{
struct kvm *kvm = mmu_notifier_to_kvm(mn);
- const struct kvm_hva_range range = {
+ const struct kvm_mmu_notifier_range range = {
.start = start,
.end = end,
.arg = arg,
.handler = handler,
.on_lock = (void *)kvm_null_fn,
- .on_unlock = (void *)kvm_null_fn,
.flush_on_ret = true,
.may_block = false,
};
- return __kvm_handle_hva_range(kvm, &range);
+ return __kvm_handle_hva_range(kvm, &range).ret;
}
static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn,
unsigned long start,
unsigned long end,
- hva_handler_t handler)
+ gfn_handler_t handler)
{
struct kvm *kvm = mmu_notifier_to_kvm(mn);
- const struct kvm_hva_range range = {
+ const struct kvm_mmu_notifier_range range = {
.start = start,
.end = end,
.handler = handler,
.on_lock = (void *)kvm_null_fn,
- .on_unlock = (void *)kvm_null_fn,
.flush_on_ret = false,
.may_block = false,
};
- return __kvm_handle_hva_range(kvm, &range);
+ return __kvm_handle_hva_range(kvm, &range).ret;
}
static bool kvm_change_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
@@ -736,16 +747,29 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
kvm_handle_hva_range(mn, address, address + 1, arg, kvm_change_spte_gfn);
}
-void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
- unsigned long end)
+void kvm_mmu_invalidate_begin(struct kvm *kvm)
{
+ lockdep_assert_held_write(&kvm->mmu_lock);
/*
* The count increase must become visible at unlock time as no
* spte can be established without taking the mmu_lock and
* count is also read inside the mmu_lock critical section.
*/
kvm->mmu_invalidate_in_progress++;
+
if (likely(kvm->mmu_invalidate_in_progress == 1)) {
+ kvm->mmu_invalidate_range_start = INVALID_GPA;
+ kvm->mmu_invalidate_range_end = INVALID_GPA;
+ }
+}
+
+void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+ lockdep_assert_held_write(&kvm->mmu_lock);
+
+ WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
+
+ if (likely(kvm->mmu_invalidate_range_start == INVALID_GPA)) {
kvm->mmu_invalidate_range_start = start;
kvm->mmu_invalidate_range_end = end;
} else {
@@ -765,16 +789,21 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
}
}
+bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+ kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
+ return kvm_unmap_gfn_range(kvm, range);
+}
+
static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
const struct mmu_notifier_range *range)
{
struct kvm *kvm = mmu_notifier_to_kvm(mn);
- const struct kvm_hva_range hva_range = {
+ const struct kvm_mmu_notifier_range hva_range = {
.start = range->start,
.end = range->end,
- .handler = kvm_unmap_gfn_range,
+ .handler = kvm_mmu_unmap_gfn_range,
.on_lock = kvm_mmu_invalidate_begin,
- .on_unlock = kvm_arch_guest_memory_reclaimed,
.flush_on_ret = true,
.may_block = mmu_notifier_range_blockable(range),
};
@@ -806,14 +835,21 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
gfn_to_pfn_cache_invalidate_start(kvm, range->start, range->end,
hva_range.may_block);
- __kvm_handle_hva_range(kvm, &hva_range);
+ /*
+ * If one or more memslots were found and thus zapped, notify arch code
+ * that guest memory has been reclaimed. This needs to be done *after*
+ * dropping mmu_lock, as x86's reclaim path is slooooow.
+ */
+ if (__kvm_handle_hva_range(kvm, &hva_range).found_memslot)
+ kvm_arch_guest_memory_reclaimed(kvm);
return 0;
}
-void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
- unsigned long end)
+void kvm_mmu_invalidate_end(struct kvm *kvm)
{
+ lockdep_assert_held_write(&kvm->mmu_lock);