summaryrefslogtreecommitdiff
path: root/virt/kvm
AgeCommit message (Collapse)AuthorFilesLines
2024-06-27KVM: Fix a data race on last_boosted_vcpu in kvm_vcpu_on_spin()Breno Leitao1-2/+3
commit 49f683b41f28918df3e51ddc0d928cb2e934ccdb upstream. Use {READ,WRITE}_ONCE() to access kvm->last_boosted_vcpu to ensure the loads and stores are atomic. In the extremely unlikely scenario the compiler tears the stores, it's theoretically possible for KVM to attempt to get a vCPU using an out-of-bounds index, e.g. if the write is split into multiple 8-bit stores, and is paired with a 32-bit load on a VM with 257 vCPUs: CPU0 CPU1 last_boosted_vcpu = 0xff; (last_boosted_vcpu = 0x100) last_boosted_vcpu[15:8] = 0x01; i = (last_boosted_vcpu = 0x1ff) last_boosted_vcpu[7:0] = 0x00; vcpu = kvm->vcpu_array[0x1ff]; As detected by KCSAN: BUG: KCSAN: data-race in kvm_vcpu_on_spin [kvm] / kvm_vcpu_on_spin [kvm] write to 0xffffc90025a92344 of 4 bytes by task 4340 on cpu 16: kvm_vcpu_on_spin (arch/x86/kvm/../../../virt/kvm/kvm_main.c:4112) kvm handle_pause (arch/x86/kvm/vmx/vmx.c:5929) kvm_intel vmx_handle_exit (arch/x86/kvm/vmx/vmx.c:? arch/x86/kvm/vmx/vmx.c:6606) kvm_intel vcpu_run (arch/x86/kvm/x86.c:11107 arch/x86/kvm/x86.c:11211) kvm kvm_arch_vcpu_ioctl_run (arch/x86/kvm/x86.c:?) kvm kvm_vcpu_ioctl (arch/x86/kvm/../../../virt/kvm/kvm_main.c:?) kvm __se_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:904 fs/ioctl.c:890) __x64_sys_ioctl (fs/ioctl.c:890) x64_sys_call (arch/x86/entry/syscall_64.c:33) do_syscall_64 (arch/x86/entry/common.c:?) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) read to 0xffffc90025a92344 of 4 bytes by task 4342 on cpu 4: kvm_vcpu_on_spin (arch/x86/kvm/../../../virt/kvm/kvm_main.c:4069) kvm handle_pause (arch/x86/kvm/vmx/vmx.c:5929) kvm_intel vmx_handle_exit (arch/x86/kvm/vmx/vmx.c:? arch/x86/kvm/vmx/vmx.c:6606) kvm_intel vcpu_run (arch/x86/kvm/x86.c:11107 arch/x86/kvm/x86.c:11211) kvm kvm_arch_vcpu_ioctl_run (arch/x86/kvm/x86.c:?) kvm kvm_vcpu_ioctl (arch/x86/kvm/../../../virt/kvm/kvm_main.c:?) kvm __se_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:904 fs/ioctl.c:890) __x64_sys_ioctl (fs/ioctl.c:890) x64_sys_call (arch/x86/entry/syscall_64.c:33) do_syscall_64 (arch/x86/entry/common.c:?) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) value changed: 0x00000012 -> 0x00000000 Fixes: 217ece6129f2 ("KVM: use yield_to instead of sleep in kvm_vcpu_on_spin") Cc: stable@vger.kernel.org Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://lore.kernel.org/r/20240510092353.2261824-1-leitao@debian.org Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-04-03KVM: Always flush async #PF workqueue when vCPU is being destroyedSean Christopherson1-5/+26
[ Upstream commit 3d75b8aa5c29058a512db29da7cbee8052724157 ] Always flush the per-vCPU async #PF workqueue when a vCPU is clearing its completion queue, e.g. when a VM and all its vCPUs is being destroyed. KVM must ensure that none of its workqueue callbacks is running when the last reference to the KVM _module_ is put. Gifting a reference to the associated VM prevents the workqueue callback from dereferencing freed vCPU/VM memory, but does not prevent the KVM module from being unloaded before the callback completes. Drop the misguided VM refcount gifting, as calling kvm_put_kvm() from async_pf_execute() if kvm_put_kvm() flushes the async #PF workqueue will result in deadlock. async_pf_execute() can't return until kvm_put_kvm() finishes, and kvm_put_kvm() can't return until async_pf_execute() finishes: WARNING: CPU: 8 PID: 251 at virt/kvm/kvm_main.c:1435 kvm_put_kvm+0x2d/0x320 [kvm] Modules linked in: vhost_net vhost vhost_iotlb tap kvm_intel kvm irqbypass CPU: 8 PID: 251 Comm: kworker/8:1 Tainted: G W 6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 Workqueue: events async_pf_execute [kvm] RIP: 0010:kvm_put_kvm+0x2d/0x320 [kvm] Call Trace: <TASK> async_pf_execute+0x198/0x260 [kvm] process_one_work+0x145/0x2d0 worker_thread+0x27e/0x3a0 kthread+0xba/0xe0 ret_from_fork+0x2d/0x50 ret_from_fork_asm+0x11/0x20 </TASK> ---[ end trace 0000000000000000 ]--- INFO: task kworker/8:1:251 blocked for more than 120 seconds. Tainted: G W 6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/8:1 state:D stack:0 pid:251 ppid:2 flags:0x00004000 Workqueue: events async_pf_execute [kvm] Call Trace: <TASK> __schedule+0x33f/0xa40 schedule+0x53/0xc0 schedule_timeout+0x12a/0x140 __wait_for_common+0x8d/0x1d0 __flush_work.isra.0+0x19f/0x2c0 kvm_clear_async_pf_completion_queue+0x129/0x190 [kvm] kvm_arch_destroy_vm+0x78/0x1b0 [kvm] kvm_put_kvm+0x1c1/0x320 [kvm] async_pf_execute+0x198/0x260 [kvm] process_one_work+0x145/0x2d0 worker_thread+0x27e/0x3a0 kthread+0xba/0xe0 ret_from_fork+0x2d/0x50 ret_from_fork_asm+0x11/0x20 </TASK> If kvm_clear_async_pf_completion_queue() actually flushes the workqueue, then there's no need to gift async_pf_execute() a reference because all invocations of async_pf_execute() will be forced to complete before the vCPU and its VM are destroyed/freed. And that in turn fixes the module unloading bug as __fput() won't do module_put() on the last vCPU reference until the vCPU has been freed, e.g. if closing the vCPU file also puts the last reference to the KVM module. Note that kvm_check_async_pf_completion() may also take the work item off the completion queue and so also needs to flush the work queue, as the work will not be seen by kvm_clear_async_pf_completion_queue(). Waiting on the workqueue could theoretically delay a vCPU due to waiting for the work to complete, but that's a very, very small chance, and likely a very small delay. kvm_arch_async_page_present_queued() unconditionally makes a new request, i.e. will effectively delay entering the guest, so the remaining work is really just: trace_kvm_async_pf_completed(addr, cr2_or_gpa); __kvm_vcpu_wake_up(vcpu); mmput(mm); and mmput() can't drop the last reference to the page tables if the vCPU is still alive, i.e. the vCPU won't get stuck tearing down page tables. Add a helper to do the flushing, specifically to deal with "wakeup all" work items, as they aren't actually work items, i.e. are never placed in a workqueue. Trying to flush a bogus workqueue entry rightly makes __flush_work() complain (kudos to whoever added that sanity check). Note, commit 5f6de5cbebee ("KVM: Prevent module exit until all VMs are freed") *tried* to fix the module refcounting issue by having VMs grab a reference to the module, but that only made the bug slightly harder to hit as it gave async_pf_execute() a bit more time to complete before the KVM module could be unloaded. Fixes: af585b921e5d ("KVM: Halt vcpu if page it tries to access is swapped out") Cc: stable@vger.kernel.org Cc: David Matlack <dmatlack@google.com> Reviewed-by: Xu Yilun <yilun.xu@intel.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20240110011533.503302-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-13kvm/vfio: ensure kvg instance stays around in kvm_vfio_group_add()Dmitry Torokhov1-1/+1
[ Upstream commit 9e0f4f2918c2ff145d3dedee862d9919a6ed5812 ] kvm_vfio_group_add() creates kvg instance, links it to kv->group_list, and calls kvm_vfio_file_set_kvm() with kvg->file as an argument after dropping kv->lock. If we race group addition and deletion calls, kvg instance may get freed by the time we get around to calling kvm_vfio_file_set_kvm(). Previous iterations of the code did not reference kvg->file outside of the critical section, but used a temporary variable. Still, they had similar problem of the file reference being owned by kvg structure and potential for kvm_vfio_group_del() dropping it before kvm_vfio_group_add() had a chance to complete. Fix this by moving call to kvm_vfio_file_set_kvm() under the protection of kv->lock. We already call it while holding the same lock when vfio group is being deleted, so it should be safe here as well. Fixes: 2fc1bec15883 ("kvm: set/clear kvm to/from vfio_group when group add/delete") Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20230714224538.404793-1-dmitry.torokhov@gmail.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-13kvm/vfio: Prepare for accepting vfio device fdYi Liu1-57/+58
[ Upstream commit 2f99073a722beef5f74f3b0f32bda227ba3df1e0 ] This renames kvm_vfio_group related helpers to prepare for accepting vfio device fd. No functional change is intended. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Terrence Xu <terrence.xu@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Tested-by: Yanting Jiang <yanting.jiang@intel.com> Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Tested-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20230718135551.6592-5-yi.l.liu@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Stable-dep-of: 9e0f4f2918c2 ("kvm/vfio: ensure kvg instance stays around in kvm_vfio_group_add()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-03KVM: Grab a reference to KVM for VM and vCPU stats file descriptorsSean Christopherson1-0/+24
commit eed3013faa401aae662398709410a59bb0646e32 upstream. Grab a reference to KVM prior to installing VM and vCPU stats file descriptors to ensure the underlying VM and vCPU objects are not freed until the last reference to any and all stats fds are dropped. Note, the stats paths manually invoke fd_install() and so don't need to grab a reference before creating the file. Fixes: ce55c049459c ("KVM: stats: Support binary stats retrieval for a VCPU") Fixes: fcfe1baeddbf ("KVM: stats: Support binary stats retrieval for a VM") Reported-by: Zheng Zhang <zheng.zhang@email.ucr.edu> Closes: https://lore.kernel.org/all/CAC_GQSr3xzZaeZt85k_RCBd5kfiOve8qXo7a81Cq53LuVQ5r=Q@mail.gmail.com Cc: stable@vger.kernel.org Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Message-Id: <20230711230131.648752-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-06-28KVM: Avoid illegal stage2 mapping on invalid memory slotGavin Shan1-1/+19
commit 2230f9e1171a2e9731422a14d1bbc313c0b719d1 upstream. We run into guest hang in edk2 firmware when KSM is kept as running on the host. The edk2 firmware is waiting for status 0x80 from QEMU's pflash device (TYPE_PFLASH_CFI01) during the operation of sector erasing or buffered write. The status is returned by reading the memory region of the pflash device and the read request should have been forwarded to QEMU and emulated by it. Unfortunately, the read request is covered by an illegal stage2 mapping when the guest hang issue occurs. The read request is completed with QEMU bypassed and wrong status is fetched. The edk2 firmware runs into an infinite loop with the wrong status. The illegal stage2 mapping is populated due to same page sharing by KSM at (C) even the associated memory slot has been marked as invalid at (B) when the memory slot is requested to be deleted. It's notable that the active and inactive memory slots can't be swapped when we're in the middle of kvm_mmu_notifier_change_pte() because kvm->mn_active_invalidate_count is elevated, and kvm_swap_active_memslots() will busy loop until it reaches to zero again. Besides, the swapping from the active to the inactive memory slots is also avoided by holding &kvm->srcu in __kvm_handle_hva_range(), corresponding to synchronize_srcu_expedited() in kvm_swap_active_memslots(). CPU-A CPU-B ----- ----- ioctl(kvm_fd, KVM_SET_USER_MEMORY_REGION) kvm_vm_ioctl_set_memory_region kvm_set_memory_region __kvm_set_memory_region kvm_set_memslot(kvm, old, NULL, KVM_MR_DELETE) kvm_invalidate_memslot kvm_copy_memslot kvm_replace_memslot kvm_swap_active_memslots (A) kvm_arch_flush_shadow_memslot (B) same page sharing by KSM kvm_mmu_notifier_invalidate_range_start : kvm_mmu_notifier_change_pte kvm_handle_hva_range __kvm_handle_hva_range kvm_set_spte_gfn (C) : kvm_mmu_notifier_invalidate_range_end Fix the issue by skipping the invalid memory slot at (C) to avoid the illegal stage2 mapping so that the read request for the pflash's status is forwarded to QEMU and emulated by it. In this way, the correct pflash's status can be returned from QEMU to break the infinite loop in the edk2 firmware. We tried a git-bisect and the first problematic commit is cd4c71835228 (" KVM: arm64: Convert to the gfn-based MMU notifier callbacks"). With this, clean_dcache_guest_page() is called after the memory slots are iterated in kvm_mmu_notifier_change_pte(). clean_dcache_guest_page() is called before the iteration on the memory slots before this commit. This change literally enlarges the racy window between kvm_mmu_notifier_change_pte() and memory slot removal so that we're able to reproduce the issue in a practical test case. However, the issue exists since commit d5d8184d35c9 ("KVM: ARM: Memory virtualization setup"). Cc: stable@vger.kernel.org # v3.9+ Fixes: d5d8184d35c9 ("KVM: ARM: Memory virtualization setup") Reported-by: Shuai Hu <hshuai@redhat.com> Reported-by: Zhenyu Zhang <zhenyzha@redhat.com> Signed-off-by: Gavin Shan <gshan@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Shaoqin Huang <shahuang@redhat.com> Message-Id: <20230615054259.14911-1-gshan@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-24KVM: Fix vcpu_array[0] racesMichal Luczaj1-6/+10
commit afb2acb2e3a32e4d56f7fbd819769b98ed1b7520 upstream. In kvm_vm_ioctl_create_vcpu(), add vcpu to vcpu_array iff it's safe to access vcpu via kvm_get_vcpu() and kvm_for_each_vcpu(), i.e. when there's no failure path requiring vcpu removal and destruction. Such order is important because vcpu_array accessors may end up referencing vcpu at vcpu_array[0] even before online_vcpus is set to 1. When online_vcpus=0, any call to kvm_get_vcpu() goes through array_index_nospec() and ends with an attempt to xa_load(vcpu_array, 0): int num_vcpus = atomic_read(&kvm->online_vcpus); i = array_index_nospec(i, num_vcpus); return xa_load(&kvm->vcpu_array, i); Similarly, when online_vcpus=0, a kvm_for_each_vcpu() does not iterate over an "empty" range, but actually [0, ULONG_MAX]: xa_for_each_range(&kvm->vcpu_array, idx, vcpup, 0, \ (atomic_read(&kvm->online_vcpus) - 1)) In both cases, such online_vcpus=0 edge case, even if leading to unnecessary calls to XArray API, should not be an issue; requesting unpopulated indexes/ranges is handled by xa_load() and xa_for_each_range(). However, this means that when the first vCPU is created and inserted in vcpu_array *and* before online_vcpus is incremented, code calling kvm_get_vcpu()/kvm_for_each_vcpu() already has access to that first vCPU. This should not pose a problem assuming that once a vcpu is stored in vcpu_array, it will remain there, but that's not the case: kvm_vm_ioctl_create_vcpu() first inserts to vcpu_array, then requests a file descriptor. If create_vcpu_fd() fails, newly inserted vcpu is removed from the vcpu_array, then destroyed: vcpu->vcpu_idx = atomic_read(&kvm->online_vcpus); r = xa_insert(&kvm->vcpu_array, vcpu->vcpu_idx, vcpu, GFP_KERNEL_ACCOUNT); kvm_get_kvm(kvm); r = create_vcpu_fd(vcpu); if (r < 0) { xa_erase(&kvm->vcpu_array, vcpu->vcpu_idx); kvm_put_kvm_no_destroy(kvm); goto unlock_vcpu_destroy; } atomic_inc(&kvm->online_vcpus); This results in a possible race condition when a reference to a vcpu is acquired (via kvm_get_vcpu() or kvm_for_each_vcpu()) moments before said vcpu is destroyed. Signed-off-by: Michal Luczaj <mhal@rbox.co> Message-Id: <20230510140410.1093987-2-mhal@rbox.co> Cc: stable@vger.kernel.org Fixes: c5b077549136 ("KVM: Convert the kvm->vcpus array to a xarray", 2021-12-08) Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-10KVM: Register /dev/kvm as the _very_ last thing during initializationSean Christopherson1-9/+22
commit 2b01281273738bf2d6551da48d65db2df3f28998 upstream. Register /dev/kvm, i.e. expose KVM to userspace, only after all other setup has completed. Once /dev/kvm is exposed, userspace can start invoking KVM ioctls, creating VMs, etc... If userspace creates a VM before KVM is done with its configuration, bad things may happen, e.g. KVM will fail to properly migrate vCPU state if a VM is created before KVM has registered preemption notifiers. Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221130230934.1014142-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-10KVM: Destroy target device if coalesced MMIO unregistration failsSean Christopherson1-3/+5
commit b1cb1fac22abf102ffeb29dd3eeca208a3869d54 upstream. Destroy and free the target coalesced MMIO device if unregistering said device fails. As clearly noted in the code, kvm_io_bus_unregister_dev() does not destroy the target device. BUG: memory leak unreferenced object 0xffff888112a54880 (size 64): comm "syz-executor.2", pid 5258, jiffies 4297861402 (age 14.129s) hex dump (first 32 bytes): 38 c7 67 15 00 c9 ff ff 38 c7 67 15 00 c9 ff ff 8.g.....8.g..... e0 c7 e1 83 ff ff ff ff 00 30 67 15 00 c9 ff ff .........0g..... backtrace: [<0000000006995a8a>] kmalloc include/linux/slab.h:556 [inline] [<0000000006995a8a>] kzalloc include/linux/slab.h:690 [inline] [<0000000006995a8a>] kvm_vm_ioctl_register_coalesced_mmio+0x8e/0x3d0 arch/x86/kvm/../../../virt/kvm/coalesced_mmio.c:150 [<00000000022550c2>] kvm_vm_ioctl+0x47d/0x1600 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3323 [<000000008a75102f>] vfs_ioctl fs/ioctl.c:46 [inline] [<000000008a75102f>] file_ioctl fs/ioctl.c:509 [inline] [<000000008a75102f>] do_vfs_ioctl+0xbab/0x1160 fs/ioctl.c:696 [<0000000080e3f669>] ksys_ioctl+0x76/0xa0 fs/ioctl.c:713 [<0000000059ef4888>] __do_sys_ioctl fs/ioctl.c:720 [inline] [<0000000059ef4888>] __se_sys_ioctl fs/ioctl.c:718 [inline] [<0000000059ef4888>] __x64_sys_ioctl+0x6f/0xb0 fs/ioctl.c:718 [<000000006444fa05>] do_syscall_64+0x9f/0x4e0 arch/x86/entry/common.c:290 [<000000009a4ed50b>] entry_SYSCALL_64_after_hwframe+0x49/0xbe BUG: leak checking failed Fixes: 5d3c4c79384a ("KVM: Stop looking for coalesced MMIO zones if the bus is destroyed") Cc: stable@vger.kernel.org Reported-by: 柳菁峰 <liujingfeng@qianxin.com> Reported-by: Michal Luczaj <mhal@rbox.co> Link: https://lore.kernel.org/r/20221219171924.67989-1-seanjc@google.com Link: https://lore.kernel.org/all/20230118220003.1239032-1-mhal@rbox.co Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-01kvm/vfio: Fix potential deadlock on vfio group_lockYi Liu1-3/+3
[ Upstream commit 51cdc8bc120ef6e42f6fb758341f5d91bc955952 ] Currently it is possible that the final put of a KVM reference comes from vfio during its device close operation. This occurs while the vfio group lock is held; however, if the vfio device is still in the kvm device list, then the following call chain could result in a deadlock: VFIO holds group->group_lock/group_rwsem -> kvm_put_kvm -> kvm_destroy_vm -> kvm_destroy_devices -> kvm_vfio_destroy -> kvm_vfio_file_set_kvm -> vfio_file_set_kvm -> try to hold group->group_lock/group_rwsem The key function is the kvm_destroy_devices() which triggers destroy cb of kvm_device_ops. It calls back to vfio and try to hold group_lock. So if this path doesn't call back to vfio, this dead lock would be fixed. Actually, there is a way for it. KVM provides another point to free the kvm-vfio device which is the point when the device file descriptor is closed. This can be achieved by providing the release cb instead of the destroy cb. Also rename kvm_vfio_destroy() to be kvm_vfio_release(). /* * Destroy is responsible for freeing dev. * * Destroy may be called before or after destructors are called * on emulated I/O regions, depending on whether a reference is * held by a vcpu or other kvm component that gets destroyed * after the emulated I/O. */ void (*destroy)(struct kvm_device *dev); /* * Release is an alternative method to free the device. It is * called when the device file descriptor is closed. Once * release is called, the destroy method will not be called * anymore as the device is removed from the device list of * the VM. kvm->lock is held. */ void (*release)(struct kvm_device *dev); Fixes: 421cfe6596f6 ("vfio: remove VFIO_GROUP_NOTIFY_SET_KVM") Reported-by: Alex Williamson <alex.williamson@redhat.com> Suggested-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com> Link: https://lore.kernel.org/r/20230114000351.115444-1-mjrosato@linux.ibm.com Link: https://lore.kernel.org/r/20230120150528.471752-1-yi.l.liu@intel.com [aw: update comment as well, s/destroy/release/] Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-11-23Merge branch 'kvm-dwmw2-fixes' into HEADPaolo Bonzini1-1/+6
This brings in a few important fixes for Xen emulation. While nobody should be enabling it, the bug effectively allows userspace to read arbitrary memory. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-23KVM: Update gfn_to_pfn_cache khva when it moves within the same pageDavid Woodhouse1-1/+6
In the case where a GPC is refreshed to a different location within the same page, we didn't bother to update it. Mostly we don't need to, but since the ->khva field also includes the offset within the page, that does have to be updated. Fixes: 3ba2c95ea180 ("KVM: Do not incorporate page offset into gfn=>pfn cache user address") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: Sean Christopherson <seanjc@google.com> Cc: stable@kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-17KVM: Obey kvm.halt_poll_ns in VMs not using KVM_CAP_HALT_POLLDavid Matlack1-3/+24
Obey kvm.halt_poll_ns in VMs not using KVM_CAP_HALT_POLL on every halt, rather than just sampling the module parameter when the VM is first created. This restore the original behavior of kvm.halt_poll_ns for VMs that have not opted into KVM_CAP_HALT_POLL. Notably, this change restores the ability for admins to disable or change the maximum halt-polling time system wide for VMs not using KVM_CAP_HALT_POLL. Reported-by: Christian Borntraeger <borntraeger@de.ibm.com> Fixes: acd05785e48c ("kvm: add capability for halt polling") Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20221117001657.1067231-4-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-17KVM: Avoid re-reading kvm->max_halt_poll_ns during halt-pollingDavid Matlack1-6/+15
Avoid re-reading kvm->max_halt_poll_ns multiple times during halt-polling except when it is explicitly useful, e.g. to check if the max time changed across a halt. kvm->max_halt_poll_ns can be changed at any time by userspace via KVM_CAP_HALT_POLL. This bug is unlikely to cause any serious side-effects. In the worst case one halt polls for shorter or longer than it should, and then is fixed up on the next halt. Furthmore, this is still possible since kvm->max_halt_poll_ns are not synchronized with halts. Fixes: acd05785e48c ("kvm: add capability for halt polling") Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20221117001657.1067231-3-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-17KVM: Cap vcpu->halt_poll_ns before halting rather than afterDavid Matlack1-4/+6
Cap vcpu->halt_poll_ns based on the max halt polling time just before halting, rather than after the last halt. This arguably provides better accuracy if an admin disables halt polling in between halts, although the improvement is nominal. A side-effect of this change is that grow_halt_poll_ns() no longer needs to access vcpu->kvm->max_halt_poll_ns, which will be useful in a future commit where the max halt polling time can come from the module parameter halt_poll_ns instead. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20221117001657.1067231-2-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-06Merge tag 'kvmarm-fixes-6.1-3' of ↵Paolo Bonzini1-0/+3
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD * Fix the pKVM stage-1 walker erronously using the stage-2 accessor * Correctly convert vcpu->kvm to a hyp pointer when generating an exception in a nVHE+MTE configuration * Check that KVM_CAP_DIRTY_LOG_* are valid before enabling them * Fix SMPRI_EL1/TPIDR2_EL0 trapping on VHE * Document the boot requirements for FGT when entering the kernel at EL1
2022-10-31KVM: Check KVM_CAP_DIRTY_LOG_{RING, RING_ACQ_REL} prior to enabling themGavin Shan1-0/+3
There are two capabilities related to ring-based dirty page tracking: KVM_CAP_DIRTY_LOG_RING and KVM_CAP_DIRTY_LOG_RING_ACQ_REL. Both are supported by x86. However, arm64 supports KVM_CAP_DIRTY_LOG_RING_ACQ_REL only when the feature is supported on arm64. The userspace doesn't have to enable the advertised capability, meaning KVM_CAP_DIRTY_LOG_RING can be enabled on arm64 by userspace and it's wrong. Fix it by double checking if the capability has been advertised prior to enabling it. It's rejected to enable the capability if it hasn't been advertised. Fixes: 17601bfed909 ("KVM: Add KVM_CAP_DIRTY_LOG_RING_ACQ_REL capability and config option") Reported-by: Sean Christopherson <seanjc@google.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221031003621.164306-4-gshan@redhat.com
2022-10-27KVM: Reject attempts to consume or refresh inactive gfn_to_pfn_cacheSean Christopherson1-7/+34
Reject kvm_gpc_check() and kvm_gpc_refresh() if the cache is inactive. Not checking the active flag during refresh is particularly egregious, as KVM can end up with a valid, inactive cache, which can lead to a variety of use-after-free bugs, e.g. consuming a NULL kernel pointer or missing an mmu_notifier invalidation due to the cache not being on the list of gfns to invalidate. Note, "active" needs to be set if and only if the cache is on the list of caches, i.e. is reachable via mmu_notifier events. If a relevant mmu_notifier event occurs while the cache is "active" but not on the list, KVM will not acquire the cache's lock and so will not serailize the mmu_notifier event with active users and/or kvm_gpc_refresh(). A race between KVM_XEN_ATTR_TYPE_SHARED_INFO and KVM_XEN_HVM_EVTCHN_SEND can be exploited to trigger the bug. 1. Deactivate shinfo cache: kvm_xen_hvm_set_attr case KVM_XEN_ATTR_TYPE_SHARED_INFO kvm_gpc_deactivate kvm_gpc_unmap gpc->valid = false gpc->khva = NULL gpc->active = false Result: active = false, valid = false 2. Cause cache refresh: kvm_arch_vm_ioctl case KVM_XEN_HVM_EVTCHN_SEND kvm_xen_hvm_evtchn_send kvm_xen_set_evtchn kvm_xen_set_evtchn_fast kvm_gpc_check return -EWOULDBLOCK because !gpc->valid kvm_xen_set_evtchn_fast return -EWOULDBLOCK kvm_gpc_refresh hva_to_pfn_retry gpc->valid = true gpc->khva = not NULL Result: active = false, valid = true 3. Race ioctl KVM_XEN_HVM_EVTCHN_SEND against ioctl KVM_XEN_ATTR_TYPE_SHARED_INFO: kvm_arch_vm_ioctl case KVM_XEN_HVM_EVTCHN_SEND kvm_xen_hvm_evtchn_send kvm_xen_set_evtchn kvm_xen_set_evtchn_fast read_lock gpc->lock kvm_xen_hvm_set_attr case KVM_XEN_ATTR_TYPE_SHARED_INFO mutex_lock kvm->lock kvm_xen_shared_info_init kvm_gpc_activate gpc->khva = NULL kvm_gpc_check [ Check passes because gpc->valid is still true, even though gpc->khva is already NULL. ] shinfo = gpc->khva pending_bits = shinfo->evtchn_pending CRASH: test_and_set_bit(..., pending_bits) Fixes: 982ed0de4753 ("KVM: Reinstate gfn_to_pfn_cache with invalidation support") Cc: stable@vger.kernel.org Reported-by: : Michal Luczaj <mhal@rbox.co> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221013211234.1318131-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-10-27KVM: Initialize gfn_to_pfn_cache locks in dedicated helperMichal Luczaj1-9/+12
Move the gfn_to_pfn_cache lock initialization to another helper and call the new helper during VM/vCPU creation. There are race conditions possible due to kvm_gfn_to_pfn_cache_init()'s ability to re-initialize the cache's locks. For example: a race between ioctl(KVM_XEN_HVM_EVTCHN_SEND) and kvm_gfn_to_pfn_cache_init() leads to a corrupted shinfo gpc lock. (thread 1) | (thread 2) | kvm_xen_set_evtchn_fast | read_lock_irqsave(&gpc->lock, ...) | | kvm_gfn_to_pfn_cache_init | rwlock_init(&gpc->lock) read_unlock_irqrestore(&gpc->lock, ...) | Rename "cache_init" and "cache_destroy" to activate+deactivate to avoid implying that the cache really is destroyed/freed. Note, there more races in the newly named kvm_gpc_activate() that will be addressed separately. Fixes: 982ed0de4753 ("KVM: Reinstate gfn_to_pfn_cache with invalidation support") Cc: stable@vger.kernel.org Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> [sean: call out that this is a bug fix] Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221013211234.1318131-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-10-27KVM: debugfs: Return retval of simple_attr_open() if it failsHou Wenlong1-7/+6
Although simple_attr_open() fails only with -ENOMEM with current code base, it would be nicer to return retval of simple_attr_open() directly in kvm_debugfs_open(). No functional change intended. Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Message-Id: <69d64d93accd1f33691b8a383ae555baee80f943.1665975828.git.houwenlong.hwl@antgroup.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-10-22kvm: Add support for arch compat vm ioctlsAlexander Graf1-0/+11
We will introduce the first architecture specific compat vm ioctl in the next patch. Add all necessary boilerplate to allow architectures to override compat vm ioctls when necessary. Signed-off-by: Alexander Graf <graf@amazon.com> Message-Id: <20221017184541.2658-2-graf@amazon.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-10-12Merge tag 'vfio-v6.1-rc1' of https://github.com/awilliam/linux-vfioLinus Torvalds1-13/+32
Pull VFIO updates from Alex Williamson: - Prune private items from vfio_pci_core.h to a new internal header, fix missed function rename, and refactor vfio-pci interrupt defines (Jason Gunthorpe) - Create consistent naming and handling of ioctls with a function per ioctl for vfio-pci and vfio group handling, use proper type args where available (Jason Gunthorpe) - Implement a set of low power device feature ioctls allowing userspace to make use of power states such as D3cold where supported (Abhishek Sahu) - Remove device counter on vfio groups, which had restricted the page pinning interface to singleton groups to account for limitations in the type1 IOMMU backend. Document usage as limited to emulated IOMMU devices, ie. traditional mdev devices where this restriction is consistent (Jason Gunthorpe) - Correct function prefix in hisi_acc driver incurred during previous refactoring (Shameer Kolothum) - Correct typo and remove redundant warning triggers in vfio-fsl driver (Christophe JAILLET) - Introduce device level DMA dirty tracking uAPI and implementation in the mlx5 variant driver (Yishai Hadas & Joao Martins) - Move much of the vfio_device life cycle management into vfio core, simplifying and avoiding duplication across drivers. This also facilitates adding a struct device to vfio_device which begins the introduction of device rather than group level user support and fills a gap allowing userspace identify devices as vfio capable without implicit knowledge of the driver (Kevin Tian & Yi Liu) - Split vfio container handling to a separate file, creating a more well defined API between the core and container code, masking IOMMU backend implementation from the core, allowing for an easier future transition to an iommufd based implementation of the same (Jason Gunthorpe) - Attempt to resolve race accessing the iommu_group for a device between vfio releasing DMA ownership and removal of the device from the IOMMU driver. Follow-up with support to allow vfio_group to exist with NULL iommu_group pointer to support existing userspace use cases of holding the group file open (Jason Gunthorpe) - Fix error code and hi/lo register manipulation issues in the hisi_acc variant driver, along with various code cleanups (Longfang Liu) - Fix a prior regression in GVT-g group teardown, resulting in unreleased resources (Jason Gunthorpe) - A significant cleanup and simplification of the mdev interface, consolidating much of the open coded per driver sysfs interface support into the mdev core (Christoph Hellwig) - Simplification of tracking and locking around vfio_groups that fall out from previous refactoring (Jason Gunthorpe) - Replace trivial open coded f_ops tests with new helper (Alex Williamson) * tag 'vfio-v6.1-rc1' of https://github.com/awilliam/linux-vfio: (77 commits) vfio: More vfio_file_is_group() use cases vfio: Make the group FD disassociate from the iommu_group vfio: Hold a reference to the iommu_group in kvm for SPAPR vfio: Add vfio_file_is_group() vfio: Change vfio_group->group_rwsem to a mutex vfio: Remove the vfio_group->users and users_comp vfio/mdev: add mdev available instance checking to the core vfio/mdev: consolidate all the description sysfs into the core code vfio/mdev: consolidate all the available_instance sysfs into the core code vfio/mdev: consolidate all the name sysfs into the core code vfio/mdev: consolidate all the device_api sysfs into the core code vfio/mdev: remove mtype_get_parent_dev vfio/mdev: remove mdev_parent_dev vfio/mdev: unexport mdev_bus_type vfio/mdev: remove mdev_from_dev vfio/mdev: simplify mdev_type handling vfio/mdev: embedd struct mdev_parent in the parent data structure vfio/mdev: make mdev.h standalone includable drm/i915/gvt: simplify vgpu configuration management drm/i915/gvt: fix a memory leak in intel_gvt_init_vgpu_types ...
2022-10-07vfio: Hold a reference to the iommu_group in kvm for SPAPRJason Gunthorpe1-11/+14
SPAPR exists completely outside the normal iommu driver framework, the groups it creates are fake and are only created to enable VFIO's uAPI. Thus, it does not need to follow the iommu core rule that the iommu_group will only be touched while a driver is attached. Carry a group reference into KVM and have KVM directly manage the lifetime of this object independently of VFIO. This means KVM no longer relies on the vfio group file being valid to maintain the group reference. Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/2-v2-15417f29324e+1c-vfio_group_disassociate_jgg@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2022-10-07vfio: Add vfio_file_is_group()Jason Gunthorpe1-2/+18
This replaces uses of vfio_file_iommu_group() which were only detecting if the file is a VFIO file with no interest in the actual group. The only remaning user of vfio_file_iommu_group() is in KVM for the SPAPR stuff. It passes the iommu_group into the arch code through kvm for some reason. Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> Tested-by: Eric Farman <farman@linux.ibm.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/1-v2-15417f29324e+1c-vfio_group_disassociate_jgg@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2022-10-03Merge tag 'kvmarm-6.1' of ↵Paolo Bonzini3-3/+24
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 updates for v6.1 - Fixes for single-stepping in the presence of an async exception as well as the preservation of PSTATE.SS - Better handling of AArch32 ID registers on AArch64-only systems - Fixes for the dirty-ring API, allowing it to work on architectures with relaxed memory ordering - Advertise the new kvmarm mailing list - Various minor cleanups and spelling fixes
2022-09-29KVM: Add KVM_CAP_DIRTY_LOG_RING_ACQ_REL capability and config optionMarc Zyngier2-1/+22
In order to differenciate between architectures that require no extra synchronisation when accessing the dirty ring and those who do, add a new capability (KVM_CAP_DIRTY_LOG_RING_ACQ_REL) that identify the latter sort. TSO architectures can obviously advertise both, while relaxed architectures must only advertise the ACQ_REL version. This requires some configuration symbol rejigging, with HAVE_KVM_DIRTY_RING being only indirectly selected by two top-level config symbols: - HAVE_KVM_DIRTY_RING_TSO for strongly ordered architectures (x86) - HAVE_KVM_DIRTY_RING_ACQ_REL for weakly ordered architectures (arm64) Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Link: https://lore.kernel.org/r/20220926145120.27974-3-maz@kernel.org
2022-09-29KVM: Use acquire/release semantics when accessing dirty ring GFN stateMarc Zyngier1-2/+2
The current implementation of the dirty ring has an implicit requirement that stores to the dirty ring from userspace must be: - be ordered with one another - visible from another CPU executing a ring reset While these implicit requirements work well for x86 (and any other TSO-like architecture), they do not work for more relaxed architectures such as arm64 where stores to different addresses can be freely reordered, and loads from these addresses not observing writes from another CPU unless the required barriers (or acquire/release semantics) are used. In order to start fixing this, upgrade the ring reset accesses: - the kvm_dirty_gfn_harvested() helper now uses acquire semantics so it is ordered after all previous writes, including that from userspace - the kvm_dirty_gfn_set_invalid() helper now uses release semantics so that the next_slot and next_offset reads don't drift past the entry invalidation This is only a partial fix as the userspace side also need upgrading. Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Link: https://lore.kernel.org/r/20220926145120.27974-2-maz@kernel.org
2022-09-26KVM: remove KVM_REQ_UNHALTPaolo Bonzini1-3/+1
KVM_REQ_UNHALT is now unnecessary because it is replaced by the return value of kvm_vcpu_block/kvm_vcpu_halt. Remove it. No functional change intended. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Acked-by: Marc Zyngier <maz@kernel.org> Message-Id: <20220921003201.1441511-13-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: fix memoryleak in kvm_init()Miaohe Lin1-3/+2
When alloc_cpumask_var_node() fails for a certain cpu, there might be some allocated cpumasks for percpu cpu_kick_mask. We should free these cpumasks or memoryleak will occur. Fixes: baff59ccdc65 ("KVM: Pre-allocate cpumasks for kvm_make_all_cpus_request_except()") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Link: https://lore.kernel.org/r/20220823063414.59778-1-linmiaohe@huawei.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-19KVM: Drop unnecessary initialization of "ops" in kvm_ioctl_create_device()Li kunyu1-1/+1
The variable is initialized but it is only used after its assignment. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Li kunyu <kunyu@nfschina.com> Message-Id: <20220819021535.483702-1-kunyu@nfschina.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-19KVM: Drop unnecessary initialization of "npages" in hva_to_pfn_slow()Li kunyu1-1/+1
The variable is initialized but it is only used after its assignment. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Li kunyu <kunyu@nfschina.com> Message-Id: <20220819022804.483914-1-kunyu@nfschina.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-19KVM: Rename mmu_notifier_* to mmu_invalidate_*Chao Peng2-33/+36
The motivation of this renaming is to make these variables and related helper functions less mmu_notifier bound and can also be used for non mmu_notifier based page invalidation. mmu_invalidate_* was chosen to better describe the purpose of 'invalidating' a page that those variables are used for. - mmu_notifier_seq/range_start/range_end are renamed to mmu_invalidate_seq/range_start/range_end. - mmu_notifier_retry{_hva} helper functions are renamed to mmu_invalidate_retry{_hva}. - mmu_notifier_count is renamed to mmu_invalidate_in_progress to avoid confusion with mn_active_invalidate_count. - While here, also update kvm_inc/dec_notifier_count() to kvm_mmu_invalidate_begin/end() to match the change for mmu_notifier_count. No functional change intended. Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> Message-Id: <20220816125322.1110439-3-chao.p.peng@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-19KVM: Move coalesced MMIO initialization (back) into kvm_create_vm()Sean Christopherson1-5/+6
Invoke kvm_coalesced_mmio_init() from kvm_create_vm() now that allocating and initializing coalesced MMIO objects is separate from registering any associated devices. Moving coalesced MMIO cleans up the last oddity where KVM does VM creation/initialization after kvm_create_vm(), and more importantly after kvm_arch_post_init_vm() is called and the VM is added to the global vm_list, i.e. after the VM is fully created as far as KVM is concerned. Originally, kvm_coalesced_mmio_init() was called by kvm_create_vm(), but the original implementation was completely devoid of error handling. Commit 6ce5a090a9a0 ("KVM: coalesced_mmio: fix kvm_coalesced_mmio_init()'s error handling" fixed the various bugs, and in doing so rightly moved the call to after kvm_create_vm() because kvm_coalesced_mmio_init() also registered the coalesced MMIO device. Commit 2b3c246a682c ("KVM: Make coalesced mmio use a device per zone") cleaned up that mess by having each zone register a separate device, i.e. moved device registration to its logical home in kvm_vm_ioctl_register_coalesced_mmio(). As a result, kvm_coalesced_mmio_init() is now a "pure" initialization helper and can be safely called from kvm_create_vm(). Opportunstically drop the #ifdef, KVM provides stubs for kvm_coalesced_mmio_{init,free}() when CONFIG_KVM_MMIO=n (s390). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220816053937.2477106-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-19KVM: Unconditionally get a ref to /dev/kvm module when creating a VMSean Christopherson1-10/+4
Unconditionally get a reference to the /dev/kvm module when creating a VM instead of using try_get_module(), which will fail if the module is in the process of being forcefully unloaded. The error handling when try_get_module() fails doesn't properly unwind all that has been done, e.g. doesn't call kvm_arch_pre_destroy_vm() and doesn't remove the VM from the global list. Not removing VMs from the global list tends to be fatal, e.g. leads to use-after-free explosions. The obvious alternative would be to add proper unwinding, but the justification for using try_get_module(), "rmmod --wait", is completely bogus as support for "rmmod --wait", i.e. delete_module() without O_NONBLOCK, was removed by commit 3f2b9c9cdf38 ("module: remove rmmod --wait option.") nearly a decade ago. It's still possible for try_get_module() to fail due to the module dying (more like being killed), as the module will be tagged MODULE_STATE_GOING by "rmmod --force", i.e. delete_module(..., O_TRUNC), but playing nice with forced unloading is an exercise in futility and gives a falsea sense of security. Using try_get_module() only prevents acquiring _new_ references, it doesn't magically put the references held by other VMs, and forced unloading doesn't wait, i.e. "rmmod --force" on KVM is all but guaranteed to cause spectacular fireworks; the window where KV