linux.git/virt, branch v4.9.158

kvm: fix kvm_ioctl_create_device() reference counting (CVE-2019-6974)

2019-02-12T18:45:01+00:00

commit cfa39381173d5f969daf43582c95ad679189cbc9 upstream.

kvm_ioctl_create_device() does the following:

1. creates a device that holds a reference to the VM object (with a borrowed
   reference, the VM's refcount has not been bumped yet)
2. initializes the device
3. transfers the reference to the device to the caller's file descriptor table
4. calls kvm_get_kvm() to turn the borrowed reference to the VM into a real
   reference

The ownership transfer in step 3 must not happen before the reference to the VM
becomes a proper, non-borrowed reference, which only happens in step 4.
After step 3, an attacker can close the file descriptor and drop the borrowed
reference, which can cause the refcount of the kvm object to drop to zero.

This means that we need to grab a reference for the device before
anon_inode_getfd(), otherwise the VM can disappear from under us.

Fixes: 852b6d57dc7f ("kvm: add device control API")
Cc: stable@kernel.org
Signed-off-by: Jann Horn 
Signed-off-by: Paolo Bonzini 
Signed-off-by: Greg Kroah-Hartman

KVM: arm/arm64: Fix vgic init race

2018-09-26T06:36:34+00:00

[ Upstream commit 1d47191de7e15900f8fbfe7cccd7c6e1c2d7c31a ]

The vgic_init function can race with kvm_arch_vcpu_create() which does
not hold kvm_lock() and we therefore have no synchronization primitives
to ensure we're doing the right thing.

As the user is trying to initialize or run the VM while at the same time
creating more VCPUs, we just have to refuse to initialize the VGIC in
this case rather than silently failing with a broken VCPU.

Reviewed-by: Eric Auger 
Signed-off-by: Christoffer Dall 
Signed-off-by: Marc Zyngier 
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman

KVM: irqfd: fix race between EPOLLHUP and irq_bypass_register_consumer

2018-08-24T11:12:41+00:00

commit 9432a3175770e06cb83eada2d91fac90c977cb99 upstream.

A comment warning against this bug is there, but the code is not doing what
the comment says.  Therefore it is possible that an EPOLLHUP races against
irq_bypass_register_consumer.  The EPOLLHUP handler schedules irqfd_shutdown,
and if that runs soon enough, you get a use-after-free.

Reported-by: syzbot 
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini 
Reviewed-by: David Hildenbrand 
Signed-off-by: Sudip Mukherjee 
Signed-off-by: Greg Kroah-Hartman

KVM: arm/arm64: Drop resource size check for GICV window

2018-08-24T11:12:30+00:00

[ Upstream commit ba56bc3a0786992755e6804fbcbdc60ef6cfc24c ]

When booting a 64 KB pages kernel on a ACPI GICv3 system that
implements support for v2 emulation, the following warning is
produced

  GICV size 0x2000 not a multiple of page size 0x10000

and support for v2 emulation is disabled, preventing GICv2 VMs
from being able to run on such hosts.

The reason is that vgic_v3_probe() performs a sanity check on the
size of the window (it should be a multiple of the page size),
while the ACPI MADT parsing code hardcodes the size of the window
to 8 KB. This makes sense, considering that ACPI does not bother
to describe the size in the first place, under the assumption that
platforms implementing ACPI will follow the architecture and not
put anything else in the same 64 KB window.

So let's just drop the sanity check altogether, and assume that
the window is at least 64 KB in size.

Fixes: 909777324588 ("KVM: arm/arm64: vgic-new: vgic_init: implement kvm_vgic_hyp_init")
Signed-off-by: Ard Biesheuvel 
Signed-off-by: Marc Zyngier 
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman

KVM/Eventfd: Avoid crash when assign and deassign specific eventfd in parallel.

2018-07-25T09:23:57+00:00

commit b5020a8e6b54d2ece80b1e7dedb33c79a40ebd47 upstream.

Syzbot reports crashes in kvm_irqfd_assign(), caused by use-after-free
when kvm_irqfd_assign() and kvm_irqfd_deassign() run in parallel
for one specific eventfd. When the assign path hasn't finished but irqfd
has been added to kvm->irqfds.items list, another thead may deassign the
eventfd and free struct kvm_kernel_irqfd(). The assign path then uses
the struct kvm_kernel_irqfd that has been freed by deassign path. To avoid
such issue, keep irqfd under kvm->irq_srcu protection after the irqfd
has been added to kvm->irqfds.items list, and call synchronize_srcu()
in irq_shutdown() to make sure that irqfd has been fully initialized in
the assign path.

Reported-by: Dmitry Vyukov 
Cc: Paolo Bonzini 
Cc: Radim Krčmář 
Cc: Dmitry Vyukov 
Signed-off-by: Tianyu Lan 
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini 
Signed-off-by: Greg Kroah-Hartman

KVM: arm/arm64: Do not use kern_hyp_va() with kvm_vgic_global_state

2018-07-22T12:27:41+00:00

Commit 44a497abd621a71c645f06d3d545ae2f46448830 upstream.

kvm_vgic_global_state is part of the read-only section, and is
usually accessed using a PC-relative address generation (adrp + add).

It is thus useless to use kern_hyp_va() on it, and actively problematic
if kern_hyp_va() becomes non-idempotent. On the other hand, there is
no way that the compiler is going to guarantee that such access is
always PC relative.

So let's bite the bullet and provide our own accessor.

Acked-by: Catalin Marinas 
Reviewed-by: James Morse 
Signed-off-by: Marc Zyngier 
Signed-off-by: Greg Kroah-Hartman

kvm: Map PFN-type memory regions as writable (if possible)

2018-05-30T05:50:22+00:00

[ Upstream commit a340b3e229b24a56f1c7f5826b15a3af0f4b13e5 ]

For EPT-violations that are triggered by a read, the pages are also mapped with
write permissions (if their memory region is also writable). That would avoid
getting yet another fault on the same page when a write occurs.

This optimization only happens when you have a "struct page" backing the memory
region. So also enable it for memory regions that do not have a "struct page".

Cc: Paolo Bonzini 
Cc: Radim Krčmář 
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: KarimAllah Ahmed 
Reviewed-by: Paolo Bonzini 
Signed-off-by: Radim Krčmář 
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman

KVM: arm/arm64: VGIC/ITS: protect kvm_read_guest() calls with SRCU lock

2018-05-22T14:57:56+00:00

commit bf308242ab98b5d1648c3663e753556bef9bec01 upstream.

kvm_read_guest() will eventually look up in kvm_memslots(), which requires
either to hold the kvm->slots_lock or to be inside a kvm->srcu critical
section.
In contrast to x86 and s390 we don't take the SRCU lock on every guest
exit, so we have to do it individually for each kvm_read_guest() call.

Provide a wrapper which does that and use that everywhere.

Note that ending the SRCU critical section before returning from the
kvm_read_guest() wrapper is safe, because the data has been *copied*, so
we don't need to rely on valid references to the memslot anymore.

Cc: Stable  # 4.8+
Reported-by: Jan Glauber 
Signed-off-by: Andre Przywara 
Acked-by: Christoffer Dall 
Signed-off-by: Paolo Bonzini 
Signed-off-by: Greg Kroah-Hartman

KVM: mmu: Fix overlap between public and private memslots

2018-03-11T15:21:29+00:00

commit b28676bb8ae4569cced423dc2a88f7cb319d5379 upstream.

Reported by syzkaller:

    pte_list_remove: ffff9714eb1f8078 0->BUG
    ------------[ cut here ]------------
    kernel BUG at arch/x86/kvm/mmu.c:1157!
    invalid opcode: 0000 [#1] SMP
    RIP: 0010:pte_list_remove+0x11b/0x120 [kvm]
    Call Trace:
     drop_spte+0x83/0xb0 [kvm]
     mmu_page_zap_pte+0xcc/0xe0 [kvm]
     kvm_mmu_prepare_zap_page+0x81/0x4a0 [kvm]
     kvm_mmu_invalidate_zap_all_pages+0x159/0x220 [kvm]
     kvm_arch_flush_shadow_all+0xe/0x10 [kvm]
     kvm_mmu_notifier_release+0x6c/0xa0 [kvm]
     ? kvm_mmu_notifier_release+0x5/0xa0 [kvm]
     __mmu_notifier_release+0x79/0x110
     ? __mmu_notifier_release+0x5/0x110
     exit_mmap+0x15a/0x170
     ? do_exit+0x281/0xcb0
     mmput+0x66/0x160
     do_exit+0x2c9/0xcb0
     ? __context_tracking_exit.part.5+0x4a/0x150
     do_group_exit+0x50/0xd0
     SyS_exit_group+0x14/0x20
     do_syscall_64+0x73/0x1f0
     entry_SYSCALL64_slow_path+0x25/0x25

The reason is that when creates new memslot, there is no guarantee for new
memslot not overlap with private memslots. This can be triggered by the
following program:

   #include 
   #include 
   #include 
   #include 
   #include 
   #include 
   #include 
   #include 
   #include 
   #include 
   #include 
   #include 
   #include 
   #include 
   #include 

   long r[16];

   int main()
   {
	void *p = valloc(0x4000);

	r[2] = open("/dev/kvm", 0);
	r[3] = ioctl(r[2], KVM_CREATE_VM, 0x0ul);

	uint64_t addr = 0xf000;
	ioctl(r[3], KVM_SET_IDENTITY_MAP_ADDR, &addr);
	r[6] = ioctl(r[3], KVM_CREATE_VCPU, 0x0ul);
	ioctl(r[3], KVM_SET_TSS_ADDR, 0x0ul);
	ioctl(r[6], KVM_RUN, 0);
	ioctl(r[6], KVM_RUN, 0);

	struct kvm_userspace_memory_region mr = {
		.slot = 0,
		.flags = KVM_MEM_LOG_DIRTY_PAGES,
		.guest_phys_addr = 0xf000,
		.memory_size = 0x4000,
		.userspace_addr = (uintptr_t) p
	};
	ioctl(r[3], KVM_SET_USER_MEMORY_REGION, &mr);
	return 0;
   }

This patch fixes the bug by not adding a new memslot even if it
overlaps with private memslots.

Reported-by: Dmitry Vyukov 
Cc: Paolo Bonzini 
Cc: Radim Krčmář 
Cc: Dmitry Vyukov 
Cc: Eric Biggers 
Cc: stable@vger.kernel.org
Signed-off-by: Wanpeng Li

kvm, mm: account kvm related kmem slabs to kmemcg

2017-12-25T13:23:43+00:00

[ Upstream commit 46bea48ac241fe0b413805952dda74dd0c09ba8b ]

The kvm slabs can consume a significant amount of system memory
and indeed in our production environment we have observed that
a lot of machines are spending significant amount of memory that
can not be left as system memory overhead. Also the allocations
from these slabs can be triggered directly by user space applications
which has access to kvm and thus a buggy application can leak
such memory. So, these caches should be accounted to kmemcg.

Signed-off-by: Shakeel Butt 
Signed-off-by: Paolo Bonzini 
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman