summaryrefslogtreecommitdiff
path: root/drivers/pci/controller/pci-hyperv.c
AgeCommit message (Collapse)AuthorFilesLines
2023-11-21x86/apic: Drop apic::delivery_modeAndrew Cooper1-7/+0
This field is set to APIC_DELIVERY_MODE_FIXED in all cases, and is read exactly once. Fold the constant in uv_program_mmr() and drop the field. Searching for the origin of the stale HyperV comment reveals commit a31e58e129f7 ("x86/apic: Switch all APICs to Fixed delivery mode") which notes: As a consequence of this change, the apic::irq_delivery_mode field is now pointless, but this needs to be cleaned up in a separate patch. 6 years is long enough for this technical debt to have survived. [ bp: Fold in https://lore.kernel.org/r/20231121123034.1442059-1-andrew.cooper3@citrix.com ] Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Steve Wahl <steve.wahl@hpe.com> Link: https://lore.kernel.org/r/20231102-x86-apic-v1-1-bf049a2a0ed6@citrix.com
2023-10-14PCI: hv: Annotate struct hv_dr_state with __counted_byKees Cook1-1/+1
Prepare for the coming implementation by GCC and Clang of the __counted_by attribute. Flexible array members annotated with __counted_by can have their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family functions). As found with Coccinelle[1], add __counted_by for struct hv_dr_state. [1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci Link: https://lore.kernel.org/linux-pci/20230922175257.work.900-kees@kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org> Reviewed-by: "Gustavo A. R. Silva" <gustavoars@kernel.org> Acked-by: Wei Liu <wei.liu@kernel.org> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Dexuan Cui <decui@microsoft.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Krzysztof Wilczyński <kw@linux.com> Cc: Lorenzo Pieralisi <lpieralisi@kernel.org> Cc: Rob Herring <robh@kernel.org> Cc: Wei Liu <wei.liu@kernel.org> Cc: linux-hyperv@vger.kernel.org Cc: linux-pci@vger.kernel.org
2023-08-22PCI: hv: Fix a crash in hv_pci_restore_msi_msg() during hibernationDexuan Cui1-0/+3
When a Linux VM with an assigned PCI device runs on Hyper-V, if the PCI device driver is not loaded yet (i.e. MSI-X/MSI is not enabled on the device yet), doing a VM hibernation triggers a panic in hv_pci_restore_msi_msg() -> msi_lock_descs(&pdev->dev), because pdev->dev.msi.data is still NULL. Avoid the panic by checking if MSI-X/MSI is enabled. Link: https://lore.kernel.org/r/20230816175939.21566-1-decui@microsoft.com Fixes: dc2b453290c4 ("PCI: hv: Rework MSI handling") Signed-off-by: Dexuan Cui <decui@microsoft.com> Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Reviewed-by: sathyanarayanan.kuppuswamy@linux.intel.com Reviewed-by: Michael Kelley <mikelley@microsoft.com> Cc: stable@vger.kernel.org
2023-06-18PCI: hv: Add a per-bus mutex state_lockDexuan Cui1-3/+26
In the case of fast device addition/removal, it's possible that hv_eject_device_work() can start to run before create_root_hv_pci_bus() starts to run; as a result, the pci_get_domain_bus_and_slot() in hv_eject_device_work() can return a 'pdev' of NULL, and hv_eject_device_work() can remove the 'hpdev', and immediately send a message PCI_EJECTION_COMPLETE to the host, and the host immediately unassigns the PCI device from the guest; meanwhile, create_root_hv_pci_bus() and the PCI device driver can be probing the dead PCI device and reporting timeout errors. Fix the issue by adding a per-bus mutex 'state_lock' and grabbing the mutex before powering on the PCI bus in hv_pci_enter_d0(): when hv_eject_device_work() starts to run, it's able to find the 'pdev' and call pci_stop_and_remove_bus_device(pdev): if the PCI device driver has loaded, the PCI device driver's probe() function is already called in create_root_hv_pci_bus() -> pci_bus_add_devices(), and now hv_eject_device_work() -> pci_stop_and_remove_bus_device() is able to call the PCI device driver's remove() function and remove the device reliably; if the PCI device driver hasn't loaded yet, the function call hv_eject_device_work() -> pci_stop_and_remove_bus_device() is able to remove the PCI device reliably and the PCI device driver's probe() function won't be called; if the PCI device driver's probe() is already running (e.g., systemd-udev is loading the PCI device driver), it must be holding the per-device lock, and after the probe() finishes and releases the lock, hv_eject_device_work() -> pci_stop_and_remove_bus_device() is able to proceed to remove the device reliably. Fixes: 4daace0d8ce8 ("PCI: hv: Add paravirtual PCI front-end for Microsoft Hyper-V VMs") Signed-off-by: Dexuan Cui <decui@microsoft.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Acked-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230615044451.5580-6-decui@microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org>
2023-06-18Revert "PCI: hv: Fix a timing issue which causes kdump to fail occasionally"Dexuan Cui1-37/+34
This reverts commit d6af2ed29c7c1c311b96dac989dcb991e90ee195. The statement "the hv_pci_bus_exit() call releases structures of all its child devices" in commit d6af2ed29c7c is not true: in the path hv_pci_probe() -> hv_pci_enter_d0() -> hv_pci_bus_exit(hdev, true): the parameter "keep_devs" is true, so hv_pci_bus_exit() does *not* release the child "struct hv_pci_dev *hpdev" that is created earlier in pci_devices_present_work() -> new_pcichild_device(). The commit d6af2ed29c7c was originally made in July 2020 for RHEL 7.7, where the old version of hv_pci_bus_exit() was used; when the commit was rebased and merged into the upstream, people didn't notice that it's not really necessary. The commit itself doesn't cause any issue, but it makes hv_pci_probe() more complicated. Revert it to facilitate some upcoming changes to hv_pci_probe(). Signed-off-by: Dexuan Cui <decui@microsoft.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Acked-by: Wei Hu <weh@microsoft.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230615044451.5580-5-decui@microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org>
2023-06-18PCI: hv: Remove the useless hv_pcichild_state from struct hv_pci_devDexuan Cui1-12/+0
The hpdev->state is never really useful. The only use in hv_pci_eject_device() and hv_eject_device_work() is not really necessary. Signed-off-by: Dexuan Cui <decui@microsoft.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Acked-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230615044451.5580-4-decui@microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org>
2023-06-18PCI: hv: Fix a race condition in hv_irq_unmask() that can cause panicDexuan Cui1-6/+5
When the host tries to remove a PCI device, the host first sends a PCI_EJECT message to the guest, and the guest is supposed to gracefully remove the PCI device and send a PCI_EJECTION_COMPLETE message to the host; the host then sends a VMBus message CHANNELMSG_RESCIND_CHANNELOFFER to the guest (when the guest receives this message, the device is already unassigned from the guest) and the guest can do some final cleanup work; if the guest fails to respond to the PCI_EJECT message within one minute, the host sends the VMBus message CHANNELMSG_RESCIND_CHANNELOFFER and removes the PCI device forcibly. In the case of fast device addition/removal, it's possible that the PCI device driver is still configuring MSI-X interrupts when the guest receives the PCI_EJECT message; the channel callback calls hv_pci_eject_device(), which sets hpdev->state to hv_pcichild_ejecting, and schedules a work hv_eject_device_work(); if the PCI device driver is calling pci_alloc_irq_vectors() -> ... -> hv_compose_msi_msg(), we can break the while loop in hv_compose_msi_msg() due to the updated hpdev->state, and leave data->chip_data with its default value of NULL; later, when the PCI device driver calls request_irq() -> ... -> hv_irq_unmask(), the guest crashes in hv_arch_irq_unmask() due to data->chip_data being NULL. Fix the issue by not testing hpdev->state in the while loop: when the guest receives PCI_EJECT, the device is still assigned to the guest, and the guest has one minute to finish the device removal gracefully. We don't really need to (and we should not) test hpdev->state in the loop. Fixes: de0aa7b2f97d ("PCI: hv: Fix 2 hang issues in hv_compose_msi_msg()") Signed-off-by: Dexuan Cui <decui@microsoft.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230615044451.5580-3-decui@microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org>
2023-06-18PCI: hv: Fix a race condition bug in hv_pci_query_relations()Dexuan Cui1-0/+18
Since day 1 of the driver, there has been a race between hv_pci_query_relations() and survey_child_resources(): during fast device hotplug, hv_pci_query_relations() may error out due to device-remove and the stack variable 'comp' is no longer valid; however, pci_devices_present_work() -> survey_child_resources() -> complete() may be running on another CPU and accessing the no-longer-valid 'comp'. Fix the race by flushing the workqueue before we exit from hv_pci_query_relations(). Fixes: 4daace0d8ce8 ("PCI: hv: Add paravirtual PCI front-end for Microsoft Hyper-V VMs") Signed-off-by: Dexuan Cui <decui@microsoft.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Acked-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230615044451.5580-2-decui@microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org>
2023-04-21PCI: hv: Replace retarget_msi_interrupt_params with hyperv_pcpu_input_argDexuan Cui1-41/+7
4 commits are involved here: A (2016): commit 0de8ce3ee8e3 ("PCI: hv: Allocate physically contiguous hypercall params buffer") B (2017): commit be66b6736591 ("PCI: hv: Use page allocation for hbus structure") C (2019): commit 877b911a5ba0 ("PCI: hv: Avoid a kmemleak false positive caused by the hbus buffer") D (2018): commit 68bb7bfb7985 ("X86/Hyper-V: Enable IPI enlightenments") Patch D introduced the per-CPU hypercall input page "hyperv_pcpu_input_arg" in 2018. With patch D, we no longer need the per-Hyper-V-PCI-bus hypercall input page "hbus->retarget_msi_interrupt_params" that was added in patch A, and the issue addressed by patch B is no longer an issue, and we can also get rid of patch C. The change here is required for PCI device assignment to work for Confidential VMs (CVMs) running without a paravisor, because otherwise we would have to call set_memory_decrypted() for "hbus->retarget_msi_interrupt_params" before calling the hypercall HVCALL_RETARGET_INTERRUPT. Signed-off-by: Dexuan Cui <decui@microsoft.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Acked-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Link: https://lore.kernel.org/r/20230421013025.17152-1-decui@microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org>
2023-04-17PCI: hv: Enable PCI pass-thru devices in Confidential VMsMichael Kelley1-64/+168
For PCI pass-thru devices in a Confidential VM, Hyper-V requires that PCI config space be accessed via hypercalls. In normal VMs, config space accesses are trapped to the Hyper-V host and emulated. But in a confidential VM, the host can't access guest memory to decode the instruction for emulation, so an explicit hypercall must be used. Add functions to make the new MMIO read and MMIO write hypercalls. Update the PCI config space access functions to use the hypercalls when such use is indicated by Hyper-V flags. Also, set the flag to allow the Hyper-V PCI driver to be loaded and used in a Confidential VM (a.k.a., "Isolation VM"). The driver has previously been hardened against a malicious Hyper-V host[1]. [1] https://lore.kernel.org/all/20220511223207.3386-2-parri.andrea@gmail.com/ Co-developed-by: Dexuan Cui <decui@microsoft.com> Signed-off-by: Dexuan Cui <decui@microsoft.com> Signed-off-by: Michael Kelley <mikelley@microsoft.com> Reviewed-by: Boqun Feng <boqun.feng@gmail.com> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com> Link: https://lore.kernel.org/r/1679838727-87310-13-git-send-email-mikelley@microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org>
2023-01-17Drivers: hv: Make remove callback of hyperv driver void returnedDawei Li1-6/+2
Since commit fc7a6209d571 ("bus: Make remove callback return void") forces bus_type::remove be void-returned, it doesn't make much sense for any bus based driver implementing remove callbalk to return non-void to its caller. As such, change the remove function for Hyper-V VMBus based drivers to return void. Signed-off-by: Dawei Li <set_pte_at@outlook.com> Link: https://lore.kernel.org/r/TYCP286MB2323A93C55526E4DF239D3ACCAFA9@TYCP286MB2323.JPNP286.PROD.OUTLOOK.COM Signed-off-by: Wei Liu <wei.liu@kernel.org>
2022-12-12Merge tag 'irq-core-2022-12-10' of ↵Linus Torvalds1-14/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq updates from Thomas Gleixner: "Updates for the interrupt core and driver subsystem: The bulk is the rework of the MSI subsystem to support per device MSI interrupt domains. This solves conceptual problems of the current PCI/MSI design which are in the way of providing support for PCI/MSI[-X] and the upcoming PCI/IMS mechanism on the same device. IMS (Interrupt Message Store] is a new specification which allows device manufactures to provide implementation defined storage for MSI messages (as opposed to PCI/MSI and PCI/MSI-X that has a specified message store which is uniform accross all devices). The PCI/MSI[-X] uniformity allowed us to get away with "global" PCI/MSI domains. IMS not only allows to overcome the size limitations of the MSI-X table, but also gives the device manufacturer the freedom to store the message in arbitrary places, even in host memory which is shared with the device. There have been several attempts to glue this into the current MSI code, but after lengthy discussions it turned out that there is a fundamental design problem in the current PCI/MSI-X implementation. This needs some historical background. When PCI/MSI[-X] support was added around 2003, interrupt management was completely different from what we have today in the actively developed architectures. Interrupt management was completely architecture specific and while there were attempts to create common infrastructure the commonalities were rudimentary and just providing shared data structures and interfaces so that drivers could be written in an architecture agnostic way. The initial PCI/MSI[-X] support obviously plugged into this model which resulted in some basic shared infrastructure in the PCI core code for setting up MSI descriptors, which are a pure software construct for holding data relevant for a particular MSI interrupt, but the actual association to Linux interrupts was completely architecture specific. This model is still supported today to keep museum architectures and notorious stragglers alive. In 2013 Intel tried to add support for hot-pluggable IO/APICs to the kernel, which was creating yet another architecture specific mechanism and resulted in an unholy mess on top of the existing horrors of x86 interrupt handling. The x86 interrupt management code was already an incomprehensible maze of indirections between the CPU vector management, interrupt remapping and the actual IO/APIC and PCI/MSI[-X] implementation. At roughly the same time ARM struggled with the ever growing SoC specific extensions which were glued on top of the architected GIC interrupt controller. This resulted in a fundamental redesign of interrupt management and provided the today prevailing concept of hierarchical interrupt domains. This allowed to disentangle the interactions between x86 vector domain and interrupt remapping and also allowed ARM to handle the zoo of SoC specific interrupt components in a sane way. The concept of hierarchical interrupt domains aims to encapsulate the functionality of particular IP blocks which are involved in interrupt delivery so that they become extensible and pluggable. The X86 encapsulation looks like this: |--- device 1 [Vector]---[Remapping]---[PCI/MSI]--|... |--- device N where the remapping domain is an optional component and in case that it is not available the PCI/MSI[-X] domains have the vector domain as their parent. This reduced the required interaction between the domains pretty much to the initialization phase where it is obviously required to establish the proper parent relation ship in the components of the hierarchy. While in most cases the model is strictly representing the chain of IP blocks and abstracting them so they can be plugged together to form a hierarchy, the design stopped short on PCI/MSI[-X]. Looking at the hardware it's clear that the actual PCI/MSI[-X] interrupt controller is not a global entity, but strict a per PCI device entity. Here we took a short cut on the hierarchical model and went for the easy solution of providing "global" PCI/MSI domains which was possible because the PCI/MSI[-X] handling is uniform across the devices. This also allowed to keep the existing PCI/MSI[-X] infrastructure mostly unchanged which in turn made it simple to keep the existing architecture specific management alive. A similar problem was created in the ARM world with support for IP block specific message storage. Instead of going all the way to stack a IP block specific domain on top of the generic MSI domain this ended in a construct which provides a "global" platform MSI domain which allows overriding the irq_write_msi_msg() callback per allocation. In course of the lengthy discussions we identified other abuse of the MSI infrastructure in wireless drivers, NTB etc. where support for implementation specific message storage was just mindlessly glued into the existing infrastructure. Some of this just works by chance on particular platforms but will fail in hard to diagnose ways when the driver is used on platforms where the underlying MSI interrupt management code does not expect the creative abuse. Another shortcoming of today's PCI/MSI-X support is the inability to allocate or free individual vectors after the initial enablement of MSI-X. This results in an works by chance implementation of VFIO (PCI pass-through) where interrupts on the host side are not set up upfront to avoid resource exhaustion. They are expanded at run-time when the guest actually tries to use them. The way how this is implemented is that the host disables MSI-X and then re-enables it with a larger number of vectors again. That works by chance because most device drivers set up all interrupts before the device actually will utilize them. But that's not universally true because some drivers allocate a large enough number of vectors but do not utilize them until it's actually required, e.g. for acceleration support. But at that point other interrupts of the device might be in active use and the MSI-X disable/enable dance can just result in losing interrupts and therefore hard to diagnose subtle problems. Last but not least the "global" PCI/MSI-X domain approach prevents to utilize PCI/MSI[-X] and PCI/IMS on the same device due to the fact that IMS is not longer providing a uniform storage and configuration model. The solution to this is to implement the missing step and switch from global PCI/MSI domains to per device PCI/MSI domains. The resulting hierarchy then looks like this: |--- [PCI/MSI] device 1 [Vector]---[Remapping]---|... |--- [PCI/MSI] device N which in turn allows to provide support for multiple domains per device: |--- [PCI/MSI] device 1 |--- [PCI/IMS] device 1 [Vector]---[Remapping]---|... |--- [PCI/MSI] device N |--- [PCI/IMS] device N This work converts the MSI and PCI/MSI core and the x86 interrupt domains to the new model, provides new interfaces for post-enable allocation/free of MSI-X interrupts and the base framework for PCI/IMS. PCI/IMS has been verified with the work in progress IDXD driver. There is work in progress to convert ARM over which will replace the platform MSI train-wreck. The cleanup of VFIO, NTB and other creative "solutions" are in the works as well. Drivers: - Updates for the LoongArch interrupt chip drivers - Support for MTK CIRQv2 - The usual small fixes and updates all over the place" * tag 'irq-core-2022-12-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (134 commits) irqchip/ti-sci-inta: Fix kernel doc irqchip/gic-v2m: Mark a few functions __init irqchip/gic-v2m: Include arm-gic-common.h irqchip/irq-mvebu-icu: Fix works by chance pointer assignment iommu/amd: Enable PCI/IMS iommu/vt-d: Enable PCI/IMS x86/apic/msi: Enable PCI/IMS PCI/MSI: Provide pci_ims_alloc/free_irq() PCI/MSI: Provide IMS (Interrupt Message Store) support genirq/msi: Provide constants for PCI/IMS support x86/apic/msi: Enable MSI_FLAG_PCI_MSIX_ALLOC_DYN PCI/MSI: Provide post-enable dynamic allocation interfaces for MSI-X PCI/MSI: Provide prepare_desc() MSI domain op PCI/MSI: Split MSI-X descriptor setup genirq/msi: Provide MSI_FLAG_MSIX_ALLOC_DYN genirq/msi: Provide msi_domain_alloc_irq_at() genirq/msi: Provide msi_domain_ops:: Prepare_desc() genirq/msi: Provide msi_desc:: Msi_data genirq/msi: Provide struct msi_map x86/apic/msi: Remove arch_create_remap_msi_irq_domain() ...
2022-11-28PCI: hv: update comment in x86 specific hv_arch_irq_unmaskOlaf Hering1-3/+3
The function hv_set_affinity was removed in commit 831c1ae7 ("PCI: hv: Make the code arch neutral by adding arch specific interfaces"). Signed-off-by: Olaf Hering <olaf@aepfle.de> Link: https://lore.kernel.org/r/20221107171831.25283-1-olaf@aepfle.de Signed-off-by: Wei Liu <wei.liu@kernel.org>
2022-11-17x86/apic: Remove X86_IRQ_ALLOC_CONTIGUOUS_VECTORSThomas Gleixner1-14/+1
Now that the PCI/MSI core code does early checking for multi-MSI support X86_IRQ_ALLOC_CONTIGUOUS_VECTORS is not required anymore. Remove the flag and rely on MSI_FLAG_MULTI_PCI_MSI. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20221111122015.865042356@linutronix.de
2022-11-12PCI: hv: Only reuse existing IRTE allocation for Multi-MSIDexuan Cui1-15/+75
Jeffrey added Multi-MSI support to the pci-hyperv driver by the 4 patches: 08e61e861a0e ("PCI: hv: Fix multi-MSI to allow more than one MSI vector") 455880dfe292 ("PCI: hv: Fix hv_arch_irq_unmask() for multi-MSI") b4b77778ecc5 ("PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()") a2bad844a67b ("PCI: hv: Fix interrupt mapping for multi-MSI") It turns out that the third patch (b4b77778ecc5) causes a performance regression because all the interrupts now happen on 1 physical CPU (or two pCPUs, if one pCPU doesn't have enough vectors). When a guest has many PCI devices, it may suffer from soft lockups if the workload is heavy, e.g., see https://lwn.net/ml/linux-kernel/20220804025104.15673-1-decui@microsoft.com/ Commit b4b77778ecc5 itself is good. The real issue is that the hypercall in hv_irq_unmask() -> hv_arch_irq_unmask() -> hv_do_hypercall(HVCALL_RETARGET_INTERRUPT...) only changes the target virtual CPU rather than physical CPU; with b4b77778ecc5, the pCPU is determined only once in hv_compose_msi_msg() where only vCPU0 is specified; consequently the hypervisor only uses 1 target pCPU for all the interrupts. Note: before b4b77778ecc5, the pCPU is determined twice, and when the pCPU is determined the second time, the vCPU in the effective affinity mask is used (i.e., it isn't always vCPU0), so the hypervisor chooses different pCPU for each interrupt. The hypercall will be fixed in future to update the pCPU as well, but that will take quite a while, so let's restore the old behavior in hv_compose_msi_msg(), i.e., don't reuse the existing IRTE allocation for single-MSI and MSI-X; for multi-MSI, we choose the vCPU in a round-robin manner for each PCI device, so the interrupts of different devices can happen on different pCPUs, though the interrupts of each device happen on some single pCPU. The hypercall fix may not be backported to all old versions of Hyper-V, so we want to have this guest side change forever (or at least till we're sure the old affected versions of Hyper-V are no longer supported). Fixes: b4b77778ecc5 ("PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()") Co-developed-by: Jeffrey Hugo <quic_jhugo@quicinc.com> Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com> Co-developed-by: Carl Vanderlip <quic_carlv@quicinc.com> Signed-off-by: Carl Vanderlip <quic_carlv@quicinc.com> Signed-off-by: Dexuan Cui <decui@microsoft.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Link: https://lore.kernel.org/r/20221104222953.11356-1-decui@microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org>
2022-11-03PCI: hv: Fix the definition of vector in hv_compose_msi_msg()Dexuan Cui1-6/+16
The local variable 'vector' must be u32 rather than u8: see the struct hv_msi_desc3. 'vector_count' should be u16 rather than u8: see struct hv_msi_desc, hv_msi_desc2 and hv_msi_desc3. Fixes: a2bad844a67b ("PCI: hv: Fix interrupt mapping for multi-MSI") Signed-off-by: Dexuan Cui <decui@microsoft.com> Cc: Jeffrey Hugo <quic_jhugo@quicinc.com> Cc: Carl Vanderlip <quic_carlv@quicinc.com> Reviewed-by: Jeffrey Hugo <quic_jhugo@quicinc.com> Link: https://lore.kernel.org/r/20221027205256.17678-1-decui@microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org>
2022-07-08PCI: hv: Take a const cpumask in hv_compose_msi_req_get_cpu()Samuel Holland1-1/+1
The cpumask that is passed to this function ultimately comes from irq_data_get_effective_affinity_mask(), which was recently changed to return a const cpumask pointer. The first level of functions handling the affinity mask were updated, but not this helper function. Fixes: 4d0b8298818b ("genirq: Return a const cpumask from irq_data_get_affinity_mask") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Samuel Holland <samuel@sholland.org> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20220708004931.1672-1-samuel@sholland.org
2022-07-07genirq: Return a const cpumask from irq_data_get_affinity_maskSamuel Holland1-5/+5
Now that the irq_data_update_affinity helper exists, enforce its use by returning a a const cpumask from irq_data_get_affinity_mask. Since the previous commit already updated places that needed to call irq_data_update_affinity, this commit updates the remaining code that either did not modify the cpumask or immediately passed the modified mask to irq_set_affinity. Signed-off-by: Samuel Holland <samuel@sholland.org> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20220701200056.46555-8-samuel@sholland.org
2022-05-13PCI: hv: Fix synchronization between channel callback and hv_pci_bus_exit()Andrea Parri (Microsoft)1-7/+19
[ Similarly to commit a765ed47e4516 ("PCI: hv: Fix synchronization between channel callback and hv_compose_msi_msg()"): ] The (on-stack) teardown packet becomes invalid once the completion timeout in hv_pci_bus_exit() has expired and hv_pci_bus_exit() has returned. Prevent the channel callback from accessing the invalid packet by removing the ID associated to such packet from the VMbus requestor in hv_pci_bus_exit(). Signed-off-by: Andrea Parri (Microsoft) <parri.andrea@gmail.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Acked-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Link: https://lore.kernel.org/r/20220511223207.3386-3-parri.andrea@gmail.com Signed-off-by: Wei Liu <wei.liu@kernel.org>
2022-05-13PCI: hv: Add validation for untrusted Hyper-V valuesAndrea Parri (Microsoft)1-9/+24
For additional robustness in the face of Hyper-V errors or malicious behavior, validate all values that originate from packets that Hyper-V has sent to the guest in the host-to-guest ring buffer. Ensure that invalid values cannot cause data being copied out of the bounds of the source buffer in hv_pci_onchannelcallback(). While at it, remove a redundant validation in hv_pci_generic_compl(): hv_pci_onchannelcallback() already ensures that all processed incoming packets are "at least as large as [in fact larger than] a response". Signed-off-by: Andrea Parri (Microsoft) <parri.andrea@gmail.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Acked-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Link: https://lore.kernel.org/r/20220511223207.3386-2-parri.andrea@gmail.com Signed-off-by: Wei Liu <wei.liu@kernel.org>
2022-05-11PCI: hv: Fix interrupt mapping for multi-MSIJeffrey Hugo1-10/+50
According to Dexuan, the hypervisor folks beleive that multi-msi allocations are not correct. compose_msi_msg() will allocate multi-msi one by one. However, multi-msi is a block of related MSIs, with alignment requirements. In order for the hypervisor to allocate properly aligned and consecutive entries in the IOMMU Interrupt Remapping Table, there should be a single mapping request that requests all of the multi-msi vectors in one shot. Dexuan suggests detecting the multi-msi case and composing a single request related to the first MSI. Then for the other MSIs in the same block, use the cached information. This appears to be viable, so do it. Suggested-by: Dexuan Cui <decui@microsoft.com> Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com> Reviewed-by: Dexuan Cui <decui@microsoft.com> Tested-by: Michael Kelley <mikelley@microsoft.com> Link: https://lore.kernel.org/r/1652282599-21643-1-git-send-email-quic_jhugo@quicinc.com Signed-off-by: Wei Liu <wei.liu@kernel.org>
2022-05-11PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()Jeffrey Hugo1-7/+9
Currently if compose_msi_msg() is called multiple times, it will free any previous IRTE allocation, and generate a new allocation. While nothing prevents this from occurring, it is extraneous when Linux could just reuse the existing allocation and avoid a bunch of overhead. However, when future IRTE allocations operate on blocks of MSIs instead of a single line, freeing the allocation will impact all of the lines. This could cause an issue where an allocation of N MSIs occurs, then some of the lines are retargeted, and finally the allocation is freed/reallocated. The freeing of the allocation removes all of the configuration for the entire block, which requires all the lines to be retargeted, which might not happen since some lines might already be unmasked/active. Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com> Reviewed-by: Dexuan Cui <decui@microsoft.com> Tested-by: Dexuan Cui <decui@microsoft.com> Tested-by: Michael Kelley <mikelley@microsoft.com> Link: https://lore.kernel.org/r/1652282582-21595-1-git-send-email-quic_jhugo@quicinc.com Signed-off-by: Wei Liu <wei.liu@kernel.org>
2022-05-03PCI: hv: Do not set PCI_COMMAND_MEMORY to reduce VM boot timeDexuan Cui1-6/+11
Currently when the pci-hyperv driver finishes probing and initializing the PCI device, it sets the PCI_COMMAND_MEMORY bit; later when the PCI device is registered to the core PCI subsystem, the core PCI driver's BAR detection and initialization code toggles the bit multiple times, and each toggling of the bit causes the hypervisor to unmap/map the virtual BARs from/to the physical BARs, which can be slow if the BAR sizes are huge, e.g., a Linux VM with 14 GPU devices has to spend more than 3 minutes on BAR detection and initialization, causing a long boot time. Reduce the boot time by not setting the PCI_COMMAND_MEMORY bit when we register the PCI device (there is no need to have it set in the first place). The bit stays off till the PCI device driver calls pci_enable_device(). With this change, the boot time of such a 14-GPU VM is reduced by almost 3 minutes. Link: https://lore.kernel.org/lkml/20220419220007.26550-1-decui@microsoft.com/ Tested-by: Boqun Feng (Microsoft) <boqun.feng@gmail.com> Signed-off-by: Dexuan Cui <decui@microsoft.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Acked-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Cc: Jake Oshins <jakeo@microsoft.com> Link: https://lore.kernel.org/r/20220502074255.16901-1-decui@microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org>
2022-04-28PCI: hv: Fix hv_arch_irq_unmask() for multi-MSIJeffrey Hugo1-8/+4
In the multi-MSI case, hv_arch_irq_unmask() will only operate on the first MSI of the N allocated. This is because only the first msi_desc is cached and it is shared by all the MSIs of the multi-MSI block. This means that hv_arch_irq_unmask() gets the correct address, but the wrong data (always 0). This can break MSIs. Lets assume MSI0 is vector 34 on CPU0, and MSI1 is vector 33 on CPU0. hv_arch_irq_unmask() is called on MSI0. It uses a hypercall to configure the MSI address and data (0) to vector 34 of CPU0. This is correct. Then hv_arch_irq_unmask is called on MSI1. It uses another hypercall to configure the MSI address and data (0) to vector 33 of CPU0. This is wrong, and results in both MSI0 and MSI1 being routed to vector 33. Linux will observe extra instances of MSI1 and no instances of MSI0 despite the endpoint device behaving correctly. For the multi-MSI case, we need unique address and data info for each MSI, but the cached msi_desc does not provide that. However, that information can be gotten from the int_desc cached in the chip_data by compose_msi_msg(). Fix the multi-MSI case to use that cached information instead. Since hv_set_msi_entry_from_desc() is no longer applicable, remove it. Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Link: https://lore.kernel.org/r/1651068453-29588-1-git-send-email-quic_jhugo@quicinc.com Signed-off-by: Wei Liu <wei.liu@kernel.org>
2022-04-25PCI: hv: Fix synchronization between channel callback and hv_compose_msi_msg()Andrea Parri (Microsoft)1-6/+27
Dexuan wrote: "[...] when we disable AccelNet, the host PCI VSP driver sends a PCI_EJECT message first, and the channel callback may set hpdev->state to hv_pcichild_ejecting on a different CPU. This can cause hv_compose_msi_msg() to exit from the loop and 'return', and the on-stack variable 'ctxt' is invalid. Now, if the response message from the host arrives, the channel callback will try to access the invalid 'ctxt' variable, and this may cause a crash." Schematically: Hyper-V sends PCI_EJECT msg hv_pci_onchannelcallback() state = hv_pcichild_ejecting hv_compose_msi_msg() alloc and init comp_pkt state == hv_pcichild_ejecting Hyper-V sends VM_PKT_COMP msg hv_pci_onchannelcallback() retrieve address of comp_pkt 'free' comp_pkt and return comp_pkt->completion_func() Dexuan also showed how the crash can be triggered after introducing suitable delays in the driver code, thus validating the 'assumption' that the host can still normally respond to the guest's compose_msi request after the host has started to eject the PCI device. Fix the synchronization by leveraging the requestor lock as follows: - Before 'return'-ing in hv_compose_msi_msg(), remove the ID (while holding the requestor lock) associated to the completion packet. - Retrieve the address *and call ->completion_func() within a same (requestor) critical section in hv_pci_onchannelcallback(). Reported-by: Wei Hu <weh@microsoft.com> Reported-by: Dexuan Cui <decui@microsoft.com> Suggested-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: Andrea Parri (Microsoft) <parri.andrea@gmail.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Link: https://lore.kernel.org/r/20220419122325.10078-7-parri.andrea@gmail.com Signed-off-by: Wei Liu <wei.liu@kernel.org>
2022-04-25PCI: hv: Use vmbus_requestor to generate transaction IDs for VMbus hardeningAndrea Parri (Microsoft)1-10/+29
Currently, pointers to guest memory are passed to Hyper-V as transaction IDs in hv_pci. In the face of errors or malicious behavior in Hyper-V, hv_pci should not expose or trust the transaction IDs returned by Hyper-V to be valid guest memory addresses. Instead, use small integers generated by vmbus_requestor as request (transaction) IDs. Suggested-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: Andrea Parri (Microsoft) <parri.andrea@gmail.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Link: https://lore.kernel.org/r/20220419122325.10078-3-parri.andrea@gmail.com Signed-off-by: Wei Liu <wei.liu@kernel.org>
2022-04-25PCI: hv: Fix multi-MSI to allow more than one MSI vectorJeffrey Hugo1-1/+10
If the allocation of multiple MSI vectors for multi-MSI fails in the core PCI framework, the framework will retry the allocation as a single MSI vector, assuming that meets the min_vecs specified by the requesting driver. Hyper-V advertises that multi-MSI is supported, but reuses the VECTOR domain to implement that for x86. The VECTOR domain does not support multi-MSI, so the alloc will always fail and fallback to a single MSI allocation. In short, Hyper-V advertises a capability it does not implement. Hyper-V can support multi-MSI because it coordinates with the hypervisor to map the MSIs in the IOMMU's interrupt remapper, which is something the VECTOR domain does not have. Therefore the fix is simple - copy what the x86 IOMMU drivers (AMD/Intel-IR) do by removing X86_IRQ_ALLOC_CONTIGUOUS_VECTORS after calling the VECTOR domain's pci_msi_prepare(). Fixes: 4daace0d8ce8 ("PCI: hv: Add paravirtual PCI front-end for Microsoft Hyper-V VMs") Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com> Reviewed-by: Dexuan Cui <decui@microsoft.com> Link: https://lore.kernel.org/r/1649856981-14649-1-git-send-email-quic_jhugo@quicinc.com Signed-off-by: Wei Liu <wei.liu@kernel.org>
2022-04-07Merge tag 'hyperv-fixes-signed-20220407' of ↵Linus Torvalds1-0/+9
git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyperv fixes from Wei Liu: - Correctly propagate coherence information for VMbus devices (Michael Kelley) - Disable balloon and memory hot-add on ARM64 temporarily (Boqun Feng) - Use barrier to prevent reording when reading ring buffer (Michael Kelley) - Use virt_store_mb in favour of smp_store_mb (Andrea Parri) - Fix VMbus device object initialization (Andrea Parri) - Deactivate sysctl_record_panic_msg on isolated guest (Andrea Parri) - Fix a crash when unloading VMbus module (Guilherme G. Piccoli) * tag 'hyperv-fixes-signed-20220407' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: Drivers: hv: vmbus: Replace smp_store_mb() with virt_store_mb() Drivers: hv: balloon: Disable balloon and hot-add accordingly Drivers: hv: balloon: Support status report for larger page sizes Drivers: hv: vmbus: Prevent load re-ordering when reading ring buffer PCI: hv: Propagate coherence from VMbus device to PCI device Drivers: hv: vmbus: Propagate VMbus coherence to each VMbus device Drivers: hv: vmbus: Fix potential crash on module unload Drivers: hv: vmbus: Fix initialization of device object in vmbus_device_register() Drivers: hv: vmbus: Deactivate sysctl_record_panic_msg by default in isolated guests
2022-03-31PCI: hv: Remove unused hv_set_msi_entry_from_desc()YueHaibing1-8/+0
Fix the following build error: drivers/pci/controller/pci-hyperv.c:769:13: error: ‘hv_set_msi_entry_from_desc’ defined but not used [-Werror=unused-function] 769 | static void hv_set_msi_entry_from_desc(union hv_msi_entry *msi_entry, The arm64 implementation of hv_set_msi_entry_from_desc() is not used after d06957d7a692 ("PCI: hv: Avoid the retarget interrupt hypercall in irq_unmask() on ARM64"), so remove it. Fixes: d06957d7a692 ("PCI: hv: Avoid the retarget interrupt hypercall in irq_unmask() on ARM64") Link: https://lore.kernel.org/r/20220317085130.36388-1-yuehaibing@huawei.com Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Acked-by: Boqun Feng <boqun.feng@gmail.com>
2022-03-29PCI: hv: Propagate coherence from VMbus device to PCI deviceMichael Kelley1-0/+9
PCI pass-thru devices in a Hyper-V VM are represented as a VMBus device and as a PCI device. The coherence of the VMbus device is set based on the VMbus node in ACPI, but the PCI device has no ACPI node and defaults to not hardware coherent. This results in extra software coherence management overhead on ARM64 when devices are hardware coherent. Fix this by setting up the PCI host bus so that normal PCI mechanisms will propagate the coherence of the VMbus device to the PCI device. There's no effect on x86/x64 where devices are always hardware coherent. Signed-off-by: Michael Kelley <mikelley@microsoft.com> Acked-by: Boqun Feng <boqun.feng@gmail.com> Acked-by: Robin Murphy <robin.murphy@arm.com> Link: https://lore.kernel.org/r/1648138492-2191-3-git-send-email-mikelley@microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org>
2022-03-25Merge tag 'pci-v5.18-changes' of ↵Linus Torvalds1-111/+122
git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull pci updates from Bjorn Helgaas: "Enumeration: - Move the VGA arbiter from drivers/gpu to drivers/pci because it's PCI-specific, not GPU-specific (Bjorn Helgaas) - Select the default VGA device consistently whether it's enumerated before or after VGA arbiter init, which fixes arches that enumerate PCI devices late (Huacai Chen) Resource management: - Support BAR sizes up to 8TB (Dongdong Liu) PCIe native device hotplug: - Fix "Command Completed" tracking to avoid spurious timouts when powering off empty slots (Liguang Zhang) - Quirk Qualcomm devices that don't implement Command Completed correctly, again to avoid spurious timeouts (Manivannan Sadhasivam) Peer-to-peer DMA: - Add Intel 3rd Gen Intel Xeon Scalable Processors to whitelist (Michael J. Ruhl) APM X-Gene PCIe controller driver: - Revert generic DT parsing changes that broke some machines in the field (Marc Zyngier) Freescale i.MX6 PCIe controller driver: - Allow controller probe to succeed even when no devices currently present to allow hot-add later (Fabio Estevam) - Enable power management on i.MX6QP (Richard Zhu) - Assert CLKREQ# on i.MX8MM so enumeration doesn't hang when no device is connected (Richard Zhu) Marvell Aardvark PCIe controller driver: - Fix MSI and MSI-X support (Marek Behún, Pali Rohár) - Add support for ERR and PME interrupts (Pali Rohár) Marvell MVEBU PCIe controller driver: - Add DT binding and support for "num-lanes" (Pali Rohár) - Add support for INTx interrupts (Pali Rohár) Microsoft Hyper-V host bridge driver: - Avoid unnecessary hypercalls when unmasking IRQs on ARM64 (Boqun Feng) Qualcomm PCIe controller driver: - Add SM8450 DT binding and driver support (Dmitry Baryshkov) Renesas R-Car PCIe controller driver: - Help the controller get to the L1 state since the hardware can't do it on its own (Marek Vasut) - Return PCI_ERROR_RESPONSE (~0) for reads that fail on PCIe (Marek Vasut) SiFive FU740 PCIe controller driver: - Drop redundant '-gpios' from DT GPIO lookup (Ben Dooks) - Force 2.5GT/s for initial device probe (Ben Dooks) Socionext UniPhier Pro5 controller driver: - Add NX1 DT binding and driver support (Kunihiko Hayashi) Synopsys DesignWare PCIe controller driver: - Restore MSI configuration so MSI works after resume (Jisheng Zhang)" * tag 'pci-v5.18-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (94 commits) x86/PCI: Add #includes to asm/pci_x86.h PCI: ibmphp: Remove unused assignments PCI: cpqphp: Remove unused assignments PCI: fu740: Remove unused assignments PCI: kirin: Remove unused assignments PCI: Remove unused assignments PCI: Declare pci_filp_private only when HAVE_PCI_MMAP PCI: Avoid broken MSI on SB600 USB devices PCI: fu740: Force 2.5GT/s for initial device probe PCI: xgene: Revert "PCI: xgene: Fix IB window setup" PCI: xgene: Revert "PCI: xgene: Use inbound resources for setup" PCI: imx6: Assert i.MX8MM CLKREQ# even if no device present PCI: imx6: Invoke the PHY exit function after PHY power off PCI: rcar: Use PCI_SET_ERROR_RESPONSE after read which triggered an exception PCI: rcar: Finish transition to L1 state in rcar_pcie_config_access() PCI: dwc: Restore MSI Receiver mask during resume PCI: fu740: Drop redundant '-gpios' from DT GPIO lookup PCI/VGA: Replace full MIT license text with SPDX identifier PCI/VGA: Use unsigned format string to print lock counts PCI/VGA: Log bridge control messages when adding devices ...
2022-03-02PCI: hv: Avoid the retarget interrupt hypercall in irq_unmask() on ARM64Boqun Feng1-111/+122
On ARM64 Hyper-V guests, SPIs are used for the interrupts of virtual PCI devices, and SPIs can be managed directly via GICD registers. Therefore the retarget interrupt hypercall is not needed on ARM64. An arch-specific interface hv_arch_irq_unmask() is introduced to handle the architecture level differences on this. For x86, the behavior remains unchanged, while for ARM64 no hypercall is invoked when unmasking an irq for virtual PCI devices. Link: https://lore.kernel.org/r/20220217034525.1687678-1-boqun.feng@gmail.com Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com>
2022-02-03