summaryrefslogtreecommitdiff
path: root/drivers/vfio
AgeCommit message (Collapse)AuthorFilesLines
2025-02-21Revert "vfio/platform: check the bounds of read/write syscalls"Greg Kroah-Hartman1-10/+0
This reverts commit 198090eb6f5f094cf3a268c3c30ef1e9c84a6dbe. It had been committed multiple times to the tree, and isn't needed again. Link: https://lore.kernel.org/r/a082db2605514513a0a8568382d5bd2b6f1877a0.camel@cyberus-technology.de Reported-by: Stefan Nürnberger <stefan.nuernberger@cyberus-technology.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-21vfio/nvgrace-gpu: Expose the blackwell device PF BAR1 to the VMAnkit Agrawal1-22/+45
[ Upstream commit 6a9eb2d125ba90d13b45bcfabcddf9f61268f6a8 ] There is a HW defect on Grace Hopper (GH) to support the Multi-Instance GPU (MIG) feature [1] that necessiated the presence of a 1G region carved out from the device memory and mapped as uncached. The 1G region is shown as a fake BAR (comprising region 2 and 3) to workaround the issue. The Grace Blackwell systems (GB) differ from GH systems in the following aspects: 1. The aforementioned HW defect is fixed on GB systems. 2. There is a usable BAR1 (region 2 and 3) on GB systems for the GPUdirect RDMA feature [2]. This patch accommodate those GB changes by showing the 64b physical device BAR1 (region2 and 3) to the VM instead of the fake one. This takes care of both the differences. Moreover, the entire device memory is exposed on GB as cacheable to the VM as there is no carveout required. Link: https://www.nvidia.com/en-in/technologies/multi-instance-gpu/ [1] Link: https://docs.nvidia.com/cuda/gpudirect-rdma/ [2] Cc: Kevin Tian <kevin.tian@intel.com> CC: Jason Gunthorpe <jgg@nvidia.com> Suggested-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20250124183102.3976-3-ankita@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-21vfio/nvgrace-gpu: Read dvsec register to determine need for uncached resmemAnkit Agrawal1-0/+28
[ Upstream commit bd53764a60ad586ad5b6ed339423ad5e67824464 ] NVIDIA's recently introduced Grace Blackwell (GB) Superchip is a continuation with the Grace Hopper (GH) superchip that provides a cache coherent access to CPU and GPU to each other's memory with an internal proprietary chip-to-chip cache coherent interconnect. There is a HW defect on GH systems to support the Multi-Instance GPU (MIG) feature [1] that necessiated the presence of a 1G region with uncached mapping carved out from the device memory. The 1G region is shown as a fake BAR (comprising region 2 and 3) to workaround the issue. This is fixed on the GB systems. The presence of the fix for the HW defect is communicated by the device firmware through the DVSEC PCI config register with ID 3. The module reads this to take a different codepath on GB vs GH. Scan through the DVSEC registers to identify the correct one and use it to determine the presence of the fix. Save the value in the device's nvgrace_gpu_pci_core_device structure. Link: https://www.nvidia.com/en-in/technologies/multi-instance-gpu/ [1] CC: Jason Gunthorpe <jgg@nvidia.com> CC: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20250124183102.3976-2-ankita@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-21vfio/pci: Enable iowrite64 and ioread64 for vfio pciRamesh Thomas1-0/+1
[ Upstream commit 2b938e3db335e3670475e31a722c2bee34748c5a ] Definitions of ioread64 and iowrite64 macros in asm/io.h called by vfio pci implementations are enclosed inside check for CONFIG_GENERIC_IOMAP. They don't get defined if CONFIG_GENERIC_IOMAP is defined. Include linux/io-64-nonatomic-lo-hi.h to define iowrite64 and ioread64 macros when they are not defined. io-64-nonatomic-lo-hi.h maps the macros to generic implementation in lib/iomap.c. The generic implementation does 64 bit rw if readq/writeq is defined for the architecture, otherwise it would do 32 bit back to back rw. Note that there are two versions of the generic implementation that differs in the order the 32 bit words are written if 64 bit support is not present. This is not the little/big endian ordering, which is handled separately. This patch uses the lo followed by hi word ordering which is consistent with current back to back implementation in the vfio/pci code. Signed-off-by: Ramesh Thomas <ramesh.thomas@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20241210131938.303500-2-ramesh.thomas@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-17vfio/platform: check the bounds of read/write syscallsAlex Williamson1-0/+10
commit ce9ff21ea89d191e477a02ad7eabf4f996b80a69 upstream. count and offset are passed from user space and not checked, only offset is capped to 40 bits, which can be used to read/write out of bounds of the device. Fixes: 6e3f26456009 (“vfio/platform: read and write support for the device fd”) Cc: stable@vger.kernel.org Reported-by: Mostafa Saleh <smostafa@google.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Mostafa Saleh <smostafa@google.com> Tested-by: Mostafa Saleh <smostafa@google.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-01vfio/platform: check the bounds of read/write syscallsAlex Williamson1-0/+10
commit ce9ff21ea89d191e477a02ad7eabf4f996b80a69 upstream. count and offset are passed from user space and not checked, only offset is capped to 40 bits, which can be used to read/write out of bounds of the device. Fixes: 6e3f26456009 (“vfio/platform: read and write support for the device fd”) Cc: stable@vger.kernel.org Reported-by: Mostafa Saleh <smostafa@google.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Mostafa Saleh <smostafa@google.com> Tested-by: Mostafa Saleh <smostafa@google.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-01-03vfio/pci: Fallback huge faults for unaligned pfnAlex Williamson1-8/+9
The PFN must also be aligned to the fault order to insert a huge pfnmap. Test the alignment and fallback when unaligned. Fixes: f9e54c3a2f5b ("vfio/pci: implement huge_fault support") Link: https://bugzilla.kernel.org/show_bug.cgi?id=219619 Reported-by: Athul Krishna <athul.krishna.kr@protonmail.com> Reported-by: Precific <precification@posteo.de> Reviewed-by: Peter Xu <peterx@redhat.com> Tested-by: Precific <precification@posteo.de> Link: https://lore.kernel.org/r/20250102183416.1841878-1-alex.williamson@redhat.com Cc: stable@vger.kernel.org Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-12-11Merge tag 'vfio-v6.13-rc3' of https://github.com/awilliam/linux-vfioLinus Torvalds1-12/+35
Pull vfio fix from Alex Williamson: - Fix migration dirty page tracking support in the mlx5-vfio-pci variant driver in configurations where the system page size exceeds the device maximum message size, and anticipate device updates where the opposite may also be required (Yishai Hadas) * tag 'vfio-v6.13-rc3' of https://github.com/awilliam/linux-vfio: vfio/mlx5: Align the page tracking max message size with the device capability
2024-12-05vfio/mlx5: Align the page tracking max message size with the device capabilityYishai Hadas1-12/+35
Align the page tracking maximum message size with the device's capability instead of relying on PAGE_SIZE. This adjustment resolves a mismatch on systems where PAGE_SIZE is 64K, but the firmware only supports a maximum message size of 4K. Now that we rely on the device's capability for max_message_size, we must account for potential future increases in its value. Key considerations include: - Supporting message sizes that exceed a single system page (e.g., an 8K message on a 4K system). - Ensuring the RQ size is adjusted to accommodate at least 4 WQEs/messages, in line with the device specification. The above has been addressed as part of the patch. Fixes: 79c3cf279926 ("vfio/mlx5: Init QP based resources for dirty tracking") Reviewed-by: Cédric Le Goater <clg@redhat.com> Tested-by: Yingshun Cui <yicui@redhat.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20241205122654.235619-1-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-12-02module: Convert symbol namespace to string literalPeter Zijlstra6-7/+7
Clean up the existing export namespace code along the same lines of commit 33def8498fdd ("treewide: Convert macro and uses of __section(foo) to __section("foo")") and for the same reason, it is not desired for the namespace argument to be a macro expansion itself. Scripted using git grep -l -e MODULE_IMPORT_NS -e EXPORT_SYMBOL_NS | while read file; do awk -i inplace ' /^#define EXPORT_SYMBOL_NS/ { gsub(/__stringify\(ns\)/, "ns"); print; next; } /^#define MODULE_IMPORT_NS/ { gsub(/__stringify\(ns\)/, "ns"); print; next; } /MODULE_IMPORT_NS/ { $0 = gensub(/MODULE_IMPORT_NS\(([^)]*)\)/, "MODULE_IMPORT_NS(\"\\1\")", "g"); } /EXPORT_SYMBOL_NS/ { if ($0 ~ /(EXPORT_SYMBOL_NS[^(]*)\(([^,]+),/) { if ($0 !~ /(EXPORT_SYMBOL_NS[^(]*)\(([^,]+), ([^)]+)\)/ && $0 !~ /(EXPORT_SYMBOL_NS[^(]*)\(\)/ && $0 !~ /^my/) { getline line; gsub(/[[:space:]]*\\$/, ""); gsub(/[[:space:]]/, "", line); $0 = $0 " " line; } $0 = gensub(/(EXPORT_SYMBOL_NS[^(]*)\(([^,]+), ([^)]+)\)/, "\\1(\\2, \"\\3\")", "g"); } } { print }' $file; done Requested-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://mail.google.com/mail/u/2/#inbox/FMfcgzQXKWgMmjdFwwdsfgxzKpVHWPlc Acked-by: Greg KH <gregkh@linuxfoundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-12-01Get rid of 'remove_new' relic from platform driver structLinus Torvalds1-1/+1
The continual trickle of small conversion patches is grating on me, and is really not helping. Just get rid of the 'remove_new' member function, which is just an alias for the plain 'remove', and had a comment to that effect: /* * .remove_new() is a relic from a prototype conversion of .remove(). * New drivers are supposed to implement .remove(). Once all drivers are * converted to not use .remove_new any more, it will be dropped. */ This was just a tree-wide 'sed' script that replaced '.remove_new' with '.remove', with some care taken to turn a subsequent tab into two tabs to make things line up. I did do some minimal manual whitespace adjustment for places that used spaces to line things up. Then I just removed the old (sic) .remove_new member function, and this is the end result. No more unnecessary conversion noise. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-11-27Merge tag 'vfio-v6.13-rc1' of https://github.com/awilliam/linux-vfioLinus Torvalds13-471/+2278
Pull VFIO updates from Alex Williamson: - Constify an unmodified structure used in linking vfio and kvm (Christophe JAILLET) - Add ID for an additional hardware SKU supported by the nvgrace-gpu vfio-pci variant driver (Ankit Agrawal) - Fix incorrect signed cast in QAT vfio-pci variant driver, negating test in check_add_overflow(), though still caught by later tests (Giovanni Cabiddu) - Additional debugfs attributes exposed in hisi_acc vfio-pci variant driver for migration debugging (Longfang Liu) - Migration support is added to the virtio vfio-pci variant driver, becoming the primary feature of the driver while retaining emulation of virtio legacy support as a secondary option (Yishai Hadas) - Fixes to a few unwind flows in the mlx5 vfio-pci driver discovered through reviews of the virtio variant driver (Yishai Hadas) - Fix an unlikely issue where a PCI device exposed to userspace with an unknown capability at the base of the extended capability chain can overflow an array index (Avihai Horon) * tag 'vfio-v6.13-rc1' of https://github.com/awilliam/linux-vfio: vfio/pci: Properly hide first-in-list PCIe extended capability vfio/mlx5: Fix unwind flows in mlx5vf_pci_save/resume_device_data() vfio/mlx5: Fix an unwind issue in mlx5vf_add_migration_pages() vfio/virtio: Enable live migration once VIRTIO_PCI was configured vfio/virtio: Add PRE_COPY support for live migration vfio/virtio: Add support for the basic live migration functionality virtio-pci: Introduce APIs to execute device parts admin commands virtio: Manage device and driver capabilities via the admin commands virtio: Extend the admin command to include the result size virtio_pci: Introduce device parts access commands Documentation: add debugfs description for hisi migration hisi_acc_vfio_pci: register debugfs for hisilicon migration driver hisi_acc_vfio_pci: create subfunction for data reading hisi_acc_vfio_pci: extract public functions for container_of vfio/qat: fix overflow check in qat_vf_resume_write() vfio/nvgrace-gpu: Add a new GH200 SKU to the devid table kvm/vfio: Constify struct kvm_device_ops
2024-11-25vfio/pci: Properly hide first-in-list PCIe extended capabilityAvihai Horon1-2/+14
There are cases where a PCIe extended capability should be hidden from the user. For example, an unknown capability (i.e., capability with ID greater than PCI_EXT_CAP_ID_MAX) or a capability that is intentionally chosen to be hidden from the user. Hiding a capability is done by virtualizing and modifying the 'Next Capability Offset' field of the previous capability so it points to the capability after the one that should be hidden. The special case where the first capability in the list should be hidden is handled differently because there is no previous capability that can be modified. In this case, the capability ID and version are zeroed while leaving the next pointer intact. This hides the capability and leaves an anchor for the rest of the capability list. However, today, hiding the first capability in the list is not done properly if the capability is unknown, as struct vfio_pci_core_device->pci_config_map is set to the capability ID during initialization but the capability ID is not properly checked later when used in vfio_config_do_rw(). This leads to the following warning [1] and to an out-of-bounds access to ecap_perms array. Fix it by checking cap_id in vfio_config_do_rw(), and if it is greater than PCI_EXT_CAP_ID_MAX, use an alternative struct perm_bits for direct read only access instead of the ecap_perms array. Note that this is safe since the above is the only case where cap_id can exceed PCI_EXT_CAP_ID_MAX (except for the special capabilities, which are already checked before). [1] WARNING: CPU: 118 PID: 5329 at drivers/vfio/pci/vfio_pci_config.c:1900 vfio_pci_config_rw+0x395/0x430 [vfio_pci_core] CPU: 118 UID: 0 PID: 5329 Comm: simx-qemu-syste Not tainted 6.12.0+ #1 (snip) Call Trace: <TASK> ? show_regs+0x69/0x80 ? __warn+0x8d/0x140 ? vfio_pci_config_rw+0x395/0x430 [vfio_pci_core] ? report_bug+0x18f/0x1a0 ? handle_bug+0x63/0xa0 ? exc_invalid_op+0x19/0x70 ? asm_exc_invalid_op+0x1b/0x20 ? vfio_pci_config_rw+0x395/0x430 [vfio_pci_core] ? vfio_pci_config_rw+0x244/0x430 [vfio_pci_core] vfio_pci_rw+0x101/0x1b0 [vfio_pci_core] vfio_pci_core_read+0x1d/0x30 [vfio_pci_core] vfio_device_fops_read+0x27/0x40 [vfio] vfs_read+0xbd/0x340 ? vfio_device_fops_unl_ioctl+0xbb/0x740 [vfio] ? __rseq_handle_notify_resume+0xa4/0x4b0 __x64_sys_pread64+0x96/0xc0 x64_sys_call+0x1c3d/0x20d0 do_syscall_64+0x4d/0x120 entry_SYSCALL_64_after_hwframe+0x76/0x7e Fixes: 89e1f7d4c66d ("vfio: Add PCI device driver") Signed-off-by: Avihai Horon <avihaih@nvidia.com> Reviewed-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20241124142739.21698-1-avihaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-11-21Merge tag 'for-linus-iommufd' of ↵Linus Torvalds1-11/+1
git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd Pull iommufd updates from Jason Gunthorpe: "Several new features and uAPI for iommufd: - IOMMU_IOAS_MAP_FILE allows passing in a file descriptor as the backing memory for an iommu mapping. To date VFIO/iommufd have used VMA's and pin_user_pages(), this now allows using memfds and memfd_pin_folios(). Notably this creates a pure folio path from the memfd to the iommu page table where memory is never broken down to PAGE_SIZE. - IOMMU_IOAS_CHANGE_PROCESS moves the pinned page accounting between two processes. Combined with the above this allows iommufd to support a VMM re-start using exec() where something like qemu would exec() a new version of itself and fd pass the memfds/iommufd/etc to the new process. The memfd allows DMA access to the memory to continue while the new process is getting setup, and the CHANGE_PROCESS updates all the accounting. - Support for fault reporting to userspace on non-PRI HW, such as ARM stall-mode embedded devices. - IOMMU_VIOMMU_ALLOC introduces the concept of a HW/driver backed virtual iommu. This will be used by VMMs to access hardware features that are contained with in a VM. The first use is to inform the kernel of the virtual SID to physical SID mapping when issuing SID based invalidation on ARM. Further uses will tie HW features that are directly accessed by the VM, such as invalidation queue assignment and others. - IOMMU_VDEVICE_ALLOC informs the kernel about the mapping of virtual device to physical device within a VIOMMU. Minimially this is used to translate VM issued cache invalidation commands from virtual to physical device IDs. - Enhancements to IOMMU_HWPT_INVALIDATE and IOMMU_HWPT_ALLOC to work with the VIOMMU - ARM SMMuv3 support for nested translation. Using the VIOMMU and VDEVICE the driver can model this HW's behavior for nested translation. This includes a shared branch from Will" * tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd: (51 commits) iommu/arm-smmu-v3: Import IOMMUFD module namespace iommufd: IOMMU_IOAS_CHANGE_PROCESS selftest iommufd: Add IOMMU_IOAS_CHANGE_PROCESS iommufd: Lock all IOAS objects iommufd: Export do_update_pinned iommu/arm-smmu-v3: Support IOMMU_HWPT_INVALIDATE using a VIOMMU object iommu/arm-smmu-v3: Allow ATS for IOMMU_DOMAIN_NESTED iommu/arm-smmu-v3: Use S2FWB for NESTED domains iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED iommu/arm-smmu-v3: Support IOMMU_VIOMMU_ALLOC Documentation: userspace-api: iommufd: Update vDEVICE iommufd/selftest: Add vIOMMU coverage for IOMMU_HWPT_INVALIDATE ioctl iommufd/selftest: Add IOMMU_TEST_OP_DEV_CHECK_CACHE test command iommufd/selftest: Add mock_viommu_cache_invalidate iommufd/viommu: Add iommufd_viommu_find_dev helper iommu: Add iommu_copy_struct_from_full_user_array helper iommufd: Allow hwpt_id to carry viommu_id for IOMMU_HWPT_INVALIDATE iommu/viommu: Add cache_invalidate to iommufd_viommu_ops iommufd/selftest: Add IOMMU_VDEVICE_ALLOC test coverage iommufd/viommu: Add IOMMUFD_OBJ_VDEVICE and IOMMU_VDEVICE_ALLOC ioctl ...
2024-11-14vfio/mlx5: Fix unwind flows in mlx5vf_pci_save/resume_device_data()Yishai Hadas1-18/+17
Fix unwind flows in mlx5vf_pci_save_device_data() and mlx5vf_pci_resume_device_data() to avoid freeing the migf pointer at the 'end' label, as this will be handled by fput(migf->filp) through mlx5vf_release_file(). To ensure mlx5vf_release_file() functions correctly, move the initialization of migf fields (such as migf->lock) to occur before any potential unwind flow, as these fields may be accessed within mlx5vf_release_file(). Fixes: 9945a67ea4b3 ("vfio/mlx5: Refactor PD usage") Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20241114095318.16556-3-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-11-14vfio/mlx5: Fix an unwind issue in mlx5vf_add_migration_pages()Yishai Hadas1-1/+5
Fix an unwind issue in mlx5vf_add_migration_pages(). If a set of pages is allocated but fails to be added to the SG table, they need to be freed to prevent a memory leak. Any pages successfully added to the SG table will be freed as part of mlx5vf_free_data_buffer(). Fixes: 6fadb021266d ("vfio/mlx5: Implement vfio_pci driver for mlx5 devices") Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20241114095318.16556-2-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-11-13vfio/virtio: Enable live migration once VIRTIO_PCI was configuredYishai Hadas5-401/+489
Now that the driver supports live migration, only the legacy IO functionality depends on config VIRTIO_PCI_ADMIN_LEGACY. As part of that we introduce a bool configuration option as a sub menu under the driver's main live migration feature named VIRTIO_VFIO_PCI_ADMIN_LEGACY, to control the legacy IO functionality. This will let users configuring the kernel, know which features from the description might be available in the resulting driver. As of that, move the legacy IO into a separate file to be compiled only once CONFIG_VIRTIO_VFIO_PCI_ADMIN_LEGACY was configured and let the live migration depends only on VIRTIO_PCI. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20241113115200.209269-8-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-11-13vfio/virtio: Add PRE_COPY support for live migrationYishai Hadas2-8/+227
Add PRE_COPY support for live migration. This functionality may reduce the downtime upon STOP_COPY as of letting the target machine to get some 'initial data' from the source once the machine is still in its RUNNING state and let it prepares itself pre-ahead to get the final STOP_COPY data. As the Virtio specification does not support reading partial or incremental device contexts. This means that during the PRE_COPY state, the vfio-virtio driver reads the full device state. As the device state can be changed and the benefit is highest when the pre copy data closely matches the final data we read it in a rate limiter mode. This means we avoid reading new data from the device for a specified time interval after the last read. With PRE_COPY enabled, we observed a downtime reduction of approximately 70-75% in various scenarios compared to when PRE_COPY was disabled, while keeping the total migration time nearly the same. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20241113115200.209269-7-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-11-13vfio/virtio: Add support for the basic live migration functionalityYishai Hadas4-25/+1285
Add support for basic live migration functionality in VFIO over virtio-net devices, aligned with the virtio device specification 1.4. This includes the following VFIO features: VFIO_MIGRATION_STOP_COPY, VFIO_MIGRATION_P2P. The implementation registers with the VFIO subsystem using vfio_pci_core and then incorporates the virtio-specific logic for the migration process. The migration follows the definitions in uapi/vfio.h and leverages the virtio VF-to-PF admin queue command channel for execution device parts related commands. Additional Notes: ----------------- The kernel protocol between the source and target devices contains a header with metadata, including record size, tag, and flags. The record size allows the target to recognize and read a complete image from the source before passing the device part data. This adheres to the virtio device specification, which mandates that partial device parts cannot be supplied. The tag and flags serve as placeholders for future extensions of the kernel protocol between the source and target, ensuring backward and forward compatibility. Both the source and target comply with the virtio device specification by using a device part object with a unique ID as part of the migration process. Since this resource is limited to a maximum of 255, its lifecycle is confined to periods with an active live migration flow. According to the virtio specification, a device has only two modes: RUNNING and STOPPED. As a result, certain VFIO transitions (i.e., RUNNING_P2P->STOP, STOP->RUNNING_P2P) are treated as no-ops. When transitioning to RUNNING_P2P, the device state is set to STOP, and it will remain STOPPED until the transition out of RUNNING_P2P->RUNNING, at which point it returns to RUNNING. During transition to STOP, the virtio device only stops initiating outgoing requests(e.g. DMA, MSIx, etc.) but still must accept incoming operations. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20241113115200.209269-6-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-11-13hisi_acc_vfio_pci: register debugfs for hisilicon migration driverLongfang Liu2-0/+210
On the debugfs framework of VFIO, if the CONFIG_VFIO_DEBUGFS macro is enabled, the debug function is registered for the live migration driver of the HiSilicon accelerator device. After registering the HiSilicon accelerator device on the debugfs framework of live migration of vfio, a directory file "hisi_acc" of debugfs is created, and then three debug function files are created in this directory: vfio | +---<dev_name1> | +---migration | +--state | +--hisi_acc | +--dev_data | +--migf_data | +--cmd_state | +---<dev_name2> +---migration +--state +--hisi_acc +--dev_data +--migf_data +--cmd_state dev_data file: read device data that needs to be migrated from the current device in real time migf_data file: read the migration data of the last live migration from the current driver. cmd_state: used to get the cmd channel state for the device. +----------------+ +--------------+ +---------------+ | migration dev | | src dev | | dst dev | +-------+--------+ +------+-------+ +-------+-------+ | | | | +------v-------+ +-------v-------+ | | saving_migf | | resuming_migf | read | | file | | file | | +------+-------+ +-------+-------+ | | copy | | +------------+----------+ | | +-------v--------+ +-------v--------+ | data buffer | | debug_migf | +-------+--------+ +-------+--------+ | | cat | cat | +-------v--------+ +-------v--------+ | dev_data | | migf_data | +----------------+ +----------------+ When accessing debugfs, user can obtain the most recent status data of the device through the "dev_data" file. It can read recent complete status data of the device. If the current device is being migrated, it will wait for it to complete. The data for the last completed migration function will be stored in debug_migf. Users can read it via "migf_data". Signed-off-by: Longfang Liu <liulongfang@huawei.com> Reviewed-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Link: https://lore.kernel.org/r/20241112073322.54550-4-liulongfang@huawei.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-11-12hisi_acc_vfio_pci: create subfunction for data readingLongfang Liu1-21/+33
This patch generates the code for the operation of reading data from the device into a sub-function. Then, it can be called during the device status data saving phase of the live migration process and the device status data reading function in debugfs. Thereby reducing the redundant code of the driver. Signed-off-by: Longfang Liu <liulongfang@huawei.com> Reviewed-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Link: https://lore.kernel.org/r/20241112073322.54550-3-liulongfang@huawei.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-11-12hisi_acc_vfio_pci: extract public functions for container_ofLongfang Liu1-10/+11
In the current driver, vdev is obtained from struct hisi_acc_vf_core_device through the container_of function. This method is used in many places in the driver. In order to reduce this repetitive operation, It was extracted into a public function. Signed-off-by: Longfang Liu <liulongfang@huawei.com> Reviewed-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Link: https://lore.kernel.org/r/20241112073322.54550-2-liulongfang@huawei.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-11-05vfio: Remove VFIO_TYPE1_NESTING_IOMMUJason Gunthorpe1-11/+1
This control causes the ARM SMMU drivers to choose a stage 2 implementation for the IO pagetable (vs the stage 1 usual default), however this choice has no significant visible impact to the VFIO user. Further qemu never implemented this and no other userspace user is known. The original description in commit f5c9ecebaf2a ("vfio/iommu_type1: add new VFIO_TYPE1_NESTING_IOMMU IOMMU type") suggested this was to "provide SMMU translation services to the guest operating system" however the rest of the API to set the guest table pointer for the stage 1 and manage invalidation was never completed, or at least never upstreamed, rendering this part useless dead code. Upstream has now settled on iommufd as the uAPI for controlling nested translation. Choosing the stage 2 implementation should be done by through the IOMMU_HWPT_ALLOC_NEST_PARENT flag during domain allocation. Remove VFIO_TYPE1_NESTING_IOMMU and everything under it including the enable_nesting iommu_domain_op. Just in-case there is some userspace using this continue to treat requesting it as a NOP, but do not advertise support any more. Acked-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Mostafa Saleh <smostafa@google.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com> Reviewed-by: Donald Dutile <ddutile@redhat.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/1-v4-9e99b76f3518+3a8-smmuv3_nesting_jgg@nvidia.com Signed-off-by: Will Deacon <will@kernel.org>
2024-11-03assorted variants of irqfd setup: convert to CLASS(fd)Al Viro1-13/+3
in all of those failure exits prior to fdget() are plain returns and the only thing done after fdput() is (on failure exits) a kfree(), which can be done before fdput() just fine. NOTE: in acrn_irqfd_assign() 'fail:' failure exit is wrong for eventfd_ctx_fileget() failure (we only want fdput() there) and once we stop doing that, it doesn't need to check if eventfd is NULL or ERR_PTR(...) there. NOTE: in privcmd we move fdget() up before the allocation - more to the point, before the copy_from_user() attempt. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-03fdget(), more trivial conversionsAl Viro1-4/+2
all failure exits prior to fdget() leave the scope, all matching fdput() are immediately followed by leaving the scope. [xfs_ioc_commit_range() chunk moved here as well] Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-10-30vfio/qat: fix overflow check in qat_vf_resume_write()Giovanni Cabiddu1-1/+1
The unsigned variable `size_t len` is cast to the signed type `loff_t` when passed to the function check_add_overflow(). This function considers the type of the destination, which is of type loff_t (signed), potentially leading to an overflow. This issue is similar to the one described in the link below. Remove the cast. Note that even if check_add_overflow() is bypassed, by setting `len` to a value that is greater than LONG_MAX (which is considered as a negative value after the cast), the function copy_from_user(), invoked a few lines later, will not perform any copy and return `len` as (len > INT_MAX) causing qat_vf_resume_write() to fail with -EFAULT. Fixes: bb208810b1ab ("vfio/qat: Add vfio_pci driver for Intel QAT SR-IOV VF devices") CC: stable@vger.kernel.org # 6.10+ Link: https://lore.kernel.org/all/138bd2e2-ede8-4bcc-aa7b-f3d9de167a37@moroto.mountain Reported-by: Zijie Zhao <zzjas98@gmail.com> Signed-off-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com> Reviewed-by: Xin Zeng <xin.zeng@intel.com> Link: https://lore.kernel.org/r/20241021123843.42979-1-giovanni.cabiddu@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-10-30vfio/nvgrace-gpu: Add a new GH200 SKU to the devid tableAnkit Agrawal1-0/+2
NVIDIA is planning to productize a new Grace Hopper superchip SKU with device ID 0x2348. Add the SKU devid to nvgrace_gpu_vfio_pci_table. Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20241013075216.19229-1-ankita@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-09-27[tree-wide] finally take no_llseek outAl Viro4-8/+0
no_llseek had been defined to NULL two years ago, in commit 868941b14441 ("fs: remove no_llseek") To quote that commit, At -rc1 we'll need do a mechanical removal of no_llseek - git grep -l -w no_llseek | grep -v porting.rst | while read i; do sed -i '/\<no_llseek\>/d' $i done would do it. Unfortunately, that hadn't been done. Linus, could you do that now, so that we could finally put that thing to rest? All instances are of the form .llseek = no_llseek, so it's obviously safe. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-09-24Merge tag 'vfio-v6.12-rc1' of https://github.com/awilliam/linux-vfioLinus Torvalds4-13/+3
Pull VFIO updates from Alex Williamson: "Just a few cleanups this cycle: - Remove several unused structure and function declarations, and unused variables (Dr. David Alan Gilbert, Yue Haibing, Zhang Zekun) - Constify unmodified structure in mdev (Hongbo Li) - Convert to unsigned type to catch overflow with less fanfare than passing a negative value to kcalloc() (Dan Carpenter)" * tag 'vfio-v6.12-rc1' of https://github.com/awilliam/linux-vfio: vfio/pci: clean up a type in vfio_pci_ioctl_pci_hot_reset_groups() vfio/mdev: Constify struct kobj_type vfio: mdev: Remove unused function declarations vfio/fsl-mc: Remove unused variable 'hwirq' vfio/pci: Remove unused struct 'vfio_pci_mmap_vma'
2024-09-23Merge tag 'pull-stable-struct_fd' of ↵Linus Torvalds2-6/+6
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull 'struct fd' updates from Al Viro: "Just the 'struct fd' layout change, with conversion to accessor helpers" * tag 'pull-stable-struct_fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: add struct fd constructors, get rid of __to_fd() struct fd: representation change introduce fd_file(), convert all accessors to it.
2024-09-17vfio/pci: implement huge_fault supportAlex Williamson1-17/+43
With the addition of pfnmap support in vmf_insert_pfn_{pmd,pud}() we can take advantage of PMD and PUD faults to PCI BAR mmaps and create more efficient mappings. PCI BARs are always a power of two and will typically get at least PMD alignment without userspace even trying. Userspace alignment for PUD mappings is also not too difficult. Consolidate faults through a single handler with a new wrapper for standard single page faults. The pre-faulting behavior of commit d71a989cf5d9 ("vfio/pci: Insert full vma on mmap'd MMIO fault") is removed in this refactoring since huge_fault will cover the bulk of the faults and results in more efficient page table usage. We also want to avoid that pre-faulted single page mappings preempt huge page mappings. Link: https://lkml.kernel.org/r/20240826204353.2228736-20-peterx@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Gavin Shan <gshan@redhat.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Niklas Schnelle <schnelle@linux.ibm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-17vfio: use the new follow_pfnmap APIPeter Xu1-10/+6
Use the new API that can understand huge pfn mappings. Link: https://lkml.kernel.org/r/20240826204353.2228736-14-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Gavin Shan <gshan@redhat.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Niklas Schnelle <schnelle@linux.ibm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-12vfio/pci: clean up a type in vfio_pci_ioctl_pci_hot_reset_groups()Dan Carpenter1-1/+1
The "array_count" value comes from the copy_from_user() in vfio_pci_ioctl_pci_hot_reset(). If the user passes a value larger than INT_MAX then we'll pass a negative value to kcalloc() which triggers an allocation failure and a stack trace. It's better to make the type unsigned so that if (array_count > count) returns -EINVAL instead. Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/262ada03-d848-4369-9c37-81edeeed2da2@stanley.mountain Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-09-06vfio/mdev: Constify struct kobj_typeHongbo Li1-1/+1
This 'struct kobj_type' is not modified. It is only used in kobject_init_and_add() which takes a 'const struct kobj_type *ktype' parameter. Constifying this structure and moving it to a read-only section, and this can increase over all security. ``` [Before] text data bss dec hex filename 2372 600 0 2972 b9c drivers/vfio/mdev/mdev_sysfs.o [After] text data bss dec hex filename 2436 568 0 3004 bbc drivers/vfio/mdev/mdev_sysfs.o ``` Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20240904011837.2010444-1-lihongbo22@huawei.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-09-03vfio: mdev: Remove unused function declarationsZhang Zekun1-3/+0
The definition of mdev_bus_register() and mdev_bus_unregister() have been removed since commit 6c7f98b334a3 ("vfio/mdev: Remove vfio_mdev.c"). So, let's remove the unused declarations. Signed-off-by: Zhang Zekun <zhangzekun11@huawei.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20240812120823.10968-1-zhangzekun11@huawei.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-09-03vfio/fsl-mc: Remove unused variable 'hwirq'Yue Haibing1-3/+1
Commit 7447d911af69 ("vfio/fsl-mc: Block calling interrupt handler without trigger") left this variable unused, so remove it. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Link: https://lore.kernel.org/r/20240730141133.525771-1-yuehaibing@huawei.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-09-03vfio/pci: Remove unused struct 'vfio_pci_mmap_vma'Dr. David Alan Gilbert1-5/+0
'vfio_pci_mmap_vma' has been unused since commit aac6db75a9fc ("vfio/pci: Use unmap_mapping_range()") Remove it. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20240727160307.1000476-1-linux@treblig.org Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-08-12introduce fd_file(), convert all accessors to it.Al Viro2-6/+6
For any changes of struct fd representation we need to turn existing accesses to fields into calls of wrappers. Accesses to struct fd::flags are very few (3 in linux/file.h, 1 in net/socket.c, 3 in fs/overlayfs/file.c and 3 more in explicit initializers). Those can be dealt with in the commit converting to new layout; accesses to struct fd::file are too many for that. This commit converts (almost) all of f.file to fd_file(f). It's not entirely mechanical ('file' is used as a member name more than just in struct fd) and it does not even attempt to distinguish the uses in pointer context from those in boolean context; the latter will be eventually turned into a separate helper (fd_empty()). NOTE: mass conversion to fd_empty(), tempting as it might be, is a bad idea; better do that piecewise in commit that convert from fdget...() to CLASS(...). [conflicts in fs/fhandle.c, kernel/bpf/syscall.c, mm/memcontrol.c caught by git; fs/stat.c one got caught by git grep] [fs/xattr.c conflict] Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-07-25Merge tag 'driver-core-6.11-rc1' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here is the big set of driver core changes for 6.11-rc1. Lots of stuff in here, with not a huge diffstat, but apis are evolving which required lots of files to be touched. Highlights of the changes in here are: - platform remove callback api final fixups (Uwe took many releases to get here, finally!) - Rust bindings for basic firmware apis and initial driver-core interactions. It's not all that useful for a "write a whole driver in rust" type of thing, but the firmware bindings do help out the phy rust drivers, and the driver core bindings give a solid base on which others can start their work. There is still a long way to go here before we have a multitude of rust drivers being added, but it's a great first step. - driver core const api changes. This reached across all bus types, and there are some fix-ups for some not-common bus types that linux-next and 0-day testing shook out. This work is being done to help make the rust bindings more safe, as well as the C code, moving toward the end-goal of allowing us to put driver structures into read-only memory. We aren't there yet, but are getting closer. - minor devres cleanups and fixes found by code inspection - arch_topology minor changes - other minor driver core cleanups All of these have been in linux-next for a very long time with no reported problems" * tag 'driver-core-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (55 commits) ARM: sa1100: make match function take a const pointer sysfs/cpu: Make crash_hotplug attribute world-readable dio: Have dio_bus_match() callback take a const * zorro: make match function take a const pointer driver core: module: make module_[add|remove]_driver take a const * driver core: make driver_find_device() take a const * driver core: make driver_[create|remove]_file take a const * firmware_loader: fix soundness issue in `request_internal` firmware_loader: annotate doctests as `no_run` devres: Correct code style for functions that return a pointer type devres: Initi