summaryrefslogtreecommitdiff
path: root/drivers/vdpa
AgeCommit message (Collapse)AuthorFilesLines
10 daysvdpa: Fix IDR memory leak in VDUSE module exitAnders Roxell1-0/+1
[ Upstream commit d9ea58b5dc6b4b50fbb6a10c73f840e8b10442b7 ] Add missing idr_destroy() call in vduse_exit() to properly free the vduse_idr radix tree nodes. Without this, module load/unload cycles leak 576-byte radix tree node allocations, detectable by kmemleak as: unreferenced object (size 576): backtrace: [<ffffffff81234567>] radix_tree_node_alloc+0xa0/0xf0 [<ffffffff81234568>] idr_get_free+0x128/0x280 The vduse_idr is initialized via DEFINE_IDR() at line 136 and used throughout the VDUSE (vDPA Device in Userspace) driver for device ID management. The fix follows the documented pattern in lib/idr.c and matches the cleanup approach used by other drivers. This leak was discovered through comprehensive module testing with cumulative kmemleak detection across 10 load/unload iterations per module. Fixes: c8a6153b6c59 ("vduse: Introduce VDUSE - vDPA Device in Userspace") Signed-off-by: Anders Roxell <anders.roxell@linaro.org> Message-Id: <20250704125335.1084649-1-anders.roxell@linaro.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
10 daysvdpa/mlx5: Fix release of uninitialized resources on error pathDragos Tatulea2-4/+9
[ Upstream commit cc51a66815999afb7e9cd845968de4fdf07567b7 ] The commit in the fixes tag made sure that mlx5_vdpa_free() is the single entrypoint for removing the vdpa device resources added in mlx5_vdpa_dev_add(), even in the cleanup path of mlx5_vdpa_dev_add(). This means that all functions from mlx5_vdpa_free() should be able to handle uninitialized resources. This was not the case though: mlx5_vdpa_destroy_mr_resources() and mlx5_cmd_cleanup_async_ctx() were not able to do so. This caused the splat below when adding a vdpa device without a MAC address. This patch fixes these remaining issues: - Makes mlx5_vdpa_destroy_mr_resources() return early if called on uninitialized resources. - Moves mlx5_cmd_init_async_ctx() early on during device addition because it can't fail. This means that mlx5_cmd_cleanup_async_ctx() also can't fail. To mirror this, move the call site of mlx5_cmd_cleanup_async_ctx() in mlx5_vdpa_free(). An additional comment was added in mlx5_vdpa_free() to document the expectations of functions called from this context. Splat: mlx5_core 0000:b5:03.2: mlx5_vdpa_dev_add:3950:(pid 2306) warning: No mac address provisioned? ------------[ cut here ]------------ WARNING: CPU: 13 PID: 2306 at kernel/workqueue.c:4207 __flush_work+0x9a/0xb0 [...] Call Trace: <TASK> ? __try_to_del_timer_sync+0x61/0x90 ? __timer_delete_sync+0x2b/0x40 mlx5_vdpa_destroy_mr_resources+0x1c/0x40 [mlx5_vdpa] mlx5_vdpa_free+0x45/0x160 [mlx5_vdpa] vdpa_release_dev+0x1e/0x50 [vdpa] device_release+0x31/0x90 kobject_cleanup+0x37/0x130 mlx5_vdpa_dev_add+0x327/0x890 [mlx5_vdpa] vdpa_nl_cmd_dev_add_set_doit+0x2c1/0x4d0 [vdpa] genl_family_rcv_msg_doit+0xd8/0x130 genl_family_rcv_msg+0x14b/0x220 ? __pfx_vdpa_nl_cmd_dev_add_set_doit+0x10/0x10 [vdpa] genl_rcv_msg+0x47/0xa0 ? __pfx_genl_rcv_msg+0x10/0x10 netlink_rcv_skb+0x53/0x100 genl_rcv+0x24/0x40 netlink_unicast+0x27b/0x3b0 netlink_sendmsg+0x1f7/0x430 __sys_sendto+0x1fa/0x210 ? ___pte_offset_map+0x17/0x160 ? next_uptodate_folio+0x85/0x2b0 ? percpu_counter_add_batch+0x51/0x90 ? filemap_map_pages+0x515/0x660 __x64_sys_sendto+0x20/0x30 do_syscall_64+0x7b/0x2c0 ? do_read_fault+0x108/0x220 ? do_pte_missing+0x14a/0x3e0 ? __handle_mm_fault+0x321/0x730 ? count_memcg_events+0x13f/0x180 ? handle_mm_fault+0x1fb/0x2d0 ? do_user_addr_fault+0x20c/0x700 ? syscall_exit_work+0x104/0x140 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f0c25b0feca [...] ---[ end trace 0000000000000000 ]--- Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Fixes: 83e445e64f48 ("vdpa/mlx5: Fix error path during device add") Reported-by: Wenli Quan <wquan@redhat.com> Closes: https://lore.kernel.org/virtualization/CADZSLS0r78HhZAStBaN1evCSoPqRJU95Lt8AqZNJ6+wwYQ6vPQ@mail.gmail.com/ Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Message-Id: <20250708120424.2363354-2-dtatulea@nvidia.com> Tested-by: Wenli Quan <wquan@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
10 daysvdpa/mlx5: Fix needs_teardown flag calculationDragos Tatulea1-1/+1
[ Upstream commit 6f0f3d7fc4e05797b801ded4910a64d16db230e9 ] needs_teardown is a device flag that indicates when virtual queues need to be recreated. This happens for certain configuration changes: queue size and some specific features. Currently, the needs_teardown state can be incorrectly reset by subsequent .set_vq_num() calls. For example, for 1 rx VQ with size 512 and 1 tx VQ with size 256: .set_vq_num(0, 512) -> sets needs_teardown to true (rx queue has a non-default size) .set_vq_num(1, 256) -> sets needs_teardown to false (tx queue has a default size) This change takes into account the previous value of the needs_teardown flag when re-calculating it during VQ size configuration. Fixes: 0fe963d6fc16 ("vdpa/mlx5: Re-create HW VQs under certain conditions") Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Shahar Shitrit <shshitrit@nvidia.com> Reviewed-by: Si-Wei Liu <si-wei.liu@oracle.com> Tested-by: Si-Wei Liu<si-wei.liu@oracle.com> Message-Id: <20250604184802.2625300-1-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-25vduse: add virtio_fs to allowed dev idEugenio Pérez1-0/+1
A VDUSE device that implements virtiofs device works fine just by adding the device id to the whitelist. Signed-off-by: Eugenio Pérez <eperezma@redhat.com> Message-Id: <20250121103346.1030165-1-eperezma@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
2025-02-25vdpa/mlx5: Fix oversized null mkey longer than 32bitSi-Wei Liu1-2/+5
create_user_mr() has correct code to count the number of null keys used to fill in a hole for the memory map. However, fill_indir() does not follow the same to cap the range up to the 1GB limit correspondingly. Fill in more null keys for the gaps in between, so that null keys are correctly populated. Fixes: 94abbccdf291 ("vdpa/mlx5: Add shared memory registration code") Cc: stable@vger.kernel.org Reported-by: Cong Meng <cong.meng@oracle.com> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com> Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Message-Id: <20250220193732.521462-2-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com>
2025-02-25vdpa/mlx5: Fix mlx5_vdpa_get_config() endianness on big-endian machinesKonstantin Shkolnyy1-0/+3
mlx5_vdpa_dev_add() doesn’t initialize mvdev->actual_features. It’s initialized later by mlx5_vdpa_set_driver_features(). However, mlx5_vdpa_get_config() depends on the VIRTIO_F_VERSION_1 flag in actual_features, to return data with correct endianness. When it’s called before mlx5_vdpa_set_driver_features(), the data are incorrectly returned as big-endian on big-endian machines, while QEMU then interprets them as little-endian. The fix is to initialize this VIRTIO_F_VERSION_1 as early as possible, especially considering that mlx5_vdpa_dev_add() insists on this flag to always be set anyway. Signed-off-by: Konstantin Shkolnyy <kshk@linux.ibm.com> Message-Id: <20250204173127.166673-1-kshk@linux.ibm.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Acked-by: Jason Wang <jasowang@redhat.com>
2025-01-27Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds5-70/+165
Pull virtio updates from Michael Tsirkin: "A small number of improvements all over the place: - vdpa/octeon support for multiple interrupts - virtio-pci support for error recovery - vp_vdpa support for notification with data - vhost/net fix to set num_buffers for spec compliance - virtio-mem now works with kdump on s390 And small cleanups all over the place" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (23 commits) virtio_blk: Add support for transport error recovery virtio_pci: Add support for PCIe Function Level Reset vhost/net: Set num_buffers for virtio 1.0 vdpa/octeon_ep: read vendor-specific PCI capability virtio-pci: define type and header for PCI vendor data vdpa/octeon_ep: handle device config change events vdpa/octeon_ep: enable support for multiple interrupts per device vdpa: solidrun: Replace deprecated PCI functions s390/kdump: virtio-mem kdump support (CONFIG_PROC_VMCORE_DEVICE_RAM) virtio-mem: support CONFIG_PROC_VMCORE_DEVICE_RAM virtio-mem: remember usable region size virtio-mem: mark device ready before registering callbacks in kdump mode fs/proc/vmcore: introduce PROC_VMCORE_DEVICE_RAM to detect device RAM ranges in 2nd kernel fs/proc/vmcore: factor out freeing a list of vmcore ranges fs/proc/vmcore: factor out allocating a vmcore range and adding it to a list fs/proc/vmcore: move vmcore definitions out of kcore.h fs/proc/vmcore: prefix all pr_* with "vmcore:" fs/proc/vmcore: disallow vmcore modifications while the vmcore is open fs/proc/vmcore: replace vmcoredd_mutex by vmcore_mutex fs/proc/vmcore: convert vmcore_cb_lock into vmcore_mutex ...
2025-01-27vdpa/octeon_ep: read vendor-specific PCI capabilityShijith Thotton3-2/+58
Added support to read the vendor-specific PCI capability to identify the type of device being emulated. Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Shijith Thotton <sthotton@marvell.com> Message-Id: <20250103153226.1933479-4-sthotton@marvell.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-01-27vdpa/octeon_ep: handle device config change eventsSatha Rao1-0/+8
The first interrupt of the device is used to notify the host about device configuration changes, such as link status updates. The ISR configuration area is updated to indicate a config change event when triggered. Signed-off-by: Satha Rao <skoteshwar@marvell.com> Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Shijith Thotton <sthotton@marvell.com> Message-Id: <20250103153226.1933479-2-sthotton@marvell.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-01-27vdpa/octeon_ep: enable support for multiple interrupts per deviceShijith Thotton3-39/+62
Updated the driver to utilize all the MSI-X interrupt vectors supported by each OCTEON endpoint VF, instead of relying on a single vector. Enabling more interrupts allows packets from multiple rings to be distributed across multiple cores, improving parallelism and performance. Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Shijith Thotton <sthotton@marvell.com> Message-Id: <20250103153226.1933479-1-sthotton@marvell.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-01-27vdpa: solidrun: Replace deprecated PCI functionsPhilipp Stanner1-29/+28
The PCI functions pcim_iomap_regions() pcim_iounmap_regions() pcim_iomap_table() have been deprecated by the PCI subsystem. Replace these functions with their successors pcim_iomap_region() and pcim_iounmap_region(). Signed-off-by: Philipp Stanner <pstanner@redhat.com> Message-Id: <20241219094428.21511-2-phasta@kernel.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Stefano Garzarella <sgarzare@redhat.com>
2025-01-22Merge tag 'net-next-6.14' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Paolo Abeni: "This is slightly smaller than usual, with the most interesting work being still around RTNL scope reduction. Core: - More core refactoring to reduce the RTNL lock contention, including preparatory work for the per-network namespace RTNL lock, replacing RTNL lock with a per device-one to protect NAPI-related net device data and moving synchronize_net() calls outside such lock. - Extend drop reasons usage, adding net scheduler, AF_UNIX, bridge and more specific TCP coverage. - Reduce network namespace tear-down time by removing per-subsystems synchronize_net() in tipc and sched. - Add flow label selector support for fib rules, allowing traffic redirection based on such header field. Netfilter: - Do not remove netdev basechain when last device is gone, allowing netdev basechains without devices. - Revisit the flowtable teardown strategy, dealing better with fin, reset and re-open events. - Scale-up IP-vs connection dumping by avoiding linear search on each restart. Protocols: - A significant XDP socket refactor, consolidating and optimizing several helpers into the core - Better scaling of ICMP rate-limiting, by removing false-sharing in inet peers handling. - Introduces netlink notifications for multicast IPv4 and IPv6 address changes. - Add ipsec support for IP-TFS/AggFrag encapsulation, allowing aggregation and fragmentation of the inner IP. - Add sysctl to configure TIME-WAIT reuse delay for TCP sockets, to avoid local port exhaustion issues when the average connection lifetime is very short. - Support updating keys (re-keying) for connections using kernel TLS (for TLS 1.3 only). - Support ipv4-mapped ipv6 address clients in smc-r v2. - Add support for jumbo data packet transmission in RxRPC sockets, gluing multiple data packets in a single UDP packet. - Support RxRPC RACK-TLP to manage packet loss and retransmission in conjunction with the congestion control algorithm. Driver API: - Introduce a unified and structured interface for reporting PHY statistics, exposing consistent data across different H/W via ethtool. - Make timestamping selectable, allow the user to select the desired hwtstamp provider (PHY or MAC) administratively. - Add support for configuring a header-data-split threshold (HDS) value via ethtool, to deal with partial or buggy H/W implementation. - Consolidate DSA drivers Energy Efficiency Ethernet support. - Add EEE management to phylink, making use of the phylib implementation. - Add phylib support for in-band capabilities negotiation. - Simplify how phylib-enabled mac drivers expose the supported interfaces. Tests and tooling: - Make the YNL tool package-friendly to make it easier to deploy it separately from the kernel. - Increase TCP selftest coverage importing several packetdrill test-cases. - Regenerate the ethtool uapi header from the YNL spec, to ease maintenance and future development. - Add YNL support for decoding the link types used in net self-tests, allowing a single build to run both net and drivers/net. Drivers: - Ethernet high-speed NICs: - nVidia/Mellanox (mlx5): - add cross E-Switch QoS support - add SW Steering support for ConnectX-8 - implement support for HW-Managed Flow Steering, improving the rule deletion/insertion rate - support for multi-host LAG - Intel (ixgbe, ice, igb): - ice: add support for devlink health events - ixgbe: add initial support for E610 chipset variant - igb: add support for AF_XDP zero-copy - Meta: - add support for basic RSS config - allow changing the number of channels - add hardware monitoring support - Broadcom (bnxt): - implement TCP data split and HDS threshold ethtool support, enabling Device Memory TCP. - Marvell Octeon: - implement egress ipsec offload support for the cn10k family - Hisilicon (HIBMC): - implement unicast MAC filtering - Ethernet NICs embedded and virtual: - Convert UDP tunnel drivers to NETDEV_PCPU_STAT_DSTATS, avoiding contented atomic operations for drop counters - Freescale: - quicc: phylink conversion - enetc: support Tx and Rx checksum offload and improve TSO performances - MediaTek: - airoha: introduce support for ETS and HTB Qdisc offload - Microchip: - lan78XX USB: preparation work for phylink conversion - Synopsys (stmmac): - support DWMAC IP on NXP Automotive SoCs S32G2xx/S32G3xx/S32R45 - refactor EEE support to leverage the new driver API - optimize DMA and cache access to increase raw RX performances by 40% - TI: - icssg-prueth: add multicast filtering support for VLAN interface - netkit: - add ability to configure head/tailroom - VXLAN: - accepts packets with user-defined reserved bit - Ethernet switches: - Microchip: - lan969x: add RGMII support - lan969x: improve TX and RX performance using the FDMA engine - nVidia/Mellanox: - move Tx header handling to PCI driver, to ease XDP support - Ethernet PHYs: - Texas Instruments DP83822: - add support for GPIO2 clock output - Realtek: - 8169: add support for RTL8125D rev.b - rtl822x: add hwmon support for the temperature sensor - Microchip: - add support for RDS PTP hardware - consolidate periodic output signal generation - CAN: - several DT-bindings to DT schema conversions - tcan4x5x: - add HW standby support - support nWKRQ voltage selection - kvaser: - allowing Bus Error Reporting runtime configuration - WiFi: - the on-going Multi-Link Operation (MLO) effort continues, affecting both the stack and in drivers - mac80211/cfg80211: - Emergency Preparedness Communication Services (EPCS) station mode support - support for adding and removing station links for MLO - add support for WiFi 7/EHT mesh over 320 MHz channels - report Tx power info for each link - RealTek (rtw88): - enable USB Rx aggregation and USB 3 to improve performance - LED support - RealTek (rtw89): - refactor power save to support Multi-Link Operations - add support for RTL8922AE-VS variant - MediaTek (mt76): - single wiphy multiband support (preparation for MLO) - p2p device support - add TP-Link TXE50UH USB adapter support - Qualcomm (ath10k): - support for the QCA6698AQ IP core - Qualcomm (ath12k): - enable MLO for QCN9274 - Bluetooth: - Allow sysfs to trigger hdev reset, to allow recovering devices not responsive from user-space - MediaTek: add support for MT7922, MT7925, MT7921e devices - Realtek: add support for RTL8851BE devices - Qualcomm: add support for WCN785x devices - ISO: allow BIG re-sync" * tag 'net-next-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1386 commits) net/rose: prevent integer overflows in rose_setsockopt() net: phylink: fix regression when binding a PHY net: ethernet: ti: am65-cpsw: streamline TX queue creation and cleanup net: ethernet: ti: am65-cpsw: streamline RX queue creation and cleanup net: ethernet: ti: am65-cpsw: ensure proper channel cleanup in error path ipv6: Convert inet6_rtm_deladdr() to per-netns RTNL. ipv6: Convert inet6_rtm_newaddr() to per-netns RTNL. ipv6: Move lifetime validation to inet6_rtm_newaddr(). ipv6: Set cfg.ifa_flags before device lookup in inet6_rtm_newaddr(). ipv6: Pass dev to inet6_addr_add(). ipv6: Convert inet6_ioctl() to per-netns RTNL. ipv6: Hold rtnl_net_lock() in addrconf_init() and addrconf_cleanup(). ipv6: Hold rtnl_net_lock() in addrconf_dad_work(). ipv6: Hold rtnl_net_lock() in addrconf_verify_work(). ipv6: Convert net.ipv6.conf.${DEV}.XXX sysctl to per-netns RTNL. ipv6: Add __in6_dev_get_rtnl_net(). net: stmmac: Drop redundant skb_mark_for_recycle() for SKB frags net: mii: Fix the Speed display when the network cable is not connected sysctl net: Remove macro checks for CONFIG_SYSCTL eth: bnxt: update header sizing defaults ...
2025-01-08treewide: Introduce kthread_run_worker[_on_cpu]()Frederic Weisbecker1-1/+1
kthread_create() creates a kthread without running it yet. kthread_run() creates a kthread and runs it. On the other hand, kthread_create_worker() creates a kthread worker and runs it. This difference in behaviours is confusing. Also there is no way to create a kthread worker and affine it using kthread_bind_mask() or kthread_affine_preferred() before starting it. Consolidate the behaviours and introduce kthread_run_worker[_on_cpu]() that behaves just like kthread_run(). kthread_create_worker[_on_cpu]() will now only create a kthread worker without starting it. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
2025-01-08vdpa/vp_vdpa: implement kick_vq_with_data callbackYuxue Liu1-0/+9
Implement the kick_vq_with_data vDPA callback. On kick, we pass the next available data to the hardware by writing it in the kick offset. Signed-off-by: Yuxue Liu <yuxue.liu@jaguarmicro.com> Message-Id: <20241203023743.1757-1-yuxue.liu@jaguarmicro.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com>
2024-12-23net/mlx5: fs, add counter object to flow destinationMoshe Shemesh1-2/+2
Currently mlx5_flow_destination includes counter_id which is assigned in case we use flow counter on the flow steering rule. However, counter_id is not enough data in case of using HW Steering. Thus, have mlx5_fc object as part of mlx5_flow_destination instead of counter_id and assign it where needed. In case counter_id is received from user space, create a local counter object to represent it. Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241219175841.1094544-4-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-12vdpa/mlx5: Fix suboptimal range on iotlb iterationSi-Wei Liu1-3/+1
The starting iova address to iterate iotlb map entry within a range was set to an irrelevant value when passing to the itree_next() iterator, although luckily it doesn't affect the outcome of finding out the granule of the smallest iotlb map size. Fix the code to make it consistent with the following for-loop. Fixes: 94abbccdf291 ("vdpa/mlx5: Add shared memory registration code") Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com> Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Message-Id: <20241021134040.975221-3-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com>
2024-11-12vdpa/mlx5: Fix PA offset with unaligned starting iotlb mapSi-Wei Liu1-3/+5
When calculating the physical address range based on the iotlb and mr [start,end) ranges, the offset of mr->start relative to map->start is not taken into account. This leads to some incorrect and duplicate mappings. For the case when mr->start < map->start the code is already correct: the range in [mr->start, map->start) was handled by a different iteration. Fixes: 94abbccdf291 ("vdpa/mlx5: Add shared memory registration code") Cc: stable@vger.kernel.org Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com> Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Message-Id: <20241021134040.975221-2-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com>
2024-11-07vdpa/mlx5: Fix error path during device addDragos Tatulea1-16/+5
In the error recovery path of mlx5_vdpa_dev_add(), the cleanup is executed and at the end put_device() is called which ends up calling mlx5_vdpa_free(). This function will execute the same cleanup all over again. Most resources support being cleaned up twice, but the recent mlx5_vdpa_destroy_mr_resources() doesn't. This change drops the explicit cleanup from within the mlx5_vdpa_dev_add() and lets mlx5_vdpa_free() do its work. This issue was discovered while trying to add 2 vdpa devices with the same name: $> vdpa dev add name vdpa-0 mgmtdev auxiliary/mlx5_core.sf.2 $> vdpa dev add name vdpa-0 mgmtdev auxiliary/mlx5_core.sf.3 ... yields the following dump: BUG: kernel NULL pointer dereference, address: 00000000000000b8 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: Oops: 0000 [#1] SMP CPU: 4 UID: 0 PID: 2811 Comm: vdpa Not tainted 6.12.0-rc6 #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:destroy_workqueue+0xe/0x2a0 Code: ... RSP: 0018:ffff88814920b9a8 EFLAGS: 00010282 RAX: 0000000000000000 RBX: ffff888105c10000 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffff888100400168 RDI: 0000000000000000 RBP: 0000000000000000 R08: ffff888100120c00 R09: ffffffff828578c0 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: ffff888131fd99a0 R14: 0000000000000000 R15: ffff888105c10580 FS: 00007fdfa6b4f740(0000) GS:ffff88852ca00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000000b8 CR3: 000000018db09006 CR4: 0000000000372eb0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ? __die+0x20/0x60 ? page_fault_oops+0x150/0x3e0 ? exc_page_fault+0x74/0x130 ? asm_exc_page_fault+0x22/0x30 ? destroy_workqueue+0xe/0x2a0 mlx5_vdpa_destroy_mr_resources+0x2b/0x40 [mlx5_vdpa] mlx5_vdpa_free+0x45/0x150 [mlx5_vdpa] vdpa_release_dev+0x1e/0x50 [vdpa] device_release+0x31/0x90 kobject_put+0x8d/0x230 mlx5_vdpa_dev_add+0x328/0x8b0 [mlx5_vdpa] vdpa_nl_cmd_dev_add_set_doit+0x2b8/0x4c0 [vdpa] genl_family_rcv_msg_doit+0xd0/0x120 genl_rcv_msg+0x180/0x2b0 ? __vdpa_alloc_device+0x1b0/0x1b0 [vdpa] ? genl_family_rcv_msg_dumpit+0xf0/0xf0 netlink_rcv_skb+0x54/0x100 genl_rcv+0x24/0x40 netlink_unicast+0x1fc/0x2d0 netlink_sendmsg+0x1e4/0x410 __sock_sendmsg+0x38/0x60 ? sockfd_lookup_light+0x12/0x60 __sys_sendto+0x105/0x160 ? __count_memcg_events+0x53/0xe0 ? handle_mm_fault+0x100/0x220 ? do_user_addr_fault+0x40d/0x620 __x64_sys_sendto+0x20/0x30 do_syscall_64+0x4c/0x100 entry_SYSCALL_64_after_hwframe+0x4b/0x53 RIP: 0033:0x7fdfa6c66b57 Code: ... RSP: 002b:00007ffeace22998 EFLAGS: 00000202 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 000055a498608350 RCX: 00007fdfa6c66b57 RDX: 000000000000006c RSI: 000055a498608350 RDI: 0000000000000003 RBP: 00007ffeace229c0 R08: 00007fdfa6d35200 R09: 000000000000000c R10: 0000000000000000 R11: 0000000000000202 R12: 000055a4986082a0 R13: 0000000000000000 R14: 0000000000000000 R15: 00007ffeace233f3 </TASK> Modules linked in: ... CR2: 00000000000000b8 Fixes: 62111654481d ("vdpa/mlx5: Postpone MR deletion") Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Message-Id: <20241105185101.1323272-2-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Acked-by: Eugenio Pérez <eperezma@redhat.com>
2024-11-06vp_vdpa: fix id_table array not null terminated errorXiaoguang Wang1-3/+7
Allocate one extra virtio_device_id as null terminator, otherwise vdpa_mgmtdev_get_classes() may iterate multiple times and visit undefined memory. Fixes: ffbda8e9df10 ("vdpa/vp_vdpa : add vdpa tool support in vp_vdpa") Cc: stable@vger.kernel.org Suggested-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Angus Chen <angus.chen@jaguarmicro.com> Signed-off-by: Xiaoguang Wang <lege.wang@jaguarmicro.com> Message-Id: <20241105133518.1494-1-lege.wang@jaguarmicro.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Acked-by: Jason Wang <jasowang@redhat.com>
2024-11-06vDPA/ifcvf: Fix pci_read_config_byte() return code handlingYuan Can1-1/+1
ifcvf_init_hw() uses pci_read_config_byte() that returns PCIBIOS_* codes. The error handling, however, assumes the codes are normal errnos because it checks for < 0. Convert the error check to plain non-zero check. Fixes: 5a2414bc454e ("virtio: Intel IFC VF driver for VDPA") Signed-off-by: Yuan Can <yuancan@huawei.com> Message-Id: <20241017013812.129952-1-yuancan@huawei.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Acked-by: Zhu Lingshan <lingshan.zhu@kernel.org>
2024-11-06vdpa: solidrun: Fix UB bug with devresPhilipp Stanner1-4/+10
In psnet_open_pf_bar() and snet_open_vf_bar() a string later passed to pcim_iomap_regions() is placed on the stack. Neither pcim_iomap_regions() nor the functions it calls copy that string. Should the string later ever be used, this, consequently, causes undefined behavior since the stack frame will by then have disappeared. Fix the bug by allocating the strings on the heap through devm_kasprintf(). Cc: stable@vger.kernel.org # v6.3 Fixes: 51a8f9d7f587 ("virtio: vdpa: new SolidNET DPU driver.") Reported-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Closes: https://lore.kernel.org/all/74e9109a-ac59-49e2-9b1d-d825c9c9f891@wanadoo.fr/ Suggested-by: Andy Shevchenko <andy@kernel.org> Signed-off-by: Philipp Stanner <pstanner@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Message-Id: <20241028074357.9104-3-pstanner@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-10-07Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds1-6/+6
Pull virtio fixes from Michael Tsirkin: "Several small bugfixes all over the place. Most notably, fixes the vsock allocation with GFP_KERNEL in atomic context, which has been triggering warnings for lots of testers" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: vhost/scsi: null-ptr-dereference in vhost_scsi_get_req() vsock/virtio: use GFP_ATOMIC under RCU read lock virtio_console: fix misc probe bugs virtio_ring: tag event_triggered as racy for KCSAN vdpa/octeon_ep: Fix format specifier for pointers in debug messages
2024-09-26Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds8-218/+765
Pull virtio updates from Michael Tsirkin: "Several new features here: - virtio-balloon supports new stats - vdpa supports setting mac address - vdpa/mlx5 suspend/resume as well as MKEY ops are now faster - virtio_fs supports new sysfs entries for queue info - virtio/vsock performance has been improved And fixes, cleanups all over the place" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (34 commits) vsock/virtio: avoid queuing packets when intermediate queue is empty vsock/virtio: refactor virtio_transport_send_pkt_work fw_cfg: Constify struct kobj_type vdpa/mlx5: Postpone MR deletion vdpa/mlx5: Introduce init/destroy for MR resources vdpa/mlx5: Rename mr_mtx -> lock vdpa/mlx5: Extract mr members in own resource struct vdpa/mlx5: Rename function vdpa/mlx5: Delete direct MKEYs in parallel vdpa/mlx5: Create direct MKEYs in parallel MAINTAINERS: add virtio-vsock driver in the VIRTIO CORE section virtio_fs: add sysfs entries for queue information virtio_fs: introduce virtio_fs_put_locked helper vdpa: Remove unused declarations vdpa/mlx5: Parallelize VQ suspend/resume for CVQ MQ command vdpa/mlx5: Small improvement for change_num_qps() vdpa/mlx5: Keep notifiers during suspend but ignore vdpa/mlx5: Parallelize device resume vdpa/mlx5: Parallelize device suspend vdpa/mlx5: Use async API for vq modify commands ...
2024-09-25vdpa/octeon_ep: Fix format specifier for pointers in debug messagesSrujana Challa1-6/+6
Updates the debug messages in octep_vdpa_hw.c to use the %p format specifier for pointers instead of casting them to u64. Fixes smatch warning: octep_hw_caps_read() warn: argument 3 to %016llx specifier is cast from pointer Fixes: 8b6c724cdab8 ("virtio: vdpa: vDPA driver for Marvell OCTEON DPU devices") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/202409160431.bRhZWhiU-lkp@intel.com/ Signed-off-by: Srujana Challa <schalla@marvell.com> Message-Id: <20240916162255.677774-1-schalla@marvell.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-09-25vdpa/mlx5: Postpone MR deletionDragos Tatulea3-5/+64
Currently, when a new MR is set up, the old MR is deleted. MR deletion is about 30-40% the time of MR creation. As deleting the old MR is not important for the process of setting up the new MR, this operation can be postponed. This series adds a workqueue that does MR garbage collection at a later point. If the MR lock is taken, the handler will back off and reschedule. The exception during shutdown: then the handler must not postpone the work. Note that this is only a speculative optimization: if there is some mapping operation that is triggered while the garbage collector handler has the lock taken, this operation it will have to wait for the handler to finish. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Message-Id: <20240830105838.2666587-9-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-09-25vdpa/mlx5: Introduce init/destroy for MR resourcesDragos Tatulea4-5/+26
There's currently not a lot of action happening during the init/destroy of MR resources. But more will be added in the upcoming patches. As the mr mutex lock init/destroy has been moved to these new functions, the lifetime has now shifted away from mlx5_vdpa_alloc_resources() / mlx5_vdpa_free_resources() into these new functions. However, the lifetime at the outer scope remains the same: mlx5_vdpa_dev_add() / mlx5_vdpa_dev_free() Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Message-Id: <20240830105838.2666587-8-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-09-25vdpa/mlx5: Rename mr_mtx -> lockDragos Tatulea4-16/+16
Now that the mr resources have their own namespace in the struct, give the lock a clearer name. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Message-Id: <20240830105838.2666587-7-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-09-25vdpa/mlx5: Extract mr members in own resource structDragos Tatulea4-41/+44
Group all mapping related resources into their own structure. Upcoming patches will add more members in this new structure. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Message-Id: <20240830105838.2666587-6-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-09-25vdpa/mlx5: Rename functionDragos Tatulea3-6/+6
A followup patch will use this name for something else. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Message-Id: <20240830105838.2666587-5-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-09-25vdpa/mlx5: Delete direct MKEYs in parallelDragos Tatulea1-0/+64
Use the async interface to issue MTT MKEY deletion. This makes destroy_user_mr() on average 8x times faster. This number is also dependent on the size of the MR being deleted. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Message-Id: <20240830105838.2666587-4-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-09-25vdpa/mlx5: Create direct MKEYs in parallelDragos Tatulea1-22/+98
Use the async interface to issue MTT MKEY creation. Extra care is taken at the allocation of FW input commands due to the MTT tables having variable sizes depending on MR. The indirect MKEY is still created synchronously at the end as the direct MKEYs need to be filled in. This makes create_user_mr() 3-5x faster, depending on the size of the MR. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Message-Id: <20240830105838.2666587-3-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-09-25vdpa: Remove unused declarationsYue Haibing2-4/+0
There is no caller and implementation in tree. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Message-Id: <20240819140930.122019-1-yuehaibing@huawei.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Shannon Nelson <shannon.nelson@amd.com> Reviewed-by: Zhu Lingshan <lingshan.zhu@kernel.org> Reviewed-by: Shannon Nelson &lt;<a href="mailto:shannon.nelson@amd.com" target="_blank">shannon.nelson@amd.com</a>&gt;<br> Reviewed-by: Zhu Lingshan <lingshan.zhu@kernel.org>
2024-09-25vdpa/mlx5: Parallelize VQ suspend/resume for CVQ MQ commandDragos Tatulea1-10/+12
change_num_qps() is still suspending/resuming VQs one by one. This change switches to parallel suspend/resume. When increasing the number of queues the flow has changed a bit for simplicity: the setup_vq() function will always be called before resume_vqs(). If the VQ is initialized, setup_vq() will exit early. If the VQ is not initialized, setup_vq() will create it and resume_vqs() will resume it. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Message-Id: <20240816090159.1967650-11-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com>
2024-09-25vdpa/mlx5: Small improvement for change_num_qps()Dragos Tatulea1-10/+11
change_num_qps() has a lot of multiplications by 2 to convert the number of VQ pairs to number of VQs. This patch simplifies the code by doing the VQP -> VQ count conversion at the beginning in a variable. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Message-Id: <20240816090159.1967650-10-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com>
2024-09-25vdpa/mlx5: Keep notifiers during suspend but ignoreDragos Tatulea1-2/+4
Unregistering notifiers is a costly operation. Instead of removing the notifiers during device suspend and adding them back at resume, simply ignore the call when the device is suspended. At resume time call queue_link_work() to make sure that the device state is propagated in case there were changes. For 1 vDPA device x 32 VQs (16 VQPs) attached to a large VM (256 GB RAM, 32 CPUs x 2 threads per core), the device suspend time is reduced from ~13 ms to ~2.5 ms. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Message-Id: <20240816090159.1967650-9-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com>
2024-09-25vdpa/mlx5: Parallelize device resumeDragos Tatulea1-26/+14
Currently device resume works on vqs serially. Building up on previous changes that converted vq operations to the async api, this patch parallelizes the device resume. For 1 vDPA device x 32 VQs (16 VQPs) attached to a large VM (256 GB RAM, 32 CPUs x 2 threads per core), the device resume time is reduced from ~16 ms to ~4.5 ms. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Message-Id: <20240816090159.1967650-8-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com>
2024-09-25vdpa/mlx5: Parallelize device suspendDragos Tatulea1-27/+29
Currently device suspend works on vqs serially. Building up on previous changes that converted vq operations to the async api, this patch parallelizes the device suspend: 1) Suspend all active vqs parallel. 2) Query suspended vqs in parallel. For 1 vDPA device x 32 VQs (16 VQPs) attached to a large VM (256 GB RAM, 32 CPUs x 2 threads per core), the device suspend time is reduced from ~37 ms to ~13 ms. A later patch will remove the link unregister operation which will make it even faster. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Message-Id: <20240816090159.1967650-7-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com>
2024-09-25vdpa/mlx5: Use async API for vq modify commandsDragos Tatulea1-48/+106
Switch firmware vq modify command to be issued via the async API to allow future parallelization. The new refactored function applies the modify on a range of vqs and waits for their execution to complete. For now the command is still used in a serial fashion. A later patch will switch to modifying multiple vqs in parallel. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Message-Id: <20240816090159.1967650-6-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com>
2024-09-25vdpa/mlx5: Use async API for vq query commandDragos Tatulea2-25/+78
Switch firmware vq query command to be issued via the async API to allow future parallelization. For now the command is still serial but the infrastructure is there to issue commands in parallel, including ratelimiting the number of issued async commands to firmware. A later patch will switch to issuing more commands at a time. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Message-Id: <20240816090159.1967650-5-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com>
2024-09-25vdpa/mlx5: Introduce async fw command wrapperDragos Tatulea2-0/+88
Introduce a new function mlx5_vdpa_exec_async_cmds() which wraps the mlx5_core async firmware command API in a way that will be used to parallelize certain operation in this driver. The wrapper deals with the case when mlx5_cmd_exec_cb() returns EBUSY due to the command being throttled. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Message-Id: <20240816090159.1967650-4-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com>
2024-09-25vdpa/mlx5: Introduce error logging functionDragos Tatulea2-12/+17
mlx5_vdpa_err() was missing. This patch adds it and uses it in the necessary places. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Message-Id: <20240816090159.1967650-3-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com>
2024-09-21Merge tag 'mm-stable-2024-09-20-02-31' of ↵Linus Torvalds2-8/