summaryrefslogtreecommitdiff
path: root/drivers/infiniband/core
AgeCommit message (Collapse)AuthorFilesLines
2024-07-14RDMA: Fix netdev tracker in ib_device_set_netdevDavid Ahern1-6/+2
If a netdev has already been assigned, ib_device_set_netdev needs to release the reference on the older netdev but it is mistakenly being called for the new netdev. Fix it and in the process use netdev_put to be symmetrical with the netdev_hold. Fixes: 09f530f0c6d6 ("RDMA: Add netdevice_tracker to ib_device_set_netdev()") Signed-off-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240710203310.19317-1-dsahern@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-07-04RDMA/core: Introduce "name_assign_type" for an IB deviceMark Zhang1-0/+5
The name_assign_type indicates how the name is provided. Currently these types are supported: - RDMA_NAME_ASSIGN_TYPE_UNKNOWN: Unknown or not set; - RDMA_NAME_ASSIGN_TYPE_USER: Name is provided by the user; The user-created sub device, rxe and siw device has this type. When filling nl device info, it is set in the new attribute RDMA_NLDEV_ATTR_NAME_ASSIGN_TYPE. User-space tools like udev "rdma_rename" could check this attribute to determine if this device needs to be renamed or not. Signed-off-by: Mark Zhang <markzhang@nvidia.com> Link: https://lore.kernel.org/r/522591bef9a369cc8e5dcb77787e017bffee37fe.1719837610.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-07-01RDMA/nldev: Add support to dump device type and parent device if existsMark Zhang1-0/+10
If a device has a specific type or a parent device, dump them as well. Example: $ rdma dev show smi1 3: smi1: node_type ca fw 20.38.1002 node_guid 9803:9b03:009f:d5ef sys_image_guid 9803:9b03:009f:d5ee type smi parent ibp8s0f1 Signed-off-by: Mark Zhang <markzhang@nvidia.com> Link: https://lore.kernel.org/r/4c022e3e34b5de1254a3b367d502a362cdd0c53a.1718553901.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2024-07-01RDMA/nldev: Add support to add/delete a sub IB device through netlinkMark Zhang1-0/+59
Add new netlink commands and attributes to support adding and deleting a sub IB device with admin privilege. Examples: $ rdma dev add smi1 type SMI parent ibp8s0f1 $ rdma dev del smi1 Signed-off-by: Mark Zhang <markzhang@nvidia.com> Link: https://lore.kernel.org/r/77cbf1b36359642be8a8d8c5c2f4e585b544282f.1718553901.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2024-07-01RDMA/core: Create GSI QP only when CM is supportedMark Zhang2-13/+28
GSI QP is not needed if the port doesn't support connection management. In following patches mlx5 is going to support IB ports that doesn't support CM. Signed-off-by: Mark Zhang <markzhang@nvidia.com> Link: https://lore.kernel.org/r/c449ebd955923b0e54c58832fd322f9d461b37a0.1718553901.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2024-07-01RDMA/core: Support IB sub device with type "SMI"Mark Zhang2-1/+71
This patch adds 2 APIs, as well as driver operations to support adding and deleting an IB sub device, which provides part of functionalities of it's parent. A sub device has a type; for a sub device with type "SMI", it provides the smi capability through umad for its parent, meaning uverb is not supported. A sub device cannot live without a parent. So when a parent is released, all it's sub devices are released as well. Signed-off-by: Mark Zhang <markzhang@nvidia.com> Link: https://lore.kernel.org/r/44253f7508b21eb2caefea3980c2bc072869116c.1718553901.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2024-07-01RDMA/core: Create "issm*" device nodes only when SMI is supportedMark Zhang1-11/+18
For an IB port create it's issm device node only when it has SMI capability. In following patches mlx5 is going to support IB devices without this cap. Signed-off-by: Mark Zhang <markzhang@nvidia.com> Link: https://lore.kernel.org/r/359f73c9a388d5e3ae971e40d8507888b1ba6f93.1718553901.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2024-07-01RDMA/device: Return error earlier if port in not validLeon Romanovsky1-3/+3
There is no need to allocate port data if port provided is not valid. Fixes: c2261dd76b54 ("RDMA/device: Add ib_device_set_netdev() as an alternative to get_netdev") Link: https://lore.kernel.org/r/022047a8b16988fc88d4426da50bf60a4833311b.1719235449.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2024-06-27RDMA: Pass entire uverbs attr bundle to create cq functionAkiva Goldberger2-2/+2
Changes the create_cq verb signature by sending the entire uverbs attr bundle as a parameter. This allows drivers to send driver specific attrs through ioctl for the create_cq verb and access them in their driver specific code. Also adds a new enum value for driver specific ioctl attributes for methods already supporting UHW. Link: https://lore.kernel.org/r/ed147343987c0d43fd391c1b2f85e2f425747387.1719512393.git.leon@kernel.org Signed-off-by: Akiva Goldberger <agoldberger@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-06-26IB/core: add support for draining Shared receive queuesMax Gurtovoy1-1/+81
To avoid leakage for QPs assocoated with SRQ, according to IB spec (section 10.3.1): "Note, for QPs that are associated with an SRQ, the Consumer should take the QP through the Error State before invoking a Destroy QP or a Modify QP to the Reset State. The Consumer may invoke the Destroy QP without first performing a Modify QP to the Error State and waiting for the Affiliated Asynchronous Last WQE Reached Event. However, if the Consumer does not wait for the Affiliated Asynchronous Last WQE Reached Event, then WQE and Data Segment leakage may occur. Therefore, it is good programming practice to teardown a QP that is associated with an SRQ by using the following process: - Put the QP in the Error State; - wait for the Affiliated Asynchronous Last WQE Reached Event; - either: - drain the CQ by invoking the Poll CQ verb and either wait for CQ to be empty or the number of Poll CQ operations has exceeded CQ capacity size; or - post another WR that completes on the same CQ and wait for this WR to return as a WC; - and then invoke a Destroy QP or Reset QP." Catch the Last WQE Reached Event in the core layer during drain QP flow. Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Link: https://lore.kernel.org/r/20240619171153.34631-2-mgurtovoy@nvidia.com Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-06-21RDMA/cache: Release GID table even if leak is detectedLeon Romanovsky1-9/+5
When the table is released, we nullify pointer to GID table, it means that in case GID entry leak is detected, we will leak table too. Delete code that prevents table destruction. Fixes: b150c3862d21 ("IB/core: Introduce GID entry reference counts") Link: https://lore.kernel.org/r/a62560af06ba82c88ef9194982bfa63d14768ff9.1716900410.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-06-09RDMA/iwcm: Fix a use-after-free related to destroying CM IDsBart Van Assche1-4/+7
iw_conn_req_handler() associates a new struct rdma_id_private (conn_id) with an existing struct iw_cm_id (cm_id) as follows: conn_id->cm_id.iw = cm_id; cm_id->context = conn_id; cm_id->cm_handler = cma_iw_handler; rdma_destroy_id() frees both the cm_id and the struct rdma_id_private. Make sure that cm_work_handler() does not trigger a use-after-free by only freeing of the struct rdma_id_private after all pending work has finished. Cc: stable@vger.kernel.org Fixes: 59c68ac31e15 ("iw_cm: free cm_id resources on the last deref") Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240605145117.397751-6-bvanassche@acm.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-06-09RDMA/iwcm: Simplify cm_work_handler()Bart Van Assche1-6/+1
Instead of complicating the code to avoid a spin_lock_irqsave() / spin_lock_irqrestore() pair before returning, simplify the code by removing the local variable 'empty'. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240605145117.397751-5-bvanassche@acm.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-06-09RDMA/iwcm: Simplify cm_event_handler()Bart Van Assche1-5/+2
queue_work() can test efficiently whether or not work is pending. Hence, simplify cm_event_handler() by always calling queue_work() instead of only if the list with pending work is empty. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240605145117.397751-4-bvanassche@acm.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-06-09RDMA/iwcm: Change the return type of iwcm_deref_id()Bart Van Assche1-4/+4
Since iwcm_deref_id() returns either 0 or 1, change its return type from 'int' into 'bool'. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240605145117.397751-3-bvanassche@acm.org Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-06-09RDMA/iwcm: Use list_first_entry() where appropriateBart Van Assche1-4/+4
Improve source code readability by using list_first_entry() where appropriate. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240605145117.397751-2-bvanassche@acm.org Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-05-22tracing/treewide: Remove second parameter of __assign_str()Steven Rostedt (Google)1-2/+2
With the rework of how the __string() handles dynamic strings where it saves off the source string in field in the helper structure[1], the assignment of that value to the trace event field is stored in the helper value and does not need to be passed in again. This means that with: __string(field, mystring) Which use to be assigned with __assign_str(field, mystring), no longer needs the second parameter and it is unused. With this, __assign_str() will now only get a single parameter. There's over 700 users of __assign_str() and because coccinelle does not handle the TRACE_EVENT() macro I ended up using the following sed script: git grep -l __assign_str | while read a ; do sed -e 's/\(__assign_str([^,]*[^ ,]\) *,[^;]*/\1)/' $a > /tmp/test-file; mv /tmp/test-file $a; done I then searched for __assign_str() that did not end with ';' as those were multi line assignments that the sed script above would fail to catch. Note, the same updates will need to be done for: __assign_str_len() __assign_rel_str() __assign_rel_str_len() I tested this with both an allmodconfig and an allyesconfig (build only for both). [1] https://lore.kernel.org/linux-trace-kernel/20240222211442.634192653@goodmis.org/ Link: https://lore.kernel.org/linux-trace-kernel/20240516133454.681ba6a0@rorschach.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Julia Lawall <Julia.Lawall@inria.fr> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by: Jani Nikula <jani.nikula@intel.com> Acked-by: Christian König <christian.koenig@amd.com> for the amdgpu parts. Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> #for Acked-by: Rafael J. Wysocki <rafael@kernel.org> # for thermal Acked-by: Takashi Iwai <tiwai@suse.de> Acked-by: Darrick J. Wong <djwong@kernel.org> # xfs Tested-by: Guenter Roeck <linux@roeck-us.net>
2024-05-18Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds7-74/+53
Pull rdma updates from Jason Gunthorpe: "Aside from the usual things this has an arch update for __iowrite64_copy() used by the RDMA drivers. This API was intended to generate large 64 byte MemWr TLPs on PCI. These days most processors had done this by just repeating writel() in a loop. S390 and some new ARM64 designs require a special helper to get this to generate. - Small improvements and fixes for erdma, efa, hfi1, bnxt_re - Fix a UAF crash after module unload on leaking restrack entry - Continue adding full RDMA support in mana with support for EQs, GID's and CQs - Improvements to the mkey cache in mlx5 - DSCP traffic class support in hns and several bug fixes - Cap the maximum number of MADs in the receive queue to avoid OOM - Another batch of rxe bug fixes from large scale testing - __iowrite64_copy() optimizations for write combining MMIO memory - Remove NULL checks before dev_put/hold() - EFA support for receive with immediate - Fix a recent memleaking regression in a cma error path" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (70 commits) RDMA/cma: Fix kmemleak in rdma_core observed during blktests nvme/rdma use siw RDMA/IPoIB: Fix format truncation compilation errors bnxt_re: avoid shift undefined behavior in bnxt_qplib_alloc_init_hwq RDMA/efa: Support QP with unsolicited write w/ imm. receive IB/hfi1: Remove generic .ndo_get_stats64 IB/hfi1: Do not use custom stat allocator RDMA/hfi1: Use RMW accessors for changing LNKCTL2 RDMA/mana_ib: implement uapi for creation of rnic cq RDMA/mana_ib: boundary check before installing cq callbacks RDMA/mana_ib: introduce a helper to remove cq callbacks RDMA/mana_ib: create and destroy RNIC cqs RDMA/mana_ib: create EQs for RNIC CQs RDMA/core: Remove NULL check before dev_{put, hold} RDMA/ipoib: Remove NULL check before dev_{put, hold} RDMA/mlx5: Remove NULL check before dev_{put, hold} RDMA/mlx5: Track DCT, DCI and REG_UMR QPs as diver_detail resources. RDMA/core: Add an option to display driver-specific QPs in the rdmatool RDMA/efa: Add shutdown notifier RDMA/mana_ib: Fix missing ret value IB/mlx5: Use __iowrite64_copy() for write combining stores ...
2024-05-12RDMA/cma: Fix kmemleak in rdma_core observed during blktests nvme/rdma use siwZhu Yanjun1-1/+3
When running blktests nvme/rdma, the following kmemleak issue will appear. kmemleak: Kernel memory leak detector initialized (mempool available:36041) kmemleak: Automatic memory scanning thread started kmemleak: 2 new suspected memory leaks (see /sys/kernel/debug/kmemleak) kmemleak: 8 new suspected memory leaks (see /sys/kernel/debug/kmemleak) kmemleak: 17 new suspected memory leaks (see /sys/kernel/debug/kmemleak) kmemleak: 4 new suspected memory leaks (see /sys/kernel/debug/kmemleak) unreferenced object 0xffff88855da53400 (size 192): comm "rdma", pid 10630, jiffies 4296575922 hex dump (first 32 bytes): 37 00 00 00 00 00 00 00 c0 ff ff ff 1f 00 00 00 7............... 10 34 a5 5d 85 88 ff ff 10 34 a5 5d 85 88 ff ff .4.].....4.].... backtrace (crc 47f66721): [<ffffffff911251bd>] kmalloc_trace+0x30d/0x3b0 [<ffffffffc2640ff7>] alloc_gid_entry+0x47/0x380 [ib_core] [<ffffffffc2642206>] add_modify_gid+0x166/0x930 [ib_core] [<ffffffffc2643468>] ib_cache_update.part.0+0x6d8/0x910 [ib_core] [<ffffffffc2644e1a>] ib_cache_setup_one+0x24a/0x350 [ib_core] [<ffffffffc263949e>] ib_register_device+0x9e/0x3a0 [ib_core] [<ffffffffc2a3d389>] 0xffffffffc2a3d389 [<ffffffffc2688cd8>] nldev_newlink+0x2b8/0x520 [ib_core] [<ffffffffc2645fe3>] rdma_nl_rcv_msg+0x2c3/0x520 [ib_core] [<ffffffffc264648c>] rdma_nl_rcv_skb.constprop.0.isra.0+0x23c/0x3a0 [ib_core] [<ffffffff9270e7b5>] netlink_unicast+0x445/0x710 [<ffffffff9270f1f1>] netlink_sendmsg+0x761/0xc40 [<ffffffff9249db29>] __sys_sendto+0x3a9/0x420 [<ffffffff9249dc8c>] __x64_sys_sendto+0xdc/0x1b0 [<ffffffff92db0ad3>] do_syscall_64+0x93/0x180 [<ffffffff92e00126>] entry_SYSCALL_64_after_hwframe+0x71/0x79 The root cause: rdma_put_gid_attr is not called when sgid_attr is set to ERR_PTR(-ENODEV). Reported-and-tested-by: Yi Zhang <yi.zhang@redhat.com> Closes: https://lore.kernel.org/all/19bf5745-1b3b-4b8a-81c2-20d945943aaf@linux.dev/T/ Fixes: f8ef1be816bf ("RDMA/cma: Avoid GID lookups on iWARP devices") Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev> Link: https://lore.kernel.org/r/20240510211247.31345-1-yanjun.zhu@linux.dev Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-05-05RDMA/core: Remove NULL check before dev_{put, hold}Jules Irenge3-11/+5
Coccinelle reports a warning WARNING: NULL check before dev_{put, hold} functions is not needed The reason is the call netdev_{put, hold} of dev_{put,hold} will check NULL There is no need to check before using dev_{put, hold} Signed-off-by: Jules Irenge <jbi.octave@gmail.com> Link: https://lore.kernel.org/r/ZjF1Eedxwhn4JSkz@octinomon.home Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-30inet: introduce dst_rtable() helperEric Dumazet1-9/+3
I added dst_rt6_info() in commit e8dfd42c17fa ("ipv6: introduce dst_rt6_info() helper") This patch does a similar change for IPv4. Instead of (struct rtable *)dst casts, we can use : #define dst_rtable(_ptr) \ container_of_const(_ptr, struct rtable, dst) Patch is smaller than IPv6 one, because IPv4 has skb_rtable() helper. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://lore.kernel.org/r/20240429133009.1227754-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-30RDMA/core: Add an option to display driver-specific QPs in the rdmatoolChiara Meiohas2-6/+29
Utilize the -dd flag (driver-specific details) in the rdmatool to view driver-specific QPs which are not exposed yet. Add the netlink attribute to mark request to convey driver details and use it to return QP subtype as a string. $ rdma resource show qp link ibp8s0f1 link ibp8s0f1/1 lqpn 360 type UD state RTS sq-psn 0 comm [mlx5_ib] link ibp8s0f1/1 lqpn 0 type SMI state RTS sq-psn 0 comm [ib_core] link ibp8s0f1/1 lqpn 1 type GSI state RTS sq-psn 0 comm [ib_core] $ rdma resource show qp link ibp8s0f1 -dd link ibp8s0f1/1 lqpn 360 type UD state RTS sq-psn 0 comm [mlx5_ib] link ibp8s0f1/1 lqpn 465 type DRIVER subtype REG_UMR state RTS sq-psn 0 comm [mlx5_ib] link ibp8s0f1/1 lqpn 0 type SMI state RTS sq-psn 0 comm [ib_core] link ibp8s0f1/1 lqpn 1 type GSI state RTS sq-psn 0 comm [ib_core] $ rdma resource show 0: ibp8s0f0: pd 3 cq 4 qp 3 cm_id 0 mr 0 ctx 0 srq 2 1: ibp8s0f1: pd 3 cq 4 qp 3 cm_id 0 mr 0 ctx 0 srq 2 $ rdma resource show -dd 0: ibp8s0f0: pd 3 cq 4 qp 4 cm_id 0 mr 0 ctx 0 srq 2 1: ibp8s0f1: pd 3 cq 4 qp 4 cm_id 0 mr 0 ctx 0 srq 2 Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com> Link: https://lore.kernel.org/r/2607bb3ddec3cae3443c2ea19e9f700825d20a98.1713268997.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-29ipv6: introduce dst_rt6_info() helperEric Dumazet1-3/+3
Instead of (struct rt6_info *)dst casts, we can use : #define dst_rt6_info(_ptr) \ container_of_const(_ptr, struct rt6_info, dst) Some places needed missing const qualifiers : ip6_confirm_neigh(), ipv6_anycast_destination(), ipv6_unicast_destination(), has_gateway() v2: added missing parts (David Ahern) Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-21IB/core: Implement a limit on UMAD receive ListMichael Guralnik1-6/+15
The existing behavior of ib_umad, which maintains received MAD packets in an unbounded list, poses a risk of uncontrolled growth. As user-space applications extract packets from this list, the rate of extraction may not match the rate of incoming packets, leading to potential list overflow. To address this, we introduce a limit to the size of the list. After considering typical scenarios, such as OpenSM processing, which can handle approximately 100k packets per second, and the 1-second retry timeout for most packets, we set the list size limit to 200k. Packets received beyond this limit are dropped, assuming they are likely timed out by the time they are handled by user-space. Notably, packets queued on the receive list due to reasons like timed-out sends are preserved even when the list is full. Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Reviewed-by: Mark Zhang <markzhang@nvidia.com> Link: https://lore.kernel.org/r/7197cb58a7d9e78399008f25036205ceab07fbd5.1713268818.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-01RDMA/cm: Print the old state when cm_destroy_id gets timeoutMark Zhang1-4/+7
The old state is helpful for debugging, as the current state is always IB_CM_IDLE when timeout happens. Fixes: 96d9cbe2f2ff ("RDMA/cm: add timeout to cm_destroy_id wait") Signed-off-by: Mark Zhang <markzhang@nvidia.com> Link: https://lore.kernel.org/r/20240322112049.2022994-1-markzhang@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-01RDMA/restrack: Fix potential invalid address accessWenchao Hao1-50/+1
struct rdma_restrack_entry's kern_name was set to KBUILD_MODNAME in ib_create_cq(), while if the module exited but forgot del this rdma_restrack_entry, it would cause a invalid address access in rdma_restrack_clean() when print the owner of this rdma_restrack_entry. These code is used to help find one forgotten PD release in one of the ULPs. But it is not needed anymore, so delete them. Signed-off-by: Wenchao Hao <haowenchao2@huawei.com> Link: https://lore.kernel.org/r/20240318092320.1215235-1-haowenchao2@huawei.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-03-10RDMA/cm: add timeout to cm_destroy_id waitManjunath Patil1-1/+19
Add timeout to cm_destroy_id, so that userspace can trigger any data collection that would help in analyzing the cause of delay in destroying the cm_id. New noinline function helps dtrace/ebpf programs to hook on to it. Existing functionality isn't changed except triggering a probe-able new function at every timeout interval. We have seen cases where CM messages stuck with MAD layer (either due to software bug or faulty HCA), leading to cm_id getting stuck in the following call stack. This patch helps in resolving such issues faster. kernel: ... INFO: task XXXX:56778 blocked for more than 120 seconds. ... Call Trace: __schedule+0x2bc/0x895 schedule+0x36/0x7c schedule_timeout+0x1f6/0x31f ? __slab_free+0x19c/0x2ba wait_for_completion+0x12b/0x18a ? wake_up_q+0x80/0x73 cm_destroy_id+0x345/0x610 [ib_cm] ib_destroy_cm_id+0x10/0x20 [ib_cm] rdma_destroy_id+0xa8/0x300 [rdma_cm] ucma_destroy_id+0x13e/0x190 [rdma_ucm] ucma_write+0xe0/0x160 [rdma_ucm] __vfs_write+0x3a/0x16d vfs_write+0xb2/0x1a1 ? syscall_trace_enter+0x1ce/0x2b8 SyS_write+0x5c/0xd3 do_syscall_64+0x79/0x1b9 entry_SYSCALL_64_after_hwframe+0x16d/0x0 Signed-off-by: Manjunath Patil <manjunath.b.patil@oracle.com> Link: https://lore.kernel.org/r/20240309063323.458102-1-manjunath.b.patil@oracle.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-03-03RDMA/uverbs: Avoid -Wflex-array-member-not-at-end warningsGustavo A. R. Silva1-33/+45
-Wflex-array-member-not-at-end is coming in GCC-14, and we are getting ready to enable it globally. There are currently a couple of objects (`alloc_head` and `bundle`) in `struct bundle_priv` that contain a couple of flexible structures: struct bundle_priv { /* Must be first */ struct bundle_alloc_head alloc_head; ... /* * Must be last. bundle ends in a flex array which overlaps * internal_buffer. */ struct uverbs_attr_bundle bundle; u64 internal_buffer[32]; }; So, in order to avoid ending up with a couple of flexible-array members in the middle of a struct, we use the `struct_group_tagged()` helper to separate the flexible array from the rest of the members in the flexible structures: struct uverbs_attr_bundle { struct_group_tagged(uverbs_attr_bundle_hdr, hdr, ... the rest of the members ); struct uverbs_attr attrs[]; }; With the change described above, we now declare objects of the type of the tagged struct without embedding flexible arrays in the middle of another struct: struct bundle_priv { /* Must be first */ struct bundle_alloc_head_hdr alloc_head; ... struct uverbs_attr_bundle_hdr bundle; u64 internal_buffer[32]; }; We also use `container_of()` whenever we need to retrieve a pointer to the flexible structures. Notice that the `bundle_size` computed in `uapi_compute_bundle_size()` remains the same. So, with these changes, fix the following warnings: drivers/infiniband/core/uverbs_ioctl.c:45:34: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] 45 | struct bundle_alloc_head alloc_head; | ^~~~~~~~~~ drivers/infiniband/core/uverbs_ioctl.c:67:35: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] 67 | struct uverbs_attr_bundle bundle; | ^~~~~~ Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Link: https://lore.kernel.org/r/ZeIgeZ5Sb0IZTOyt@neat Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-02-21RDMA/uverbs: Remove flexible arrays from struct *_filterErick Archer1-8/+8
When a struct containing a flexible array is included in another struct, and there is a member after the struct-with-flex-array, there is a possibility of memory overlap. These cases must be audited [1]. See: struct inner { ... int flex[]; }; struct outer { ... struct inner header; int overlap; ... }; This is the scenario for all the "struct *_filter" structures that are included in the following "struct ib_flow_spec_*" structures: struct ib_flow_spec_eth struct ib_flow_spec_ib struct ib_flow_spec_ipv4 struct ib_flow_spec_ipv6 struct ib_flow_spec_tcp_udp struct ib_flow_spec_tunnel struct ib_flow_spec_esp struct ib_flow_spec_gre struct ib_flow_spec_mpls The pattern is like the one shown below: struct *_filter { ... u8 real_sz[]; }; struct ib_flow_spec_* { ... struct *_filter val; struct *_filter mask; }; In this case, the trailing flexible array "real_sz" is never allocated and is only used to calculate the size of the structures. Here the use of the "offsetof" helper can be changed by the "sizeof" operator because the goal is to get the size of these structures. Therefore, the trailing flexible arrays can also be removed. However, due to the trailing padding that can be induced in structs it is possible that the: offsetof(struct *_filter, real_sz) != sizeof(struct *_filter) This situation happens with the "struct ib_flow_ipv6_filter" and to avoid it the "__packed" macro is used in this structure. But now, the "sizeof(struct ib_flow_ipv6_filter)" has changed. This is not a problem since this size is not used in the code. The situation now is that "sizeof(struct ib_flow_spec_ipv6)" has also changed (this struct contains the struct ib_flow_ipv6_filter). This is also not a problem since it is only used to set the size of the "union ib_flow_spec", which can store all the "ib_flow_spec_*" structures. Link: https://lore.kernel.org/r/20240217142913.4285-1-erick.archer@gmx.com Signed-off-by: Erick Archer <erick.archer@gmx.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-02-21RDMA/device: Fix a race between mad_client and cm_client initShifeng Li1-15/+22
The mad_client will be initialized in enable_device_and_get(), while the devices_rwsem will be downgraded to a read semaphore. There is a window that leads to the failed initialization for cm_client, since it can not get matched mad port from ib_mad_port_list, and the matched mad port will be added to the list after that. mad_client | cm_client ------------------|-------------------------------------------------------- ib_register_device| enable_device_and_get down_write(&devices_rwsem) xa_set_mark(&devices, DEVICE_REGISTERED) downgrade_write(&devices_rwsem) | |ib_cm_init |ib_register_client(&cm_client) |down_read(&devices_rwsem) |xa_for_each_marked (&devices, DEVICE_REGISTERED) |add_client_context |cm_add_one |ib_register_mad_agent |ib_get_mad_port |__ib_get_mad_port |list_for_each_entry(entry, &ib_mad_port_list, port_list) |return NULL |up_read(&devices_rwsem) | add_client_context| ib_mad_init_device| ib_mad_port_open | list_add_tail(&port_priv->port_list, &ib_mad_port_list) up_read(&devices_rwsem) | Fix it by using down_write(&devices_rwsem) in ib_register_client(). Fixes: d0899892edd0 ("RDMA/device: Provide APIs from the core code to help unregistration") Link: https://lore.kernel.org/r/20240203035313.98991-1-lishifeng@sangfor.com.cn Suggested-by: Jason Gunthorpe <jgg@ziepe.ca> Signed-off-by: Shifeng Li <lishifeng@sangfor.com.cn> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-12-04RDMA/core: Fix umem iterator when PAGE_SIZE is greater then HCA pgszMike Marciniszyn1-6/+0
64k pages introduce the situation in this diagram when the HCA 4k page size is being used: +-------------------------------------------+ <--- 64k aligned VA | | | HCA 4k page | | | +-------------------------------------------+ | o | | | | o | | | | o | +-------------------------------------------+ | | | HCA 4k page | | | +-------------------------------------------+ <--- Live HCA page |OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO| <--- offset | | <--- VA | MR data | +-------------------------------------------+ | | | HCA 4k page | | | +-------------------------------------------+ | o | | | | o | | | | o | +-------------------------------------------+ | | | HCA 4k page | | | +-------------------------------------------+ The VA addresses are coming from rdma-core in this diagram can be arbitrary, but for 64k pages, the VA may be offset by some number of HCA 4k pages and followed by some number of HCA 4k pages. The current iterator doesn't account for either the preceding 4k pages or the following 4k pages. Fix the issue by extending the ib_block_iter to contain the number of DMA pages like comment [1] says and by using __sg_advance to start the iterator at the first live HCA page. The changes are contained in a parallel set of iterator start and next functions that are umem aware and specific to umem since there is one user of the rdma_for_each_block() without umem. These two fixes prevents the extra pages before and after the user MR data. Fix the preceding pages by using the __sq_advance field to start at the first 4k page containing MR data. Fix the following pages by saving the number of pgsz blocks in the iterator state and downcounting on each next. This fix allows for the elimination of the small page crutch noted in the Fixes. Fixes: 10c75ccb54e4 ("RDMA/umem: Prevent small pages from being returned by ib_umem_find_best_pgsz()") Link: https://lore.kernel.org/r/20231129202143.1434-2-shiraz.saleem@intel.com Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-11-13RDMA/core: Fix uninit-value access in ib_get_eth_speed()Shigeru Yoshida1-1/+1
KMSAN reported the following uninit-value access issue: lo speed is unknown, defaulting to 1000 ===================================================== BUG: KMSAN: uninit-value in ib_get_width_and_speed drivers/infiniband/core/verbs.c:1889 [inline] BUG: KMSAN: uninit-value in ib_get_eth_speed+0x546/0xaf0 drivers/infiniband/core/verbs.c:1998 ib_get_width_and_speed drivers/infiniband/core/verbs.c:1889 [inline] ib_get_eth_speed+0x546/0xaf0 drivers/infiniband/core/verbs.c:1998 siw_query_port drivers/infiniband/sw/siw/siw_verbs.c:173 [inline] siw_get_port_immutable+0x6f/0x120 drivers/infiniband/sw/siw/siw_verbs.c:203 setup_port_data drivers/infiniband/core/device.c:848 [inline] setup_device drivers/infiniband/core/device.c:1244 [inline] ib_register_device+0x1589/0x1df0 drivers/infiniband/core/device.c:1383 siw_device_register drivers/infiniband/sw/siw/siw_main.c:72 [inline] siw_newlink+0x129e/0x13d0 drivers/infiniband/sw/siw/siw_main.c:490 nldev_newlink+0x8fd/0xa60 drivers/infiniband/core/nldev.c:1763 rdma_nl_rcv_skb drivers/infiniband/core/netlink.c:239 [inline] rdma_nl_rcv+0xe8a/0x1120 drivers/infiniband/core/netlink.c:259 netlink_unicast_kernel net/netlink/af_netlink.c:1342 [inline] netlink_unicast+0xf4b/0x1230 net/netlink/af_netlink.c:1368 netlink_sendmsg+0x1242/0x1420 net/netlink/af_netlink.c:1910 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg net/socket.c:745 [inline] ____sys_sendmsg+0x997/0xd60 net/socket.c:2588 ___sys_sendmsg+0x271/0x3b0 net/socket.c:2642 __sys_sendmsg net/socket.c:2671 [inline] __do_sys_sendmsg net/socket.c:2680 [inline] __se_sys_sendmsg net/socket.c:2678 [inline] __x64_sys_sendmsg+0x2fa/0x4a0 net/socket.c:2678 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x44/0x110 arch/x86/entry/common.c:82 entry_SYSCALL_64_after_hwframe+0x63/0x6b Local variable lksettings created at: ib_get_eth_speed+0x4b/0xaf0 drivers/infiniband/core/verbs.c:1974 siw_query_port drivers/infiniband/sw/siw/siw_verbs.c:173 [inline] siw_get_port_immutable+0x6f/0x120 drivers/infiniband/sw/siw/siw_verbs.c:203 CPU: 0 PID: 11257 Comm: syz-executor.1 Not tainted 6.6.0-14500-g1c41041124bd #10 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014 ===================================================== If __ethtool_get_link_ksettings() fails, `netdev_speed` is set to the default value, SPEED_1000. In this case, if `lanes` field of struct ethtool_link_ksettings is not initialized, an uninitialized value is passed to ib_get_width_and_speed(). This causes the above issue. This patch resolves the issue by initializing `lanes` to 0. Fixes: cb06b6b3f6cb ("RDMA/core: Get IB width and speed from netdev") Signed-off-by: Shigeru Yoshida <syoshida@redhat.com> Link: https://lore.kernel.org/r/20231108143113.1360567-1-syoshida@redhat.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-11-02Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds12-27/+112
Pull rdma updates from Jason Gunthorpe: "Nothing exciting this cycle, most of the diffstat is changing SPDX 'or' to 'OR'. Summary: - Bugfixes for hns, mlx5, and hfi1 - Hardening patches for size_*, counted_by, strscpy - rts fixes from static analysis - Dump SRQ objects in rdma netlink, with hns support - Fix a performance regression in mlx5 MR deregistration - New XDR (200Gb/lane) link speed - SRQ record doorbell latency optimization for hns - IPSEC support for mlx5 multi-port mode - ibv_rereg_mr() support for irdma - Affiliated event support for bnxt_re - Opt out for the spec compliant qkey security enforcement as we discovered SW that breaks under enforcement - Comment and trivial updates" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (50 commits) IB/mlx5: Fix init stage error handling to avoid double free of same QP and UAF RDMA/mlx5: Fix mkey cache WQ flush RDMA/hfi1: Workaround truncation compilation error IB/hfi1: Fix potential deadlock on &irq_src_lock and &dd->uctxt_lock RDMA/core: Remove NULL check before dev_{put, hold} RDMA/hfi1: Remove redundant assignment to pointer ppd RDMA/mlx5: Change the key being sent for MPV device affiliation RDMA/bnxt_re: Fix clang -Wimplicit-fallthrough in bnxt_re_handle_cq_async_error() RDMA/hns: Fix init failure of RoCE VF and HIP08 RDMA/hns: Fix unnecessary port_num transition in HW stats allocation RDMA/hns: The UD mode can only be configured with DCQCN RDMA/hns: Add check for SL RDMA/hns: Fix signed-unsigned mixed comparisons RDMA/hns: Fix uninitialized ucmd in hns_roce_create_qp_common() RDMA/hns: Fix printing level of asynchronous events RDMA/core: Add support to set privileged QKEY parameter RDMA/bnxt_re: Do not report SRQ error in srq notification RDMA/bnxt_re: Report async events and errors RDMA/bnxt_re: Update HW interface headers IB/mlx5: Fix rdma counter binding for RAW QP ...
2023-11-01Merge tag 'sysctl-6.7-rc1' of ↵Linus Torvalds2-2/+0
git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux Pull sysctl updates from Luis Chamberlain: "To help make the move of sysctls out of kernel/sysctl.c not incur a size penalty sysctl has been changed to allow us to not require the sentinel, the final empty element on the sysctl array. Joel Granados has been doing all this work. On the v6.6 kernel we got the major infrastructure changes required to support this. For v6.7-rc1 we have all arch/ and drivers/ modified to remove the sentinel. Both arch and driver changes have been on linux-next for a bit less than a month. It is worth re-iterating the value: - this helps reduce the overall build time size of the kernel and run time memory consumed by the kernel by about ~64 bytes per array - the extra 64-byte penalty is no longer inncurred now when we move sysctls out from kernel/sysctl.c to their own files For v6.8-rc1 expect removal of all the sentinels and also then the unneeded check for procname == NULL. The last two patches are fixes recently merged by Krister Johansen which allow us again to use softlockup_panic early on boot. This used to work but the alias work broke it. This is useful for folks who want to detect softlockups super early rather than wait and spend money on cloud solutions with nothing but an eventual hung kernel. Although this hadn't gone through linux-next it's also a stable fix, so we might as well roll through the fixes now" * tag 'sysctl-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: (23 commits) watchdog: move softlockup_panic back to early_param proc: sysctl: prevent aliased sysctls from getting passed to init intel drm: Remove now superfluous sentinel element from ctl_table array Drivers: hv: Remove now superfluous sentinel element from ctl_table array raid: Remove now superfluous sentinel element from ctl_table array fw loader: Remove the now superfluous sentinel element from ctl_table array sgi-xp: Remove the now superfluous sentinel element from ctl_table array vrf: Remove the now superfluous sentinel element from ctl_table array char-misc: Remove the now superfluous sentinel element from ctl_table array infiniband: Remove the now superfluous sentinel element from ctl_table array macintosh: Remove the now superfluous sentinel element from ctl_table array parport: Remove the now superfluous sentinel element from ctl_table array scsi: Remove now superfluous sentinel element from ctl_table array tty: Remove now superfluous sentinel element from ctl_table array xen: Remove now superfluous sentinel element from ctl_table array hpet: Remove now superfluous sentinel element from ctl_table array c-sky: Remove now superfluous sentinel element from ctl_talbe array powerpc: Remove now superfluous sentinel element from ctl_table arrays riscv: Remove now superfluous sentinel element from ctl_table array x86/vdso: Remove now superfluous sentinel element from ctl_table array ...
2023-10-31Merge tag 'v6.6' into rdma.git for-nextJason Gunthorpe4-3/+4
Resolve conflict by taking the spin_lock hunk from for-next: https://lore.kernel.org/r/20230928113851.5197a1ec@canb.auug.org.au Required for the next patch. Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-10-24RDMA/core: Remove NULL check before dev_{put, hold}Yang Li2-4/+2
The call netdev_{put, hold} of dev_{put, hold} will check NULL, so there is no need to check before using dev_{put, hold}, remove it to silence the warning: ./drivers/infiniband/core/nldev.c:375:2-9: WARNING: NULL check before dev_{put, hold} functions is not needed. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=7047 Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Link: https://lore.kernel.org/r/20231024003815.89742-1-yang.lee@linux.alibaba.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-10-19RDMA/core: Add support to set privileged QKEY parameterPatrisious Haddad3-9/+58
Add netlink command that enables/disables privileged QKEY by default. It is disabled by default, since according to IB spec only privileged users are allowed to use privileged QKEY. According to the IB specification rel-1.6, section 3.5.3: "QKEYs with the most significant bit set are considered controlled QKEYs, and a HCA does not allow a consumer to arbitrarily specify a controlled QKEY." Using rdma tool, $rdma system set privileged-qkey on When enabled non-privileged users would be able to use controlled QKEYs which are considered privileged. Using rdma tool, $rdma system set privileged-qkey off When disabled only privileged users would be able to use controlled QKEYs. You can also use the command below to check the parameter state: $rdma system show netns shared privileged-qkey off copy-on-fork on Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Link: https://lore.kernel.org/r/90398be70a9d23d2aa9d0f9fd11d2c264c1be534.1696848201.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-10-11infiniband: Remove the now superfluous sentinel element from ctl_table arrayJoel Granados2-2/+0
This commit comes at the tail end of a greater effort to remove the empty elements at the end of the ctl_table arrays (sentinels) which will reduce the overall build time size of the kernel and run time memory bloat by ~64 bytes per sentinel (further information Link : https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/) Remove sentinel from iwcm_ctl_table and ucma_ctl_table Signed-off-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-10-05RDMA/core: Require admin capabilities to set system parametersLeon Romanovsky1-0/+1
Like any other set command, require admin permissions to do it. Cc: stable@vger.kernel.org Fixes: 2b34c5580226 ("RDMA/core: Add command to set ib_core device net namspace sharing mode") Link: https://lore.kernel.org/r/75d329fdd7381b52cbdf87910bef16c9965abb1f.1696443438.git.leon@kernel.org Reviewed-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2023-10-04RDMA/core: Fix a couple of obvious typos in commentsChuck Lever1-1/+1
Fix typos. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://lore.kernel.org/r/169643338101.8035.6826446669479247727.stgit@manet.1015granger.net Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-10-02RDMA/core: Annotate struct ib_pkey_cache with __counted_byKees Cook1-1/+1