| Age | Commit message (Collapse) | Author | Files | Lines |
|
[ Upstream commit 65936bf25f90fe440bb2d11624c7d10fab266639 ]
ipoib_mcast_carrier_on_task() insanely open codes a rtnl_lock() such that
the only time flush_workqueue() can be called is if it also clears
IPOIB_FLAG_OPER_UP.
Thus the flush inside ipoib_flush_ah() will deadlock if it gets unlucky
enough, and lockdep doesn't help us to find it early:
CPU0 CPU1 CPU2
__ipoib_ib_dev_flush()
down_read(vlan_rwsem)
ipoib_vlan_add()
rtnl_trylock()
down_write(vlan_rwsem)
ipoib_mcast_carrier_on_task()
while (!rtnl_trylock())
msleep(20);
ipoib_flush_ah()
flush_workqueue(priv->wq)
Clean up the ah_reaper related functions and lifecycle to make sense:
- Start/Stop of the reaper should only be done in open/stop NDOs, not in
any other places
- cancel and flush of the reaper should only happen in the stop NDO.
cancel is only functional when combined with IPOIB_STOP_REAPER.
- Non-stop places were flushing the AH's just need to flush out dead AH's
synchronously and ignore the background task completely. It is fully
locked and harmless to leave running.
Which ultimately fixes the ABBA deadlock by removing the unnecessary
flush_workqueue() from the problematic place under the vlan_rwsem.
Fixes: efc82eeeae4e ("IB/ipoib: No longer use flush as a parameter")
Link: https://lore.kernel.org/r/20200625174219.290842-1-kamalheib1@gmail.com
Reported-by: Kamal Heib <kheib@redhat.com>
Tested-by: Kamal Heib <kheib@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 95a5631f6c9f3045f26245e6045244652204dfdb ]
The return value from ipoib_ib_dev_stop() is always 0 - change it to be
void.
Link: https://lore.kernel.org/r/20200623105236.18683-1-kamalheib1@gmail.com
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 1acba6a817852d4aa7916d5c4f2c82f702ee9224 ]
When connected mode is set, and we have connected and datagram traffic in
parallel, ipoib might crash with double free of datagram skb.
The current mechanism assumes that the order in the completion queue is
the same as the order of sent packets for all QPs. Order is kept only for
specific QP, in case of mixed UD and CM traffic we have few QPs (one UD and
few CM's) in parallel.
The problem:
----------------------------------------------------------
Transmit queue:
-----------------
UD skb pointer kept in queue itself, CM skb kept in spearate queue and
uses transmit queue as a placeholder to count the number of total
transmitted packets.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 .........127
------------------------------------------------------------
NL ud1 UD2 CM1 ud3 cm2 cm3 ud4 cm4 ud5 NL NL NL ...........
------------------------------------------------------------
^ ^
tail head
Completion queue (problematic scenario) - the order not the same as in
the transmit queue:
1 2 3 4 5 6 7 8 9
------------------------------------
ud1 CM1 UD2 ud3 cm2 cm3 ud4 cm4 ud5
------------------------------------
1. CM1 'wc' processing
- skb freed in cm separate ring.
- tx_tail of transmit queue increased although UD2 is not freed.
Now driver assumes UD2 index is already freed and it could be used for
new transmitted skb.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 .........127
------------------------------------------------------------
NL NL UD2 CM1 ud3 cm2 cm3 ud4 cm4 ud5 NL NL NL ...........
------------------------------------------------------------
^ ^ ^
(Bad)tail head
(Bad - Could be used for new SKB)
In this case (due to heavy load) UD2 skb pointer could be replaced by new
transmitted packet UD_NEW, as the driver assumes its free. At this point
we will have to process two 'wc' with same index but we have only one
pointer to free.
During second attempt to free the same skb we will have NULL pointer
exception.
2. UD2 'wc' processing
- skb freed according the index we got from 'wc', but it was already
overwritten by mistake. So actually the skb that was released is the
skb of the new transmitted packet and not the original one.
3. UD_NEW 'wc' processing
- attempt to free already freed skb. NUll pointer exception.
The fix:
-----------------------------------------------------------------------
The fix is to stop using the UD ring as a placeholder for CM packets, the
cyclic ring variables tx_head and tx_tail will manage the UD tx_ring, a
new cyclic variables global_tx_head and global_tx_tail are introduced for
managing and counting the overall outstanding sent packets, then the send
queue will be stopped and waken based on these variables only.
Note that no locking is needed since global_tx_head is updated in the xmit
flow and global_tx_tail is updated in the NAPI flow only. A previous
attempt tried to use one variable to count the outstanding sent packets,
but it did not work since xmit and NAPI flows can run at the same time and
the counter will be updated wrongly. Thus, we use the same simple cyclic
head and tail scheme that we have today for the UD tx_ring.
Fixes: 2c104ea68350 ("IB/ipoib: Get rid of the tx_outstanding variable in all modes")
Link: https://lore.kernel.org/r/20200527134705.480068-1-leon@kernel.org
Signed-off-by: Valentine Fatiev <valentinef@mellanox.com>
Signed-off-by: Alaa Hleihel <alaa@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Acked-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
logout"
commit 76261ada16dcc3be610396a46d35acc3efbda682 upstream.
Since commit 04060db41178 introduces soft lockups when toggling network
interfaces, revert it.
Link: https://marc.info/?l=target-devel&m=158157054906196
Cc: Rahul Kundu <rahul.kundu@chelsio.com>
Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Reported-by: Dakshaja Uppalapati <dakshaja@chelsio.com>
Fixes: 04060db41178 ("scsi: RDMA/isert: Fix a recently introduced regression related to logout")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 04060db41178c7c244f2c7dcd913e7fd331de915 upstream.
iscsit_close_connection() calls isert_wait_conn(). Due to commit
e9d3009cb936 both functions call target_wait_for_sess_cmds() although that
last function should be called only once. Fix this by removing the
target_wait_for_sess_cmds() call from isert_wait_conn() and by only calling
isert_wait_conn() after target_wait_for_sess_cmds().
Fixes: e9d3009cb936 ("scsi: target: iscsi: Wait for all commands to finish before freeing a session").
Link: https://lore.kernel.org/r/20200116044737.19507-1-bvanassche@acm.org
Reported-by: Rahul Kundu <rahul.kundu@chelsio.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit c1545f1a200f4adc4ef8dd534bf33e2f1aa22c2f ]
The retured value from ib_dma_map_sg saved in dma_nents variable. To avoid
future mismatch between types, define dma_nents as an integer instead of
unsigned.
Fixes: 57b26497fabe ("IB/iser: Pass the correct number of entries for dma mapped SGL")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Israel Rukshin <israelr@mellanox.com>
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 57b26497fabe1b9379b59fbc7e35e608e114df16 ]
ib_dma_map_sg() augments the SGL into a 'dma mapped SGL'. This process may
change the number of entries and the lengths of each entry.
Code that touches dma_address is iterating over the 'dma mapped SGL' and
must use dma_nents which returned from ib_dma_map_sg().
ib_sg_to_pages() and ib_map_mr_sg() are using dma_address so they must use
dma_nents.
Fixes: 39405885005a ("IB/iser: Port to new fast registration API")
Fixes: bfe066e256d5 ("IB/iser: Reuse ib_sg_to_pages")
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit e88982ad1bb12db699de96fbc07096359ef6176c upstream.
The code added by this patch is similar to the code that already exists in
ibmvscsis_determine_resid(). This patch has been tested by running the
following command:
strace sg_raw -r 1k /dev/sdb 12 00 00 00 60 00 -o inquiry.bin |&
grep resid=
Link: https://lore.kernel.org/r/20191105214632.183302-1-bvanassche@acm.org
Fixes: a42d985bd5b2 ("ib_srpt: Initial SRP Target merge for v3.3-rc1")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: Honggang Li <honli@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 7718cf03c3ce4b6ebd90107643ccd01c952a1fce ]
In case we don't set the sg_prot_tablesize, the scsi layer assign the
default size (65535 entries). We should limit this size since we should
take into consideration the underlaying device capability. This cap is
considered when calculating the sg_tablesize. Otherwise, for example,
we can get that /sys/block/sdb/queue/max_segments is 128 and
/sys/block/sdb/queue/max_integrity_segments is 65535.
Link: https://lore.kernel.org/r/1569359027-10987-1-git-send-email-maxg@mellanox.com
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 2ee00f6a98c36f7e4ba07cc33f24cc5a69060cc9 ]
This patch avoids that the SCSI mid-layer keeps retrying forever if
ib_post_send() fails. This was discovered while testing immediate
data support and passing a too large num_sge value to ib_post_send().
Cc: Sergey Gorenko <sergeygo@mellanox.com>
Cc: Max Gurtovoy <maxg@mellanox.com>
Cc: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 3144533bf667c8e53bb20656b78295960073e57b ]
The dlid assignment made by looking into the u_ucast_dlid array does not
do an explicit check for the size of the array. The code path to arrive at
def_port, the index value is long and complicated so its best to just have
an explicit check here.
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 65f07f5a09dacf3b60619f196f096ea3671a5eda ]
In case target remote invalidates bogus rkey and signature is not used,
pi_ctx is NULL deref.
The commit also fails the connection on bogus remote invalidation.
Fixes: 59caaed7a72a ("IB/iser: Support the remote invalidation exception")
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 142a9c287613560edf5a03c8d142c8b6ebc1995b ]
It is illegal to change MTU to a value lower than the minimum MTU
stated in ethernet spec. In addition to that we need to add 4 bytes
for encapsulation header (IPOIB_ENCAP_LEN).
Before "ifconfig ib0 mtu 0" command, succeeds while it obviously shouldn't.
Signed-off-by: Muhammad Sammar <muhammads@mellanox.com>
Reviewed-by: Feras Daoud <ferasda@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit bcef5b7215681250c4bf8961dfe15e9e4fef97d0 ]
The function srp_parse_in() is used both for parsing source address
specifications and for target address specifications. Target addresses
must have a port number. Having to specify a port number for source
addresses is inconvenient. Make sure that srp_parse_in() supports again
parsing addresses with no port number.
Cc: <stable@vger.kernel.org>
Fixes: c62adb7def71 ("IB/srp: Fix IPv6 address parsing")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit e37df2d5b569390e3b80ebed9a73fd5b9dcda010 ]
This patch avoids that a warning is reported when building with W=1.
Cc: Sergey Gorenko <sergeygo@mellanox.com>
Cc: Max Gurtovoy <maxg@mellanox.com>
Cc: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 91b01061fef9c57d2f5b712a6322ef51061f4efd ]
Despite failure in ipoib_dev_init() we continue with initialization flow
and creation of child device. It causes to the situation where this child
device is added too early to parent device list.
Change the logic, so in case of failure we properly return error from
ipoib_dev_init() and add child only in success path.
Fixes: eaeb39842508 ("IB/ipoib: Move init code to ndo_init")
Signed-off-by: Valentine Fatiev <valentinef@mellanox.com>
Reviewed-by: Feras Daoud <ferasda@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 64d701c608fea362881e823b666327f5d28d7ffd ]
in the case of IPoIB with SRIOV enabled hardware
ip link show command incorrecly prints
0 instead of a VF hardware address.
Before:
11: ib1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc pfifo_fast
state UP mode DEFAULT group default qlen 256
link/infiniband
80:00:00:66:fe:80:00:00:00:00:00:00:24:8a:07:03:00:a4:3e:7c brd
00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
vf 0 MAC 00:00:00:00:00:00, spoof checking off, link-state disable,
trust off, query_rss off
...
After:
11: ib1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc pfifo_fast
state UP mode DEFAULT group default qlen 256
link/infiniband
80:00:00:66:fe:80:00:00:00:00:00:00:24:8a:07:03:00:a4:3e:7c brd
00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
vf 0 link/infiniband
80:00:00:66:fe:80:00:00:00:00:00:00:24:8a:07:03:00:a4:3e:7c brd
00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff, spoof
checking off, link-state disable, trust off, query_rss off
v1->v2: just copy an address without modifing ifla_vf_mac
v2->v3: update the changelog
Signed-off-by: Denis Kirjanov <kda@linux-powerpc.org>
Acked-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 40ca8757291ca7a8775498112d320205b2a2e571 upstream.
Make sure that the next time a response is sent to the initiator that the
credit it had allocated for the aborted request gets freed.
Cc: Doug Ledford <dledford@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Nicholas Bellinger <nab@linux-iscsi.org>
Cc: Mike Christie <mchristi@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Fixes: 131e6abc674e ("target: Add TFO->abort_task for aborted task resources release") # v3.15
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 6ab4aba00f811a5265acc4d3eb1863bb3ca60562 ]
The following BUG was reported by kasan:
BUG: KASAN: use-after-free in ipoib_cm_tx_start+0x430/0x1390 [ib_ipoib]
Read of size 80 at addr ffff88034c30bcd0 by task kworker/u16:1/24020
Workqueue: ipoib_wq ipoib_cm_tx_start [ib_ipoib]
Call Trace:
dump_stack+0x9a/0xeb
print_address_description+0xe3/0x2e0
kasan_report+0x18a/0x2e0
? ipoib_cm_tx_start+0x430/0x1390 [ib_ipoib]
memcpy+0x1f/0x50
ipoib_cm_tx_start+0x430/0x1390 [ib_ipoib]
? kvm_clock_read+0x1f/0x30
? ipoib_cm_skb_reap+0x610/0x610 [ib_ipoib]
? __lock_is_held+0xc2/0x170
? process_one_work+0x880/0x1960
? process_one_work+0x912/0x1960
process_one_work+0x912/0x1960
? wq_pool_ids_show+0x310/0x310
? lock_acquire+0x145/0x440
worker_thread+0x87/0xbb0
? process_one_work+0x1960/0x1960
kthread+0x314/0x3d0
? kthread_create_worker_on_cpu+0xc0/0xc0
ret_from_fork+0x3a/0x50
Allocated by task 0:
kasan_kmalloc+0xa0/0xd0
kmem_cache_alloc_trace+0x168/0x3e0
path_rec_create+0xa2/0x1f0 [ib_ipoib]
ipoib_start_xmit+0xa98/0x19e0 [ib_ipoib]
dev_hard_start_xmit+0x159/0x8d0
sch_direct_xmit+0x226/0xb40
__dev_queue_xmit+0x1d63/0x2950
neigh_update+0x889/0x1770
arp_process+0xc47/0x21f0
arp_rcv+0x462/0x760
__netif_receive_skb_core+0x1546/0x2da0
netif_receive_skb_internal+0xf2/0x590
napi_gro_receive+0x28e/0x390
ipoib_ib_handle_rx_wc_rss+0x873/0x1b60 [ib_ipoib]
ipoib_rx_poll_rss+0x17d/0x320 [ib_ipoib]
net_rx_action+0x427/0xe30
__do_softirq+0x28e/0xc42
Freed by task 26680:
__kasan_slab_free+0x11d/0x160
kfree+0xf5/0x360
ipoib_flush_paths+0x532/0x9d0 [ib_ipoib]
ipoib_set_mode_rss+0x1ad/0x560 [ib_ipoib]
set_mode+0xc8/0x150 [ib_ipoib]
kernfs_fop_write+0x279/0x440
__vfs_write+0xd8/0x5c0
vfs_write+0x15e/0x470
ksys_write+0xb8/0x180
do_syscall_64+0x9b/0x420
entry_SYSCALL_64_after_hwframe+0x49/0xbe
The buggy address belongs to the object at ffff88034c30bcc8
which belongs to the cache kmalloc-512 of size 512
The buggy address is located 8 bytes inside of
512-byte region [ffff88034c30bcc8, ffff88034c30bec8)
The buggy address belongs to the page:
The following race between change mode and xmit flow is the reason for
this use-after-free:
Change mode Send packet 1 to GID XX Send packet 2 to GID XX
| | |
start | |
| | |
| | |
| Create new path for GID XX |
| and update neigh path |
| | |
| | |
| | |
flush_paths | |
| |
queue_work(cm.start_task) |
| Path for GID XX not found
| create new path
|
|
start_task runs with old
released path
There is no locking to protect the lifetime of the path through the
ipoib_cm_tx struct, so delete it entirely and always use the newly looked
up path under the priv->lock.
Fixes: 546481c2816e ("IB/ipoib: Fix memory corruption in ipoib cm mode connect flow")
Signed-off-by: Feras Daoud <ferasda@mellanox.com>
Reviewed-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 48396e80fb6526ea5ed267bd84f028bae56d2f9e upstream.
Since .scsi_done() must only be called after scsi_queue_rq() has
finished, make sure that the SRP initiator driver does not call
.scsi_done() while scsi_queue_rq() is in progress. Although
invoking sg_reset -d while I/O is in progress works fine with kernel
v4.20 and before, that is not the case with kernel v5.0-rc1. This
patch avoids that the following crash is triggered with kernel
v5.0-rc1:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000138
CPU: 0 PID: 360 Comm: kworker/0:1H Tainted: G B 5.0.0-rc1-dbg+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
Workqueue: kblockd blk_mq_run_work_fn
RIP: 0010:blk_mq_dispatch_rq_list+0x116/0xb10
Call Trace:
blk_mq_sched_dispatch_requests+0x2f7/0x300
__blk_mq_run_hw_queue+0xd6/0x180
blk_mq_run_work_fn+0x27/0x30
process_one_work+0x4f1/0xa20
worker_thread+0x67/0x5b0
kthread+0x1cf/0x1f0
ret_from_fork+0x24/0x30
Cc: <stable@vger.kernel.org>
Fixes: 94a9174c630c ("IB/srp: reduce lock coverage of command completion")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit ed041919f0d23c109d52cde8da6ddc211c52d67e upstream.
This patch avoids that KASAN sporadically reports the following:
BUG: KASAN: use-after-free in rxe_run_task+0x1e/0x60 [rdma_rxe]
Read of size 1 at addr ffff88801c50d8f4 by task check/24830
CPU: 4 PID: 24830 Comm: check Not tainted 4.20.0-rc6-dbg+ #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
Call Trace:
dump_stack+0x86/0xca
print_address_description+0x71/0x239
kasan_report.cold.5+0x242/0x301
__asan_load1+0x47/0x50
rxe_run_task+0x1e/0x60 [rdma_rxe]
rxe_post_send+0x4bd/0x8d0 [rdma_rxe]
srpt_zerolength_write+0xe1/0x160 [ib_srpt]
srpt_close_ch+0x8b/0xe0 [ib_srpt]
srpt_set_enabled+0xe7/0x150 [ib_srpt]
srpt_tpg_enable_store+0xc0/0x100 [ib_srpt]
configfs_write_file+0x157/0x1d0
__vfs_write+0xd7/0x3d0
vfs_write+0x102/0x290
ksys_write+0xab/0x130
__x64_sys_write+0x43/0x50
do_syscall_64+0x71/0x210
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Allocated by task 13856:
save_stack+0x43/0xd0
kasan_kmalloc+0xc7/0xe0
kasan_slab_alloc+0x11/0x20
kmem_cache_alloc+0x105/0x320
rxe_alloc+0xff/0x1f0 [rdma_rxe]
rxe_create_qp+0x9f/0x160 [rdma_rxe]
ib_create_qp+0xf5/0x690 [ib_core]
rdma_create_qp+0x6a/0x140 [rdma_cm]
srpt_cm_req_recv.cold.59+0x1588/0x237b [ib_srpt]
srpt_rdma_cm_req_recv.isra.35+0x1d5/0x220 [ib_srpt]
srpt_rdma_cm_handler+0x6f/0x100 [ib_srpt]
cma_listen_handler+0x59/0x60 [rdma_cm]
cma_ib_req_handler+0xd5b/0x2570 [rdma_cm]
cm_process_work+0x2e/0x110 [ib_cm]
cm_work_handler+0x2aae/0x502b [ib_cm]
process_one_work+0x481/0x9e0
worker_thread+0x67/0x5b0
kthread+0x1cf/0x1f0
ret_from_fork+0x24/0x30
Freed by task 3440:
save_stack+0x43/0xd0
__kasan_slab_free+0x139/0x190
kasan_slab_free+0xe/0x10
kmem_cache_free+0xbc/0x330
rxe_elem_release+0x66/0xe0 [rdma_rxe]
rxe_destroy_qp+0x3f/0x50 [rdma_rxe]
ib_destroy_qp+0x140/0x360 [ib_core]
srpt_release_channel_work+0xdc/0x310 [ib_srpt]
process_one_work+0x481/0x9e0
worker_thread+0x67/0x5b0
kthread+0x1cf/0x1f0
ret_from_fork+0x24/0x30
Cc: Sergey Gorenko <sergeygo@mellanox.com>
Cc: Max Gurtovoy <maxg@mellanox.com>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 24c3456c8d5ee6fc1933ca40f7b4406130682668 upstream.
If for some reason we failed to query the mr status, we need to make sure
to provide sufficient information for an ambiguous error (guard error on
sector 0).
Fixes: 0a7a08ad6f5f ("IB/iser: Implement check_protection")
Cc: <stable@vger.kernel.org>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 9b8b2a323008aedd39a8debb861b825707f01420 ]
Some InfiniBand network devices have multiple ports on the same PCI
function. This initializes the `dev_port' sysfs field of those
network interfaces with their port number.
Prior to this the kernel erroneously used the `dev_id' sysfs
field of those network interfaces to convey the port number to userspace.
The use of `dev_id' was considered correct until Linux 3.15,
when another field, `dev_port', was defined for this particular
purpose and `dev_id' was reserved for distinguishing stacked ifaces
(e.g: VLANs) with the same hardware address as their parent device.
Similar fixes to net/mlx4_en and many other drivers, which started
exporting this information through `dev_id' before 3.15, were accepted
into the kernel 4 years ago.
See 76a066f2a2a0 (`net/mlx4_en: Expose port number through sysfs').
Signed-off-by: Arseny Maslennikov <ar@cs.msu.ru>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 4d6e4d12da2c308f8f976d3955c45ee62539ac98 ]
IPCB should be cleared before icmp_send, since it may contain data from
previous layers and the data could be misinterpreted as ip header options,
which later caused the ihl to be set to an invalid value and resulted in
the following stack corruption:
[ 1083.031512] ib0: packet len 57824 (> 2048) too long to send, dropping
[ 1083.031843] ib0: packet len 37904 (> 2048) too long to send, dropping
[ 1083.032004] ib0: packet len 4040 (> 2048) too long to send, dropping
[ 1083.032253] ib0: packet len 63800 (> 2048) too long to send, dropping
[ 1083.032481] ib0: packet len 23960 (> 2048) too long to send, dropping
[ 1083.033149] ib0: packet len 63800 (> 2048) too long to send, dropping
[ 1083.033439] ib0: packet len 63800 (> 2048) too long to send, dropping
[ 1083.033700] ib0: packet len 63800 (> 2048) too long to send, dropping
[ 1083.034124] ib0: packet len 63800 (> 2048) too long to send, dropping
[ 1083.034387] ==================================================================
[ 1083.034602] BUG: KASAN: stack-out-of-bounds in __ip_options_echo+0xf08/0x1310
[ 1083.034798] Write of size 4 at addr ffff880353457c5f by task kworker/u16:0/7
[ 1083.034990]
[ 1083.035104] CPU: 7 PID: 7 Comm: kworker/u16:0 Tainted: G O 4.19.0-rc5+ #1
[ 1083.035316] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu2 04/01/2014
[ 1083.035573] Workqueue: ipoib_wq ipoib_cm_skb_reap [ib_ipoib]
[ 1083.035750] Call Trace:
[ 1083.035888] dump_stack+0x9a/0xeb
[ 1083.036031] print_address_description+0xe3/0x2e0
[ 1083.036213] kasan_report+0x18a/0x2e0
[ 1083.036356] ? __ip_options_echo+0xf08/0x1310
[ 1083.036522] __ip_options_echo+0xf08/0x1310
[ 1083.036688] icmp_send+0x7b9/0x1cd0
[ 1083.036843] ? icmp_route_lookup.constprop.9+0x1070/0x1070
[ 1083.037018] ? netif_schedule_queue+0x5/0x200
[ 1083.037180] ? debug_show_all_locks+0x310/0x310
[ 1083.037341] ? rcu_dynticks_curr_cpu_in_eqs+0x85/0x120
[ 1083.037519] ? debug_locks_off+0x11/0x80
[ 1083.037673] ? debug_check_no_obj_freed+0x207/0x4c6
[ 1083.037841] ? check_flags.part.27+0x450/0x450
[ 1083.037995] ? debug_check_no_obj_freed+0xc3/0x4c6
[ 1083.038169] ? debug_locks_off+0x11/0x80
[ 1083.038318] ? skb_dequeue+0x10e/0x1a0
[ 1083.038476] ? ipoib_cm_skb_reap+0x2b5/0x650 [ib_ipoib]
[ 1083.038642] ? netif_schedule_queue+0xa8/0x200
[ 1083.038820] ? ipoib_cm_skb_reap+0x544/0x650 [ib_ipoib]
[ 1083.038996] ipoib_cm_skb_reap+0x544/0x650 [ib_ipoib]
[ 1083.039174] process_one_work+0x912/0x1830
[ 1083.039336] ? wq_pool_ids_show+0x310/0x310
[ 1083.039491] ? lock_acquire+0x145/0x3a0
[ 1083.042312] worker_thread+0x87/0xbb0
[ 1083.045099] ? process_one_work+0x1830/0x1830
[ 1083.047865] kthread+0x322/0x3e0
[ 1083.050624] ? kthread_create_worker_on_cpu+0xc0/0xc0
[ 1083.053354] ret_from_fork+0x3a/0x50
For instance __ip_options_echo is failing to proceed with invalid srr and
optlen passed from another layer via IPCB
[ 762.139568] IPv4: __ip_options_echo rr=0 ts=0 srr=43 cipso=0
[ 762.139720] IPv4: ip_options_build: IPCB 00000000f3cd969e opt 000000002ccb3533
[ 762.139838] IPv4: __ip_options_echo in srr: optlen 197 soffset 84
[ 762.139852] IPv4: ip_options_build srr=0 is_frag=0 rr_needaddr=0 ts_needaddr=0 ts_needtime=0 rr=0 ts=0
[ 762.140269] ==================================================================
[ 762.140713] IPv4: __ip_options_echo rr=0 ts=0 srr=0 cipso=0
[ 762.141078] BUG: KASAN: stack-out-of-bounds in __ip_options_echo+0x12ec/0x1680
[ 762.141087] Write of size 4 at addr ffff880353457c7f by task kworker/u16:0/7
Signed-off-by: Denis Drozdov <denisd@mellanox.com>
Reviewed-by: Erez Shitrit <erezsh@mellanox.com>
Reviewed-by: Feras Daoud <ferasda@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Use different loop variables for the inner and outer loop. This avoids
that an infinite loop occurs if there are more RDMA channels than
target->req_ring_size.
Fixes: d92c0da71a35 ("IB/srp: Add multichannel support")
Cc: <stable@vger.kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Inside of start_xmit() the call to check if the connection is up and the
queueing of the packets for later transmission is not atomic which leaves
a window where cm_rep_handler can run, set the connection up, dequeue
pending packets and leave the subsequently queued packets by start_xmit()
sitting on neigh->queue until they're dropped when the connection is torn
down. This only applies to connected mode. These dropped packets can
really upset TCP, for example, and cause multi-minute delays in
transmission for open connections.
Here's the code in start_xmit where we check to see if the connection is
up:
if (ipoib_cm_get(neigh)) {
if (ipoib_cm_up(neigh)) {
ipoib_cm_send(dev, skb, ipoib_cm_get(neigh));
goto unref;
}
}
The race occurs if cm_rep_handler execution occurs after the above
connection check (specifically if it gets to the point where it acquires
priv->lock to dequeue pending skb's) but before the below code snippet in
start_xmit where packets are queued.
if (skb_queue_len(&neigh->queue) < IPOIB_MAX_PATH_REC_QUEUE) {
push_pseudo_header(skb, phdr->hwaddr);
spin_lock_irqsave(&priv->lock, flags);
__skb_queue_tail(&neigh->queue, skb);
spin_unlock_irqrestore(&priv->lock, flags);
} else {
++dev->stats.tx_dropped;
dev_kfree_skb_any(skb);
}
The patch acquires the netif tx lock in cm_rep_handler for the section
where it sets the connection up and dequeues and retransmits deferred
skb's.
Fixes: 839fcaba355a ("IPoIB: Connected mode experimental support")
Cc: stable@vger.kernel.org
Signed-off-by: Aaron Knister <aaron.s.knister@nasa.gov>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
rdma.git merge resolution for the 4.19 merge window
Conflicts:
drivers/infiniband/core/rdma_core.c
- Use the rdma code and revise with the new spelling for
atomic_fetch_add_unless
drivers/nvme/host/rdma.c
- Replace max_sge with max_send_sge in new blk code
drivers/nvme/target/rdma.c
- Use the blk code and revise to use NULL for ib_post_recv when
appropriate
- Replace max_sge with max_recv_sge in new blk code
net/rds/ib_send.c
- Use the net code and revise to use NULL for ib_post_recv when
appropriate
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Pull SCSI updates from James Bottomley:
"This is mostly updates to the usual drivers: mpt3sas, lpfc, qla2xxx,
hisi_sas, smartpqi, megaraid_sas, arcmsr.
In addition, with the continuing absence of Nic we have target updates
for tcmu and target core (all with reviews and acks).
The biggest observable change is going to be that we're (again) trying
to switch to mulitqueue as the default (a user can still override the
setting on the kernel command line).
Other major core stuff is the removal of the remaining Microchannel
drivers, an update of the internal timers and some reworks of
completion and result handling"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (203 commits)
scsi: core: use blk_mq_run_hw_queues in scsi_kick_queue
scsi: ufs: remove unnecessary query(DM) UPIU trace
scsi: qla2xxx: Fix issue reported by static checker for qla2x00_els_dcmd2_sp_done()
scsi: aacraid: Spelling fix in comment
scsi: mpt3sas: Fix calltrace observed while running IO & reset
scsi: aic94xx: fix an error code in aic94xx_init()
scsi: st: remove redundant pointer STbuffer
scsi: qla2xxx: Update driver version to 10.00.00.08-k
scsi: qla2xxx: Migrate NVME N2N handling into state machine
scsi: qla2xxx: Save frame payload size from ICB
scsi: qla2xxx: Fix stalled relogin
scsi: qla2xxx: Fix race between switch cmd completion and timeout
scsi: qla2xxx: Fix Management Server NPort handle reservation logic
scsi: qla2xxx: Flush mailbox commands on chip reset
scsi: qla2xxx: Fix unintended Logout
scsi: qla2xxx: Fix session state stuck in Get Port DB
scsi: qla2xxx: Fix redundant fc_rport registration
scsi: qla2xxx: Silent erroneous message
scsi: qla2xxx: Prevent sysfs access when chip is down
scsi: qla2xxx: Add longer window for chip reset
...
|
|
Pull networking updates from David Miller:
"Highlights:
- Gustavo A. R. Silva keeps working on the implicit switch fallthru
changes.
- Support 802.11ax High-Efficiency wireless in cfg80211 et al, From
Luca Coelho.
- Re-enable ASPM in r8169, from Kai-Heng Feng.
- Add virtual XFRM interfaces, which avoids all of the limitations of
existing IPSEC tunnels. From Steffen Klassert.
- Convert GRO over to use a hash table, so that when we have many
flows active we don't traverse a long list during accumluation.
- Many new self tests for routing, TC, tunnels, etc. Too many
contributors to mention them all, but I'm really happy to keep
seeing this stuff.
- Hardware timestamping support for dpaa_eth/fsl-fman from Yangbo Lu.
- Lots of cleanups and fixes in L2TP code from Guillaume Nault.
- Add IPSEC offload support to netdevsim, from Shannon Nelson.
- Add support for slotting with non-uniform distribution to netem
packet scheduler, from Yousuk Seung.
- Add UDP GSO support to mlx5e, from Boris Pismenny.
- Support offloading of Team LAG in NFP, from John Hurley.
- Allow to configure TX queue selection based upon RX queue, from
Amritha Nambiar.
- Support ethtool ring size configuration in aquantia, from Anton
Mikaev.
- Support DSCP and flowlabel per-transport in SCTP, from Xin Long.
- Support list based batching and stack traversal of SKBs, this is
very exciting work. From Edward Cree.
- Busyloop optimizations in vhost_net, from Toshiaki Makita.
- Introduce the ETF qdisc, which allows time based transmissions. IGB
can offload this in hardware. From Vinicius Costa Gomes.
- Add parameter support to devlink, from Moshe Shemesh.
- Several multiplication and division optimizations for BPF JIT in
nfp driver, from Jiong Wang.
- Lots of prepatory work to make more of the packet scheduler layer
lockless, when possible, from Vlad Buslov.
- Add ACK filter and NAT awareness to sch_cake packet scheduler, from
Toke Høiland-Jørgensen.
- Support regions and region snapshots in devlink, from Alex Vesker.
- Allow to attach XDP programs to both HW and SW at the same time on
a given device, with initial support in nfp. From Jakub Kicinski.
- Add TLS RX offload and support in mlx5, from Ilya Lesokhin.
- Use PHYLIB in r8169 driver, from Heiner Kallweit.
- All sorts of changes to support Spectrum 2 in mlxsw driver, from
Ido Schimmel.
- PTP support in mv88e6xxx DSA driver, from Andrew Lunn.
- Make TCP_USER_TIMEOUT socket option more accurate, from Jon
Maxwell.
- Support for templates in packet scheduler classifier, from Jiri
Pirko.
- IPV6 support in RDS, from Ka-Cheong Poon.
- Native tproxy support in nf_tables, from Máté Eckl.
- Maintain IP fragment queue in an rbtree, but optimize properly for
in-order frags. From Peter Oskolkov.
- Improvde handling of ACKs on hole repairs, from Yuchung Cheng"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1996 commits)
bpf: test: fix spelling mistake "REUSEEPORT" -> "REUSEPORT"
hv/netvsc: Fix NULL dereference at single queue mode fallback
net: filter: mark expected switch fall-through
xen-netfront: fix warn message as irq device name has '/'
cxgb4: Add new T5 PCI device ids 0x50af and 0x50b0
net: dsa: mv88e6xxx: missing unlock on error path
rds: fix building with IPV6=m
inet/connection_sock: prefer _THIS_IP_ to current_text_addr
net: dsa: mv88e6xxx: bitwise vs logical bug
net: sock_diag: Fix spectre v1 gadget in __sock_diag_cmd()
ieee802154: hwsim: using right kind of iteration
net: hns3: Add vlan filter setting by ethtool command -K
net: hns3: Set tx ring' tc info when netdev is up
net: hns3: Remove tx ring BD len register in hns3_enet
net: hns3: Fix desc num set to default when setting channel
net: hns3: Fix for phy link issue when using marvell phy driver
net: hns3: Fix for information of phydev lost problem when down/up
net: hns3: Fix for command format parsing error in hclge_is_all_function_id_zero
net: hns3: Add support for serdes loopback selftest
bnxt_en: take coredump_record structure off stack
...
|
|
Move all the checking for pkey and other validity to the __ipoib_vlan_add
function. This removes the last difference from the control flow
of the __ipoib_vlan_add to make the overall design simpler to
understand.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
|
|
This fixes a bug in the netlink path where the vlan_rwsem was not
held around __ipoib_vlan_add causing the child_intfs to be manipulated
unsafely.
In the process this greatly simplifies the vlan_rwsem write side locking
to only cover a single non-sleeping statement.
This also further increases the safety of the removal ordering by holding
the netdev of the parent while the child is active to ensure most bugs
become either an oops on a NULL priv or a deadlock on the netdev refcount.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
|
|
Switching to priv_destructor and needs_free_netdev created a subtle
ordering problem in ipoib_remove_one.
Now that unregister_netdev frees the netdev and priv we must ensure that
the children are unregistered before trying to unregister the parent,
or child unregister will use after free.
The solution is to unregister the children, then parent, in the same batch
all while holding the rtnl_lock. This closes all the races where a new
child could have been added and ensures proper ordering.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
|
|
This mutex was introduced to deal with the deadlock formed by calling
unregister_netdev from within the sysfs callback of a netdev.
Now that we have priv_destructor and needs_free_netdev we can switch
to the more targeted solution of running the unregister from a
work queue. This avoids the deadlock and gets rid of the mutex.
The next patch in the series needs this mutex eliminated to create
atomicity of unregisteration.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
|
|
Now that the unregister_netdev flow for IPoIB no longer relies on external
code we can now introduce the use of priv_destructor and
needs_free_netdev.
The rdma_netdev flow is switched to use the netdev common priv_destructor
instead of the special free_rdma_netdev and the IPOIB ULP adjusted:
- priv_destructor needs to switch to point to the ULP's destructor
which will then call the rdma_ndev's in the right order
- We need to be careful around the error unwind of register_netdev
as it sometimes calls priv_destructor on failure
- ULPs need to use ndo_init/uninit to ensure proper ordering
of failures around register_netdev
Switching to priv_destructor is a necessary pre-requisite to using
the rtnl new_link mechanism.
The VNIC user for rdma_netdev should also be revised, but that is left for
another patch.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Denis Drozdov <denisd@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
|
|
Now that we have a proper ndo_uninit, move code that naturally pairs
with the ndo_uninit into ndo_init. This allows the netdev core to natually
handle ordering.
This fixes the situation where register_netdev can fail before calling
ndo_init, in which case it wouldn't call ndo_uninit either.
Also move a bunch of duplicated init code that is shared between child
and parent for clarity. Now the child and parent register functions look
very similar.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
|
|
Currently uninit is sometimes done twice in error flows, and is sprinkled
a bit all over the place.
Improve the clarity of the design by moving all uninit only into
ndo_uinit.
Some duplication is removed:
- Sometimes IPOIB_STOP_NEIGH_GC was done before unregister, but
this duplicates the process in ipoib_neigh_hash_init
- Flushing priv->wq was sometimes done before unregister,
but that duplicates what has been done in ndo_uninit
Uniniting the IB event queue must remain before unregister_netdev as it
requires the RTNL lock to be dropped, this is moved to a helper to make
that flow really clear and remove some duplication in error flows.
If register_netdev fails (and ndo_init is NULL) then it almost always
calls ndo_uninit, which lets us remove all the extra code from the error
unwinds. The next patch in the series will close the 'almost always' hole
by pairing a proper ndo_init with ndo_uninit.
Signed-off-by: Jason Gun |