linux.git/drivers/nvme, branch v4.14.98

nvmet-rdma: fix null dereference under heavy load

2019-01-31T07:13:47+00:00

commit 5cbab6303b4791a3e6713dfe2c5fda6a867f9adc upstream.

Under heavy load if we don't have any pre-allocated rsps left, we
dynamically allocate a rsp, but we are not actually allocating memory
for nvme_completion (rsp->req.rsp). In such a case, accessing pointer
fields (req->rsp->status) in nvmet_req_init() will result in crash.

To fix this, allocate the memory for nvme_completion by calling
nvmet_rdma_alloc_rsp()

Fixes: 8407879c("nvmet-rdma:fix possible bogus dereference under heavy load")

Cc: 
Reviewed-by: Max Gurtovoy 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Raju Rangoju 
Signed-off-by: Sagi Grimberg 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

nvmet-rdma: Add unlikely for response allocated check

2019-01-31T07:13:47+00:00

commit ad1f824948e4ed886529219cf7cd717d078c630d upstream.

Signed-off-by: Israel Rukshin 
Reviewed-by: Sagi Grimberg 
Reviewed-by: Max Gurtovoy 
Signed-off-by: Christoph Hellwig 
Signed-off-by: Jens Axboe 
Cc: Raju  Rangoju 
Signed-off-by: Greg Kroah-Hartman

nvmet-rdma: fix response use after free

2018-12-21T13:13:18+00:00

[ Upstream commit d7dcdf9d4e15189ecfda24cc87339a3425448d5c ]

nvmet_rdma_release_rsp() may free the response before using it at error
flow.

Fixes: 8407879 ("nvmet-rdma: fix possible bogus dereference under heavy load")
Signed-off-by: Israel Rukshin 
Reviewed-by: Sagi Grimberg 
Reviewed-by: Max Gurtovoy 
Signed-off-by: Christoph Hellwig 
Signed-off-by: Sasha Levin

nvme: flush namespace scanning work just before removing namespaces

2018-12-17T08:28:53+00:00

[ Upstream commit f6c8e432cb0479255322c5d0335b9f1699a0270c ]

nvme_stop_ctrl can be called also for reset flow and there is no need to
flush the scan_work as namespaces are not being removed. This can cause
deadlock in rdma, fc and loop drivers since nvme_stop_ctrl barriers
before controller teardown (and specifically I/O cancellation of the
scan_work itself) takes place, but the scan_work will be blocked anyways
so there is no need to flush it.

Instead, move scan_work flush to nvme_remove_namespaces() where it really
needs to flush.

Reported-by: Ming Lei 
Signed-off-by: Sagi Grimberg 
Reviewed-by: Keith Busch 
Reviewed by: James Smart 
Tested-by: Ewan D. Milne 
Signed-off-by: Christoph Hellwig 
Signed-off-by: Sasha Levin

nvme-loop: fix kernel oops in case of unhandled command

2018-11-21T08:24:17+00:00

commit 11d9ea6f2ca69237d35d6c55755beba3e006b106 upstream.

When nvmet_req_init() fails, __nvmet_req_complete() is called
to handle the target request via .queue_response(), so
nvme_loop_queue_response() shouldn't be called again for
handling the failure.

This patch fixes this case by the following way:

- move blk_mq_start_request() before nvmet_req_init(), so
nvme_loop_queue_response() may work well to complete this
host request

- don't call nvme_cleanup_cmd() which is done in nvme_loop_complete_rq()

- don't call nvme_loop_queue_response() which is done via
.queue_response()

Signed-off-by: Ming Lei 
Reviewed-by: Christoph Hellwig 
[trimmed changelog]
Signed-off-by: Keith Busch 
Signed-off-by: Jens Axboe 
Signed-off-by: Sudip Mukherjee 
Signed-off-by: Greg Kroah-Hartman

nvme_fc: fix ctrl create failures racing with workq items

2018-10-13T07:27:28+00:00

commit cf25809bec2c7df4b45df5b2196845d9a4a3c89b upstream.

If there are errors during initial controller create, the transport
will teardown the partially initialized controller struct and free
the ctlr memory.  Trouble is - most of those errors can occur due
to asynchronous events happening such io timeouts and subsystem
connectivity failures. Those failures invoke async workq items to
reset the controller and attempt reconnect.  Those may be in progress
as the main thread frees the ctrl memory, resulting in NULL ptr oops.

Prevent this from happening by having the main ctrl failure thread
changing state to DELETING followed by synchronously cancelling any
pending queued work item. The change of state will prevent the
scheduling of resets or reconnect events.

Signed-off-by: James Smart 
Signed-off-by: Keith Busch 
Signed-off-by: Jens Axboe 
Signed-off-by: Amit Pundir 
Signed-off-by: Greg Kroah-Hartman

nvmet-rdma: fix possible bogus dereference under heavy load

2018-10-10T06:54:24+00:00

[ Upstream commit 8407879c4e0d7731f6e7e905893cecf61a7762c7 ]

Currently we always repost the recv buffer before we send a response
capsule back to the host. Since ordering is not guaranteed for send
and recv completions, it is posible that we will receive a new request
from the host before we got a send completion for the response capsule.

Today, we pre-allocate 2x rsps the length of the queue, but in reality,
under heavy load there is nothing that is really preventing the gap to
expand until we exhaust all our rsps.

To fix this, if we don't have any pre-allocated rsps left, we dynamically
allocate a rsp and make sure to free it when we are done. If under memory
pressure we fail to allocate a rsp, we silently drop the command and
wait for the host to retry.

Reported-by: Steve Wise 
Tested-by: Steve Wise 
Signed-off-by: Sagi Grimberg 
[hch: dropped a superflous assignment]
Signed-off-by: Christoph Hellwig 
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman

nvme-fcloop: Fix dropped LS's to removed target port

2018-10-04T00:00:59+00:00

[ Upstream commit afd299ca996929f4f98ac20da0044c0cdc124879 ]

When a targetport is removed from the config, fcloop will avoid calling
the LS done() routine thinking the targetport is gone. This leaves the
initiator reset/reconnect hanging as it waits for a status on the
Create_Association LS for the reconnect.

Change the filter in the LS callback path. If tport null (set when
failed validation before "sending to remote port"), be sure to call
done. This was the main bug. But, continue the logic that only calls
done if tport was set but there is no remoteport (e.g. case where
remoteport has been removed, thus host doesn't expect a completion).

Signed-off-by: James Smart 
Signed-off-by: Christoph Hellwig 
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman

nvme-rdma: unquiesce queues when deleting the controller

2018-09-26T06:38:02+00:00

[ Upstream commit 90140624e8face94207003ac9a9d2a329b309d68 ]

If the controller is going away, we need to unquiesce the IO queues so
that all pending request can fail gracefully before moving forward with
controller deletion. Do that before we destroy the IO queues so
blk_cleanup_queue won't block in freeze.

Signed-off-by: Sagi Grimberg 
Signed-off-by: Christoph Hellwig 
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman

nvme-pci: add a memory barrier to nvme_dbbuf_update_and_check_event

2018-09-05T07:26:36+00:00

commit f1ed3df20d2d223e0852cc4ac1f19bba869a7e3c upstream.

In many architectures loads may be reordered with older stores to
different locations.  In the nvme driver the following two operations
could be reordered:

 - Write shadow doorbell (dbbuf_db) into memory.
 - Read EventIdx (dbbuf_ei) from memory.

This can result in a potential race condition between driver and VM host
processing requests (if given virtual NVMe controller has a support for
shadow doorbell).  If that occurs, then the NVMe controller may decide to
wait for MMIO doorbell from guest operating system, and guest driver may
decide not to issue MMIO doorbell on any of subsequent commands.

This issue is purely timing-dependent one, so there is no easy way to
reproduce it. Currently the easiest known approach is to run "Oracle IO
Numbers" (orion) that is shipped with Oracle DB:

orion -run advanced -num_large 0 -size_small 8 -type rand -simulate \
	concat -write 40 -duration 120 -matrix row -testname nvme_test

Where nvme_test is a .lun file that contains a list of NVMe block
devices to run test against. Limiting number of vCPUs assigned to given
VM instance seems to increase chances for this bug to occur. On test
environment with VM that got 4 NVMe drives and 1 vCPU assigned the
virtual NVMe controller hang could be observed within 10-20 minutes.
That correspond to about 400-500k IO operations processed (or about
100GB of IO read/writes).

Orion tool was used as a validation and set to run in a loop for 36
hours (equivalent of pushing 550M IO operations). No issues were
observed. That suggest that the patch fixes the issue.

Fixes: f9f38e33389c ("nvme: improve performance for virtual NVMe devices")
Signed-off-by: Michal Wnukowski 
Reviewed-by: Keith Busch 
Reviewed-by: Sagi Grimberg 
[hch: updated changelog and comment a bit]
Signed-off-by: Christoph Hellwig 
Signed-off-by: Greg Kroah-Hartman