summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorDavid S. Miller <davem@davemloft.net>2021-06-03 14:11:22 -0700
committerDavid S. Miller <davem@davemloft.net>2021-06-03 14:11:22 -0700
commit5ff5622ea1f16d535f1be4e478e712ef48fe183b (patch)
tree59cd234f8b1561db0d38c4d0990661d6fc2f34d2
parentae1d9cc31244407710131b7ca531e7a8be3381c2 (diff)
parent35155e2626dcae187df7071550fbfd94b7113d6c (diff)
downloadlinux-5ff5622ea1f16d535f1be4e478e712ef48fe183b.tar.gz
linux-5ff5622ea1f16d535f1be4e478e712ef48fe183b.tar.bz2
linux-5ff5622ea1f16d535f1be4e478e712ef48fe183b.zip
Merge branch 'NVMeTCP-Offload-ULP'
Shai Malin says: ==================== NVMeTCP Offload ULP With the goal of enabling a generic infrastructure that allows NVMe/TCP offload devices like NICs to seamlessly plug into the NVMe-oF stack, this patch series introduces the nvme-tcp-offload ULP host layer, which will be a new transport type called "tcp-offload" and will serve as an abstraction layer to work with vendor specific nvme-tcp offload drivers. NVMeTCP offload is a full offload of the NVMeTCP protocol, this includes both the TCP level and the NVMeTCP level. The nvme-tcp-offload transport can co-exist with the existing tcp and other transports. The tcp offload was designed so that stack changes are kept to a bare minimum: only registering new transports. All other APIs, ops etc. are identical to the regular tcp transport. Representing the TCP offload as a new transport allows clear and manageable differentiation between the connections which should use the offload path and those that are not offloaded (even on the same device). The nvme-tcp-offload layers and API compared to nvme-tcp and nvme-rdma: * NVMe layer: * [ nvme/nvme-fabrics/blk-mq ] | (nvme API and blk-mq API) | | * Vendor agnostic transport layer: * [ nvme-rdma ] [ nvme-tcp ] [ nvme-tcp-offload ] | | | (Verbs) | | | | (Socket) | | | | | (nvme-tcp-offload API) | | | | | | * Vendor Specific Driver: * | | | [ qedr ] | | [ qede ] | [ qedn ] Performance: ============ With this implementation on top of the Marvell qedn driver (using the Marvell FastLinQ NIC), we were able to demonstrate the following CPU utilization improvement: On AMD EPYC 7402, 2.80GHz, 28 cores: - For 16K queued read IOs, 16jobs, 4qd (50Gbps line rate): Improved the CPU utilization from 15.1% with NVMeTCP SW to 4.7% with NVMeTCP offload. On Intel(R) Xeon(R) Gold 5122 CPU, 3.60GHz, 16 cores: - For 512K queued read IOs, 16jobs, 4qd (25Gbps line rate): Improved the CPU utilization from 16.3% with NVMeTCP SW to 1.1% with NVMeTCP offload. In addition, we were able to demonstrate the following latency improvement: - For 200K read IOPS (16 jobs, 16 qd, with fio rate limiter): Improved the average latency from 105 usec with NVMeTCP SW to 39 usec with NVMeTCP offload. Improved the 99.99 tail latency from 570 usec with NVMeTCP SW to 91 usec with NVMeTCP offload. The end-to-end offload latency was measured from fio while running against back end of null device. Upstream plan: ============== The RFC series "NVMeTCP Offload ULP and QEDN Device Driver" https://lore.kernel.org/netdev/20210531225222.16992-1-smalin@marvell.com/ was designed in a modular way so that part 1 (nvme-tcp-offload) and part 2 (qed) are independent and part 3 (qedn) depends on both parts 1+2. - Part 1 (RFC patch 1-8): NVMeTCP Offload ULP The nvme-tcp-offload patches, will be sent to 'linux-nvme@lists.infradead.org'. - Part 2 (RFC patches 9-15): QED NVMeTCP Offload The qed infrastructure, will be sent to 'netdev@vger.kernel.org'. Once part 1 and 2 are accepted: - Part 3 (RFC patches 16-27): QEDN NVMeTCP Offload The qedn patches, will be sent to 'linux-nvme@lists.infradead.org'. Marvell is fully committed to maintain, test, and address issues with the new nvme-tcp-offload layer. Usage: ====== With the Marvell NVMeTCP offload design, the network-device (qede) and the offload-device (qedn) are paired on each port - Logically similar to the RDMA model. The user will interact with the network-device in order to configure the ip/vlan. The NVMeTCP configuration is populated as part of the nvme connect command. Example: Assign IP to the net-device (from any existing Linux tool): ip addr add 100.100.0.101/24 dev p1p1 This IP will be used by both net-device (qede) and offload-device (qedn). In order to connect from "sw" nvme-tcp through the net-device (qede): nvme connect -t tcp -s 4420 -a 100.100.0.100 -n testnqn In order to connect from "offload" nvme-tcp through the offload-device (qedn): nvme connect -t tcp_offload -s 4420 -a 100.100.0.100 -n testnqn An alternative approach, and as a future enhancement that will not impact this series will be to modify nvme-cli with a new flag that will determine if "-t tcp" should be the regular nvme-tcp (which will be the default) or nvme-tcp-offload. Exmaple: nvme connect -t tcp -s 4420 -a 100.100.0.100 -n testnqn -[new flag] Queue Initialization Design: ============================ The nvme-tcp-offload ULP module shall register with the existing nvmf_transport_ops (.name = "tcp_offload"), nvme_ctrl_ops and blk_mq_ops. The nvme-tcp-offload vendor driver shall register to nvme-tcp-offload ULP with the following ops: - claim_dev() - in order to resolve the route to the target according to the paired net_dev. - create_queue() - in order to create offloaded nvme-tcp queue. The nvme-tcp-offload ULP module shall manage all the controller level functionalities, call claim_dev and based on the return values shall call the relevant module create_queue in order to create the admin queue and the IO queues. IO-path Design: =============== The nvme-tcp-offload shall work at the IO-level - the nvme-tcp-offload ULP module shall pass the request (the IO) to the nvme-tcp-offload vendor driver and later, the nvme-tcp-offload vendor driver returns the request completion (the IO completion). No additional handling is needed in between; this design will reduce the CPU utilization as we will describe below. The nvme-tcp-offload vendor driver shall register to nvme-tcp-offload ULP with the following IO-path ops: - send_req() - in order to pass the request to the handling of the offload driver that shall pass it to the vendor specific device. - poll_queue() Once the IO completes, the nvme-tcp-offload vendor driver shall call command.done() that will invoke the nvme-tcp-offload ULP layer to complete the request. TCP events: =========== The Marvell FastLinQ NIC HW engine handle all the TCP re-transmissions and OOO events. Teardown and errors: ==================== In case of NVMeTCP queue error the nvme-tcp-offload vendor driver shall call the nvme_tcp_ofld_report_queue_err. The nvme-tcp-offload vendor driver shall register to nvme-tcp-offload ULP with the following teardown ops: - drain_queue() - destroy_queue() The Marvell FastLinQ NIC HW engine: ==================================== The Marvell NIC HW engine is capable of offloading the entire TCP/IP stack and managing up to 64K connections per PF, already implemented and upstream use cases for this include iWARP (by the Marvell qedr driver) and iSCSI (by the Marvell qedi driver). In addition, the Marvell NIC HW engine offloads the NVMeTCP queue layer and is able to manage the IO level also in case of TCP re-transmissions and OOO events. The HW engine enables direct data placement (including the data digest CRC calculation and validation) and direct data transmission (including data digest CRC calculation). The Marvell qedn driver: ======================== The new driver will be added under "drivers/nvme/hw" and will be enabled by the Kconfig "Marvell NVM Express over Fabrics TCP offload". As part of the qedn init, the driver will register as a pci device driver and will work with the Marvell fastlinQ NIC. As part of the probe, the driver will register to the nvme_tcp_offload (ULP) and to the qed module (qed_nvmetcp_ops) - similar to other "qed_*_ops" which are used by the qede, qedr, qedf and qedi device drivers. nvme-tcp-offload Future work: ============================= - NVMF_OPT_HOST_IFACE Support. Changes since RFC v1: ===================== - nvme-tcp-offload: Fix nvme_tcp_ofld_ops return values. - nvme-tcp-offload: Remove NVMF_TRTYPE_TCP_OFFLOAD. - nvme-tcp-offload: Add nvme_tcp_ofld_poll() implementation. - nvme-tcp-offload: Fix nvme_tcp_ofld_queue_rq() to check map_sg() and send_req() return values. Changes since RFC v2: ===================== - nvme-tcp-offload: Fixes in controller and queue level (patches 3-6). - qedn: Add the Marvell's NVMeTCP HW offload vendor driver init and probe (patches 8-11). Changes since RFC v3: ===================== - nvme-tcp-offload: Add the full implementation of the nvme-tcp-offload layer including the new ops: setup_ctrl(), release_ctrl(), commit_rqs() and new flows (ASYNC and timeout). - nvme-tcp-offload: Add device maximums: max_hw_sectors, max_segments. - nvme-tcp-offload: layer design and optimization changes. Changes since RFC v4: ===================== (Many thanks to Hannes Reinecke for his feedback) - nvme_tcp_offload: Add num_hw_vectors in order to limit the number of queues. - nvme_tcp_offload: Add per device private_data. - nvme_tcp_offload: Fix header digest, data digest and tos initialization. Changes since RFC v5: ===================== (Many thanks to Sagi Grimberg for his feedback) - nvme-fabrics: Expose nvmf_check_required_opts() globally (as a new patch). - nvme_tcp_offload: Remove io-queues BLK_MQ_F_BLOCKING. - nvme_tcp_offload: Fix the nvme_tcp_ofld_stop_queue (drain_queue) flow. - nvme_tcp_offload: Fix the nvme_tcp_ofld_free_queue (destroy_queue) flow. - nvme_tcp_offload: Change rwsem to mutex. - nvme_tcp_offload: remove redundant fields. - nvme_tcp_offload: Remove the "new" from setup_ctrl(). - nvme_tcp_offload: Remove the init_req() and commit_rqs() ops. - nvme_tcp_offload: Minor fixes in nvme_tcp_ofld_create_ctrl() ansd nvme_tcp_ofld_free_queue(). - nvme_tcp_offload: Patch 8 (timeout and async) was squeashed into patch 7 (io level). Changes since RFC v6: ===================== - No changes in nvme_tcp_offload (only in qedn). ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-rw-r--r--MAINTAINERS8
-rw-r--r--drivers/nvme/host/Kconfig17
-rw-r--r--drivers/nvme/host/Makefile3
-rw-r--r--drivers/nvme/host/fabrics.c12
-rw-r--r--drivers/nvme/host/fabrics.h9
-rw-r--r--drivers/nvme/host/tcp-offload.c1318
-rw-r--r--drivers/nvme/host/tcp-offload.h206
7 files changed, 1564 insertions, 9 deletions
diff --git a/MAINTAINERS b/MAINTAINERS
index 9cbc3766fd74..d8e882229a48 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -13107,6 +13107,14 @@ F: drivers/nvme/host/
F: include/linux/nvme.h
F: include/uapi/linux/nvme_ioctl.h
+NVM EXPRESS TCP OFFLOAD TRANSPORT DRIVERS
+M: Shai Malin <smalin@marvell.com>
+M: Ariel Elior <aelior@marvell.com>
+L: linux-nvme@lists.infradead.org
+S: Supported
+F: drivers/nvme/host/tcp-offload.c
+F: drivers/nvme/host/tcp-offload.h
+
NVM EXPRESS FC TRANSPORT DRIVERS
M: James Smart <james.smart@broadcom.com>
L: linux-nvme@lists.infradead.org
diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig
index a44d49d63968..caedc35e1f0d 100644
--- a/drivers/nvme/host/Kconfig
+++ b/drivers/nvme/host/Kconfig
@@ -84,3 +84,20 @@ config NVME_TCP
from https://github.com/linux-nvme/nvme-cli.
If unsure, say N.
+
+config NVME_TCP_OFFLOAD
+ tristate "NVM Express over Fabrics TCP offload common layer"
+ default m
+ depends on BLOCK
+ depends on INET
+ select NVME_CORE
+ select NVME_FABRICS
+ help
+ This provides support for the NVMe over Fabrics protocol using
+ the TCP offload transport. This allows you to use remote block devices
+ exported using the NVMe protocol set.
+
+ To configure a NVMe over Fabrics controller use the nvme-cli tool
+ from https://github.com/linux-nvme/nvme-cli.
+
+ If unsure, say N.
diff --git a/drivers/nvme/host/Makefile b/drivers/nvme/host/Makefile
index cbc509784b2e..3c3fdf83ce38 100644
--- a/drivers/nvme/host/Makefile
+++ b/drivers/nvme/host/Makefile
@@ -8,6 +8,7 @@ obj-$(CONFIG_NVME_FABRICS) += nvme-fabrics.o
obj-$(CONFIG_NVME_RDMA) += nvme-rdma.o
obj-$(CONFIG_NVME_FC) += nvme-fc.o
obj-$(CONFIG_NVME_TCP) += nvme-tcp.o
+obj-$(CONFIG_NVME_TCP_OFFLOAD) += nvme-tcp-offload.o
nvme-core-y := core.o ioctl.o
nvme-core-$(CONFIG_TRACING) += trace.o
@@ -26,3 +27,5 @@ nvme-rdma-y += rdma.o
nvme-fc-y += fc.o
nvme-tcp-y += tcp.o
+
+nvme-tcp-offload-y += tcp-offload.o
diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index a2bb7fc63a73..ceb263eb50fb 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -860,8 +860,8 @@ out:
return ret;
}
-static int nvmf_check_required_opts(struct nvmf_ctrl_options *opts,
- unsigned int required_opts)
+int nvmf_check_required_opts(struct nvmf_ctrl_options *opts,
+ unsigned int required_opts)
{
if ((opts->mask & required_opts) != required_opts) {
int i;
@@ -879,6 +879,7 @@ static int nvmf_check_required_opts(struct nvmf_ctrl_options *opts,
return 0;
}
+EXPORT_SYMBOL_GPL(nvmf_check_required_opts);
bool nvmf_ip_options_match(struct nvme_ctrl *ctrl,
struct nvmf_ctrl_options *opts)
@@ -942,13 +943,6 @@ void nvmf_free_options(struct nvmf_ctrl_options *opts)
}
EXPORT_SYMBOL_GPL(nvmf_free_options);
-#define NVMF_REQUIRED_OPTS (NVMF_OPT_TRANSPORT | NVMF_OPT_NQN)
-#define NVMF_ALLOWED_OPTS (NVMF_OPT_QUEUE_SIZE | NVMF_OPT_NR_IO_QUEUES | \
- NVMF_OPT_KATO | NVMF_OPT_HOSTNQN | \
- NVMF_OPT_HOST_ID | NVMF_OPT_DUP_CONNECT |\
- NVMF_OPT_DISABLE_SQFLOW |\
- NVMF_OPT_FAIL_FAST_TMO)
-
static struct nvme_ctrl *
nvmf_create_ctrl(struct device *dev, const char *buf)
{
diff --git a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h
index d7f7974dc208..8399fcc063ef 100644
--- a/drivers/nvme/host/fabrics.h
+++ b/drivers/nvme/host/fabrics.h
@@ -68,6 +68,13 @@ enum {
NVMF_OPT_FAIL_FAST_TMO = 1 << 20,
};
+#define NVMF_REQUIRED_OPTS (NVMF_OPT_TRANSPORT | NVMF_OPT_NQN)
+#define NVMF_ALLOWED_OPTS (NVMF_OPT_QUEUE_SIZE | NVMF_OPT_NR_IO_QUEUES | \
+ NVMF_OPT_KATO | NVMF_OPT_HOSTNQN | \
+ NVMF_OPT_HOST_ID | NVMF_OPT_DUP_CONNECT |\
+ NVMF_OPT_DISABLE_SQFLOW |\
+ NVMF_OPT_FAIL_FAST_TMO)
+
/**
* struct nvmf_ctrl_options - Used to hold the options specified
* with the parsing opts enum.
@@ -186,5 +193,7 @@ int nvmf_get_address(struct nvme_ctrl *ctrl, char *buf, int size);
bool nvmf_should_reconnect(struct nvme_ctrl *ctrl);
bool nvmf_ip_options_match(struct nvme_ctrl *ctrl,
struct nvmf_ctrl_options *opts);
+int nvmf_check_required_opts(struct nvmf_ctrl_options *opts,
+ unsigned int required_opts);
#endif /* _NVME_FABRICS_H */
diff --git a/drivers/nvme/host/tcp-offload.c b/drivers/nvme/host/tcp-offload.c
new file mode 100644
index 000000000000..c76822e5ada7
--- /dev/null
+++ b/drivers/nvme/host/tcp-offload.c
@@ -0,0 +1,1318 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright 2021 Marvell. All rights reserved.
+ */
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+/* Kernel includes */
+#include <linux/kernel.h>
+#include <linux/module.h>
+
+/* Driver includes */
+#include "tcp-offload.h"
+
+static LIST_HEAD(nvme_tcp_ofld_devices);
+static DEFINE_MUTEX(nvme_tcp_ofld_devices_mutex);
+static LIST_HEAD(nvme_tcp_ofld_ctrl_list);
+static DEFINE_MUTEX(nvme_tcp_ofld_ctrl_mutex);
+static struct blk_mq_ops nvme_tcp_ofld_admin_mq_ops;
+static struct blk_mq_ops nvme_tcp_ofld_mq_ops;
+
+static inline struct nvme_tcp_ofld_ctrl *to_tcp_ofld_ctrl(struct nvme_ctrl *nctrl)
+{
+ return container_of(nctrl, struct nvme_tcp_ofld_ctrl, nctrl);
+}
+
+static inline int nvme_tcp_ofld_qid(struct nvme_tcp_ofld_queue *queue)
+{
+ return queue - queue->ctrl->queues;
+}
+
+/**
+ * nvme_tcp_ofld_register_dev() - NVMeTCP Offload Library registration
+ * function.
+ * @dev: NVMeTCP offload device instance to be registered to the
+ * common tcp offload instance.
+ *
+ * API function that registers the type of vendor specific driver
+ * being implemented to the common NVMe over TCP offload library. Part of
+ * the overall init sequence of starting up an offload driver.
+ */
+int nvme_tcp_ofld_register_dev(struct nvme_tcp_ofld_dev *dev)
+{
+ struct nvme_tcp_ofld_ops *ops = dev->ops;
+
+ if (!ops->claim_dev ||
+ !ops->setup_ctrl ||
+ !ops->release_ctrl ||
+ !ops->create_queue ||
+ !ops->drain_queue ||
+ !ops->destroy_queue ||
+ !ops->poll_queue ||
+ !ops->send_req)
+ return -EINVAL;
+
+ mutex_lock(&nvme_tcp_ofld_devices_mutex);
+ list_add_tail(&dev->entry, &nvme_tcp_ofld_devices);
+ mutex_unlock(&nvme_tcp_ofld_devices_mutex);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(nvme_tcp_ofld_register_dev);
+
+/**
+ * nvme_tcp_ofld_unregister_dev() - NVMeTCP Offload Library unregistration
+ * function.
+ * @dev: NVMeTCP offload device instance to be unregistered from the
+ * common tcp offload instance.
+ *
+ * API function that unregisters the type of vendor specific driver being
+ * implemented from the common NVMe over TCP offload library.
+ * Part of the overall exit sequence of unloading the implemented driver.
+ */
+void nvme_tcp_ofld_unregister_dev(struct nvme_tcp_ofld_dev *dev)
+{
+ mutex_lock(&nvme_tcp_ofld_devices_mutex);
+ list_del(&dev->entry);
+ mutex_unlock(&nvme_tcp_ofld_devices_mutex);
+}
+EXPORT_SYMBOL_GPL(nvme_tcp_ofld_unregister_dev);
+
+/**
+ * nvme_tcp_ofld_error_recovery() - NVMeTCP Offload library error recovery.
+ * function.
+ * @nctrl: NVMe controller instance to change to resetting.
+ *
+ * API function that change the controller state to resseting.
+ * Part of the overall controller reset sequence.
+ */
+void nvme_tcp_ofld_error_recovery(struct nvme_ctrl *nctrl)
+{
+ if (!nvme_change_ctrl_state(nctrl, NVME_CTRL_RESETTING))
+ return;
+
+ queue_work(nvme_reset_wq, &to_tcp_ofld_ctrl(nctrl)->err_work);
+}
+EXPORT_SYMBOL_GPL(nvme_tcp_ofld_error_recovery);
+
+/**
+ * nvme_tcp_ofld_report_queue_err() - NVMeTCP Offload report error event
+ * callback function. Pointed to by nvme_tcp_ofld_queue->report_err.
+ * @queue: NVMeTCP offload queue instance on which the error has occurred.
+ *
+ * API function that allows the vendor specific offload driver to reports errors
+ * to the common offload layer, to invoke error recovery.
+ */
+int nvme_tcp_ofld_report_queue_err(struct nvme_tcp_ofld_queue *queue)
+{
+ pr_err("nvme-tcp-offload queue error\n");
+ nvme_tcp_ofld_error_recovery(&queue->ctrl->nctrl);
+
+ return 0;
+}
+
+/**
+ * nvme_tcp_ofld_req_done() - NVMeTCP Offload request done callback
+ * function. Pointed to by nvme_tcp_ofld_req->done.
+ * Handles both NVME_TCP_F_DATA_SUCCESS flag and NVMe CQ.
+ * @req: NVMeTCP offload request to complete.
+ * @result: The nvme_result.
+ * @status: The completion status.
+ *
+ * API function that allows the vendor specific offload driver to report request
+ * completions to the common offload layer.
+ */
+void nvme_tcp_ofld_req_done(struct nvme_tcp_ofld_req *req,
+ union nvme_result *result,
+ __le16 status)
+{
+ struct request *rq = blk_mq_rq_from_pdu(req);
+
+ if (!nvme_try_complete_req(rq, cpu_to_le16(status << 1), *result))
+ nvme_complete_rq(rq);
+}
+
+/**
+ * nvme_tcp_ofld_async_req_done() - NVMeTCP Offload request done callback
+ * function for async request. Pointed to by nvme_tcp_ofld_req->done.
+ * Handles both NVME_TCP_F_DATA_SUCCESS flag and NVMe CQ.
+ * @req: NVMeTCP offload request to complete.
+ * @result: The nvme_result.
+ * @status: The completion status.
+ *
+ * API function that allows the vendor specific offload driver to report request
+ * completions to the common offload layer.
+ */
+void nvme_tcp_ofld_async_req_done(struct nvme_tcp_ofld_req *req,
+ union nvme_result *result, __le16 status)
+{
+ struct nvme_tcp_ofld_queue *queue = req->queue;
+ struct nvme_tcp_ofld_ctrl *ctrl = queue->ctrl;
+
+ nvme_complete_async_event(&ctrl->nctrl, status, result);
+}
+
+static struct nvme_tcp_ofld_dev *
+nvme_tcp_ofld_lookup_dev(struct nvme_tcp_ofld_ctrl *ctrl)
+{
+ struct nvme_tcp_ofld_dev *dev;
+
+ mutex_lock(&nvme_tcp_ofld_devices_mutex);
+ list_for_each_entry(dev, &nvme_tcp_ofld_devices, entry) {
+ if (dev->ops->claim_dev(dev, ctrl))
+ goto out;
+ }
+
+ dev = NULL;
+out:
+ mutex_unlock(&nvme_tcp_ofld_devices_mutex);
+
+ return dev;
+}
+
+static struct blk_mq_tag_set *
+nvme_tcp_ofld_alloc_tagset(struct nvme_ctrl *nctrl, bool admin)
+{
+ struct nvme_tcp_ofld_ctrl *ctrl = to_tcp_ofld_ctrl(nctrl);
+ struct blk_mq_tag_set *set;
+ int rc;
+
+ if (admin) {
+ set = &ctrl->admin_tag_set;
+ memset(set, 0, sizeof(*set));
+ set->ops = &nvme_tcp_ofld_admin_mq_ops;
+ set->queue_depth = NVME_AQ_MQ_TAG_DEPTH;
+ set->reserved_tags = NVMF_RESERVED_TAGS;
+ set->numa_node = nctrl->numa_node;
+ set->flags = BLK_MQ_F_BLOCKING;
+ set->cmd_size = sizeof(struct nvme_tcp_ofld_req);
+ set->driver_data = ctrl;
+ set->nr_hw_queues = 1;
+ set->timeout = NVME_ADMIN_TIMEOUT;
+ } else {
+ set = &ctrl->tag_set;
+ memset(set, 0, sizeof(*set));
+ set->ops = &nvme_tcp_ofld_mq_ops;
+ set->queue_depth = nctrl->sqsize + 1;
+ set->reserved_tags = NVMF_RESERVED_TAGS;
+ set->numa_node = nctrl->numa_node;
+ set->flags = BLK_MQ_F_SHOULD_MERGE;
+ set->cmd_size = sizeof(struct nvme_tcp_ofld_req);
+ set->driver_data = ctrl;
+ set->nr_hw_queues = nctrl->queue_count - 1;
+ set->timeout = NVME_IO_TIMEOUT;
+ set->nr_maps = nctrl->opts->nr_poll_queues ? HCTX_MAX_TYPES : 2;
+ }
+
+ rc = blk_mq_alloc_tag_set(set);
+ if (rc)
+ return ERR_PTR(rc);
+
+ return set;
+}
+
+static void __nvme_tcp_ofld_stop_queue(struct nvme_tcp_ofld_queue *queue)
+{
+ queue->dev->ops->drain_queue(queue);
+}
+
+static void nvme_tcp_ofld_stop_queue(struct nvme_ctrl *nctrl, int qid)
+{
+ struct nvme_tcp_ofld_ctrl *ctrl = to_tcp_ofld_ctrl(nctrl);
+ struct nvme_tcp_ofld_queue *queue = &ctrl->queues[qid];
+
+ mutex_lock(&queue->queue_lock);
+ if (test_and_clear_bit(NVME_TCP_OFLD_Q_LIVE, &queue->flags))
+ __nvme_tcp_ofld_stop_queue(queue);
+ mutex_unlock(&queue->queue_lock);
+}
+
+static void nvme_tcp_ofld_stop_io_queues(struct nvme_ctrl *ctrl)
+{
+ int i;
+
+ for (i = 1; i < ctrl->queue_count; i++)
+ nvme_tcp_ofld_stop_queue(ctrl, i);
+}
+
+static void __nvme_tcp_ofld_free_queue(struct nvme_tcp_ofld_queue *queue)
+{
+ queue->dev->ops->destroy_queue(queue);
+}
+
+static void nvme_tcp_ofld_free_queue(struct nvme_ctrl *nctrl, int qid)
+{
+ struct nvme_tcp_ofld_ctrl *ctrl = to_tcp_ofld_ctrl(nctrl);
+ struct nvme_tcp_ofld_queue *queue = &ctrl->queues[qid];
+
+ if (test_and_clear_bit(NVME_TCP_OFLD_Q_ALLOCATED, &queue->flags)) {
+ __nvme_tcp_ofld_free_queue(queue);
+ mutex_destroy(&queue->queue_lock);
+ }
+}
+
+static void
+nvme_tcp_ofld_free_io_queues(struct nvme_ctrl *nctrl)
+{
+ int i;
+
+ for (i = 1; i < nctrl->queue_count; i++)
+ nvme_tcp_ofld_free_queue(nctrl, i);
+}
+
+static void nvme_tcp_ofld_destroy_io_queues(struct nvme_ctrl *nctrl, bool remove)
+{
+ nvme_tcp_ofld_stop_io_queues(nctrl);
+ if (remove) {
+ blk_cleanup_queue(nctrl->connect_q);
+ blk_mq_free_tag_set(nctrl->tagset);
+ }
+ nvme_tcp_ofld_free_io_queues(nctrl);
+}
+
+static void nvme_tcp_ofld_destroy_admin_queue(struct nvme_ctrl *nctrl, bool remove)
+{
+ nvme_tcp_ofld_stop_queue(nctrl, 0);
+ if (remove) {
+ blk_cleanup_queue(nctrl->admin_q);
+ blk_cleanup_queue(nctrl->fabrics_q);
+ blk_mq_free_tag_set(nctrl->admin_tagset);
+ }
+ nvme_tcp_ofld_free_queue(nctrl, 0);
+}
+
+static int nvme_tcp_ofld_start_queue(struct nvme_ctrl *nctrl, int qid)
+{
+ struct nvme_tcp_ofld_ctrl *ctrl = to_tcp_ofld_ctrl(nctrl);
+ struct nvme_tcp_ofld_queue *queue = &ctrl->queues[qid];
+ int rc;
+
+ queue = &ctrl->queues[qid];
+ if (qid) {
+ queue->cmnd_capsule_len = nctrl->ioccsz * 16;
+ rc = nvmf_connect_io_queue(nctrl, qid, false);
+ } else {
+ queue->cmnd_capsule_len = sizeof(struct nvme_command) + NVME_TCP_ADMIN_CCSZ;
+ rc = nvmf_connect_admin_queue(nctrl);
+ }
+
+ if (!rc) {
+ set_bit(NVME_TCP_OFLD_Q_LIVE, &queue->flags);
+ } else {
+ if (test_bit(NVME_TCP_OFLD_Q_ALLOCATED, &queue->flags))
+ __nvme_tcp_ofld_stop_queue(queue);
+ dev_err(nctrl->device,
+ "failed to connect queue: %d ret=%d\n", qid, rc);
+ }
+
+ return rc;
+}
+
+static int nvme_tcp_ofld_configure_admin_queue(struct nvme_ctrl *nctrl,
+ bool new)
+{
+ struct nvme_tcp_ofld_ctrl *ctrl = to_tcp_ofld_ctrl(nctrl);
+ struct nvme_tcp_ofld_queue *queue = &ctrl->queues[0];
+ int rc;
+
+ mutex_init(&queue->queue_lock);
+
+ rc = ctrl->dev->ops->create_queue(queue, 0, NVME_AQ_DEPTH);
+ if (rc)
+ return rc;
+
+ set_bit(NVME_TCP_OFLD_Q_ALLOCATED, &queue->flags);
+ if (new) {
+ nctrl->admin_tagset =
+ nvme_tcp_ofld_alloc_tagset(nctrl, true);
+ if (IS_ERR(nctrl->admin_tagset)) {
+ rc = PTR_ERR(nctrl->admin_tagset);
+ nctrl->admin_tagset = NULL;
+ goto out_free_queue;
+ }
+
+ nctrl->fabrics_q = blk_mq_init_queue(nctrl->admin_tagset);
+ if (IS_ERR(nctrl->fabrics_q)) {
+ rc = PTR_ERR(nctrl->fabrics_q);
+ nctrl->fabrics_q = NULL;
+ goto out_free_tagset;
+ }
+
+ nctrl->admin_q = blk_mq_init_queue(nctrl->admin_tagset);
+ if (IS_ERR(nctrl->admin_q)) {
+ rc = PTR_ERR(nctrl->admin_q);
+ nctrl->admin_q = NULL;
+ goto out_cleanup_fabrics_q;
+ }
+ }
+
+ rc = nvme_tcp_ofld_start_queue(nctrl, 0);
+ if (rc)
+ goto out_cleanup_queue;
+
+ rc = nvme_enable_ctrl(nctrl);
+ if (rc)
+ goto out_stop_queue;
+
+ blk_mq_unquiesce_queue(nctrl->admin_q);
+
+ rc = nvme_init_ctrl_finish(nctrl);
+ if (rc)
+ goto out_quiesce_queue;
+
+ return 0;
+
+out_quiesce_queue:
+ blk_mq_quiesce_queue(nctrl->admin_q);
+ blk_sync_queue(nctrl->admin_q);
+out_stop_queue:
+ nvme_tcp_ofld_stop_queue(nctrl, 0);
+ nvme_cancel_admin_tagset(nctrl);
+out_cleanup_queue:
+ if (new)
+ blk_cleanup_queue(nctrl->admin_q);
+out_cleanup_fabrics_q:
+ if (new)
+ blk_cleanup_queue(nctrl->fabrics_q);
+out_free_tagset:
+ if (new)
+ blk_mq_free_tag_set(nctrl->admin_tagset);
+out_free_queue:
+ nvme_tcp_ofld_free_queue(nctrl, 0);
+
+ return rc;
+}
+
+static unsigned int nvme_tcp_ofld_nr_io_queues(struct nvme_ctrl *nctrl)
+{
+ struct nvme_tcp_ofld_ctrl *ctrl = to_tcp_ofld_ctrl(nctrl);
+ struct nvme_tcp_ofld_dev *dev = ctrl->dev;
+ u32 hw_vectors = dev->num_hw_vectors;
+ u32 nr_write_queues, nr_poll_queues;
+ u32 nr_io_queues, nr_total_queues;
+
+ nr_io_queues = min3(nctrl->opts->nr_io_queues, num_online_cpus(),
+ hw_vectors);
+ nr_write_queues = min3(nctrl->opts->nr_write_queues, num_online_cpus(),
+ hw_vectors);
+ nr_poll_queues = min3(nctrl->opts->nr_poll_queues, num_online_cpus(),
+ hw_vectors);
+
+ nr_total_queues = nr_io_queues + nr_write_queues + nr_poll_queues;
+
+ return nr_total_queues;
+}
+
+static void
+nvme_tcp_ofld_set_io_queues(struct nvme_ctrl *nctrl, unsigned int nr_io_queues)
+{
+ struct nvme_tcp_ofld_ctrl *ctrl = to_tcp_ofld_ctrl(nctrl);
+ struct nvmf_ctrl_options *opts = nctrl->opts;
+
+ if (opts->nr_write_queues && opts->nr_io_queues < nr_io_queues) {
+ /*
+ * separate read/write queues
+ * hand out dedicated default queues only after we have
+ * sufficient read queues.
+ */
+ ctrl->io_queues[HCTX_TYPE_READ] = opts->nr_io_queues;
+ nr_io_queues -= ctrl->io_queues[HCTX_TYPE_READ];
+ ctrl->io_queues[HCTX_TYPE_DEFAULT] =
+ min(opts->nr_write_queues, nr_io_queues);
+ nr_io_queues -= ctrl->io_queues[HCTX_TYPE_DEFAULT];
+ } else {
+ /*
+ * shared read/write queues
+ * either no write queues were requested, or we don't have
+ * sufficient queue count to have dedicated default queues.
+ */
+ ctrl->io_queues[HCTX_TYPE_DEFAULT] =
+ min(opts->nr_io_queues, nr_io_queues);
+ nr_io_queues -= ctrl->io_queues[HCTX_TYPE_DEFAULT];
+ }
+
+ if (opts->nr_poll_queues && nr_io_queues) {
+ /* map dedicated poll queues only if we have queues left */
+ ctrl->io_queues[HCTX_TYPE_POLL] =
+ min(opts->nr_poll_queues, nr_io_queues);
+ }
+}
+
+static int nvme_tcp_ofld_create_io_queues(struct nvme_ctrl *nctrl)
+{
+ struct nvme_tcp_ofld_ctrl *ctrl = to_tcp_ofld_ctrl(nctrl);
+ int i, rc;
+
+ for (i = 1; i < nctrl->queue_count; i++) {
+ mutex_init(&ctrl->queues[i].queue_lock);
+
+ rc = ctrl->dev->ops->create_queue(&ctrl->queues[i],
+ i, nctrl->sqsize + 1);
+ if (rc)
+ goto out_free_queues;
+
+ set_bit(NVME_TCP_OFLD_Q_ALLOCATED, &ctrl->queues[i].flags);
+ }
+
+ return 0;
+
+out_free_queues:
+ for (i--; i >= 1; i--)
+ nvme_tcp_ofld_free_queue(nctrl, i);
+
+ return rc;
+}
+
+static int nvme_tcp_ofld_alloc_io_queues(struct nvme_ctrl *nctrl)
+{
+ unsigned int nr_io_queues;
+ int rc;
+
+ nr_io_queues = nvme_tcp_ofld_nr_io_queues(nctrl);
+ rc = nvme_set_queue_count(nctrl, &nr_io_queues);
+ if (rc)
+ return rc;
+
+ nctrl->queue_count = nr_io_queues + 1;
+ if (nctrl->queue_count < 2) {
+ dev_err(nctrl->device,
+ "unable to set any I/O queues\n");
+
+ return -ENOMEM;
+ }
+
+ dev_info(nctrl->device, "creating %d I/O queues.\n", nr_io_queues);
+ nvme_tcp_ofld_set_io_queues(nctrl, nr_io_queues);
+
+ return nvme_tcp_ofld_create_io_queues(nctrl);
+}
+
+static int nvme_tcp_ofld_start_io_queues(struct nvme_ctrl *nctrl)
+{
+ int i, rc = 0;
+
+ for (i = 1; i < nctrl->queue_count; i++) {
+ rc = nvme_tcp_ofld_start_queue(nctrl, i);
+ if (rc)
+ goto out_stop_queues;
+ }
+
+ return 0;
+
+out_stop_queues:
+ for (i--; i >= 1; i--)
+ nvme_tcp_ofld_stop_queue(nctrl, i);
+
+ return rc;
+}
+
+static int
+nvme_tcp_ofld_configure_io_queues(struct nvme_ctrl *nctrl, bool new)
+{
+ int rc = nvme_tcp_ofld_alloc_io_queues(nctrl);
+
+ if (rc)
+ return rc;
+
+ if (new) {
+ nctrl->tagset = nvme_tcp_ofld_alloc_tagset(nctrl, false);
+ if (IS_ERR(nctrl->tagset)) {
+ rc = PTR_ERR(nctrl->tagset);
+ nctrl->tagset = NULL;
+ goto out_free_io_queues;
+ }
+
+ nctrl->connect_q = blk_mq_init_queue(nctrl->tagset);
+ if (IS_ERR(nctrl->connect_q)) {
+ rc = PTR_ERR(nctrl->connect_q);
+ nctrl->connect_q = NULL;
+ goto out_free_tag_set;
+ }
+ }
+
+ rc = nvme_tcp_ofld_start_io_queues(nctrl);
+ if (rc)
+ goto out_cleanup_connect_q;
+
+ if (!new) {
+ nvme_start_queues(nctrl);
+ if (!nvme_wait_freeze_timeout(nctrl, NVME_IO_TIMEOUT)) {
+ /*
+ * If we timed out waiting for freeze we are likely to
+ * be stuck. Fail the controller initialization just
+ * to be safe.
+ */
+ rc = -ENODEV;
+ goto out_wait_freeze_timed_out;
+ }
+ blk_mq_update_nr_hw_queues(nctrl->tagset, nctrl->queue_count - 1);
+ nvme_unfreeze(nctrl);
+ }
+
+ return 0;
+
+out_wait_freeze_timed_out:
+ nvme_stop_queues(nctrl);
+ nvme_sync_io_queues(nctrl);
+ nvme_tcp_ofld_stop_io_queues(nctrl);
+out_cleanup_connect_q:
+ nvme_cancel_tagset(nctrl);
+ if (new)
+ blk_cleanup_queue(nctrl->connect_q);
+out_free_tag_set:
+ if (new)
+ blk_mq_free_tag_set(nctrl->tagset);
+out_free_io_queues:
+ nvme_tcp_ofld_free_io_queues(nctrl);
+
+ return rc;
+}
+
+static void nvme_tcp_ofld_reconnect_or_remove(struct nvme_ctrl *nctrl)
+{
+ /* If we are resetting/deleting then do nothing */
+ if (nctrl->state != NVME_CTRL_CONNECTING) {
+ WARN_ON_ONCE(nctrl->state == NVME_CTRL_NEW ||
+ nctrl->state == NVME_CTRL_LIVE);
+
+ return;
+ }
+
+ if (nvmf_should_reconnect(nctrl)) {
+ dev_info(nctrl->device, "Reconnecting in %d seconds...\n",
+ nctrl->opts->reconnect_delay);
+ queue_delayed_work(nvme_wq,
+ &to_tcp_ofld_ctrl(nctrl)->connect_work,
+ nctrl->opts->reconnect_delay * HZ);
+ } else {
+ dev_info(nctrl->device, "Removing controller...\n");
+ nvme_delete_ctrl(nctrl);
+ }
+}
+
+static int
+nvme_tcp_ofld_init_admin_hctx(struct blk_mq_hw_ctx *hctx, void *data,
+ unsigned int hctx_idx)
+{
+ struct nvme_tcp_ofld_ctrl *ctrl = data;
+
+ hctx->driver_data = &ctrl->queues[0];
+
+ return 0;
+}
+
+static int nvme_tcp_ofld_setup_ctrl(struct nvme_ctrl *nctrl, bool new)
+{
+ struct nvme_tcp_ofld_ctrl *ctrl = to_tcp_ofld_ctrl(nctrl);
+ struct nvmf_ctrl_options *opts = nctrl->opts;
+ int rc = 0;
+
+ rc = ctrl->dev->ops->setup_ctrl(ctrl);
+ if (rc)
+ return rc;
+
+ rc = nvme_tcp_ofld_configure_admin_queue(nctrl, new);
+ if (rc)
+ goto out_release_ctrl;
+
+ if (nctrl->icdoff) {
+ dev_err(nctrl->device, "icdoff is not supported!\n");
+ rc = -EINVAL;
+ goto destroy_admin;
+ }
+
+ if (!(nctrl->sgls & ((1 << 0) | (1 << 1)))) {
+ dev_err(nctrl->device, "Mandatory sgls are not supported!\n");
+ goto destroy_admin;
+ }
+
+ if (opts->queue_size > nctrl->sqsize + 1)
+ dev_warn(nctrl->device,
+ "queue_size %zu > ctrl sqsize %u, clamping down\n",
+ opts->queue_size, nctrl->sqsize + 1);
+
+ if (nctrl->sqsize + 1 > nctrl->maxcmd) {
+ dev_warn(nctrl->device,
+ "sqsize %u > ctrl maxcmd %u, clamping down\n",
+ nctrl->sqsize + 1, nctrl->maxcmd);
+ nctrl->sqsize = nctrl->maxcmd - 1;
+ }
+
+ if (nctrl->queue_count > 1) {
+ rc = nvme_tcp_ofld_configure_io_queues(nctrl, new);
+ if (rc)
+ goto destroy_admin;
+ }
+
+ if (!nvme_change_ctrl_state(nctrl, NVME_CTRL_LIVE)) {
+ /*
+ * state change failure is ok if we started ctrl delete,
+ * unless we're during creation of a new controller to
+ * avoid races with teardown flow.
+ */
+ WARN_ON_ONCE(nctrl->state != NVME_CTRL_DELETING &&
+ nctrl->state != NVME_CTRL_DELETING_NOIO);
+ WARN_ON_ONCE(new);
+ rc = -EINVAL;
+ goto destroy_io;
+ }
+
+ nvme_start_ctrl(nctrl);
+
+ return 0;
+
+destroy_io:
+ if (nctrl->queue_count > 1) {
+ nvme_stop_queues(nctrl);
+ nvme_sync_io_queues(nctrl);
+ nvme_tcp_ofld_stop_io_queues(nctrl);
+ nvme_cancel_tagset(nctrl);
+ nvme_tcp_ofld_destroy_io_queues(nctrl, new);
+ }
+destroy_admin:
+ blk_mq_quiesce_queue(nctrl->admin_q);
+ blk_sync_queue(nctrl->admin_q);
+ nvme_tcp_ofld_stop_queue(nctrl, 0);
+ nvme_cancel_admin_tagset(nctrl);
+ nvme_tcp_ofld_destroy_admin_queue(nctrl, new);
+out_release_ctrl:
+ ctrl->dev->ops->release_ctrl(ctrl);
+
+ return rc;
+}
+
+static int
+nvme_tcp_ofld_check_dev_opts(struct nvmf_ctrl_options *opts,
+ struct nvme_tcp_ofld_ops *ofld_ops)
+{
+ unsigned int nvme_tcp_ofld_opt_mask = NVMF_ALLOWED_OPTS |
+ ofld_ops->allowed_opts | ofld_ops->required_opts;
+ struct nvmf_ctrl_options dev_opts_mask;
+
+ if (opts->mask & ~nvme_tcp_ofld_opt_mask) {
+ pr_warn("One or more nvmf options missing from ofld drvr %s.\n",
+ ofld_ops->name);
+
+ dev_opts_mask.mask = nvme_tcp_ofld_opt_mask;
+
+ return nvmf_check_required_opts(&dev_opts_mask, opts->mask);
+ }
+
+ return 0;
+}
+
+static void nvme_tcp_ofld_free_ctrl(struct nvme_ctrl *nctrl)
+{
+ struct nvme_tcp_ofld_ctrl *ctrl = to_tcp_ofld_ctrl(nctrl);
+ struct nvme_tcp_ofld_dev *dev = ctrl->dev;
+
+ if (list_empty(&ctrl->list))
+ goto free_ctrl;
+
+ ctrl->dev->ops->release_ctrl(ctrl);
+
+ mutex_lock(&nvme_tcp_ofld_ctrl_mutex);
+ list_del(&ctrl->list);
+ mutex_unlock(&nvme_tcp_ofld_ctrl_mutex);
+
+ nvmf_free_options(nctrl->opts);
+free_ctrl:
+ module_put(dev->ops->module);
+ kfree(ctrl->queues);
+ kfree(ctrl);
+}
+
+static void nvme_tcp_ofld_set_sg_null(struct nvme_command *c)
+{
+ struct nvme_sgl_desc *sg = &c->common.dptr.sgl;
+
+ sg->addr = 0;
+ sg->length = 0;
+ sg->type = (NVME_TRANSPORT_SGL_DATA_DESC << 4) | NVME_SGL_FMT_TRANSPORT_A;
+}
+
+inline void nvme_tcp_ofld_set_sg_inline(struct nvme_tcp_ofld_queue *queue,
+ struct nvme