From 1ce4da3dd99e98bd4a8b396c291041080e0fe85e Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 12 Jun 2025 10:34:30 +0200 Subject: docs: use parser_yaml extension to handle Netlink specs Instead of manually calling ynl_gen_rst.py, use a Sphinx extension. This way, no .rst files would be written to the Kernel source directories. We are using here a toctree with :glob: property. This way, there is no need to touch the netlink/specs/index.rst file every time a new Netlink spec is added/renamed/removed. Signed-off-by: Mauro Carvalho Chehab Reviewed-by: Donald Hunter --- Documentation/networking/index.rst | 2 +- Documentation/networking/netlink_spec/readme.txt | 4 ---- 2 files changed, 1 insertion(+), 5 deletions(-) delete mode 100644 Documentation/networking/netlink_spec/readme.txt (limited to 'Documentation/networking') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index ac90b82f3ce9..b7a4969e9bc9 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -57,7 +57,7 @@ Contents: filter generic-hdlc generic_netlink - netlink_spec/index + ../netlink/specs/index gen_stats gtp ila diff --git a/Documentation/networking/netlink_spec/readme.txt b/Documentation/networking/netlink_spec/readme.txt deleted file mode 100644 index 030b44aca4e6..000000000000 --- a/Documentation/networking/netlink_spec/readme.txt +++ /dev/null @@ -1,4 +0,0 @@ -SPDX-License-Identifier: GPL-2.0 - -This file is populated during the build of the documentation (htmldocs) by the -tools/net/ynl/pyynl/ynl_gen_rst.py script. -- cgit v1.2.3 From 6abc773747a80dfcbf8aa05fce7bb8d5d815fdb3 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 10 Jun 2025 10:44:45 +0200 Subject: docs: netlink: remove obsolete .gitignore from unused directory The previous code was generating source rst files under Documentation/networking/netlink_spec/. With the Sphinx YAML parser, this is now gone. So, stop ignoring *.rst files inside netlink specs directory. Signed-off-by: Mauro Carvalho Chehab Reviewed-by: Donald Hunter --- Documentation/networking/netlink_spec/.gitignore | 1 - 1 file changed, 1 deletion(-) delete mode 100644 Documentation/networking/netlink_spec/.gitignore (limited to 'Documentation/networking') diff --git a/Documentation/networking/netlink_spec/.gitignore b/Documentation/networking/netlink_spec/.gitignore deleted file mode 100644 index 30d85567b592..000000000000 --- a/Documentation/networking/netlink_spec/.gitignore +++ /dev/null @@ -1 +0,0 @@ -*.rst -- cgit v1.2.3 From 5213ff086344394ddb586298ad1d408c93406621 Mon Sep 17 00:00:00 2001 From: Mohsin Bashir Date: Wed, 13 Aug 2025 15:13:18 -0700 Subject: eth: fbnic: Collect packet statistics for XDP Add support for XDP statistics collection and reporting via rtnl_link and netdev_queue API. For XDP programs without frags support, fbnic requires MTU to be less than the HDS threshold. If an over-sized frame is received, the frame is dropped and recorded as rx_length_errors reported via ip stats to highlight that this is an error. Signed-off-by: Jakub Kicinski Signed-off-by: Mohsin Bashir Link: https://patch.msgid.link/20250813221319.3367670-9-mohsin.bashr@gmail.com Signed-off-by: Paolo Abeni --- .../networking/device_drivers/ethernet/meta/fbnic.rst | 11 +++++++++++ 1 file changed, 11 insertions(+) (limited to 'Documentation/networking') diff --git a/Documentation/networking/device_drivers/ethernet/meta/fbnic.rst b/Documentation/networking/device_drivers/ethernet/meta/fbnic.rst index afb8353daefd..fb6559fa4be4 100644 --- a/Documentation/networking/device_drivers/ethernet/meta/fbnic.rst +++ b/Documentation/networking/device_drivers/ethernet/meta/fbnic.rst @@ -160,3 +160,14 @@ behavior and potential performance bottlenecks. credit exhaustion - ``pcie_ob_rd_no_np_cred``: Read requests dropped due to non-posted credit exhaustion + +XDP Length Error: +~~~~~~~~~~~~~~~~~ + +For XDP programs without frags support, fbnic tries to make sure that MTU fits +into a single buffer. If an oversized frame is received and gets fragmented, +it is dropped and the following netlink counters are updated + + - ``rx-length``: number of frames dropped due to lack of fragmentation + support in the attached XDP program + - ``rx-errors``: total number of packets with errors received on the interface -- cgit v1.2.3 From 2f3dd6ec901f29aef5fff3d7a63b1371d67c1760 Mon Sep 17 00:00:00 2001 From: Eric Biggers Date: Mon, 18 Aug 2025 13:54:25 -0700 Subject: sctp: Convert cookie authentication to use HMAC-SHA256 Convert SCTP cookies to use HMAC-SHA256, instead of the previous choice of the legacy algorithms HMAC-MD5 and HMAC-SHA1. Simplify and optimize the code by using the HMAC-SHA256 library instead of crypto_shash, and by preparing the HMAC key when it is generated instead of per-operation. This doesn't break compatibility, since the cookie format is an implementation detail, not part of the SCTP protocol itself. Note that the cookie size doesn't change either. The HMAC field was already 32 bytes, even though previously at most 20 bytes were actually compared. 32 bytes exactly fits an untruncated HMAC-SHA256 value. So, although we could safely truncate the MAC to something slightly shorter, for now just keep the cookie size the same. I also considered SipHash, but that would generate only 8-byte MACs. An 8-byte MAC *might* suffice here. However, there's quite a lot of information in the SCTP cookies: more than in TCP SYN cookies. So absent an analysis that occasional forgeries of all that information is okay in SCTP, I errored on the side of caution. Remove HMAC-MD5 and HMAC-SHA1 as options, since the new HMAC-SHA256 option is just better. It's faster as well as more secure. For example, benchmarking on x86_64, cookie authentication is now nearly 3x as fast as the previous default choice and implementation of HMAC-MD5. Also just make the kernel always support cookie authentication if SCTP is supported at all, rather than making it optional in the build. (It was sort of optional before, but it didn't really work properly. E.g., a kernel with CONFIG_SCTP_COOKIE_HMAC_MD5=n still supported HMAC-MD5 cookie authentication if CONFIG_CRYPTO_HMAC and CONFIG_CRYPTO_MD5 happened to be enabled in the kconfig for other reasons.) Acked-by: Xin Long Signed-off-by: Eric Biggers Link: https://patch.msgid.link/20250818205426.30222-5-ebiggers@kernel.org Signed-off-by: Jakub Kicinski --- Documentation/networking/ip-sysctl.rst | 11 ++++------- 1 file changed, 4 insertions(+), 7 deletions(-) (limited to 'Documentation/networking') diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst index 9756d16e3df1..3d6782683eee 100644 --- a/Documentation/networking/ip-sysctl.rst +++ b/Documentation/networking/ip-sysctl.rst @@ -3508,16 +3508,13 @@ cookie_hmac_alg - STRING a listening sctp socket to a connecting client in the INIT-ACK chunk. Valid values are: - * md5 - * sha1 + * sha256 * none - Ability to assign md5 or sha1 as the selected alg is predicated on the - configuration of those algorithms at build time (CONFIG_CRYPTO_MD5 and - CONFIG_CRYPTO_SHA1). + md5 and sha1 are also accepted for backwards compatibility, but cause + sha256 to be selected. - Default: Dependent on configuration. MD5 if available, else SHA1 if - available, else none. + Default: sha256 rcvbuf_policy - INTEGER Determines if the receive buffer is attributed to the socket or to -- cgit v1.2.3 From d5a253702add0da3e1e19252ae2a251ee24b486d Mon Sep 17 00:00:00 2001 From: Eric Biggers Date: Mon, 18 Aug 2025 13:54:26 -0700 Subject: sctp: Stop accepting md5 and sha1 for net.sctp.cookie_hmac_alg The upgrade of the cookie authentication algorithm to HMAC-SHA256 kept some backwards compatibility for the net.sctp.cookie_hmac_alg sysctl by still accepting the values 'md5' and 'sha1'. Those algorithms are no longer actually used, but rather those values were just treated as requests to enable cookie authentication. As requested at https://lore.kernel.org/netdev/CADvbK_fmCRARc8VznH8cQa-QKaCOQZ6yFbF=1-VDK=zRqv_cXw@mail.gmail.com/ and https://lore.kernel.org/netdev/20250818084345.708ac796@kernel.org/ , go further and start rejecting 'md5' and 'sha1' completely. Signed-off-by: Eric Biggers Link: https://patch.msgid.link/20250818205426.30222-6-ebiggers@kernel.org Signed-off-by: Jakub Kicinski --- Documentation/networking/ip-sysctl.rst | 3 --- 1 file changed, 3 deletions(-) (limited to 'Documentation/networking') diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst index 3d6782683eee..43badb338d22 100644 --- a/Documentation/networking/ip-sysctl.rst +++ b/Documentation/networking/ip-sysctl.rst @@ -3511,9 +3511,6 @@ cookie_hmac_alg - STRING * sha256 * none - md5 and sha1 are also accepted for backwards compatibility, but cause - sha256 to be selected. - Default: sha256 rcvbuf_policy - INTEGER -- cgit v1.2.3 From a6d4f25888b83b8300aef28d9ee22765c1cc9b34 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Tue, 19 Aug 2025 17:40:30 +0000 Subject: net: set net.core.rmem_max and net.core.wmem_max to 4 MB SO_RCVBUF and SO_SNDBUF have limited range today, unless distros or system admins change rmem_max and wmem_max. Even iproute2 uses 1 MB SO_RCVBUF which is capped by the kernel. Decouple [rw]mem_max and [rw]mem_default and increase [rw]mem_max to 4 MB. Before: $ sysctl net.core.rmem_default net.core.rmem_max net.core.wmem_default net.core.wmem_max net.core.rmem_default = 212992 net.core.rmem_max = 212992 net.core.wmem_default = 212992 net.core.wmem_max = 212992 After: $ sysctl net.core.rmem_default net.core.rmem_max net.core.wmem_default net.core.wmem_max net.core.rmem_default = 212992 net.core.rmem_max = 4194304 net.core.wmem_default = 212992 net.core.wmem_max = 4194304 Signed-off-by: Eric Dumazet Reviewed-by: Neal Cardwell Link: https://patch.msgid.link/20250819174030.1986278-1-edumazet@google.com Signed-off-by: Jakub Kicinski --- Documentation/networking/ip-sysctl.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'Documentation/networking') diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst index 43badb338d22..9f5891c9b07b 100644 --- a/Documentation/networking/ip-sysctl.rst +++ b/Documentation/networking/ip-sysctl.rst @@ -209,7 +209,7 @@ neigh/default/unres_qlen_bytes - INTEGER Setting negative value is meaningless and will return error. - Default: SK_WMEM_MAX, (same as net.core.wmem_default). + Default: SK_WMEM_DEFAULT, (same as net.core.wmem_default). Exact value depends on architecture and kernel options, but should be enough to allow queuing 256 packets @@ -805,8 +805,8 @@ tcp_rmem - vector of 3 INTEGERs: min, default, max This value results in initial window of 65535. max: maximal size of receive buffer allowed for automatically - selected receiver buffers for TCP socket. This value does not override - net.core.rmem_max. Calling setsockopt() with SO_RCVBUF disables + selected receiver buffers for TCP socket. + Calling setsockopt() with SO_RCVBUF disables automatic tuning of that socket's receive buffer size, in which case this value is ignored. Default: between 131072 and 32MB, depending on RAM size. -- cgit v1.2.3 From 6b9f301985a35997338b0a0429859b49aff7f352 Mon Sep 17 00:00:00 2001 From: Lei Wei Date: Mon, 18 Aug 2025 21:14:26 +0800 Subject: docs: networking: Add PPE driver documentation for Qualcomm IPQ9574 SoC Add description and high-level diagram for PPE, driver overview and module enable/debug information. Signed-off-by: Lei Wei Signed-off-by: Luo Jie Link: https://patch.msgid.link/20250818-qcom_ipq_ppe-v8-2-1d4ff641fce9@quicinc.com Signed-off-by: Paolo Abeni --- .../networking/device_drivers/ethernet/index.rst | 1 + .../device_drivers/ethernet/qualcomm/ppe/ppe.rst | 194 +++++++++++++++++++++ 2 files changed, 195 insertions(+) create mode 100644 Documentation/networking/device_drivers/ethernet/qualcomm/ppe/ppe.rst (limited to 'Documentation/networking') diff --git a/Documentation/networking/device_drivers/ethernet/index.rst b/Documentation/networking/device_drivers/ethernet/index.rst index 40ac552641a3..0b0a3eef6aae 100644 --- a/Documentation/networking/device_drivers/ethernet/index.rst +++ b/Documentation/networking/device_drivers/ethernet/index.rst @@ -50,6 +50,7 @@ Contents: neterion/s2io netronome/nfp pensando/ionic + qualcomm/ppe/ppe smsc/smc9 stmicro/stmmac ti/cpsw diff --git a/Documentation/networking/device_drivers/ethernet/qualcomm/ppe/ppe.rst b/Documentation/networking/device_drivers/ethernet/qualcomm/ppe/ppe.rst new file mode 100644 index 000000000000..4ab299a28969 --- /dev/null +++ b/Documentation/networking/device_drivers/ethernet/qualcomm/ppe/ppe.rst @@ -0,0 +1,194 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=============================================== +PPE Ethernet Driver for Qualcomm IPQ SoC Family +=============================================== + +Copyright (c) Qualcomm Technologies, Inc. and/or its subsidiaries. + +Author: Lei Wei + + +Contents +======== + +- `PPE Overview`_ +- `PPE Driver Overview`_ +- `PPE Driver Supported SoCs`_ +- `Enabling the Driver`_ +- `Debugging`_ + + +PPE Overview +============ + +IPQ (Qualcomm Internet Processor) SoC (System-on-Chip) series is Qualcomm's series of +networking SoC for Wi-Fi access points. The PPE (Packet Process Engine) is the Ethernet +packet process engine in the IPQ SoC. + +Below is a simplified hardware diagram of IPQ9574 SoC which includes the PPE engine and +other blocks which are in the SoC but outside the PPE engine. These blocks work together +to enable the Ethernet for the IPQ SoC:: + + +------+ +------+ +------+ +------+ +------+ +------+ start +-------+ + |netdev| |netdev| |netdev| |netdev| |netdev| |netdev|<------|PHYLINK| + +------+ +------+ +------+ +------+ +------+ +------+ stop +-+-+-+-+ + | | | ^ + +-------+ +-------------------------+--------+----------------------+ | | | + | GCC | | | EDMA | | | | | + +---+---+ | PPE +---+----+ | | | | + | clk | | | | | | + +-------->| +-----------------------+------+-----+---------------+ | | | | + | | Switch Core |Port0 | |Port7(EIP FIFO)| | | | | + | | +---+--+ +------+--------+ | | | | + | | | | | | | | | + +-------+ | | +------+---------------+----+ | | | | | + |CMN PLL| | | +---+ +---+ +----+ | +--------+ | | | | | | + +---+---+ | | |BM | |QM | |SCH | | | L2/L3 | ....... | | | | | | + | | | | +---+ +---+ +----+ | +--------+ | | | | | | + | | | | +------+--------------------+ | | | | | + | | | | | | | | | | + | v | | +-----+-+-----+-+-----+-+-+---+--+-----+-+-----+ | | | | | + | +------+ | | |Port1| |Port2| |Port3| |Port4| |Port5| |Port6| | | | | | + | |NSSCC | | | +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ | | mac| | | + | +-+-+--+ | | |MAC0 | |MAC1 | |MAC2 | |MAC3 | |MAC4 | |MAC5 | | |<---+ | | + | ^ | |clk | | +-----+-+-----+-+-----+-+-----+--+-----+-+-----+ | | ops | | + | | | +------>| +----|------|-------|-------|---------|--------|-----+ | | | + | | | +---------------------------------------------------------+ | | + | | | | | | | | | | | + | | | MII clk | QSGMII USXGMII USXGMII | | + | | +--------------->| | | | | | | | + | | +-------------------------+ +---------+ +---------+ | | + | |125/312.5MHz clk| (PCS0) | | (PCS1) | | (PCS2) | pcs ops | | + | +----------------+ UNIPHY0 | | UNIPHY1 | | UNIPHY2 |<--------+ | + +----------------->| | | | | | | + | 31.25MHz ref clk +-------------------------+ +---------+ +---------+ | + | | | | | | | | + | +-----------------------------------------------------+ | + |25/50MHz ref clk| +-------------------------+ +------+ +------+ | link | + +--------------->| | QUAD PHY | | PHY4 | | PHY5 | |---------+ + | +-------------------------+ +------+ +------+ | change + | | + | MDIO bus | + +-----------------------------------------------------+ + +The CMN (Common) PLL, NSSCC (Networking Sub System Clock Controller) and GCC (Global +Clock Controller) blocks are in the SoC and act as clock providers. + +The UNIPHY block is in the SoC and provides the PCS (Physical Coding Sublayer) and +XPCS (10-Gigabit Physical Coding Sublayer) functions to support different interface +modes between the PPE MAC and the external PHY. + +This documentation focuses on the descriptions of PPE engine and the PPE driver. + +The Ethernet functionality in the PPE (Packet Process Engine) is comprised of three +components: the switch core, port wrapper and Ethernet DMA. + +The Switch core in the IPQ9574 PPE has maximum of 6 front panel ports and two FIFO +interfaces. One of the two FIFO interfaces is used for Ethernet port to host CPU +communication using Ethernet DMA. The other one is used to communicate to the EIP +engine which is used for IPsec offload. On the IPQ9574, the PPE includes 6 GMAC/XGMACs +that can be connected with external Ethernet PHY. Switch core also includes BM (Buffer +Management), QM (Queue Management) and SCH (Scheduler) modules for supporting the +packet processing. + +The port wrapper provides connections from the 6 GMAC/XGMACS to UNIPHY (PCS) supporting +various modes such as SGMII/QSGMII/PSGMII/USXGMII/10G-BASER. There are 3 UNIPHY (PCS) +instances supported on the IPQ9574. + +Ethernet DMA is used to transmit and receive packets between the Ethernet subsystem +and ARM host CPU. + +The following lists the main blocks in the PPE engine which will be driven by this +PPE driver: + +- BM + BM is the hardware buffer manager for the PPE switch ports. +- QM + Queue Manager for managing the egress hardware queues of the PPE switch ports. +- SCH + The scheduler which manages the hardware traffic scheduling for the PPE switch ports. +- L2 + The L2 block performs the packet bridging in the switch core. The bridge domain is + represented by the VSI (Virtual Switch Instance) domain in PPE. FDB learning can be + enabled based on the VSI domain and bridge forwarding occurs within the VSI domain. +- MAC + The PPE in the IPQ9574 supports up to six MACs (MAC0 to MAC5) which are corresponding + to six switch ports (port1 to port6). The MAC block is connected with external PHY + through the UNIPHY PCS block. Each MAC block includes the GMAC and XGMAC blocks and + the switch port can select to use GMAC or XMAC through a MUX selection according to + the external PHY's capability. +- EDMA (Ethernet DMA) + The Ethernet DMA is used to transmit and receive Ethernet packets between the PPE + ports and the ARM cores. + +The received packet on a PPE MAC port can be forwarded to another PPE MAC port. It can +be also forwarded to internal switch port0 so that the packet can be delivered to the +ARM cores using the Ethernet DMA (EDMA) engine. The Ethernet DMA driver will deliver the +packet to the corresponding 'netdevice' interface. + +The software instantiations of the PPE MAC (netdevice), PCS and external PHYs interact +with the Linux PHYLINK framework to manage the connectivity between the PPE ports and +the connected PHYs, and the port link states. This is also illustrated in above diagram. + + +PPE Driver Overview +=================== +PPE driver is Ethernet driver for the Qualcomm IPQ SoC. It is a single platform driver +which includes the PPE part and Ethernet DMA part. The PPE part initializes and drives the +various blocks in PPE switch core such as BM/QM/L2 blocks and the PPE MACs. The EDMA part +drives the Ethernet DMA for packet transfer between PPE ports and ARM cores, and enables +the netdevice driver for the PPE ports. + +The PPE driver files in drivers/net/ethernet/qualcomm/ppe/ are listed as below: + +- Makefile +- ppe.c +- ppe.h +- ppe_config.c +- ppe_config.h +- ppe_debugfs.c +- ppe_debugfs.h +- ppe_regs.h + +The ppe.c file contains the main PPE platform driver and undertakes the initialization of +PPE switch core blocks such as QM, BM and L2. The configuration APIs for these hardware +blocks are provided in the ppe_config.c file. + +The ppe.h defines the PPE device data structure which will be used by PPE driver functions. + +The ppe_debugfs.c enables the PPE statistics counters such as PPE port Rx and Tx counters, +CPU code counters and queue counters. + + +PPE Driver Supported SoCs +========================= + +The PPE driver supports the following IPQ SoC: + +- IPQ9574 + + +Enabling the Driver +=================== + +The driver is located in the menu structure at:: + + -> Device Drivers + -> Network device support (NETDEVICES [=y]) + -> Ethernet driver support + -> Qualcomm devices + -> Qualcomm Technologies, Inc. PPE Ethernet support + +If the driver is built as a module, the module will be called qcom-ppe. + +The PPE driver functionally depends on the CMN PLL and NSSCC clock controller drivers. +Please make sure the dependent modules are installed before installing the PPE driver +module. + + +Debugging +========= + +The PPE hardware counters can be accessed using debugfs interface from the +``/sys/kernel/debug/ppe/`` directory. -- cgit v1.2.3 From da0e2197645c8e01bb6080c7a2b86d9a56cc64a9 Mon Sep 17 00:00:00 2001 From: Shahar Shitrit Date: Sun, 24 Aug 2025 11:43:53 +0300 Subject: devlink: Make health reporter burst period configurable MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Enable configuration of the burst period — a time window starting from the first error recovery, during which the reporter allows recovery attempts for each reported error. This feature is helpful when a single underlying issue causes multiple errors, as it delays the start of the grace period to allow sufficient time for recovering all related errors. For example, if multiple TX queues time out simultaneously, a sufficient burst period could allow all affected TX queues to be recovered within that window. Without this period, only the first TX queue that reports a timeout will undergo recovery, while the remaining TX queues will be blocked once the grace period begins. Configuration example: $ devlink health set pci/0000:00:09.0 reporter tx burst_period 500 Configuration example with ynl: ./tools/net/ynl/pyynl/cli.py \ --spec Documentation/netlink/specs/devlink.yaml \ --do health-reporter-set --json '{ "bus-name": "auxiliary", "dev-name": "mlx5_core.eth.0", "port-index": 65535, "health-reporter-name": "tx", "health-reporter-burst-period": 500 }' Signed-off-by: Shahar Shitrit Reviewed-by: Jiri Pirko Reviewed-by: Dragos Tatulea Reviewed-by: Carolina Jubran Signed-off-by: Mark Bloch Link: https://patch.msgid.link/20250824084354.533182-5-mbloch@nvidia.com Signed-off-by: Jakub Kicinski --- Documentation/networking/devlink/devlink-health.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation/networking') diff --git a/Documentation/networking/devlink/devlink-health.rst b/Documentation/networking/devlink/devlink-health.rst index e0b8cfed610a..4d10536377ab 100644 --- a/Documentation/networking/devlink/devlink-health.rst +++ b/Documentation/networking/devlink/devlink-health.rst @@ -50,7 +50,7 @@ Once an error is reported, devlink health will perform the following actions: * Auto recovery attempt is being done. Depends on: - Auto-recovery configuration - - Grace period vs. time passed since last recover + - Grace period (and burst period) vs. time passed since last recover Devlink formatted message ========================= -- cgit v1.2.3 From 23a6037ce76cb44d93cfea23aec5c7f3971227d4 Mon Sep 17 00:00:00 2001 From: Jay Vosburgh Date: Fri, 29 Aug 2025 17:08:37 -0700 Subject: bonding: Remove support for use_carrier Remove the implementation of use_carrier, the link monitoring method that utilizes ethtool or ioctl to determine the link state of an interface in a bond. Bonding will always behaves as if use_carrier=1, which relies on netif_carrier_ok() to determine the link state of interfaces. To avoid acquiring RTNL many times per second, bonding inspects link state under RCU, but not under RTNL. However, ethtool implementations in drivers may sleep, and therefore this strategy is unsuitable for use with calls into driver ethtool functions. The use_carrier option was introduced in 2003, to provide backwards compatibility for network device drivers that did not support the then-new netif_carrier_ok/on/off system. Device drivers are now expected to support netif_carrier_*, and the use_carrier backwards compatibility logic is no longer necessary. The option itself remains, but when queried always returns 1, and may only be set to 1. Link: https://lore.kernel.org/000000000000eb54bf061cfd666a@google.com Link: https://lore.kernel.org/20240718122017.d2e33aaac43a.I10ab9c9ded97163aef4e4de10985cd8f7de60d28@changeid Signed-off-by: Jay Vosburgh Reported-by: syzbot+b8c48ea38ca27d150063@syzkaller.appspotmail.com Acked-by: Stanislav Fomichev Link: https://patch.msgid.link/2029487.1756512517@famine Signed-off-by: Jakub Kicinski --- Documentation/networking/bonding.rst | 79 ++++++------------------------------ 1 file changed, 12 insertions(+), 67 deletions(-) (limited to 'Documentation/networking') diff --git a/Documentation/networking/bonding.rst b/Documentation/networking/bonding.rst index f8f5766703d4..a2b42ae719d2 100644 --- a/Documentation/networking/bonding.rst +++ b/Documentation/networking/bonding.rst @@ -582,10 +582,8 @@ miimon This determines how often the link state of each slave is inspected for link failures. A value of zero disables MII link monitoring. A value of 100 is a good starting point. - The use_carrier option, below, affects how the link state is - determined. See the High Availability section for additional - information. The default value is 100 if arp_interval is not - set. + + The default value is 100 if arp_interval is not set. min_links @@ -896,25 +894,14 @@ updelay use_carrier - Specifies whether or not miimon should use MII or ETHTOOL - ioctls vs. netif_carrier_ok() to determine the link - status. The MII or ETHTOOL ioctls are less efficient and - utilize a deprecated calling sequence within the kernel. The - netif_carrier_ok() relies on the device driver to maintain its - state with netif_carrier_on/off; at this writing, most, but - not all, device drivers support this facility. - - If bonding insists that the link is up when it should not be, - it may be that your network device driver does not support - netif_carrier_on/off. The default state for netif_carrier is - "carrier on," so if a driver does not support netif_carrier, - it will appear as if the link is always up. In this case, - setting use_carrier to 0 will cause bonding to revert to the - MII / ETHTOOL ioctl method to determine the link state. - - A value of 1 enables the use of netif_carrier_ok(), a value of - 0 will use the deprecated MII / ETHTOOL ioctls. The default - value is 1. + Obsolete option that previously selected between MII / + ETHTOOL ioctls and netif_carrier_ok() to determine link + state. + + All link state checks are now done with netif_carrier_ok(). + + For backwards compatibility, this option's value may be inspected + or set. The only valid setting is 1. xmit_hash_policy @@ -2036,22 +2023,8 @@ depending upon the device driver to maintain its carrier state, by querying the device's MII registers, or by making an ethtool query to the device. -If the use_carrier module parameter is 1 (the default value), -then the MII monitor will rely on the driver for carrier state -information (via the netif_carrier subsystem). As explained in the -use_carrier parameter information, above, if the MII monitor fails to -detect carrier loss on the device (e.g., when the cable is physically -disconnected), it may be that the driver does not support -netif_carrier. - -If use_carrier is 0, then the MII monitor will first query the -device's (via ioctl) MII registers and check the link state. If that -request fails (not just that it returns carrier down), then the MII -monitor will make an ethtool ETHTOOL_GLINK request to attempt to obtain -the same information. If both methods fail (i.e., the driver either -does not support or had some error in processing both the MII register -and ethtool requests), then the MII monitor will assume the link is -up. +The MII monitor relies on the driver for carrier state information (via +the netif_carrier subsystem). 8. Potential Sources of Trouble =============================== @@ -2135,34 +2108,6 @@ This will load tg3 and e1000 modules before loading the bonding one. Full documentation on this can be found in the modprobe.d and modprobe manual pages. -8.3. Painfully Slow Or No Failed Link Detection By Miimon ---------------------------------------------------------- - -By default, bonding enables the use_carrier option, which -instructs bonding to trust the driver to maintain carrier state. - -As discussed in the options section, above, some drivers do -not support the netif_carrier_on/_off link state tracking system. -With use_carrier enabled, bonding will always see these links as up, -regardless of their actual state. - -Additionally, other drivers do support netif_carrier, but do -not maintain it in real time, e.g., only polling the link state at -some fixed interval. In this case, miimon will detect failures, but -only after some long period of time has expired. If it appears that -miimon is very slow in detecting link failures, try specifying -use_carrier=0 to see if that improves the failure detection time. If -it does, then it may be that the driver checks the carrier state at a -fixed interval, but does not cache the MII register values (so the -use_carrier=0 method of querying the registers directly works). If -use_carrier=0 does not improve the failover, then the driver may cache -the registers, or the problem may be elsewhere. - -Also, remember that miimon only checks for the device's -carrier state. It has no way to determine the state of devices on or -beyond other ports of a switch, or if a switch is refusing to pass -traffic while still maintaining carrier on. - 9. SNMP agents =============== -- cgit v1.2.3 From 6b6dc81ee7e8ca87c71a533e1d69cf96a4f1e986 Mon Sep 17 00:00:00 2001 From: Hangbin Liu Date: Tue, 2 Sep 2025 06:44:59 +0000 Subject: bonding: add support for per-port LACP actor priority Introduce a new netlink attribute 'actor_port_prio' to allow setting the LACP actor port priority on a per-slave basis. This extends the existing bonding infrastructure to support more granular control over LACP negotiations. The priority value is embedded in LACPDU packets and will be used by subsequent patches to influence aggregator selection policies. Signed-off-by: Hangbin Liu Link: https://patch.msgid.link/20250902064501.360822-2-liuhangbin@gmail.com Signed-off-by: Paolo Abeni --- Documentation/networking/bonding.rst | 9 +++++++++ 1 file changed, 9 insertions(+) (limited to 'Documentation/networking') diff --git a/Documentation/networking/bonding.rst b/Documentation/networking/bonding.rst index a2b42ae719d2..345b60992446 100644 --- a/Documentation/networking/bonding.rst +++ b/Documentation/networking/bonding.rst @@ -193,6 +193,15 @@ ad_actor_sys_prio This parameter has effect only in 802.3ad mode and is available through SysFs interface. +actor_port_prio + + In an AD system, this specifies the port priority. The allowed range + is 1 - 65535. If the value is not specified, it takes 255 as the + default value. + + This parameter has effect only in 802.3ad mode and is available through + netlink interface. + ad_actor_system In an AD system, this specifies the mac-address for the actor in -- cgit v1.2.3 From e5a6643435fa4ad1e104323ec7d3e6215e2d832c Mon Sep 17 00:00:00 2001 From: Hangbin Liu Date: Tue, 2 Sep 2025 06:45:00 +0000 Subject: bonding: support aggregator selection based on port priority Add a new ad_select policy 'port_priority' that uses the per-port actor priority values (set via ad_actor_port_prio) to determine aggregator selection. This allows administrators to influence which ports are preferred for aggregation by assigning different priority values, providing more flexible load balancing control in LACP configurations. Signed-off-by: Hangbin Liu Link: https://patch.msgid.link/20250902064501.360822-3-liuhangbin@gmail.com Signed-off-by: Paolo Abeni --- Documentation/networking/bonding.rst | 16 ++++++++++++---- 1 file changed, 12 insertions(+), 4 deletions(-) (limited to 'Documentation/networking') diff --git a/Documentation/networking/bonding.rst b/Documentation/networking/bonding.rst index 345b60992446..e700bf1d095c 100644 --- a/Documentation/networking/bonding.rst +++ b/Documentation/networking/bonding.rst @@ -250,10 +250,18 @@ ad_select ports (slaves). Reselection occurs as described under the "bandwidth" setting, above. - The bandwidth and count selection policies permit failover of - 802.3ad aggregations when partial failure of the active aggregator - occurs. This keeps the aggregator with the highest availability - (either in bandwidth or in number of ports) active at all times. + actor_port_prio or 3 + + The active aggregator is chosen by the highest total sum of + actor port priorities across its active ports. Note this + priority is actor_port_prio, not per port prio, which is + used for primary reselect. + + The bandwidth, count and actor_port_prio selection policies permit + failover of 802.3ad aggregations when partial failure of the active + aggregator occurs. This keeps the aggregator with the highest + availability (either in bandwidth, number of ports, or total value + of port priorities) active at all times. This option was added in bonding version 3.4.0. -- cgit v1.2.3 From 30549eebc4d84f844c62965f2e1d362bc1accdce Mon Sep 17 00:00:00 2001 From: Geliang Tang Date: Sun, 7 Sep 2025 17:32:42 +0200 Subject: mptcp: make ADD_ADDR retransmission timeout adaptive Currently the ADD_ADDR option is retransmitted with a fixed timeout. This patch makes the retransmission timeout adaptive by using the maximum RTO among all the subflows, while still capping it at the configured maximum value (add_addr_timeout_max). This improves responsiveness when establishing new subflows. Specifically: 1. Adds mptcp_adjust_add_addr_timeout() helper to compute the adaptive timeout. 2. Uses maximum subflow RTO (icsk_rto) when available. 3. Applies exponential backoff based on retransmission count. 4. Maintains fallback to configured max timeout when no RTO data exists. This slightly changes the behaviour of the MPTCP "add_addr_timeout" sysctl knob to be used as a maximum instead of a fixed value. But this is seen as an improvement: the ADD_ADDR might be sent quicker than before to improve the overall MPTCP connection. Also, the default value is set to 2 min, which was already way too long, and caused the ADD_ADDR not to be retransmitted for connections shorter than 2 minutes. Suggested-by: Matthieu Baerts (NGI0) Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/576 Reviewed-by: Christoph Paasch Signed-off-by: Geliang Tang Reviewed-by: Matthieu Baerts (NGI0) Signed-off-by: Matthieu Baerts (NGI0) Link: https://patch.msgid.link/20250907-net-next-mptcp-add_addr-retrans-adapt-v1-1-824cc805772b@kernel.org Signed-off-by: Jakub Kicinski --- Documentation/networking/mptcp-sysctl.rst | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) (limited to 'Documentation/networking') diff --git a/Documentation/networking/mptcp-sysctl.rst b/Documentation/networking/mptcp-sysctl.rst index 1683c139821e..1eb6af26b4a7 100644 --- a/Documentation/networking/mptcp-sysctl.rst +++ b/Documentation/networking/mptcp-sysctl.rst @@ -8,9 +8,11 @@ MPTCP Sysfs variables =============================== add_addr_timeout - INTEGER (seconds) - Set the timeout after which an ADD_ADDR control message will be - resent to an MPTCP peer that has not acknowledged a previous - ADD_ADDR message. + Set the maximum value of timeout after which an ADD_ADDR control message + will be resent to an MPTCP peer that has not acknowledged a previous + ADD_ADDR message. A dynamically estimated retransmission timeout based + on the estimated connection round-trip-time is used if this value is + lower than the maximum one. Do not retransmit if set to 0. -- cgit v1.2.3 From ce0b015e2619ae64b7d33fb24a6b6cadcd70c317 Mon Sep 17 00:00:00 2001 From: Vlad Dumitrescu Date: Sat, 6 Sep 2025 18:29:43 -0700 Subject: devlink: Add 'total_vfs' generic device param NICs are typically configured with total_vfs=0, forcing users to rely on external tools to enable SR-IOV (a widely used and essential feature). Add total_vfs parameter to devlink for SR-IOV max VF configurability. Enables standard kernel tools to manage SR-IOV, addressing the need for flexible VF configuration. Signed-off-by: Vlad Dumitrescu Tested-by: Kamal Heib Reviewed-by: Jiri Pirko Signed-off-by: Saeed Mahameed Reviewed-by: Simon Horman Link: https://patch.msgid.link/20250907012953.301746-2-saeed@kernel.org Signed-off-by: Jakub Kicinski --- Documentation/networking/devlink/devlink-params.rst | 5 +++++ 1 file changed, 5 insertions(+) (limited to 'Documentation/networking') diff --git a/Documentation/networking/devlink/devlink-params.rst b/Documentation/networking/devlink/devlink-params.rst index 211b58177e12..c51da4fba7e7 100644 --- a/Documentation/networking/devlink/devlink-params.rst +++ b/Documentation/networking/devlink/devlink-params.rst @@ -143,3 +143,8 @@ own name. * - ``clock_id`` - u64 - Clock ID used by the device for registering DPLL devices and pins. + * - ``total_vfs`` + - u32 + - The max number of Virtual Functions (VFs) exposed by the PF. + after reboot/pci reset, 'sriov_totalvfs' entry under the device's sysfs + directory will report this value. -- cgit v1.2.3 From bf2da4799fdb6eb58d9c9541b7dc1096c260499d Mon Sep 17 00:00:00 2001 From: Saeed Mahameed Date: Sat, 6 Sep 2025 18:29:44 -0700 Subject: net/mlx5: Implement cqe_compress_type via devlink params Selects which algorithm should be used by the NIC in order to decide rate of CQE compression dependeng on PCIe bus conditions. Supported values: 1) balanced, merges fewer CQEs, resulting in a moderate compression ratio but maintaining a balance between bandwidth savings and performance 2) aggressive, merges more CQEs into a single entry, achieving a higher compression rate and maximizing performance, particularly under high traffic loads. Signed-off-by: Saeed Mahameed Reviewed-by: Jiri Pirko Reviewed-by: Simon Horman Link: https://patch.msgid.link/20250907012953.301746-3-saeed@kernel.org Signed-off-by: Jakub Kicinski --- Documentation/networking/devlink/mlx5.rst | 10 ++++++++++ 1 file changed, 10 insertions(+) (limited to 'Documentation/networking') diff --git a/Documentation/networking/devlink/mlx5.rst b/Documentation/networking/devlink/mlx5.rst index 7febe0aecd53..2edc842b620d 100644 --- a/Documentation/networking/devlink/mlx5.rst +++ b/Documentation/networking/devlink/mlx5.rst @@ -117,6 +117,16 @@ parameters. - driverinit - Control the size (in packets) of the hairpin queues. + * - ``cqe_compress_type`` + - string + - permanent + - Configure which mechanism/algorithm should be used by the NIC that will + affect the rate (aggressiveness) of compressed CQEs depending on PCIe bus + conditions and other internal NIC factors. This mode affects all queues + that enable compression. + * ``balanced`` : Merges fewer CQEs, resulting in a moderate compression ratio but maintaining a balance between bandwidth savings and performance + * ``aggressive`` : Merges more CQEs into a single entry, achieving a higher compression rate and maximizing performance, particularly under high traffic loads + The ``mlx5`` driver supports reloading via ``DEVLINK_CMD_RELOAD`` Info versions -- cgit v1.2.3 From 95a0af146dff5437acb4ea27eacc05aa22c7bb54 Mon Sep 17 00:00:00 2001 From: Vlad Dumitrescu Date: Sat, 6 Sep 2025 18:29:45 -0700 Subject: net/mlx5: Implement devlink enable_sriov parameter Example usage: devlink dev param set pci/0000:01:00.0 name enable_sriov value {true, false} cmode permanent devlink dev reload pci/0000:01:00.0 action fw_activate echo 1 >/sys/bus/pci/devices/0000:01:00.0/remove echo 1 >/sys/bus/pci/rescan grep ^ /sys/bus/pci/devices/0000:01:00.0/sriov_* Signed-off-by: Vlad Dumitrescu Tested-by: Kamal Heib Reviewed-by: Jiri Pirko Signed-off-by: Saeed Mahameed Reviewed-by: Simon Horman Link: https://patch.msgid.link/20250907012953.301746-4-saeed@kernel.org Signed-off-by: Jakub Kicinski --- Documentation/networking/devlink/mlx5.rst | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-) (limited to 'Documentation/networking') diff --git a/Documentation/networking/devlink/mlx5.rst b/Documentation/networking/devlink/mlx5.rst index 2edc842b620d..c3610f7c1d4b 100644 --- a/Documentation/networking/devlink/mlx5.rst +++ b/Documentation/networking/devlink/mlx5.rst @@ -15,23 +15,31 @@ Parameters * - Name - Mode - Validation + - Notes * - ``enable_roce`` - driverinit - - Type: Boolean - - If the device supports RoCE disablement, RoCE enablement state controls + - Boolean + - If the device supports RoCE disablement, RoCE enablement state controls device support for RoCE capability. Otherwise, the control occurs in the driver stack. When RoCE is disabled at the driver level, only raw ethernet QPs are supported. * - ``io_eq_size`` - driverinit - The range is between 64 and 4096. + - * - ``event_eq_size`` - driverinit - The range is between 64 and 4096. + - * - ``max_macs`` - driverinit - The range is between 1 and 2^31. Only power of 2 values are supported. + - + * - ``enable_sriov`` + - permanent + - Boolean + - Applies to each physical function (PF) independently, if the device + supports it. Otherwise, it applies symmetrically to all PFs. The ``mlx5`` driver also implements the following driver-specific parameters. -- cgit v1.2.3 From a4c49611cf4f7018ee80f02bded12fd4002ef95c Mon Sep 17 00:00:00 2001 From: Vlad Dumitrescu Date: Sat, 6 Sep 2025 18:29:46 -0700 Subject: net/mlx5: Implement devlink total_vfs parameter Some devices support both symmetric (same value for all PFs) and asymmetric, while others only support symmetric configuration. This implementation prefers asymmetric, since it is closer to the devlink model (per function settings), but falls back to symmetric when needed. Example usage: devlink dev param set pci/0000:01:00.0 name total_vfs value cmode permanent devlink dev reload pci/0000:01:00.0 action fw_activate echo 1 >/sys/bus/pci/devices/0000:01:00.0/remove echo 1 >/sys/bus/pci/rescan cat /sys/bus/pci/devices/0000:01:00.0/sriov_totalvfs Signed-off-by: Vlad Dumitrescu Reviewed-by: Jiri Pirko Tested-by: Kamal Heib Signed-off-by: Saeed Mahameed Reviewed-by: Simon Horman Link: https://patch.msgid.link/20250907012953.301746-5-saeed@kernel.org Signed-off-by: Jakub Kicinski --- Documentation/networking/devlink/mlx5.rst | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+) (limited to 'Documentation/networking') diff --git a/Documentation/networking/devlink/mlx5.rst b/Documentation/networking/devlink/mlx5.rst index c3610f7c1d4b..07b1424cbfbb 100644 --- a/Documentation/networking/devlink/mlx5.rst +++ b/Documentation/networking/devlink/mlx5.rst @@ -40,6 +40,28 @@ Parameters - Boolean - Applies to each physical function (PF) independently, if the device supports it. Otherwise, it applies symmetrically to all PFs. + * - ``total_vfs`` + - permanent + - The range is between 1 and a device-specific max. + - Applies to each physical function (PF) independently, if the device + supports it. Otherwise, it applies symmetrically to all PFs. + +Note: permanent parameters such as ``enable_sriov`` and ``total_vfs`` require FW reset to take effect + +.. code-block:: bash + + # setup parameters + devlink dev param set pci/0000:01:00.0 name enable_sriov value true cmode permanent + devlink dev param set pci/0000:01:00.0 name total_vfs value 8 cmode permanent + + # Fw reset + devlink dev reload pci/0000:01:00.0 action fw_activate + + # for PCI related config such as sriov PCI reset/rescan is required: + echo 1 >/sys/bus/pci/devices/0000:01:00.0/remove + echo 1 >/sys/bus/pci/rescan + grep ^ /sys/bus/pci/devices/0000:01:00.0/sriov_* + The ``mlx5`` driver also implements the following driver-specific parameters. -- cgit v1.2.3 From f4053490a6f651476124903dbe0777e3c24ac8cb Mon Sep 17 00:00:00 2001 From: Dragos Tatulea Date: Sun, 7 Sep 2025 12:39:35 +0300 Subject: net/mlx5e: Make PCIe congestion event thresholds configurable Add devlink driverinit parameters for configuring the thresholds for PCIe congestion events. These parameters are registered only when the firmware supports this feature. Update the mlx5 devlink docs as well on these new params. Signed-off-by: Dragos Tatulea Signed-off-by: Tariq Toukan Reviewed-by: Simon Horman Link: https://patch.msgid.link/1757237976-531416-2-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski --- Documentation/networking/devlink/mlx5.rst | 52 +++++++++++++++++++++++++++++++ 1 file changed, 52 insertions(+) (limited to 'Documentation/networking') diff --git a/Documentation/networking/devlink/mlx5.rst b/Documentation/networking/devlink/mlx5.rst index 07b1424cbfbb..60cc9fedf1ef 100644 --- a/Documentation/networking/devlink/mlx5.rst +++ b/Documentation/networking/devlink/mlx5.rst @@ -146,6 +146,58 @@ parameters. - u32 - driverinit - Control the size (in packets) of the hairpin queues. + * - ``pcie_cong_inbound_high`` + - u16 + - driverinit + - High threshold configuration for PCIe congestion events. The firmware + will send an event once device side inbound PCIe traffic went + above the configured high threshold for a long enough period (at least + 200ms). + + See pci_bw_inbound_high ethtool stat. + + Units are 0.01 %. Accepted values are in range [0, 10000]. + pcie_cong_inbound_low < pcie_cong_inbound_high. + Default value: 9000 (Corresponds to 90%). + * - ``pcie_cong_inbound_low`` + - u16 + - driverinit + - Low threshold configuration for PCIe congestion events. The firmware + will send an event once device side inbound PCIe traffic went + below the configured low threshold, only after having been previously in + a congested state. + + See pci_bw_inbound_low ethtool stat. + + Units are 0.01 %. Accepted values are in range [0, 10000]. + pcie_cong_inbound_low < pcie_cong_inbound_high. + Default value: 7500. + * - ``pcie_cong_outbound_high`` + - u16 + - driverinit + - High threshold configuration for PCIe congestion events. The firmware + will send an event once device side outbound PCIe traffic went + above the configured high threshold for a long enough period (at least + 200ms). + + See pci_bw_outbound_high ethtool stat. + + Units are 0.01 %. Accepted values are in range [0, 10000]. + pcie_cong_outbound_low < pcie_cong_outbound_high. + Default value: 9000 (Corresponds to 90%). + * - ``pcie_cong_outbound_low`` + - u16 + - driverinit + - Low threshold configuration for PCIe congestion events. The firmware + will send an event once device side outbound PCIe traffic went + below the configured low threshold, only after having been previously in + a congested state. + + See pci_bw_outbound_low ethtool stat. + + Units are 0.01 %. Accepted values are in range [0, 10000]. + pcie_cong_outbound_low < pcie_cong_outbound_high. + Default value: 7500. * - ``cqe_compress_type`` - string -- cgit v1.2.3 From cdc492746e3f6d73a9e6a6a9962c9f1f7b7961b5 Mon Sep 17 00:00:00 2001 From: Dragos Tatulea Date: Sun, 7 Sep 2025 12:39:36 +0300 Subject: net/mlx5e: Add stale counter for PCIe congestion events This ethtool counter is meant to help with observing how many times the congestion event was triggered but on query there was no state change. This would help to indicate when a work item was scheduled to run too late and in the meantime the congestion state changed back to previous state. While at it, do a driveby typo fix in documentation for pci_bw_inbound_high. Signed-off-by: Dragos Tatulea Signed-off-by: Tariq Toukan Reviewed-by: Simon Horman Link: https://patch.msgid.link/1757237976-531416-3-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski --- .../networking/device_drivers/ethernet/mellanox/mlx5/counters.rst | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) (limited to 'Documentation/networking') diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst index 754c81436408..cc498895f92e 100644 --- a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst +++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst @@ -1348,7 +1348,7 @@ Device Counters is in a congested state. If pci_bw_inbound_high == pci_bw_inbound_low then the device is not congested. If pci_bw_inbound_high > pci_bw_inbound_low then the device is congested. - - Tnformative + - Informative * - `pci_bw_inbound_low` - The number of times the device crossed the low inbound PCIe bandwidth @@ -1373,3 +1373,8 @@ Device Counters If pci_bw_outbound_high == pci_bw_outbound_low then the device is not congested. If pci_bw_outbound_high > pci_bw_outbound_low then the device is congested. - Informative + + * - `pci_bw_stale_event` + - The number of times the device fired a PCIe congestion event but on query + there was no change in state. + - Informative -- cgit v1.2.3 From 1f24a240974589ce42f70502ccb3ff3f5189d69a Mon Sep 17 00:00:00 2001 From: "Matthieu Baerts (NGI0)" Date: Tue, 9 Sep 2025 19:46:09 +0200 Subject: doc: mptcp: fix Netlink specs link The Netlink specs RST files are no longer generated inside the source tree. In other words, the path to mptcp_pm.rst has changed, and needs to be updated to the new location. Fixes: 1ce4da3dd99e ("docs: use parser_yaml extension to handle Netlink specs") Reported-by: Kory Maincent Closes: https://lore.kernel.org/20250828185037.07873d04@kmaincent-XPS-13-7390 Signed-off-by: Matthieu Baerts (NGI0) Link: https://patch.msgid.link/20250909-net-next-mptcp-pm-link-v1-1-0f1c4b8439c6@kernel.org Signed-off-by: Jakub Kicinski --- Documentation/networking/mptcp.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation/networking') diff --git a/Documentation/networking/mptcp.rst b/Documentation/networking/mptcp.rst index 17f2bab61164..fdc7bfd5d5c5 100644 --- a/Documentation/networking/mptcp.rst +++ b/Documentation/networking/mptcp.rst @@ -66,7 +66,7 @@ same rules are applied for all the connections (see: ``ip mptcp``) ; and the userspace one (type ``1``), controlled by a userspace daemon (i.e. `mptcpd `_) where different rules can be applied for each connection. The path managers can be controlled via a Netlink API; see -netlink_spec/mptcp_pm.rst. +../netlink/specs/mptcp_pm.rst. To be able to use multiple IP addresses on a host to create multiple *subflows* (paths), the default in-kernel MPTCP path-manager needs to know which IP -- cgit v1.2.3 From a1e891fe4ae8df3ba17d75c270f1877e282c9d2c Mon Sep 17 00:00:00 2001 From: Ivan Vecera Date: Tue, 9 Sep 2025 11:15:32 +0200 Subject: dpll: zl3073x: Implement devlink flash callback Use the introduced functionality to read firmware files and flash their contents into the device's internal flash memory to implement the devlink flash update callback. Sample output on EDS2 development board: # devlink -j dev info i2c/1-0070 | jq '.[][]["versions"]["running"]' { "fw": "6026" } # devlink dev flash i2c/1-0070 file firmware_fw2.hex [utility] Prepare flash mode [utility] Downloading image 100% [utility] Flash mode enabled [firmware1-part1] Downloading image 100% [firmware1-part1] Flashing image [firmware1-part2] Downloading image 100% [firmware1-part2] Flashing image [firmware1] Flashing done [firmware2] Downloading image 100% [firmware2] Flashing image 100% [firmware2] Flashing done [utility] Leaving flash mode Flashing done # devlink -j dev info i2c/1-0070 | jq '.[][]["versions"]["running"]' { "fw": "7006" } Reviewed-by: Przemek Kitszel Signed-off-by: Ivan Vecera Link: https://patch.msgid.link/20250909091532.11790-6-ivecera@redhat.com Signed-off-by: Jakub Kicinski --- Documentation/networking/devlink/zl3073x.rst | 14 ++++++++++++++ 1 file changed, 14 insertions(+) (limited to 'Documentation/networking') diff --git a/Documentation/networking/devlink/zl3073x.rst b/Documentation/networking/devlink/zl3073x.rst index 4b6cfaf38643..fc5a8dc272a7 100644 --- a/Documentation/networking/devlink/zl3073x.rst +++ b/Documentation/networking/devlink/zl3073x.rst @@ -49,3 +49,17 @@ The ``zl3073x`` driver reports the following versions - running - 1.3.0.1 - Device configuration version customized by OEM + +Flash Update +============ + +The ``zl3073x`` driver implements support for flash update using the +``devlink-flash`` interface. It supports updating the device flash using a +combined flash image ("bundle") that contains multiple components (firmware +parts and configurations). + +During the flash procedure, the standard firmware interface is not available, +so the driver unregisters all DPLLs and associated pins, and re-registers them +once the flash procedure is complete. + +The driver does not support any overwrite mask flags. -- cgit v1.2.3 From 7cfbe1c3397c9368d2bd7fbf5345a5bf4c43df0d Mon Sep 17 00:00:00 2001 From: "Kory Maincent (Dent Project)" Date: Mon, 15 Sep 2025 19:06:28 +0200 Subject: docs: devlink: Sort table of contents alphabetically Sort devlink documentation table of contents alphabetically to improve readability and make it easier to locate specific chapters. Signed-off-by: Kory Maincent Link: https://patch.msgid.link/20250915-feature_poe_permanent_conf-v3-3-78871151088b@bootlin.com Signed-off-by: Jakub Kicinski --- Documentation/networking/devlink/index.rst | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) (limited to 'Documentation/networking') diff --git a/Documentation/networking/devlink/index.rst b/Documentation/networking/devlink/index.rst index 270a65a01411..0c58e5c729d9 100644 --- a/Documentation/networking/devlink/index.rst +++ b/Documentation/networking/devlink/index.rst @@ -56,18 +56,18 @@ general. :maxdepth: 1 devlink-dpipe + devlink-eswitch-attr + devlink-flash devlink-health devlink-info - devlink-flash + devlink-linecard devlink-params devlink-port devlink-region - devlink-resource devlink-reload + devlink-resource devlink-selftests devlink-trap - devlink-linecard - devlink-eswitch-attr Driver-specific documentation ----------------------------- @@ -78,12 +78,14 @@ parameters, info versions, and other features it supports. .. toctree:: :maxdepth: 1 + am65-nuss-cpsw-switch bnxt etas_es58x hns3 i40e - ionic ice + ionic + iosm ixgbe kvaser_pciefd kvaser_usb @@ -93,11 +95,9 @@ parameters, info versions, and other features it supports. mv88e6xxx netdevsim nfp - qed - ti-cpsw-switch - am65-nuss-cpsw-switch - prestera - iosm octeontx2 + prestera + qed sfc + ti-cpsw-switch zl3073x -- cgit v1.2.3 From 6bdcb735fec6cb866b0d40634d4f23effba81074 Mon Sep 17 00:00:00 2001 From: Cosmin Ratiu Date: Tue, 16 Sep 2025 17:11:43 +0300 Subject: devlink: Add a 'num_doorbells' driverinit param This parameter can be used by drivers to configure a different number of doorbells. Signed-off-by: Cosmin Ratiu Reviewed-by: Dragos Tatulea Reviewed-by: Jiri Pirko Signed-off-by: Tariq Toukan Reviewed-by: Simon Horman Signed-off-by: Jakub Kicinski --- Documentation/networking/devlink/devlink-params.rst | 3 +++ 1 file changed, 3 insertions(+) (limited to 'Documentation/networking') diff --git a/Documentation/networking/devlink/devlink-params.rst b/Documentation/networking/devlink/devlink-params.rst index c51da4fba7e7..0a9c20d70122 100644 --- a/Documentation/networking/devlink/devlink-params.rst +++ b/Documentation/networking/devlink/devlink-params.rst @@ -148,3 +148,6 @@ own name. - The max number of Virtual Functions (VFs) exposed by the PF. after reboot/pci reset, 'sriov_totalvfs' entry under the device's sysfs directory will report this value. + * - ``num_doorbells`` + - u32 + - Controls the number of doorbells used by the device. -- cgit v1.2.3 From 11bbcfb7668c6f4d97260f7caaefea22678bc31e Mon Sep 17 00:00:00 2001 From: Cosmin Ratiu Date: Tue, 16 Sep 2025 17:11:44 +0300 Subject: net/mlx5e: Use the 'num_doorbells' devlink param Use the new devlink param to control how many doorbells mlx5e devices allocate and use. The maximum number of doorbells configurable is capped to the maximum number of channels. This only applies to the Ethernet part, the RDMA devices using mlx5 manage their own doorbells. Signed-off-by: Cosmin Ratiu Reviewed-by: Dragos Tatulea Signed-off-by: Tariq Toukan Reviewed-by: Simon Horman Signed-off-by: Jakub Kicinski --- Documentation/networking/devlink/mlx5.rst | 9 +++++++++ 1 file changed, 9 insertions(+) (limited to 'Documentation/networking') diff --git a/Documentation/networking/devlink/mlx5.rst b/Documentation/networking/devlink/mlx5.rst index 60cc9fedf1ef..41c9b716699e 100644 --- a/Documentation/networking/devlink/mlx5.rst +++ b/Documentation/networking/devlink/mlx5.rst @@ -62,6 +62,15 @@ Note: permanent parameters such as ``enable_sriov`` and ``total_vfs`` require FW echo 1 >/sys/bus/pci/rescan grep ^ /sys/bus/pci/devices/0000:01:00.0/sriov_* + * - ``num_doorbells`` + - driverinit + - This controls the number of channel doorbells used by the netdev. In all + cases, an additional doorbell is allocated and used for non-channel + communication (e.g. for PTP, HWS, etc.). Supported values are: + + - 0: No channel-specific doorbells, use the global one for everything. + - [1, max_num_channels]: Spread netdev channels equally across these + doorbells. The ``mlx5`` driver also implements the following driver-specific parameters. -- cgit v1.2.3 From 542a495cbaa6dc57a310da62b501fdf318657cad Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ilpo=20J=C3=A4rvinen?= Date: Tue, 16 Sep 2025 10:24:25 +0200 Subject: tcp: AccECN core MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This change implements Accurate ECN without negotiation and AccECN Option (that will be added by later changes). Based on AccECN specifications: https://tools.ietf.org/id/draft-ietf-tcpm-accurate-ecn-28.txt Accurate ECN allows feeding back the number of CE (congestion experienced) marks accurately to the sender in contrast to RFC3168 ECN that can only signal one marks-seen-yes/no per RTT. Congestion control algorithms can take advantage of the accurate ECN information to fine-tune their congestion response to avoid drastic rate reduction when only mild congestion is encountered. With Accurate ECN, tp->received_ce (r.cep in AccECN spec) keeps track of how many segments have arrived with a CE mark. Accurate ECN uses ACE field (ECE, CWR, AE) to communicate the value back to the sender which updates tp->delivered_ce (s.cep) based on the feedback. This signalling channel is lossy when ACE field overflow occurs. Conservative strategy is selected here to deal with the ACE overflow, however, some strategies using the AccECN option later in the overall patchset mitigate against false overflows detected. The ACE field values on the wire are offset by TCP_ACCECN_CEP_INIT_OFFSET. Delivered_ce/received_ce count the real CE marks rather than forcing all downstream users to adapt to the wire offset. This patch uses the first 1-byte hole and the last 4-byte hole of the tcp_sock_write_txrx for 'received_ce_pending' and 'received_ce'. Also, the group size of tcp_sock_write_txrx is increased from 91 + 4 to 95 + 4 due to the new u32 received_ce member. Below are the trimmed pahole outcomes before and after this patch. [BEFORE THIS PATCH] struct tcp_sock { [...] __cacheline_group_begin__tcp_sock_write_txrx[0]; /* 2521 0 */ u8 nonagle:4; /* 2521: 0 1 */ u8 rate_app_limited:1; /* 2521: 4 1 */ /* XXX 3 bits hole, try to pack */ /* XXX 2 bytes hole, try to pack */ [...] u32 delivered_ce; /* 2576 4 */ u32 app_limited; /* 2580 4 */ u32