diff options
| author | Linus Torvalds <torvalds@linux-foundation.org> | 2023-08-29 11:33:01 -0700 |
|---|---|---|
| committer | Linus Torvalds <torvalds@linux-foundation.org> | 2023-08-29 11:33:01 -0700 |
| commit | bd6c11bc43c496cddfc6cf603b5d45365606dbd5 (patch) | |
| tree | 36318fa68f784d397111991177d65bd6325189c4 /Documentation | |
| parent | 68cf01760bc0891074e813b9bb06d2696cac1c01 (diff) | |
| parent | c873512ef3a39cc1a605b7a5ff2ad0a33d619aa8 (diff) | |
| download | linux-bd6c11bc43c496cddfc6cf603b5d45365606dbd5.tar.gz linux-bd6c11bc43c496cddfc6cf603b5d45365606dbd5.tar.bz2 linux-bd6c11bc43c496cddfc6cf603b5d45365606dbd5.zip | |
Merge tag 'net-next-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Paolo Abeni:
"Core:
- Increase size limits for to-be-sent skb frag allocations. This
allows tun, tap devices and packet sockets to better cope with
large writes operations
- Store netdevs in an xarray, to simplify iterating over netdevs
- Refactor nexthop selection for multipath routes
- Improve sched class lifetime handling
- Add backup nexthop ID support for bridge
- Implement drop reasons support in openvswitch
- Several data races annotations and fixes
- Constify the sk parameter of routing functions
- Prepend kernel version to netconsole message
Protocols:
- Implement support for TCP probing the peer being under memory
pressure
- Remove hard coded limitation on IPv6 specific info placement inside
the socket struct
- Get rid of sysctl_tcp_adv_win_scale and use an auto-estimated per
socket scaling factor
- Scaling-up the IPv6 expired route GC via a separated list of
expiring routes
- In-kernel support for the TLS alert protocol
- Better support for UDP reuseport with connected sockets
- Add NEXT-C-SID support for SRv6 End.X behavior, reducing the SR
header size
- Get rid of additional ancillary per MPTCP connection struct socket
- Implement support for BPF-based MPTCP packet schedulers
- Format MPTCP subtests selftests results in TAP
- Several new SMC 2.1 features including unique experimental options,
max connections per lgr negotiation, max links per lgr negotiation
BPF:
- Multi-buffer support in AF_XDP
- Add multi uprobe BPF links for attaching multiple uprobes and usdt
probes, which is significantly faster and saves extra fds
- Implement an fd-based tc BPF attach API (TCX) and BPF link support
on top of it
- Add SO_REUSEPORT support for TC bpf_sk_assign
- Support new instructions from cpu v4 to simplify the generated code
and feature completeness, for x86, arm64, riscv64
- Support defragmenting IPv(4|6) packets in BPF
- Teach verifier actual bounds of bpf_get_smp_processor_id() and fix
perf+libbpf issue related to custom section handling
- Introduce bpf map element count and enable it for all program types
- Add a BPF hook in sys_socket() to change the protocol ID from
IPPROTO_TCP to IPPROTO_MPTCP to cover migration for legacy
- Introduce bpf_me_mcache_free_rcu() and fix OOM under stress
- Add uprobe support for the bpf_get_func_ip helper
- Check skb ownership against full socket
- Support for up to 12 arguments in BPF trampoline
- Extend link_info for kprobe_multi and perf_event links
Netfilter:
- Speed-up process exit by aborting ruleset validation if a fatal
signal is pending
- Allow NLA_POLICY_MASK to be used with BE16/BE32 types
Driver API:
- Page pool optimizations, to improve data locality and cache usage
- Introduce ndo_hwtstamp_get() and ndo_hwtstamp_set() to avoid the
need for raw ioctl() handling in drivers
- Simplify genetlink dump operations (doit/dumpit) providing them the
common information already populated in struct genl_info
- Extend and use the yaml devlink specs to [re]generate the split ops
- Introduce devlink selective dumps, to allow SF filtering SF based
on handle and other attributes
- Add yaml netlink spec for netlink-raw families, allow route, link
and address related queries via the ynl tool
- Remove phylink legacy mode support
- Support offload LED blinking to phy
- Add devlink port function attributes for IPsec
New hardware / drivers:
- Ethernet:
- Broadcom ASP 2.0 (72165) ethernet controller
- MediaTek MT7988 SoC
- Texas Instruments AM654 SoC
- Texas Instruments IEP driver
- Atheros qca8081 phy
- Marvell 88Q2110 phy
- NXP TJA1120 phy
- WiFi:
- MediaTek mt7981 support
- Can:
- Kvaser SmartFusion2 PCI Express devices
- Allwinner T113 controllers
- Texas Instruments tcan4552/4553 chips
- Bluetooth:
- Intel Gale Peak
- Qualcomm WCN3988 and WCN7850
- NXP AW693 and IW624
- Mediatek MT2925
Drivers:
- Ethernet NICs:
- nVidia/Mellanox:
- mlx5:
- support UDP encapsulation in packet offload mode
- IPsec packet offload support in eswitch mode
- improve aRFS observability by adding new set of counters
- extends MACsec offload support to cover RoCE traffic
- dynamic completion EQs
- mlx4:
- convert to use auxiliary bus instead of custom interface
logic
- Intel
- ice:
- implement switchdev bridge offload, even for LAG
interfaces
- implement SRIOV support for LAG interfaces
- igc:
- add support for multiple in-flight TX timestamps
- Broadcom:
- bnxt:
- use the unified RX page pool buffers for XDP and non-XDP
- use the NAPI skb allocation cache
- OcteonTX2:
- support Round Robin scheduling HTB offload
- TC flower offload support for SPI field
- Freescale:
- add XDP_TX feature support
- AMD:
- ionic: add support for PCI FLR event
- sfc:
- basic conntrack offload
- introduce eth, ipv4 and ipv6 pedit offloads
- ST Microelectronics:
- stmmac: maximze PTP timestamping resolution
- Virtual NICs:
- Microsoft vNIC:
- batch ringing RX queue doorbell on receiving packets
- add page pool for RX buffers
- Virtio vNIC:
- add per queue interrupt coalescing support
- Google vNIC:
- add queue-page-list mode support
- Ethernet high-speed switches:
- nVidia/Mellanox (mlxsw):
- add port range matching tc-flower offload
- permit enslavement to netdevices with uppers
- Ethernet embedded switches:
- Marvell (mv88e6xxx):
- convert to phylink_pcs
- Renesas:
- r8A779fx: add speed change support
- rzn1: enables vlan support
- Ethernet PHYs:
- convert mv88e6xxx to phylink_pcs
- WiFi:
- Qualcomm Wi-Fi 7 (ath12k):
- extremely High Throughput (EHT) PHY support
- RealTek (rtl8xxxu):
- enable AP mode for: RTL8192FU, RTL8710BU (RTL8188GU),
RTL8192EU and RTL8723BU
- RealTek (rtw89):
- Introduce Time Averaged SAR (TAS) support
- Connector:
- support for event filtering"
* tag 'net-next-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1806 commits)
net: ethernet: mtk_wed: minor change in wed_{tx,rx}info_show
net: ethernet: mtk_wed: add some more info in wed_txinfo_show handler
net: stmmac: clarify difference between "interface" and "phy_interface"
r8152: add vendor/device ID pair for D-Link DUB-E250
devlink: move devlink_notify_register/unregister() to dev.c
devlink: move small_ops definition into netlink.c
devlink: move tracepoint definitions into core.c
devlink: push linecard related code into separate file
devlink: push rate related code into separate file
devlink: push trap related code into separate file
devlink: use tracepoint_enabled() helper
devlink: push region related code into separate file
devlink: push param related code into separate file
devlink: push resource related code into separate file
devlink: push dpipe related code into separate file
devlink: move and rename devlink_dpipe_send_and_alloc_skb() helper
devlink: push shared buffer related code into separate file
devlink: push port related code into separate file
devlink: push object register/unregister notifications into separate helpers
inet: fix IP_TRANSPARENT error handling
...
Diffstat (limited to 'Documentation')
61 files changed, 4452 insertions, 603 deletions
diff --git a/Documentation/bpf/bpf_design_QA.rst b/Documentation/bpf/bpf_design_QA.rst index 38372a956d65..eb19c945f4d5 100644 --- a/Documentation/bpf/bpf_design_QA.rst +++ b/Documentation/bpf/bpf_design_QA.rst @@ -140,11 +140,6 @@ A: Because if we picked one-to-one relationship to x64 it would have made it more complicated to support on arm64 and other archs. Also it needs div-by-zero runtime check. -Q: Why there is no BPF_SDIV for signed divide operation? -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -A: Because it would be rarely used. llvm errors in such case and -prints a suggestion to use unsigned divide instead. - Q: Why BPF has implicit prologue and epilogue? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A: Because architectures like sparc have register windows and in general diff --git a/Documentation/bpf/bpf_devel_QA.rst b/Documentation/bpf/bpf_devel_QA.rst index 609b71f5747d..de27e1620821 100644 --- a/Documentation/bpf/bpf_devel_QA.rst +++ b/Documentation/bpf/bpf_devel_QA.rst @@ -635,12 +635,12 @@ test coverage. Q: clang flag for target bpf? ----------------------------- -Q: In some cases clang flag ``-target bpf`` is used but in other cases the +Q: In some cases clang flag ``--target=bpf`` is used but in other cases the default clang target, which matches the underlying architecture, is used. What is the difference and when I should use which? A: Although LLVM IR generation and optimization try to stay architecture -independent, ``-target <arch>`` still has some impact on generated code: +independent, ``--target=<arch>`` still has some impact on generated code: - BPF program may recursively include header file(s) with file scope inline assembly codes. The default target can handle this well, @@ -658,7 +658,7 @@ independent, ``-target <arch>`` still has some impact on generated code: The clang option ``-fno-jump-tables`` can be used to disable switch table generation. -- For clang ``-target bpf``, it is guaranteed that pointer or long / +- For clang ``--target=bpf``, it is guaranteed that pointer or long / unsigned long types will always have a width of 64 bit, no matter whether underlying clang binary or default target (or kernel) is 32 bit. However, when native clang target is used, then it will @@ -668,7 +668,7 @@ independent, ``-target <arch>`` still has some impact on generated code: while the BPF LLVM back end still operates in 64 bit. The native target is mostly needed in tracing for the case of walking ``pt_regs`` or other kernel structures where CPU's register width matters. - Otherwise, ``clang -target bpf`` is generally recommended. + Otherwise, ``clang --target=bpf`` is generally recommended. You should use default target when: @@ -685,7 +685,7 @@ when: into these structures is verified by the BPF verifier and may result in verification failures if the native architecture is not aligned with the BPF architecture, e.g. 64-bit. An example of this is - BPF_PROG_TYPE_SK_MSG require ``-target bpf`` + BPF_PROG_TYPE_SK_MSG require ``--target=bpf`` .. Links diff --git a/Documentation/bpf/btf.rst b/Documentation/bpf/btf.rst index 7cd7c5415a99..f32db1f44ae9 100644 --- a/Documentation/bpf/btf.rst +++ b/Documentation/bpf/btf.rst @@ -990,7 +990,7 @@ format.:: } g2; int main() { return 0; } int test() { return 0; } - -bash-4.4$ clang -c -g -O2 -target bpf t2.c + -bash-4.4$ clang -c -g -O2 --target=bpf t2.c -bash-4.4$ readelf -S t2.o ...... [ 8] .BTF PROGBITS 0000000000000000 00000247 @@ -1000,7 +1000,7 @@ format.:: [10] .rel.BTF.ext REL 0000000000000000 000007e0 0000000000000040 0000000000000010 16 9 8 ...... - -bash-4.4$ clang -S -g -O2 -target bpf t2.c + -bash-4.4$ clang -S -g -O2 --target=bpf t2.c -bash-4.4$ cat t2.s ...... .section .BTF,"",@progbits diff --git a/Documentation/bpf/index.rst b/Documentation/bpf/index.rst index dbb39e8f9889..1ff177b89d66 100644 --- a/Documentation/bpf/index.rst +++ b/Documentation/bpf/index.rst @@ -12,9 +12,9 @@ that goes into great technical depth about the BPF Architecture. .. toctree:: :maxdepth: 1 - instruction-set verifier libbpf/index + standardization/index btf faq syscall_api @@ -29,7 +29,6 @@ that goes into great technical depth about the BPF Architecture. bpf_licensing test_debug clang-notes - linux-notes other redirect diff --git a/Documentation/bpf/llvm_reloc.rst b/Documentation/bpf/llvm_reloc.rst index e4a777a6a3a2..450e6403fe3d 100644 --- a/Documentation/bpf/llvm_reloc.rst +++ b/Documentation/bpf/llvm_reloc.rst @@ -28,7 +28,7 @@ For example, for the following code:: return g1 + g2 + l1 + l2; } -Compiled with ``clang -target bpf -O2 -c test.c``, the following is +Compiled with ``clang --target=bpf -O2 -c test.c``, the following is the code with ``llvm-objdump -dr test.o``:: 0: 18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0 ll @@ -157,7 +157,7 @@ and ``call`` instructions. For example:: return gfunc(a, b) + lfunc(a, b) + global; } -Compiled with ``clang -target bpf -O2 -c test.c``, we will have +Compiled with ``clang --target=bpf -O2 -c test.c``, we will have following code with `llvm-objdump -dr test.o``:: Disassembly of section .text: @@ -203,7 +203,7 @@ The following is an example to show how R_BPF_64_ABS64 could be generated:: int global() { return 0; } struct t { void *g; } gbl = { global }; -Compiled with ``clang -target bpf -O2 -g -c test.c``, we will see a +Compiled with ``clang --target=bpf -O2 -g -c test.c``, we will see a relocation below in ``.data`` section with command ``llvm-readelf -r test.o``:: diff --git a/Documentation/bpf/standardization/index.rst b/Documentation/bpf/standardization/index.rst new file mode 100644 index 000000000000..09c6ba055fd7 --- /dev/null +++ b/Documentation/bpf/standardization/index.rst @@ -0,0 +1,18 @@ +.. SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) + +=================== +BPF Standardization +=================== + +This directory contains documents that are being iterated on as part of the BPF +standardization effort with the IETF. See the `IETF BPF Working Group`_ page +for the working group charter, documents, and more. + +.. toctree:: + :maxdepth: 1 + + instruction-set + linux-notes + +.. Links: +.. _IETF BPF Working Group: https://datatracker.ietf.org/wg/bpf/about/ diff --git a/Documentation/bpf/instruction-set.rst b/Documentation/bpf/standardization/instruction-set.rst index 6644842cd3ea..4f73e9dc8d9e 100644 --- a/Documentation/bpf/instruction-set.rst +++ b/Documentation/bpf/standardization/instruction-set.rst @@ -10,9 +10,92 @@ This document specifies version 1.0 of the eBPF instruction set. Documentation conventions ========================= -For brevity, this document uses the type notion "u64", "u32", etc. -to mean an unsigned integer whose width is the specified number of bits, -and "s32", etc. to mean a signed integer of the specified number of bits. +For brevity and consistency, this document refers to families +of types using a shorthand syntax and refers to several expository, +mnemonic functions when describing the semantics of instructions. +The range of valid values for those types and the semantics of those +functions are defined in the following subsections. + +Types +----- +This document refers to integer types with the notation `SN` to specify +a type's signedness (`S`) and bit width (`N`), respectively. + +.. table:: Meaning of signedness notation. + + ==== ========= + `S` Meaning + ==== ========= + `u` unsigned + `s` signed + ==== ========= + +.. table:: Meaning of bit-width notation. + + ===== ========= + `N` Bit width + ===== ========= + `8` 8 bits + `16` 16 bits + `32` 32 bits + `64` 64 bits + `128` 128 bits + ===== ========= + +For example, `u32` is a type whose valid values are all the 32-bit unsigned +numbers and `s16` is a types whose valid values are all the 16-bit signed +numbers. + +Functions +--------- +* `htobe16`: Takes an unsigned 16-bit number in host-endian format and + returns the equivalent number as an unsigned 16-bit number in big-endian + format. +* `htobe32`: Takes an unsigned 32-bit number in host-endian format and + returns the equivalent number as an unsigned 32-bit number in big-endian + format. +* `htobe64`: Takes an unsigned 64-bit number in host-endian format and + returns the equivalent number as an unsigned 64-bit number in big-endian + format. +* `htole16`: Takes an unsigned 16-bit number in host-endian format and + returns the equivalent number as an unsigned 16-bit number in little-endian + format. +* `htole32`: Takes an unsigned 32-bit number in host-endian format and + returns the equivalent number as an unsigned 32-bit number in little-endian + format. +* `htole64`: Takes an unsigned 64-bit number in host-endian format and + returns the equivalent number as an unsigned 64-bit number in little-endian + format. +* `bswap16`: Takes an unsigned 16-bit number in either big- or little-endian + format and returns the equivalent number with the same bit width but + opposite endianness. +* `bswap32`: Takes an unsigned 32-bit number in either big- or little-endian + format and returns the equivalent number with the same bit width but + opposite endianness. +* `bswap64`: Takes an unsigned 64-bit number in either big- or little-endian + format and returns the equivalent number with the same bit width but + opposite endianness. + + +Definitions +----------- + +.. glossary:: + + Sign Extend + To `sign extend an` ``X`` `-bit number, A, to a` ``Y`` `-bit number, B ,` means to + + #. Copy all ``X`` bits from `A` to the lower ``X`` bits of `B`. + #. Set the value of the remaining ``Y`` - ``X`` bits of `B` to the value of + the most-significant bit of `A`. + +.. admonition:: Example + + Sign extend an 8-bit number ``A`` to a 16-bit number ``B`` on a big-endian platform: + :: + + A: 10000110 + B: 11111111 10000110 Registers and calling convention ================================ @@ -154,24 +237,27 @@ otherwise identical operations. The 'code' field encodes the operation as below, where 'src' and 'dst' refer to the values of the source and destination registers, respectively. -======== ===== ========================================================== -code value description -======== ===== ========================================================== -BPF_ADD 0x00 dst += src -BPF_SUB 0x10 dst -= src -BPF_MUL 0x20 dst \*= src -BPF_DIV 0x30 dst = (src != 0) ? (dst / src) : 0 -BPF_OR 0x40 dst \|= src -BPF_AND 0x50 dst &= src -BPF_LSH 0x60 dst <<= (src & mask) -BPF_RSH 0x70 dst >>= (src & mask) -BPF_NEG 0x80 dst = ~src -BPF_MOD 0x90 dst = (src != 0) ? (dst % src) : dst -BPF_XOR 0xa0 dst ^= src -BPF_MOV 0xb0 dst = src -BPF_ARSH 0xc0 sign extending dst >>= (src & mask) -BPF_END 0xd0 byte swap operations (see `Byte swap instructions`_ below) -======== ===== ========================================================== +========= ===== ======= ========================================================== +code value offset description +========= ===== ======= ========================================================== +BPF_ADD 0x00 0 dst += src +BPF_SUB 0x10 0 dst -= src +BPF_MUL 0x20 0 dst \*= src +BPF_DIV 0x30 0 dst = (src != 0) ? (dst / src) : 0 +BPF_SDIV 0x30 1 dst = (src != 0) ? (dst s/ src) : 0 +BPF_OR 0x40 0 dst \|= src +BPF_AND 0x50 0 dst &= src +BPF_LSH 0x60 0 dst <<= (src & mask) +BPF_RSH 0x70 0 dst >>= (src & mask) +BPF_NEG 0x80 0 dst = -dst +BPF_MOD 0x90 0 dst = (src != 0) ? (dst % src) : dst +BPF_SMOD 0x90 1 dst = (src != 0) ? (dst s% src) : dst +BPF_XOR 0xa0 0 dst ^= src +BPF_MOV 0xb0 0 dst = src +BPF_MOVSX 0xb0 8/16/32 dst = (s8,s16,s32)src +BPF_ARSH 0xc0 0 :term:`sign extending<Sign Extend>` dst >>= (src & mask) +BPF_END 0xd0 0 byte swap operations (see `Byte swap instructions`_ below) +========= ===== ======= ========================================================== Underflow and overflow are allowed during arithmetic operations, meaning the 64-bit or 32-bit value will wrap. If eBPF program execution would @@ -198,47 +284,75 @@ where '(u32)' indicates that the upper 32 bits are zeroed. dst = dst ^ imm32 -Also note that the division and modulo operations are unsigned. Thus, for -``BPF_ALU``, 'imm' is first interpreted as an unsigned 32-bit value, whereas -for ``BPF_ALU64``, 'imm' is first sign extended to 64 bits and the result -interpreted as an unsigned 64-bit value. There are no instructions for -signed division or modulo. +Note that most instructions have instruction offset of 0. Only three instructions +(``BPF_SDIV``, ``BPF_SMOD``, ``BPF_MOVSX``) have |
