summaryrefslogtreecommitdiff
path: root/include/net
AgeCommit message (Collapse)AuthorFilesLines
2023-04-22netfilter: nf_tables: do not store pktinfo in traceinfo structureFlorian Westphal1-3/+2
pass it as argument. No change in object size. stack usage decreases by 8 byte: nf_tables_core.c:254 nft_do_chain 320 static Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-04-22netfilter: nf_tables: make validation state per tableFlorian Westphal1-1/+2
We only need to validate tables that saw changes in the current transaction. The existing code revalidates all tables, but this isn't needed as cross-table jumps are not allowed (chains have table scope). Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-04-22netfilter: nf_tables: don't store chain address on jumpFlorian Westphal1-2/+12
Now that the rule trailer/end marker and the rcu head reside in the same structure, we no longer need to save/restore the chain pointer when performing/returning from a jump. We can simply let the trace infra walk the evaluated rule until it hits the end marker and then fetch the chain pointer from there. When the rule is NULL (policy tracing), then chain and basechain pointers were already identical, so just use the basechain. This cuts size of jumpstack in half, from 256 to 128 bytes in 64bit, scripts/stackusage says: nf_tables_core.c:251 nft_do_chain 328 static Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-04-21bpf: minimal support for programs hooked into netfilter frameworkFlorian Westphal1-0/+5
This adds minimal support for BPF_PROG_TYPE_NETFILTER bpf programs that will be invoked via the NF_HOOK() points in the ip stack. Invocation incurs an indirect call. This is not a necessity: Its possible to add 'DEFINE_BPF_DISPATCHER(nf_progs)' and handle the program invocation with the same method already done for xdp progs. This isn't done here to keep the size of this chunk down. Verifier restricts verdicts to either DROP or ACCEPT. Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20230421170300.24115-3-fw@strlen.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-21bpf: add bpf_link support for BPF_NETFILTER programsFlorian Westphal1-0/+10
Add bpf_link support skeleton. To keep this reviewable, no bpf program can be invoked yet, if a program is attached only a c-stub is called and not the actual bpf program. Defaults to 'y' if both netfilter and bpf syscall are enabled in kconfig. Uapi example usage: union bpf_attr attr = { }; attr.link_create.prog_fd = progfd; attr.link_create.attach_type = 0; /* unused */ attr.link_create.netfilter.pf = PF_INET; attr.link_create.netfilter.hooknum = NF_INET_LOCAL_IN; attr.link_create.netfilter.priority = -128; err = bpf(BPF_LINK_CREATE, &attr, sizeof(attr)); ... this would attach progfd to ipv4:input hook. Such hook gets removed automatically if the calling program exits. BPF_NETFILTER program invocation is added in followup change. NF_HOOK_OP_BPF enum will eventually be read from nfnetlink_hook, it allows to tell userspace which program is attached at the given hook when user runs 'nft hook list' command rather than just the priority and not-very-helpful 'this hook runs a bpf prog but I can't tell which one'. Will also be used to disallow registration of two bpf programs with same priority in a followup patch. v4: arm32 cmpxchg only supports 32bit operand s/prio/priority/ v3: restrict prog attachment to ip/ip6 for now, lets lift restrictions if more use cases pop up (arptables, ebtables, netdev ingress/egress etc). Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20230421170300.24115-2-fw@strlen.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-21Merge tag 'wireless-next-2023-04-21' of ↵Jakub Kicinski1-20/+6
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next Kalle Valo says: ==================== wireless-next patches for v6.4 Most likely the last -next pull request for v6.4. We have changes all over. rtw88 now supports SDIO bus and iwlwifi continues to work on Wi-Fi 7 support. Not much stack changes this time. Major changes: cfg80211/mac80211 - fix some Fine Time Measurement (FTM) frames not being bufferable - flush frames before key removal to avoid potential unencrypted transmission depending on the hardware design iwlwifi - preparation for Wi-Fi 7 EHT and multi-link support rtw88 - SDIO bus support - RTL8822BS, RTL8822CS and RTL8821CS SDIO chipset support rtw89 - framework firmware backwards compatibility brcmfmac - Cypress 43439 SDIO support mt76 - mt7921 P2P support - mt7996 mesh A-MSDU support - mt7996 EHT support - mt7996 coredump support wcn36xx - support for pronto v3 hardware ath11k - PCIe DeviceTree bindings - WCN6750: enable SAR support ath10k - convert DeviceTree bindings to YAML * tag 'wireless-next-2023-04-21' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (261 commits) wifi: rtw88: Update spelling in main.h wifi: airo: remove ISA_DMA_API dependency wifi: rtl8xxxu: Simplify setting the initial gain wifi: rtl8xxxu: Add rtl8xxxu_write{8,16,32}_{set,clear} wifi: rtl8xxxu: Don't print the vendor/product/serial wifi: rtw88: Fix memory leak in rtw88_usb wifi: rtw88: call rtw8821c_switch_rf_set() according to chip variant wifi: rtw88: set pkg_type correctly for specific rtw8821c variants wifi: rtw88: rtw8821c: Fix rfe_option field width wifi: rtw88: usb: fix priority queue to endpoint mapping wifi: rtw88: 8822c: add iface combination wifi: rtw88: handle station mode concurrent scan with AP mode wifi: rtw88: prevent scan abort with other VIFs wifi: rtw88: refine reserved page flow for AP mode wifi: rtw88: disallow PS during AP mode wifi: rtw88: 8822c: extend reserved page number wifi: rtw88: add port switch for AP mode wifi: rtw88: add bitmap for dynamic port settings wifi: rtw89: mac: use regular int as return type of DLE buffer request wifi: mac80211: remove return value check of debugfs_create_dir() ... ==================== Link: https://lore.kernel.org/r/20230421104726.800BCC433D2@smtp.kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-21sctp: delete the nested flexible array peer_initXin Long1-1/+1
This patch deletes the flexible-array peer_init[] from the structure sctp_cookie to avoid some sparse warnings: # make C=2 CF="-Wflexible-array-nested" M=./net/sctp/ net/sctp/sm_make_chunk.c: note: in included file (through include/net/sctp/sctp.h): ./include/net/sctp/structs.h:1588:28: warning: nested flexible array ./include/net/sctp/structs.h:343:28: warning: nested flexible array Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-21sctp: delete the nested flexible array skipXin Long1-2/+2
This patch deletes the flexible-array skip[] from the structure sctp_ifwdtsn/fwdtsn_hdr to avoid some sparse warnings: # make C=2 CF="-Wflexible-array-nested" M=./net/sctp/ net/sctp/stream_interleave.c: note: in included file (through include/net/sctp/structs.h, include/net/sctp/sctp.h): ./include/linux/sctp.h:611:32: warning: nested flexible array ./include/linux/sctp.h:628:33: warning: nested flexible array Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-21sctp: delete the nested flexible array paramsXin Long1-4/+4
This patch deletes the flexible-array params[] from the structure sctp_inithdr, sctp_addiphdr and sctp_reconf_chunk to avoid some sparse warnings: # make C=2 CF="-Wflexible-array-nested" M=./net/sctp/ net/sctp/input.c: note: in included file (through include/net/sctp/structs.h, include/net/sctp/sctp.h): ./include/linux/sctp.h:278:29: warning: nested flexible array ./include/linux/sctp.h:675:30: warning: nested flexible array This warning is reported if a structure having a flexible array member is included by other structures. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-20mac80211: use the new drop reasons infrastructureJohannes Berg1-0/+12
It can be really hard to analyse or debug why packets are going missing in mac80211, so add the needed infrastructure to use use the new per-subsystem drop reasons. We actually use two drop reason subsystems here because of the different handling of frames that are dropped but still go to monitor for old versions of hostapd, and those that are just completely unusable (e.g. crypto failed.) Annotate a few reasons here just to illustrate this, we'll need to go through and annotate more of them later. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-20net: extend drop reasons for multiple subsystemsJohannes Berg2-4/+41
Extend drop reasons to make them usable by subsystems other than core by reserving the high 16 bits for a new subsystem ID, of which 0 of course is used for the existing reasons immediately. To still be able to have string reasons, restructure that code a bit to make the loopup under RCU, the only user of this (right now) is drop_monitor. Link: https://lore.kernel.org/netdev/00659771ed54353f92027702c5bbb84702da62ce.camel@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-20net: move dropreason.h to dropreason-core.hJohannes Berg2-4/+5
This will, after the next patch, hold only the core drop reasons and minimal infrastructure. Fix a small kernel-doc issue while at it, to avoid the move triggering a checker. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-20ipv6: add icmpv6_error_anycast_as_unicast for ICMPv6Mahesh Bandewar1-0/+1
ICMPv6 error packets are not sent to the anycast destinations and this prevents things like traceroute from working. So create a setting similar to ECHO when dealing with Anycast sources (icmpv6_echo_ignore_anycast). Signed-off-by: Mahesh Bandewar <maheshb@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Maciej Żenczykowski <maze@google.com> Link: https://lore.kernel.org/r/20230419013238.2691167-1-maheshb@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-20flow_dissector: Address kdoc warningsSimon Horman1-18/+20
Address a number of warnings flagged by ./scripts/kernel-doc -none include/net/flow_dissector.h include/net/flow_dissector.h:23: warning: Function parameter or member 'addr_type' not described in 'flow_dissector_key_control' include/net/flow_dissector.h:23: warning: Function parameter or member 'flags' not described in 'flow_dissector_key_control' include/net/flow_dissector.h:46: warning: Function parameter or member 'padding' not described in 'flow_dissector_key_basic' include/net/flow_dissector.h:145: warning: Function parameter or member 'tipckey' not described in 'flow_dissector_key_addrs' include/net/flow_dissector.h:157: warning: cannot understand function prototype: 'struct flow_dissector_key_arp ' include/net/flow_dissector.h:171: warning: cannot understand function prototype: 'struct flow_dissector_key_ports ' include/net/flow_dissector.h:203: warning: cannot understand function prototype: 'struct flow_dissector_key_icmp ' Also improve indentation on adjacent lines to those changed to address the above. No functional changes intended. Signed-off-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20230419-flow-dissector-kdoc-v1-1-1aa0cca1118b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-20page_pool: unlink from napi during destroyJakub Kicinski1-0/+5
Jesper points out that we must prevent recycling into cache after page_pool_destroy() is called, because page_pool_destroy() is not synchronized with recycling (some pages may still be outstanding when destroy() gets called). I assumed this will not happen because NAPI can't be scheduled if its page pool is being destroyed. But I missed the fact that NAPI may get reused. For instance when user changes ring configuration driver may allocate a new page pool, stop NAPI, swap, start NAPI, and then destroy the old pool. The NAPI is running so old page pool will think it can recycle to the cache, but the consumer at that point is the destroy() path, not NAPI. To avoid extra synchronization let the drivers do "unlinking" during the "swap" stage while NAPI is indeed disabled. Fixes: 8c48eea3adf3 ("page_pool: allow caching from safely localized NAPI") Reported-by: Jesper Dangaard Brouer <jbrouer@redhat.com> Link: https://lore.kernel.org/all/e8df2654-6a5b-3c92-489d-2fe5e444135f@redhat.com/ Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Link: https://lore.kernel.org/r/20230419182006.719923-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-0/+4
Adjacent changes: net/mptcp/protocol.h 63740448a32e ("mptcp: fix accept vs worker race") 2a6a870e44dd ("mptcp: stops worker on unaccepted sockets at listener close") ddb1a072f858 ("mptcp: move first subflow allocation at mpc access time") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-19net/handshake: Add a kernel API for requesting a TLSv1.3 handshakeChuck Lever1-0/+43
To enable kernel consumers of TLS to request a TLS handshake, add support to net/handshake/ to request a handshake upcall. This patch also acts as a template for adding handshake upcall support for other kernel transport layer security providers. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-19net: skbuff: hide wifi_acked when CONFIG_WIRELESS not setJakub Kicinski1-1/+1
Datacenter kernel builds will very likely not include WIRELESS, so let them shave 2 bits off the skb by hiding the wifi fields. Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-19netfilter: conntrack: fix wrong ct->timeout valueTzung-Bi Shih1-1/+5
(struct nf_conn)->timeout is an interval before the conntrack confirmed. After confirmed, it becomes a timestamp. It is observed that timeout of an unconfirmed conntrack: - Set by calling ctnetlink_change_timeout(). As a result, `nfct_time_stamp` was wrongly added to `ct->timeout` twice. - Get by calling ctnetlink_dump_timeout(). As a result, `nfct_time_stamp` was wrongly subtracted. Call Trace: <TASK> dump_stack_lvl ctnetlink_dump_timeout __ctnetlink_glue_build ctnetlink_glue_build __nfqnl_enqueue_packet nf_queue nf_hook_slow ip_mc_output ? __pfx_ip_finish_output ip_send_skb ? __pfx_dst_output udp_send_skb udp_sendmsg ? __pfx_ip_generic_getfrag sock_sendmsg Separate the 2 cases in: - Setting `ct->timeout` in __nf_ct_set_timeout(). - Getting `ct->timeout` in ctnetlink_dump_timeout(). Pablo appends: Update ctnetlink to set up the timeout _after_ the IPS_CONFIRMED flag is set on, otherwise conntrack creation via ctnetlink breaks. Note that the problem described in this patch occurs since the introduction of the nfnetlink_queue conntrack support, select a sufficiently old Fixes: tag for -stable kernel to pick up this fix. Fixes: a4b4766c3ceb ("netfilter: nfnetlink_queue: rename related to nfqueue attaching conntrack info") Signed-off-by: Tzung-Bi Shih <tzungbi@kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-04-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nfJakub Kicinski1-0/+4
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Unbreak br_netfilter physdev match support, from Florian Westphal. 2) Use GFP_KERNEL_ACCOUNT for stateful/policy objects, from Chen Aotian. 3) Use IS_ENABLED() in nf_reset_trace(), from Florian Westphal. 4) Fix validation of catch-all set element. 5) Tighten requirements for catch-all set elements. * git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nf_tables: tighten netlink attribute requirements for catch-all elements netfilter: nf_tables: validate catch-all set elements netfilter: nf_tables: fix ifdef to also consider nf_tables=m netfilter: nf_tables: Modify nla_memdup's flag to GFP_KERNEL_ACCOUNT netfilter: br_netfilter: fix recent physdev match breakage ==================== Link: https://lore.kernel.org/r/20230418145048.67270-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-18wifi: mac80211: remove ieee80211_tx_status_8023Felix Fietkau1-20/+0
It is unused and should not be used. In order to avoid limitations in 4-address mode, the driver should always use ieee80211_tx_status_ext for 802.3 frames with a valid sta pointer. Signed-off-by: Felix Fietkau <nbd@nbd.name> Link: https://lore.kernel.org/r/20230417133751.79160-1-nbd@nbd.name Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2023-04-18net: add macro netif_subqueue_completed_wakeHeiner Kallweit1-0/+10
Add netif_subqueue_completed_wake, complementing the subqueue versions netif_subqueue_try_stop and netif_subqueue_maybe_stop. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-04-18netfilter: nf_tables: validate catch-all set elementsPablo Neira Ayuso1-0/+4
catch-all set element might jump/goto to chain that uses expressions that require validation. Fixes: aaa31047a6d2 ("netfilter: nftables: add catch-all set element support") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-04-17sctp: delete the obsolete code for the host name address paramXin Long1-1/+0
In the latest RFC9260, the Host Name Address param has been deprecated. For INIT chunk: Note 3: An INIT chunk MUST NOT contain the Host Name Address parameter. The receiver of an INIT chunk containing a Host Name Address parameter MUST send an ABORT chunk and MAY include an "Unresolvable Address" error cause. For Supported Address Types: The value indicating the Host Name Address parameter MUST NOT be used when sending this parameter and MUST be ignored when receiving this parameter. Currently Linux SCTP doesn't really support Host Name Address param, but only saves some flag and print debug info, which actually won't even be triggered due to the verification in sctp_verify_param(). This patch is to delete those dead code. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-14page_pool: allow caching from safely localized NAPIJakub Kicinski1-1/+2
Recent patches to mlx5 mentioned a regression when moving from driver local page pool to only using the generic page pool code. Page pool has two recycling paths (1) direct one, which runs in safe NAPI context (basically consumer context, so producing can be lockless); and (2) via a ptr_ring, which takes a spin lock because the freeing can happen from any CPU; producer and consumer may run concurrently. Since the page pool code was added, Eric introduced a revised version of deferred skb freeing. TCP skbs are now usually returned to the CPU which allocated them, and freed in softirq context. This places the freeing (producing of pages back to the pool) enticingly close to the allocation (consumer). If we can prove that we're freeing in the same softirq context in which the consumer NAPI will run - lockless use of the cache is perfectly fine, no need for the lock. Let drivers link the page pool to a NAPI instance. If the NAPI instance is scheduled on the same CPU on which we're freeing - place the pages in the direct cache. With that and patched bnxt (XDP enabled to engage the page pool, sigh, bnxt really needs page pool work :() I see a 2.6% perf boost with a TCP stream test (app on a different physical core than softirq). The CPU use of relevant functions decreases as expected: page_pool_refill_alloc_cache 1.17% -> 0% _raw_spin_lock 2.41% -> 0.98% Only consider lockless path to be safe when NAPI is scheduled - in practice this should cover majority if not all of steady state workloads. It's usually the NAPI kicking in that causes the skb flush. The main case we'll miss out on is when application runs on the same CPU as NAPI. In that case we don't use the deferred skb free path. Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Tested-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-14net: mana: Add support for jumbo frameHaiyang Zhang2-0/+18
During probe, get the hardware-allowed max MTU by querying the device configuration. Users can select MTU up to the device limit. When XDP is in use, limit MTU settings so the buffer size is within one page. And, when MTU is set to a too large value, XDP is not allowed to run. Also, to prevent changing MTU fails, and leaves the NIC in a bad state, pre-allocate all buffers before starting the change. So in low memory condition, it will return error, without affecting the NIC. Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-14net: mana: Enable RX path to handle various MTU sizesHaiyang Zhang1-0/+7
Update RX data path to allocate and use RX queue DMA buffers with proper size based on potentially various MTU sizes. Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-14net: mana: Refactor RX buffer allocation code to prepare for various MTUHaiyang Zhang1-5/+1
Move out common buffer allocation code from mana_process_rx_cqe() and mana_alloc_rx_wqe() to helper functions. Refactor related variables so they can be changed in one place, and buffer sizes are in sync. Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-13net/sched: mqprio: allow per-TC user input of FP adminStatusVladimir Oltean1-0/+1
IEEE 802.1Q-2018 clause 6.7.2 Frame preemption specifies that each packet priority can be assigned to a "frame preemption status" value of either "express" or "preemptible". Express priorities are transmitted by the local device through the eMAC, and preemptible priorities through the pMAC (the concepts of eMAC and pMAC come from the 802.3 MAC Merge layer). The FP adminStatus is defined per packet priority, but 802.1Q clause 12.30.1.1.1 framePreemptionAdminStatus also says that: | Priorities that all map to the same traffic class should be | constrained to use the same value of preemption status. It is impossible to ignore the cognitive dissonance in the standard here, because it practically means that the FP adminStatus only takes distinct values per traffic class, even though it is defined per priority. I can see no valid use case which is prevented by having the kernel take the FP adminStatus as input per traffic class (what we do here). In addition, this also enforces the above constraint by construction. User space network managers which wish to expose FP adminStatus per priority are free to do so; they must only observe the prio_tc_map of the netdev (which presumably is also under their control, when constructing the mqprio netlink attributes). The reason for configuring frame preemption as a property of the Qdisc layer is that the information about "preemptible TCs" is closest to the place which handles the num_tc and prio_tc_map of the netdev. If the UAPI would have been any other layer, it would be unclear what to do with the FP information when num_tc collapses to 0. A key assumption is that only mqprio/taprio change the num_tc and prio_tc_map of the netdev. Not sure if that's a great assumption to make. Having FP in tc-mqprio can be seen as an implementation of the use case defined in 802.1Q Annex S.2 "Preemption used in isolation". There will be a separate implementation of FP in tc-taprio, for the other use cases. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Ferenc Fejes <fejes@inf.elte.hu> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-13net/sched: pass netlink extack to mqprio and taprio offloadVladimir Oltean1-0/+2
With the multiplexed ndo_setup_tc() model which lacks a first-class struct netlink_ext_ack * argument, the only way to pass the netlink extended ACK message down to the device driver is to embed it within the offload structure. Do this for struct tc_mqprio_qopt_offload and struct tc_taprio_qopt_offload. Since struct tc_taprio_qopt_offload also contains a tc_mqprio_qopt_offload structure, and since device drivers might effectively reuse their mqprio implementation for the mqprio portion of taprio, we make taprio set the extack in both offload structures to point at the same netlink extack message. In fact, the taprio handling is a bit more tricky, for 2 reasons. First is because the offload structure has a longer lifetime than the extack structure. The driver is supposed to populate the extack synchronously from ndo_setup_tc() and leave it alone afterwards. To not have any use-after-free surprises, we zero out the extack pointer when we leave taprio_enable_offload(). The second reason is because taprio does overwrite the extack message on ndo_setup_tc() error. We need to switch to the weak form of setting an extack message, which preserves a potential message set by the driver. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-13Daniel Borkmann says:Jakub Kicinski5-48/+22
==================== pull-request: bpf-next 2023-04-13 We've added 260 non-merge commits during the last 36 day(s) which contain a total of 356 files changed, 21786 insertions(+), 11275 deletions(-). The main changes are: 1) Rework BPF verifier log behavior and implement it as a rotating log by default with the option to retain old-style fixed log behavior, from Andrii Nakryiko. 2) Adds support for using {FOU,GUE} encap with an ipip device operating in collect_md mode and add a set of BPF kfuncs for controlling encap params, from Christian Ehrig. 3) Allow BPF programs to detect at load time whether a particular kfunc exists or not, and also add support for this in light skeleton, from Alexei Starovoitov. 4) Optimize hashmap lookups when key size is multiple of 4, from Anton Protopopov. 5) Enable RCU semantics for task BPF kptrs and allow referenced kptr tasks to be stored in BPF maps, from David Vernet. 6) Add support for stashing local BPF kptr into a map value via bpf_kptr_xchg(). This is useful e.g. for rbtree node creation for new cgroups, from Dave Marchevsky. 7) Fix BTF handling of is_int_ptr to skip modifiers to work around tracing issues where a program cannot be attached, from Feng Zhou. 8) Migrate a big portion of test_verifier unit tests over to test_progs -a verifier_* via inline asm to ease {read,debug}ability, from Eduard Zingerman. 9) Several updates to the instruction-set.rst documentation which is subject to future IETF standardization (https://lwn.net/Articles/926882/), from Dave Thaler. 10) Fix BPF verifier in the __reg_bound_offset's 64->32 tnum sub-register known bits information propagation, from Daniel Borkmann. 11) Add skb bitfield compaction work related to BPF with the overall goal to make more of the sk_buff bits optional, from Jakub Kicinski. 12) BPF selftest cleanups for build id extraction which stand on its own from the upcoming integration work of build id into struct file object, from Jiri Olsa. 13) Add fixes and optimizations for xsk descriptor validation and several selftest improvements for xsk sockets, from Kal Conley. 14) Add BPF links for struct_ops and enable switching implementations of BPF TCP cong-ctls under a given name by replacing backing struct_ops map, from Kui-Feng Lee. 15) Remove a misleading BPF verifier env->bypass_spec_v1 check on variable offset stack read as earlier Spectre checks cover this, from Luis Gerhorst. 16) Fix issues in copy_from_user_nofault() for BPF and other tracers to resemble copy_from_user_nmi() from safety PoV, from Florian Lehner and Alexei Starovoitov. 17) Add --json-summary option to test_progs in order for CI tooling to ease parsing of test results, from Manu Bretelle. 18) Batch of improvements and refactoring to prep for upcoming bpf_local_storage conversion to bpf_mem_cache_{alloc,free} allocator, from Martin KaFai Lau. 19) Improve bpftool's visual program dump which produces the control flow graph in a DOT format by adding C source inline annotations, from Quentin Monnet. 20) Fix attaching fentry/fexit/fmod_ret/lsm to modules by extracting the module name from BTF of the target and searching kallsyms of the correct module, from Viktor Malik. 21) Improve BPF verifier handling of '<const> <cond> <non_const>' to better detect whether in particular jmp32 branches are taken, from Yonghong Song. 22) Allow BPF TCP cong-ctls to write app_limited of struct tcp_sock. A built-in cc or one from a kernel module is already able to write to app_limited, from Yixin Shen. Conflicts: Documentation/bpf/bpf_devel_QA.rst b7abcd9c656b ("bpf, doc: Link to submitting-patches.rst for general patch submission info") 0f10f647f455 ("bpf, docs: Use internal linking for link to netdev subsystem doc") https://lore.kernel.org/all/20230307095812.236eb1be@canb.auug.org.au/ include/net/ip_tunnels.h bc9d003dc48c3 ("ip_tunnel: Preserve pointer const in ip_tunnel_info_opts") ac931d4cdec3d ("ipip,ip_tunnel,sit: Add FOU support for externally controlled ipip devices") https://lore.kernel.org/all/20230413161235.4093777-1-broonie@kernel.org/ net/bpf/test_run.c e5995bc7e2ba ("bpf, test_run: fix crashes due to XDP frame overwriting/corruption") 294635a8165a ("bpf, test_run: fix &xdp_frame misplacement for LIVE_FRAMES") https://lore.kernel.org/all/20230320102619.05b80a98@canb.auug.org.au/ ==================== Link: https://lore.kernel.org/r/20230413191525.7295-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski3-2/+54
Conflicts: tools/testing/selftests/net/config 62199e3f1658 ("selftests: net: Add VXLAN MDB test") 3a0385be133e ("selftests: add the missing CONFIG_IP_SCTP in net config") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-13mlx5: bpf_xdp_metadata_rx_hash add xdp rss hash typeJesper Dangaard Brouer1-0/+2
Update API for bpf_xdp_metadata_rx_hash() with arg for xdp rss hash type via mapping table. The mlx5 hardware can also identify and RSS hash IPSEC. This indicate hash includes SPI (Security Parameters Index) as part of IPSEC hash. Extend xdp core enum xdp_rss_hash_type with IPSEC hash type. Fixes: bc8d405b1ba9 ("net/mlx5e: Support RX XDP metadata") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/168132892548.340624.11185734579430124869.stgit@firesoul Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-13xdp: rss hash types representationJesper Dangaard Brouer1-0/+45
The RSS hash type specifies what portion of packet data NIC hardware used when calculating RSS hash value. The RSS types are focused on Internet traffic protocols at OSI layers L3 and L4. L2 (e.g. ARP) often get hash value zero and no RSS type. For L3 focused on IPv4 vs. IPv6, and L4 primarily TCP vs UDP, but some hardware supports SCTP. Hardware RSS types are differently encoded for each hardware NIC. Most hardware represent RSS hash type as a number. Determining L3 vs L4 often requires a mapping table as there often isn't a pattern or sorting according to ISO layer. The patch introduce a XDP RSS hash type (enum xdp_rss_hash_type) that contains both BITs for the L3/L4 types, and combinations to be used by drivers for their mapping tables. The enum xdp_rss_type_bits get exposed to BPF via BTF, and it is up to the BPF-programmer to match using these defines. This proposal change the kfunc API bpf_xdp_metadata_rx_hash() adding a pointer value argument for provide the RSS hash type. Change signature for all xmo_rx_hash calls in drivers to make it compile. The RSS type implementations for each driver comes as separate patches. Fixes: 3d76a4d3d4e5 ("bpf: XDP metadata RX kfuncs") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/168132892042.340624.582563003880565460.stgit@firesoul Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-13wifi: mac80211: add flush_sta methodJohannes Berg1-0/+6
Some drivers like iwlwifi might have per-STA queues, so we may want to flush/drop just those queues rather than all when removing a station. Add a separate method for that. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2023-04-12bpf,fou: Add bpf_skb_{set,get}_fou_encap kfuncsChristian Ehrig1-0/+2
Add two new kfuncs that allow a BPF tc-hook, installed on an ipip device in collect-metadata mode, to control FOU encap parameters on a per-packet level. The set of kfuncs is registered with the fou module. The bpf_skb_set_fou_encap kfunc is supposed to be used in tandem and after a successful call to the bpf_skb_set_tunnel_key bpf-helper. UDP source and destination ports can be controlled by passing a struct bpf_fou_encap. A source port of zero will auto-assign a source port. enum bpf_fou_encap_type is used to specify if the egress path should FOU or GUE encap the packet. On the ingress path bpf_skb_get_fou_encap can be used to read UDP source and destination ports from the receiver's point of view and allows for packet multiplexing across different destination ports within a single BPF program and ipip device. Signed-off-by: Christian Ehrig <cehrig@cloudflare.com> Link: https://lore.kernel.org/r/e17c94a646b63e78ce0dbf3f04b2c33dc948a32d.1680874078.git.cehrig@cloudflare.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-12ipip,ip_tunnel,sit: Add FOU support for externally controlled ipip devicesChristian Ehrig1-13/+15
Today ipip devices in collect-metadata mode don't allow for sending FOU or GUE encapsulated packets. This patch lifts the restriction by adding a struct ip_tunnel_encap to the tunnel metadata. On the egress path, the members of this struct can be set by the bpf_skb_set_fou_encap kfunc via a BPF tc-hook. Instead of dropping packets wishing to use additional UDP encapsulation, ip_md_tunnel_xmit now evaluates the contents of this struct and adds the corresponding FOU or GUE header. Furthermore, it is making sure that additional header bytes are taken into account for PMTU discovery. On the ingress path, an ipip device in collect-metadata mode will fill this struct and a BPF tc-hook can obtain the information via a call to the bpf_skb_get_fou_encap kfunc. The minor change to ip_tunnel_encap, which now takes a pointer to struct ip_tunnel_encap instead of struct ip_tunnel, allows us to control FOU encap type and parameters on a per packet-level. Signed-off-by: Christian Ehrig <cehrig@cloudflare.com> Link: https://lore.kernel.org/r/cfea47de655d0f870248abf725932f851b53960a.1680874078.git.cehrig@cloudflare.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-10net: piggy back on the memory barrier in bql when waking queuesJakub Kicinski1-10/+29
Drivers call netdev_tx_completed_queue() right before netif_txq_maybe_wake(). If BQL is enabled netdev_tx_completed_queue() should issue a memory barrier, so we can depend on that separating the stop check from the consumer index update, instead of adding another barrier in netif_txq_maybe_wake(). This matters more than the barriers on the xmit path, because the wake condition is almost always true. So we issue the consumer side barrier often. Wrap netdev_tx_completed_queue() in a local helper to issue the barrier even if BQL is disabled. Keep the same semantics as netdev_tx_completed_queue() (barrier only if bytes != 0) to make it clear that the barrier is conditional. Plus since macro gets pkt/byte counts as arguments now - we can skip waking if there were no packets completed. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-10net: provide macros for commonly copied lockless queue stop/wake codeJakub Kicinski1-0/+144
A lot of drivers follow the same scheme to stop / start queues without introducing locks between xmit and NAPI tx completions. I'm guessing they all copy'n'paste each other's code. The original code dates back all the way to e1000 and Linux 2.6.19. Smaller drivers shy away from the scheme and introduce a lock which may cause deadlocks in netpoll. Provide macros which encapsulate the necessary logic. The macros do not prevent false wake ups, the extra barrier required to close that race is not worth it. See discussion in: https://lore.kernel.org/all/c39312a2-4537-14b4-270c-9fe1fbb91e89@gmail.com/ Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-10Bluetooth: Fix printing errors if LE Connection times outLuiz Augusto von Dentz1-0/+1
This fixes errors like bellow when LE Connection times out since that is actually not a controller error: Bluetooth: hci0: Opcode 0x200d failed: -110 Bluetooth: hci0: request failed to create LE connection: err -110 Instead the code shall properly detect if -ETIMEDOUT is returned and send HCI_OP_LE_CREATE_CONN_CANCEL to give up on the connection. Link: https://github.com/bluez/bluez/issues/340 Fixes: 8e8b92ee60de ("Bluetooth: hci_sync: Add hci_le_create_conn_sync") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-04-09net: dsa: replace NETDEV_PRE_CHANGE_HWTSTAMP notifier with a stubVladimir Oltean1-0/+48
There was a sort of rush surrounding commit 88c0a6b503b7 ("net: create a netdev notifier for DSA to reject PTP on DSA master"), due to a desire to convert DSA's attempt to deny TX timestamping on a DSA master to something that doesn't block the kernel-wide API conversion from ndo_eth_ioctl() to ndo_hwtstamp_set(). What was required was a mechanism that did not depend on ndo_eth_ioctl(), and what was provided was a mechanism that did not depend on ndo_eth_ioctl(), while at the same time introducing something that wasn't absolutely necessary - a new netdev notifier. There have been objections from Jakub Kicinski that using notifiers in general when they are not absolutely necessary creates complications to the control flow and difficulties to maintainers who look at the code. So there is a desire to not use notifiers. In addition to that, the notifier chain gets called even if there is no DSA in the system and no one is interested in applying any restriction. Take the model of udp_tunnel_nic_ops and introduce a stub mechanism, through which net/core/dev_ioctl.c can call into DSA even when CONFIG_NET_DSA=m. Compared to the code that existed prior to the notifier conversion, aka what was added in commits: - 4cfab3566710 ("net: dsa: Add wrappers for overloaded ndo_ops") - 3369afba1e46 ("net: Call into DSA netdevice_ops wrappers") this is different because we are not overloading any struct net_device_ops of the DSA master anymore, but rather, we are exposing a rather specific functionality which is orthogonal to which API is used to enable it - ndo_eth_ioctl() or ndo_hwtstamp_set(). Also, what is similar is that both approaches use function pointers to get from built-in code to DSA. There is no point in replicating the function pointers towards __dsa_master_hwtstamp_validate() once for every CPU port (dev->dsa_ptr). Instead, it is sufficient to introduce a singleton struct dsa_stubs, built into the kernel, which contains a single function pointer to __dsa_master_hwtstamp_validate(). I find this approach preferable to what we had originally, because dev->dsa_ptr->netdev_ops->ndo_do_ioctl() used to require going through struct dsa_port (dev->dsa_ptr), and so, this was incompatible with any attempts to add any data encapsulation and hide DSA data structures from the outside world. Link: https://lore.kernel.org/netdev/20230403083019.120b72fd@kernel.org/ Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-07bonding: fix ns validation on backup slavesHangbin Liu1-2/+6
When arp_validate is set to 2, 3, or 6, validation is performed for backup slaves as well. As stated in the bond documentation, validation involves checking the broadcast ARP request sent out via the active slave. This helps determine which slaves are more likely to function in the event of an active slave failure. However, when the target is an IPv6 address, the NS message sent from the active interface is not checked on backup slaves. Additionally, based on the bond_arp_rcv() rule b, we must reverse the saddr and daddr when checking the NS message. Note that when checking the NS message, the destination address is a multicast address. Therefore, we must convert the target address to solicited multicast in the bond_get_targets_ip6() function. Prior to the fix, the backup slaves had a mii status of "down", but after the fix, all of the slaves' mii status was updated to "UP". Fixes: 4e24be018eb9 ("bonding: add new parameter ns_targets") Reviewed-by: Jonathan Toppins <jtoppins@redhat.com> Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-2/+2
Conflicts: drivers/net/ethernet/google/gve/gve.h 3ce934558097 ("gve: Secure enough bytes in the first TX desc for all TCP pkts") 75eaae158b1b ("gve: Add XDP DROP and TX support for GQI-QPL format") https://lore.kernel.org/all/20230406104927.45d176f5@canb.auug.org.au/ https://lore.kernel.org/all/c5872985-1a95-0bc8-9dcc-b6f23b439e9d@tessares.net/ Adjacent changes: net/can/isotp.c 051737439eae ("can: isotp: fix race between isotp_sendsmg() and isotp_release()") 96d1c81e6a04 ("can: isotp: add module parameter for maximum pdu size") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-06xsk: Fix unaligned descriptor validationKal Conley1-7/+2
Make sure unaligned descriptors that straddle the end of the UMEM are considered invalid. Currently, descriptor validation is broken for zero-copy mode which only checks descriptors at page granularity. For example, descriptors in zero-copy mode that overrun the end of the UMEM but not a page boundary are (incorrectly) considered valid. The UMEM boundary check needs to happen before the page boundary and contiguity checks in xp_desc_crosses_non_contig_pg(). Do this check in xp_unaligned_validate_desc() instead like xp_check_unaligned() already does. Fixes: 2b43470add8c ("xsk: Introduce AF_XDP buffer allocation API") Signed-off-by: Kal Conley <kal.conley@dectris.com> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/r/20230405235920.7305-2-kal.conley@dectris.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-04-04raw: Fix NULL deref in raw_get_next().Kuniyuki Iwash