summaryrefslogtreecommitdiff
path: root/net/netfilter
AgeCommit message (Collapse)AuthorFilesLines
2024-11-07netfilter: nf_tables: wait for rcu grace period on net_device removalPablo Neira Ayuso1-7/+34
8c873e219970 ("netfilter: core: free hooks with call_rcu") removed synchronize_net() call when unregistering basechain hook, however, net_device removal event handler for the NFPROTO_NETDEV was not updated to wait for RCU grace period. Note that 835b803377f5 ("netfilter: nf_tables_netdev: unregister hooks on net_device removal") does not remove basechain rules on device removal, I was hinted to remove rules on net_device removal later, see 5ebe0b0eec9d ("netfilter: nf_tables: destroy basechain and rules on netdevice removal"). Although NETDEV_UNREGISTER event is guaranteed to be handled after synchronize_net() call, this path needs to wait for rcu grace period via rcu callback to release basechain hooks if netns is alive because an ongoing netlink dump could be in progress (sockets hold a reference on the netns). Note that nf_tables_pre_exit_net() unregisters and releases basechain hooks but it is possible to see NETDEV_UNREGISTER at a later stage in the netns exit path, eg. veth peer device in another netns: cleanup_net() default_device_exit_batch() unregister_netdevice_many_notify() notifier_call_chain() nf_tables_netdev_event() __nft_release_basechain() In this particular case, same rule of thumb applies: if netns is alive, then wait for rcu grace period because netlink dump in the other netns could be in progress. Otherwise, if the other netns is going away then no netlink dump can be in progress and basechain hooks can be released inmediately. While at it, turn WARN_ON() into WARN_ON_ONCE() for the basechain validation, which should not ever happen. Fixes: 835b803377f5 ("netfilter: nf_tables_netdev: unregister hooks on net_device removal") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-10-31netfilter: nft_payload: sanitize offset and length before calling skb_checksum()Pablo Neira Ayuso1-0/+3
If access to offset + length is larger than the skbuff length, then skb_checksum() triggers BUG_ON(). skb_checksum() internally subtracts the length parameter while iterating over skbuff, BUG_ON(len) at the end of it checks that the expected length to be included in the checksum calculation is fully consumed. Fixes: 7ec3f7b47b8d ("netfilter: nft_payload: add packet mangling support") Reported-by: Slavin Liu <slavin-ayu@qq.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-10-30netfilter: Fix use-after-free in get_info()Dong Chenchen1-1/+1
ip6table_nat module unload has refcnt warning for UAF. call trace is: WARNING: CPU: 1 PID: 379 at kernel/module/main.c:853 module_put+0x6f/0x80 Modules linked in: ip6table_nat(-) CPU: 1 UID: 0 PID: 379 Comm: ip6tables Not tainted 6.12.0-rc4-00047-gc2ee9f594da8-dirty #205 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:module_put+0x6f/0x80 Call Trace: <TASK> get_info+0x128/0x180 do_ip6t_get_ctl+0x6a/0x430 nf_getsockopt+0x46/0x80 ipv6_getsockopt+0xb9/0x100 rawv6_getsockopt+0x42/0x190 do_sock_getsockopt+0xaa/0x180 __sys_getsockopt+0x70/0xc0 __x64_sys_getsockopt+0x20/0x30 do_syscall_64+0xa2/0x1a0 entry_SYSCALL_64_after_hwframe+0x77/0x7f Concurrent execution of module unload and get_info() trigered the warning. The root cause is as follows: cpu0 cpu1 module_exit //mod->state = MODULE_STATE_GOING ip6table_nat_exit xt_unregister_template kfree(t) //removed from templ_list getinfo() t = xt_find_table_lock list_for_each_entry(tmpl, &xt_templates[af]...) if (strcmp(tmpl->name, name)) continue; //table not found try_module_get list_for_each_entry(t, &xt_net->tables[af]...) return t; //not get refcnt module_put(t->me) //uaf unregister_pernet_subsys //remove table from xt_net list While xt_table module was going away and has been removed from xt_templates list, we couldnt get refcnt of xt_table->me. Check module in xt_net->tables list re-traversal to fix it. Fixes: fdacd57c79b7 ("netfilter: x_tables: never register tables by default") Signed-off-by: Dong Chenchen <dongchenchen2@huawei.com> Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-10-24Merge tag 'net-6.12-rc5' of ↵Linus Torvalds4-2/+7
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from netfiler, xfrm and bluetooth. Oddly this includes a fix for a posix clock regression; in our previous PR we included a change there as a pre-requisite for networking one. That fix proved to be buggy and requires the follow-up included here. Thomas suggested we should send it, given we sent the buggy patch. Current release - regressions: - posix-clock: Fix unbalanced locking in pc_clock_settime() - netfilter: fix typo causing some targets not to load on IPv6 Current release - new code bugs: - xfrm: policy: remove last remnants of pernet inexact list Previous releases - regressions: - core: fix races in netdev_tx_sent_queue()/dev_watchdog() - bluetooth: fix UAF on sco_sock_timeout - eth: hv_netvsc: fix VF namespace also in synthetic NIC NETDEV_REGISTER event - eth: usbnet: fix name regression - eth: be2net: fix potential memory leak in be_xmit() - eth: plip: fix transmit path breakage Previous releases - always broken: - sched: deny mismatched skip_sw/skip_hw flags for actions created by classifiers - netfilter: bpf: must hold reference on net namespace - eth: virtio_net: fix integer overflow in stats - eth: bnxt_en: replace ptp_lock with irqsave variant - eth: octeon_ep: add SKB allocation failures handling in __octep_oq_process_rx() Misc: - MAINTAINERS: add Simon as an official reviewer" * tag 'net-6.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (40 commits) net: dsa: mv88e6xxx: support 4000ps cycle counter period net: dsa: mv88e6xxx: read cycle counter period from hardware net: dsa: mv88e6xxx: group cycle counter coefficients net: usb: qmi_wwan: add Fibocom FG132 0x0112 composition hv_netvsc: Fix VF namespace also in synthetic NIC NETDEV_REGISTER event net: dsa: microchip: disable EEE for KSZ879x/KSZ877x/KSZ876x Bluetooth: ISO: Fix UAF on iso_sock_timeout Bluetooth: SCO: Fix UAF on sco_sock_timeout Bluetooth: hci_core: Disable works on hci_unregister_dev posix-clock: posix-clock: Fix unbalanced locking in pc_clock_settime() r8169: avoid unsolicited interrupts net: sched: use RCU read-side critical section in taprio_dump() net: sched: fix use-after-free in taprio_change() net/sched: act_api: deny mismatched skip_sw/skip_hw flags for actions created by classifiers net: usb: usbnet: fix name regression mlxsw: spectrum_router: fix xa_store() error checking virtio_net: fix integer overflow in stats net: fix races in netdev_tx_sent_queue()/dev_watchdog() net: wwan: fix global oob in wwan_rtnl_policy netfilter: xtables: fix typo causing some targets not to load on IPv6 ...
2024-10-21netfilter: xtables: fix typo causing some targets not to load on IPv6Pablo Neira Ayuso3-2/+3
- There is no NFPROTO_IPV6 family for mark and NFLOG. - TRACE is also missing module autoload with NFPROTO_IPV6. This results in ip6tables failing to restore a ruleset. This issue has been reported by several users providing incomplete patches. Very similar to Ilya Katsnelson's patch including a missing chunk in the TRACE extension. Fixes: 0bfcb7b71e73 ("netfilter: xtables: avoid NFPROTO_UNSPEC where needed") Reported-by: Ignat Korchagin <ignat@cloudflare.com> Reported-by: Ilya Katsnelson <me@0upti.me> Reported-by: Krzysztof Olędzki <ole@ans.pl> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-10-18Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfLinus Torvalds1-1/+2
Pull bpf fixes from Daniel Borkmann: - Fix BPF verifier to not affect subreg_def marks in its range propagation (Eduard Zingerman) - Fix a truncation bug in the BPF verifier's handling of coerce_reg_to_size_sx (Dimitar Kanaliev) - Fix the BPF verifier's delta propagation between linked registers under 32-bit addition (Daniel Borkmann) - Fix a NULL pointer dereference in BPF devmap due to missing rxq information (Florian Kauer) - Fix a memory leak in bpf_core_apply (Jiri Olsa) - Fix an UBSAN-reported array-index-out-of-bounds in BTF parsing for arrays of nested structs (Hou Tao) - Fix build ID fetching where memory areas backing the file were created with memfd_secret (Andrii Nakryiko) - Fix BPF task iterator tid filtering which was incorrectly using pid instead of tid (Jordan Rome) - Several fixes for BPF sockmap and BPF sockhash redirection in combination with vsocks (Michal Luczaj) - Fix riscv BPF JIT and make BPF_CMPXCHG fully ordered (Andrea Parri) - Fix riscv BPF JIT under CONFIG_CFI_CLANG to prevent the possibility of an infinite BPF tailcall (Pu Lehui) - Fix a build warning from resolve_btfids that bpf_lsm_key_free cannot be resolved (Thomas Weißschuh) - Fix a bug in kfunc BTF caching for modules where the wrong BTF object was returned (Toke Høiland-Jørgensen) - Fix a BPF selftest compilation error in cgroup-related tests with musl libc (Tony Ambardar) - Several fixes to BPF link info dumps to fill missing fields (Tyrone Wu) - Add BPF selftests for kfuncs from multiple modules, checking that the correct kfuncs are called (Simon Sundberg) - Ensure that internal and user-facing bpf_redirect flags don't overlap (Toke Høiland-Jørgensen) - Switch to use kvzmalloc to allocate BPF verifier environment (Rik van Riel) - Use raw_spinlock_t in BPF ringbuf to fix a sleep in atomic splat under RT (Wander Lairson Costa) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: (38 commits) lib/buildid: Handle memfd_secret() files in build_id_parse() selftests/bpf: Add test case for delta propagation bpf: Fix print_reg_state's constant scalar dump bpf: Fix incorrect delta propagation between linked registers bpf: Properly test iter/task tid filtering bpf: Fix iter/task tid filtering riscv, bpf: Make BPF_CMPXCHG fully ordered bpf, vsock: Drop static vsock_bpf_prot initialization vsock: Update msg_count on read_skb() vsock: Update rx_bytes on read_skb() bpf, sockmap: SK_DROP on attempted redirects of unsupported af_vsock selftests/bpf: Add asserts for netfilter link info bpf: Fix link info netfilter flags to populate defrag flag selftests/bpf: Add test for sign extension in coerce_subreg_to_size_sx() selftests/bpf: Add test for truncation after sign extension in coerce_reg_to_size_sx() bpf: Fix truncation bug in coerce_reg_to_size_sx() selftests/bpf: Assert link info uprobe_multi count & path_size if unset bpf: Fix unpopulated path_size when uprobe_multi fields unset selftests/bpf: Fix cross-compiling urandom_read selftests/bpf: Add test for kfunc module order ...
2024-10-17netfilter: bpf: must hold reference on net namespaceFlorian Westphal1-0/+4
BUG: KASAN: slab-use-after-free in __nf_unregister_net_hook+0x640/0x6b0 Read of size 8 at addr ffff8880106fe400 by task repro/72= bpf_nf_link_release+0xda/0x1e0 bpf_link_free+0x139/0x2d0 bpf_link_release+0x68/0x80 __fput+0x414/0xb60 Eric says: It seems that bpf was able to defer the __nf_unregister_net_hook() after exit()/close() time. Perhaps a netns reference is missing, because the netns has been dismantled/freed already. bpf_nf_link_attach() does : link->net = net; But I do not see a reference being taken on net. Add such a reference and release it after hook unreg. Note that I was unable to get syzbot reproducer to work, so I do not know if this resolves this splat. Fixes: 84601d6ee68a ("bpf: add bpf_link support for BPF_NETFILTER programs") Diagnosed-by: Eric Dumazet <edumazet@google.com> Reported-by: Lai, Yi <yi1.lai@linux.intel.com> Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-10-16bpf: Fix link info netfilter flags to populate defrag flagTyrone Wu1-1/+2
This fix correctly populates the `bpf_link_info.netfilter.flags` field when user passes the `BPF_F_NETFILTER_IP_DEFRAG` flag. Fixes: 91721c2d02d3 ("netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link") Signed-off-by: Tyrone Wu <wudevelops@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Florian Westphal <fw@strlen.de> Cc: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/bpf/20241011193252.178997-1-wudevelops@gmail.com
2024-10-09netfilter: xtables: avoid NFPROTO_UNSPEC where neededFlorian Westphal16-165/+422
syzbot managed to call xt_cluster match via ebtables: WARNING: CPU: 0 PID: 11 at net/netfilter/xt_cluster.c:72 xt_cluster_mt+0x196/0x780 [..] ebt_do_table+0x174b/0x2a40 Module registers to NFPROTO_UNSPEC, but it assumes ipv4/ipv6 packet processing. As this is only useful to restrict locally terminating TCP/UDP traffic, register this for ipv4 and ipv6 family only. Pablo points out that this is a general issue, direct users of the set/getsockopt interface can call into targets/matches that were only intended for use with ip(6)tables. Check all UNSPEC matches and targets for similar issues: - matches and targets are fine except if they assume skb_network_header() is valid -- this is only true when called from inet layer: ip(6) stack pulls the ip/ipv6 header into linear data area. - targets that return XT_CONTINUE or other xtables verdicts must be restricted too, they are incompatbile with the ebtables traverser, e.g. EBT_CONTINUE is a completely different value than XT_CONTINUE. Most matches/targets are changed to register for NFPROTO_IPV4/IPV6, as they are provided for use by ip(6)tables. The MARK target is also used by arptables, so register for NFPROTO_ARP too. While at it, bail out if connbytes fails to enable the corresponding conntrack family. This change passes the selftests in iptables.git. Reported-by: syzbot+256c348558aa5cf611a9@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netfilter-devel/66fec2e2.050a0220.9ec68.0047.GAE@google.com/ Fixes: 0269ea493734 ("netfilter: xtables: add cluster match") Signed-off-by: Florian Westphal <fw@strlen.de> Co-developed-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-10-02move asm/unaligned.h to linux/unaligned.hAl Viro6-6/+6
asm/unaligned.h is always an include of asm-generic/unaligned.h; might as well move that thing to linux/unaligned.h and include that - there's nothing arch-specific in that header. auto-generated by the following: for i in `git grep -l -w asm/unaligned.h`; do sed -i -e "s/asm\/unaligned.h/linux\/unaligned.h/" $i done for i in `git grep -l -w asm-generic/unaligned.h`; do sed -i -e "s/asm-generic\/unaligned.h/linux\/unaligned.h/" $i done git mv include/asm-generic/unaligned.h include/linux/unaligned.h git mv tools/include/asm-generic/unaligned.h tools/include/linux/unaligned.h sed -i -e "/unaligned.h/d" include/asm-generic/Kbuild sed -i -e "s/__ASM_GENERIC/__LINUX/" include/linux/unaligned.h tools/include/linux/unaligned.h
2024-09-26netfilter: nfnetlink_queue: remove old clash resolution logicFlorian Westphal2-86/+0
For historical reasons there are two clash resolution spots in netfilter, one in nfnetlink_queue and one in conntrack core. nfnetlink_queue one was added first: If a colliding entry is found, NAT NAT transformation is reversed by calling nat engine again with altered tuple. See commit 368982cd7d1b ("netfilter: nfnetlink_queue: resolve clash for unconfirmed conntracks") for details. One problem is that nf_reroute() won't take an action if the queueing doesn't occur in the OUTPUT hook, i.e. when queueing in forward or postrouting, packet will be sent via the wrong path. Another problem is that the scenario addressed (2nd UDP packet sent with identical addresses while first packet is still being processed) can also occur without any nfqueue involvement due to threaded resolvers doing A and AAAA requests back-to-back. This lead us to add clash resolution logic to the conntrack core, see commit 6a757c07e51f ("netfilter: conntrack: allow insertion of clashing entries"). Instead of fixing the nfqueue based logic, lets remove it and let conntrack core handle this instead. Retain the ->update hook for sake of nfqueue based conntrack helpers. We could axe this hook completely but we'd have to split confirm and helper logic again, see commit ee04805ff54a ("netfilter: conntrack: make conntrack userspace helpers work again"). This SHOULD NOT be backported to kernels earlier than v5.6; they lack adequate clash resolution handling. Patch was originally written by Pablo Neira Ayuso. Reported-by: Antonio Ojea <aojea@google.com> Closes: https://bugzilla.netfilter.org/show_bug.cgi?id=1766 Signed-off-by: Florian Westphal <fw@strlen.de> Tested-by: Antonio Ojea <aojea@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-09-26netfilter: nf_tables: missing objects with no memcg accountingPablo Neira Ayuso7-15/+17
Several ruleset objects are still not using GFP_KERNEL_ACCOUNT for memory accounting, update them. This includes: - catchall elements - compat match large info area - log prefix - meta secctx - numgen counters - pipapo set backend datastructure - tunnel private objects Fixes: 33758c891479 ("memcg: enable accounting for nft objects") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-09-26netfilter: nf_tables: use rcu chain hook list iterator from netlink dump pathPablo Neira Ayuso1-1/+1
Lockless iteration over hook list is possible from netlink dump path, use rcu variant to iterate over the hook list as is done with flowtable hooks. Fixes: b9703ed44ffb ("netfilter: nf_tables: support for adding new devices to an existing netdev chain") Reported-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-09-26netfilter: ctnetlink: compile ctnetlink_label_size with ↵Simon Horman1-5/+2
CONFIG_NF_CONNTRACK_EVENTS Only provide ctnetlink_label_size when it is used, which is when CONFIG_NF_CONNTRACK_EVENTS is configured. Flagged by clang-18 W=1 builds as: .../nf_conntrack_netlink.c:385:19: warning: unused function 'ctnetlink_label_size' [-Wunused-function] 385 | static inline int ctnetlink_label_size(const struct nf_conn *ct) | ^~~~~~~~~~~~~~~~~~~~ The condition on CONFIG_NF_CONNTRACK_LABELS being removed by this patch guards compilation of non-trivial implementations of ctnetlink_dump_labels() and ctnetlink_label_size(). However, this is not necessary as each of these functions will always return 0 if CONFIG_NF_CONNTRACK_LABELS is not defined as each function starts with the equivalent of: struct nf_conn_labels *labels = nf_ct_labels_find(ct); if (!labels) return 0; And nf_ct_labels_find always returns NULL if CONFIG_NF_CONNTRACK_LABELS is not enabled. So I believe that the compiler optimises the code away in such cases anyway. Found by inspection. Compile tested only. Originally splitted in two patches, Pablo Neira Ayuso collapsed them and added Fixes: tag. Fixes: 0ceabd83875b ("netfilter: ctnetlink: deliver labels to userspace") Link: https://lore.kernel.org/netfilter-devel/20240909151712.GZ2097826@kernel.org/ Signed-off-by: Simon Horman <horms@kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-09-26netfilter: nf_tables: Keep deleted flowtable hooks until after RCUPhil Sutter1-1/+1
Documentation of list_del_rcu() warns callers to not immediately free the deleted list item. While it seems not necessary to use the RCU-variant of list_del() here in the first place, doing so seems to require calling kfree_rcu() on the deleted item as well. Fixes: 3f0465a9ef02 ("netfilter: nf_tables: dynamically allocate hooks per net_device in flowtables") Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-09-26netfilter: ctnetlink: Guard possible unused functionsAndy Shevchenko1-1/+1
Some of the functions may be unused (CONFIG_NETFILTER_NETLINK_GLUE_CT=n and CONFIG_NF_CONNTRACK_EVENTS=n), it prevents kernel builds with clang, `make W=1` and CONFIG_WERROR=y: net/netfilter/nf_conntrack_netlink.c:657:22: error: unused function 'ctnetlink_acct_size' [-Werror,-Wunused-function] 657 | static inline size_t ctnetlink_acct_size(const struct nf_conn *ct) | ^~~~~~~~~~~~~~~~~~~ net/netfilter/nf_conntrack_netlink.c:667:19: error: unused function 'ctnetlink_secctx_size' [-Werror,-Wunused-function] 667 | static inline int ctnetlink_secctx_size(const struct nf_conn *ct) | ^~~~~~~~~~~~~~~~~~~~~ net/netfilter/nf_conntrack_netlink.c:683:22: error: unused function 'ctnetlink_timestamp_size' [-Werror,-Wunused-function] 683 | static inline size_t ctnetlink_timestamp_size(const struct nf_conn *ct) | ^~~~~~~~~~~~~~~~~~~~~~~~ Fix this by guarding possible unused functions with ifdeffery. See also commit 6863f5643dd7 ("kbuild: allow Clang to find unused static inline functions for W=1 build"). Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-09-26netfilter: conntrack: add clash resolution for reverse collisionsFlorian Westphal1-5/+51
Given existing entry: ORIGIN: a:b -> c:d REPLY: c:d -> a:b And colliding entry: ORIGIN: c:d -> a:b REPLY: a:b -> c:d The colliding ct (and the associated skb) get dropped on insert. Permit this by checking if the colliding entry matches the reply direction. Happens when both ends send packets at same time, both requests are picked up as NEW, rather than NEW for the 'first' and 'ESTABLISHED' for the second packet. This is an esoteric condition, as ruleset must permit NEW connections in either direction and both peers must already have a bidirectional traffic flow at the time conntrack gets enabled. Allow the 'reverse' skb to pass and assign the existing (clashing) entry. While at it, also drop the extra 'dying' check, this is already tested earlier by the calling function. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-09-26netfilter: nf_nat: don't try nat source port reallocation for reverse dir clashFlorian Westphal1-2/+118
A conntrack entry can be inserted to the connection tracking table if there is no existing entry with an identical tuple in either direction. Example: INITIATOR -> NAT/PAT -> RESPONDER Initiator passes through NAT/PAT ("us") and SNAT is done (saddr rewrite). Then, later, NAT/PAT machine itself also wants to connect to RESPONDER. This will not work if the SNAT done earlier has same IP:PORT source pair. Conntrack table has: ORIGINAL: $IP_INITATOR:$SPORT -> $IP_RESPONDER:$DPORT REPLY: $IP_RESPONDER:$DPORT -> $IP_NAT:$SPORT and new locally originating connection wants: ORIGINAL: $IP_NAT:$SPORT -> $IP_RESPONDER:$DPORT REPLY: $IP_RESPONDER:$DPORT -> $IP_NAT:$SPORT This is handled by the NAT engine which will do a source port reallocation for the locally originating connection that is colliding with an existing tuple by attempting a source port rewrite. This is done even if this new connection attempt did not go through a masquerade/snat rule. There is a rare race condition with connection-less protocols like UDP, where we do the port reallocation even though its not needed. This happens when new packets from the same, pre-existing flow are received in both directions at the exact same time on different CPUs after the conntrack table was flushed (or conntrack becomes active for first time). With strict ordering/single cpu, the first packet creates new ct entry and second packet is resolved as established reply packet. With parallel processing, both packets are picked up as new and both get their own ct entry. In this case, the 'reply' packet (picked up as ORIGINAL) can be mangled by NAT engine because a port collision is detected. This change isn't enough to prevent a packet drop later during nf_conntrack_confirm(), the existing clash resolution strategy will not detect such reverse clash case. This is resolved by a followup patch. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-09-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-2/+2
Merge in late fixes to prepare for the 6.12 net-next PR. No conflicts or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-15netfilter: nft_socket: Fix a NULL vs IS_ERR() bug in ↵Dan Carpenter1-2/+2
nft_socket_cgroup_subtree_level() The cgroup_get_from_path() function never returns NULL, it returns error pointers. Update the error handling to match. Fixes: 7f3287db6543 ("netfilter: nft_socket: make cgroupsv2 matching work with namespaces") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Acked-by: Florian Westphal <fw@strlen.de> Acked-by: Pablo Neira Ayuso <pablo@netfilter.org> Link: https://patch.msgid.link/bbc0c4e0-05cc-4f44-8797-2f4b3920a820@stanley.mountain Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski3-7/+49
Cross-merge networking fixes after downstream PR. No conflicts (sort of) and no adjacent changes. This merge reverts commit b3c9e65eb227 ("net: hsr: remove seqnr_lock") from net, as it was superseded by commit 430d67bdcb04 ("net: hsr: Use the seqnr lock for frames received via interlink port.") in net-next. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-12net: netfilter: move nf flowtable bpf initialization in ↵Lorenzo Bianconi2-1/+7
nf_flow_table_module_init() Move nf flowtable bpf initialization in nf_flow_table module load routine since nf_flow_table_bpf is part of nf_flow_table module and not nf_flow_table_inet one. This patch allows to avoid the following kernel warning running the reproducer below: $modprobe nf_flow_table_inet $rmmod nf_flow_table_inet $modprobe nf_flow_table_inet modprobe: ERROR: could not insert 'nf_flow_table_inet': Invalid argument [ 184.081501] ------------[ cut here ]------------ [ 184.081527] WARNING: CPU: 0 PID: 1362 at kernel/bpf/btf.c:8206 btf_populate_kfunc_set+0x23c/0x330 [ 184.081550] CPU: 0 UID: 0 PID: 1362 Comm: modprobe Kdump: loaded Not tainted 6.11.0-0.rc5.22.el10.x86_64 #1 [ 184.081553] Hardware name: Red Hat OpenStack Compute, BIOS 1.14.0-1.module+el8.4.0+8855+a9e237a9 04/01/2014 [ 184.081554] RIP: 0010:btf_populate_kfunc_set+0x23c/0x330 [ 184.081558] RSP: 0018:ff22cfb38071fc90 EFLAGS: 00010202 [ 184.081559] RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000000 [ 184.081560] RDX: 000000000000006e RSI: ffffffff95c00000 RDI: ff13805543436350 [ 184.081561] RBP: ffffffffc0e22180 R08: ff13805543410808 R09: 000000000001ec00 [ 184.081562] R10: ff13805541c8113c R11: 0000000000000010 R12: ff13805541b83c00 [ 184.081563] R13: ff13805543410800 R14: 0000000000000001 R15: ffffffffc0e2259a [ 184.081564] FS: 00007fa436c46740(0000) GS:ff1380557ba00000(0000) knlGS:0000000000000000 [ 184.081569] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 184.081570] CR2: 000055e7b3187000 CR3: 0000000100c48003 CR4: 0000000000771ef0 [ 184.081571] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 184.081572] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 184.081572] PKRU: 55555554 [ 184.081574] Call Trace: [ 184.081575] <TASK> [ 184.081578] ? show_trace_log_lvl+0x1b0/0x2f0 [ 184.081580] ? show_trace_log_lvl+0x1b0/0x2f0 [ 184.081582] ? __register_btf_kfunc_id_set+0x199/0x200 [ 184.081585] ? btf_populate_kfunc_set+0x23c/0x330 [ 184.081586] ? __warn.cold+0x93/0xed [ 184.081590] ? btf_populate_kfunc_set+0x23c/0x330 [ 184.081592] ? report_bug+0xff/0x140 [ 184.081594] ? handle_bug+0x3a/0x70 [ 184.081596] ? exc_invalid_op+0x17/0x70 [ 184.081597] ? asm_exc_invalid_op+0x1a/0x20 [ 184.081601] ? btf_populate_kfunc_set+0x23c/0x330 [ 184.081602] __register_btf_kfunc_id_set+0x199/0x200 [ 184.081605] ? __pfx_nf_flow_inet_module_init+0x10/0x10 [nf_flow_table_inet] [ 184.081607] do_one_initcall+0x58/0x300 [ 184.081611] do_init_module+0x60/0x230 [ 184.081614] __do_sys_init_module+0x17a/0x1b0 [ 184.081617] do_syscall_64+0x7d/0x160 [ 184.081620] ? __count_memcg_events+0x58/0xf0 [ 184.081623] ? handle_mm_fault+0x234/0x350 [ 184.081626] ? do_user_addr_fault+0x347/0x640 [ 184.081630] ? clear_bhb_loop+0x25/0x80 [ 184.081633] ? clear_bhb_loop+0x25/0x80 [ 184.081634] ? clear_bhb_loop+0x25/0x80 [ 184.081637] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 184.081639] RIP: 0033:0x7fa43652e4ce [ 184.081647] RSP: 002b:00007ffe8213be18 EFLAGS: 00000246 ORIG_RAX: 00000000000000af [ 184.081649] RAX: ffffffffffffffda RBX: 000055e7b3176c20 RCX: 00007fa43652e4ce [ 184.081650] RDX: 000055e7737fde79 RSI: 0000000000003990 RDI: 000055e7b3185380 [ 184.081651] RBP: 000055e7737fde79 R08: 0000000000000007 R09: 000055e7b3179bd0 [ 184.081651] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000040000 [ 184.081652] R13: 000055e7b3176fa0 R14: 0000000000000000 R15: 000055e7b3179b80 Fixes: 391bb6594fd3 ("netfilter: Add bpf_xdp_flow_lookup kfunc") Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Acked-by: Florian Westphal <fw@strlen.de> Acked-by: Pablo Neira Ayuso <pablo@netfilter.org> Link: https://patch.msgid.link/20240911-nf-flowtable-bpf-modprob-fix-v1-1-f9fc075aafc3@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-09-12netfilter: nft_socket: make cgroupsv2 matching work with namespacesFlorian Westphal1-3/+38
When running in container environmment, /sys/fs/cgroup/ might not be the real root node of the sk-attached cgroup. Example: In container: % stat /sys//fs/cgroup/ Device: 0,21 Inode: 2214 .. % stat /sys/fs/cgroup/foo Device: 0,21 Inode: 2264 .. The expectation would be for: nft add rule .. socket cgroupv2 level 1 "foo" counter to match traffic from a process that got added to "foo" via "echo $pid > /sys/fs/cgroup/foo/cgroup.procs". However, 'level 3' is needed to make this work. Seen from initial namespace, the complete hierarchy is: % stat /sys/fs/cgroup/system.slice/docker-.../foo Device: 0,21 Inode: 2264 .. i.e. hierarchy is 0 1 2 3 / -> system.slice -> docker-1... -> foo ... but the container doesn't know that its "/" is the "docker-1.." cgroup. Current code will retrieve the 'system.slice' cgroup node and store its kn->id in the destination register, so compare with 2264 ("foo" cgroup id) will not match. Fetch "/" cgroup from ->init() and add its level to the level we try to extract. cgroup root-level is 0 for the init-namespace or the level of the ancestor that is exposed as the cgroup root inside the container. In the above case, cgrp->level of "/" resolved in the container is 2 (docker-1...scope/) and request for 'level 1' will get adjusted to fetch the actual level (3). v2: use CONFIG_SOCK_CGROUP_DATA, eval function depends on it. (kernel test robot) Cc: cgroups@vger.kernel.org Fixes: e0bb96db96f8 ("netfilter: nft_socket: add support for cgroupsv2") Reported-by: Nadia Pinaeva <n.m.pinaeva@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-09-12netfilter: nft_socket: fix sk refcount leaksFlorian Westphal1-3/+4
We must put 'sk' reference before returning. Fixes: 039b1f4f24ec ("netfilter: nft_socket: fix erroneous socket assignment") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-09-09netfilter: nft_flow_offload: Unmask upper DSCP bits in nft_flow_route()Ido Schimmel1-1/+2
Unmask the upper DSCP bits when calling nf_route() which eventually calls ip_route_output_key() so that in the future it could perform the FIB lookup according to the full DSCP value. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-09-06Merge tag 'nf-next-24-09-06' of ↵Jakub Kicinski26-148/+165
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next: Patch #1 adds ctnetlink support for kernel side filtering for deletions, from Changliang Wu. Patch #2 updates nft_counter support to Use u64_stats_t, from Sebastian Andrzej Siewior. Patch #3 uses kmemdup_array() in all xtables frontends, from Yan Zhen. Patch #4 is a oneliner to use ERR_CAST() in nf_conntrack instead opencoded casting, from Shen Lichuan. Patch #5 removes unused argument in nftables .validate interface, from Florian Westphal. Patch #6 is a oneliner to correct a typo in nftables kdoc, from Simon Horman. Patch #7 fixes missing kdoc in nftables, also from Simon. Patch #8 updates nftables to handle timeout less than CONFIG_HZ. Patch #9 rejects element expiration if timeout is zero, otherwise it is silently ignored. Patch #10 disallows element expiration larger than timeout. Patch #11 removes unnecessary READ_ONCE annotation while mutex is held. Patch #12 adds missing READ_ONCE/WRITE_ONCE annotation in dynset. Patch #13 annotates data-races around element expiration. Patch #14 allocates timeout and expiration in one single set element extension, they are tighly couple, no reason to keep them separated anymore. Patch #15 updates nftables to interpret zero timeout element as never times out. Note that it is already possible to declare sets with elements that never time out but this generalizes to all kind of set with timeouts. Patch #16 supports for element timeout and expiration updates. * tag 'nf-next-24-09-06' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: nf_tables: set element timeout update support netfilter: nf_tables: zero timeout means element never times out netfilter: nf_tables: consolidate timeout extension for elements netfilter: nf_tables: annotate data-races around element expiration netfilter: nft_dynset: annotate data-races around set timeout netfilter: nf_tables: remove annotation to access set timeout while holding lock netfilter: nf_tables: reject expiration higher than timeout netfilter: nf_tables: reject element expiration with no timeout netfilter: nf_tables: elements with timeout below CONFIG_HZ never expire netfilter: nf_tables: Add missing Kernel doc netfilter: nf_tables: Correct spelling in nf_tables.h netfilter: nf_tables: drop unused 3rd argument from validate callback ops netfilter: conntrack: Convert to use ERR_CAST() netfilter: Use kmemdup_array instead of kmemdup for multiple allocation netfilter: nft_counter: Use u64_stats_t for statistic. netfilter: ctnetlink: support CTA_FILTER for flush ==================== Link: https://patch.msgid.link/20240905232920.5481-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-06net/netfilter: make use of the helper macro LIST_HEAD()Hongbo Li1-3/+1
list_head can be initialized automatically with LIST_HEAD() instead of calling INIT_LIST_HEAD(). Here we can simplify the code. Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Reviewed-by: Pablo Neira Ayuso <pablo@netfilter.org> Link: https://patch.msgid.link/20240904093243.3345012-4-lihongbo22@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-03netfilter: nf_tables: set element timeout update supportPablo Neira Ayuso2-5/+44
Store new timeout and expiration in transaction object, use them to update elements from .commit path. Otherwise, discard update if .abort path is exercised. Use update_flags in the transaction to note whether the timeout, expiration, or both need to be updated. Annotate access to timeout extension now that it can be updated while lockless read access is possible. Reject timeout updates on elements with no timeout extension. Element transaction remains in the 96 bytes kmalloc slab on x86_64 after this update. This patch requires ("netfilter: nf_tables: use timestamp to check for set element timeout") to make sure an element does not expire while transaction is ongoing. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-09-03netfilter: nf_tables: zero timeout means element never times outPablo Neira Ayuso2-17/+25
This patch uses zero as timeout marker for those elements that never expire when the element is created. If userspace provides no timeout for an element, then the default set timeout applies. However, if no default set timeout is specified and timeout flag is set on, then timeout extension is allocated and timeout is set to zero to allow for future updates. Use of zero a never timeout marker has been suggested by Phil Sutter. Note that, in older kernels, it is already possible to define elements that never expire by declaring a set with the set timeout flag set on and no global set timeout, in this case, new element with no explicit timeout never expire do not allocate the timeout extension, hence, they never expire. This approach makes it complicated to accomodate element timeout update, because element extensions do not support reallocations. Therefore, allocate the timeout extension and use the new marker for this case, but do not expose it to userspace to retain backward compatibility in the set listing. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-09-03netfilter: nf_tables: consolidate timeout extension for elementsPablo Neira Ayuso2-34/+22
Expiration and timeout are stored in separated set element extensions, but they are tightly coupled. Consolidate them in a single extension to simplify and prepare for set element updates. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-09-03netfilter: nf_tables: annotate data-races around element expirationPablo Neira Ayuso2-2/+2
element expiration can be read-write locklessly, it can be written by dynset and read from netlink dump, add annotation. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-09-03netfilter: nft_dynset: annotate data-races around set timeoutPablo Neira Ayuso1-3/+3
set timeout can be read locklessly while being updated from control plane, add annotation. Fixes: 123b99619cca ("netfilter: nf_tables: honor set timeout and garbage collection updates") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-09-03netfilter: nf_tables: remove annotation to access set timeout while holding lockPablo Neira Ayuso1-2/+2
Mutex is held when adding an element, no need for READ_ONCE, remove it. Fixes: 123b99619cca ("netfilter: nf_tables: honor set timeout and garbage collection updates") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-09-03netfilter: nf_tables: reject expiration higher than timeoutPablo Neira Ayuso1-0/+3
Report ERANGE to userspace if user specifies an expiration larger than the timeout. Fixes: 8e1102d5a159 ("netfilter: nf_tables: support timeouts larger than 23 days") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-09-03netfilter: nf_tables: reject element expiration with no timeoutPablo Neira Ayuso1-0/+3
If element timeout is unset and set provides no default timeout, the element expiration is silently ignored, reject this instead to let user know this is unsupported. Also prepare for supporting timeout that never expire, where zero timeout and expiration must be also rejected. Fixes: 8e1102d5a159 ("netfilter: nf_tables: support timeouts larger than 23 days") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-09-03netfilter: nf_tables: elements with timeout below CONFIG_HZ never expirePablo Neira Ayuso1-1/+1
Element timeout that is below CONFIG_HZ never expires because the timeout extension is not allocated given that nf_msecs_to_jiffies64() returns 0. Set timeout to the minimum value to honor timeout. Fixes: 8e1102d5a159 ("netfilter: nf_tables: support timeouts larger than 23 days") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-09-03netfilter: nf_tables: drop unused 3rd argument from validate callback opsFlorian Westphal21-46/+23
Since commit a654de8fdc18 ("netfilter: nf_tables: fix chain dependency validation") the validate() callback no longer needs the return pointer argument. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-09-03netfilter: conntrack: Convert to use ERR_CAST()Shen Lichuan1-1/+1
Use the ERR_CAST macro to clearly indicate that this is a pointer to an error value and that a type conversion was performed. Signed-off-by: Shen Lichuan <shenlichuan@vivo.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-09-03netfilter: Use kmemdup_array instead of kmemdup for multiple allocationYan Zhen1-1/+1
When we are allocating an array, using kmemdup_array() to take care about multiplication and possible overflows. Also it makes auditing the code easier. Signed-off-by: Yan Zhen <yanzhen@vivo.com> Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-09-03netfilter: nft_counter: Use u64_stats_t for statistic.Sebastian Andrzej Siewior1-44/+46
The nft_counter uses two s64 counters for statistics. Those two are protected by a seqcount to ensure that the 64bit variable is always properly seen during updates even on 32bit architectures where the store is performed by two writes. A side effect is that the two counter (bytes and packet) are written and read together in the same window. This can be replaced with u64_stats_t. write_seqcount_begin()/ end() is replaced with u64_stats_update_begin()/ end() and behaves the same way as with seqcount_t on 32bit architectures. Additionally there is a preempt_disable on PREEMPT_RT to ensure that a reader does not preempt a writer. On 64bit architectures the macros are removed and the reads happen without any retries. This also means that the reader can observe one counter (bytes) from before the update and the other counter (packets) but that is okay since there is no requirement to have both counter from the same update window. Convert the statistic to u64_stats_t. There is one optimisation: nft_counter_do_init() and nft_counter_clone() allocate a new per-CPU counter and assign a value to it. During this assignment preemption is disabled which is not needed because the counter is not yet exposed to the system so there can not be another writer or reader. Therefore disabling preemption is omitted and raw_cpu_ptr() is used to obtain a pointer to a counter for the assignment. Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-09-03netfilter: ctnetlink: support CTA_FILTER for flushChangliang Wu1-6/+3
From cb8aa9a, we can use kernel side filtering for dump, but this capability is not available for flush. This Patch allows advanced filter with CTA_FILTER for flush Performace 1048576 ct flows in total, delete 50,000 flows by origin src ip 3.06s -> dump all, compare and delete 584ms -> directly flush with filter Signed-off-by: Changliang Wu <changliang.wu@smartx.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-08-26Merge tag 'nf-next-24-08-23' of ↵Jakub Kicinski25-70/+125
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following batch contains Netfilter updates for net-next: Patch #1 fix checksum calculation in nfnetlink_queue with SCTP, segment GSO packet since skb_zerocopy() does not support GSO_BY_FRAGS, from Antonio Ojea. Patch #2 extend nfnetlink_queue coverage to handle SCTP packets, from Antonio Ojea. Patch #3 uses consume_skb() instead of kfree_skb() in nfnetlink, from Donald Hunter. Patch #4 adds a dedicate commit list for sets to speed up intra-transaction lookups, from Florian Westphal. Patch #5 skips removal of element from abort path for the pipapo backend, ditching the shadow copy of this datastructure is sufficient. Patch #6 moves nf_ct_netns_get() out of nf_conncount_init() to let users of conncoiunt decide when to enable conntrack, this is needed by openvswitch, from Xin Long. Patch #7 pass context to all nft_parse_register_load() in preparation for the next patch. Patches #8 and #9 reject loads from uninitialized registers from control plane to remove register initialization from datapath. From Florian Westphal. * tag 'nf-next-24-08-23' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: nf_tables: don't initialize registers in nft_do_chain() netfilter: nf_tables: allow loads only when register is initialized netfilter: nf_tables: pass context structure to nft_parse_register_load netfilter: move nf_ct_netns_get out of nf_conncount_init netfilter: nf_tables: do not remove elements if set backend implements .abort netfilter: nf_tables: store new sets in dedicated list netfilter: nfnetlink: convert kfree_skb to consume_skb selftests: netfilter: nft_queue.sh: sctp coverage netfilter: nfnetlink_queue: unbreak SCTP traffic ==================== Link: https://patch.msgid.link/202408222219