Age | Commit message (Collapse) | Author | Files | Lines |
|
[ Upstream commit 053c8fdf2c930efdff5496960842bbb5c34ad43a ]
The xfrm{4,6}_beet_gso_segment() functions did not correctly set the
SKB_GSO_IPXIP4 and SKB_GSO_IPXIP6 gso types for the address family
tunneling case. Fix this by setting these gso types.
Fixes: 384a46ea7bdc7 ("esp4: add gso_segment for esp4 beet mode")
Fixes: 7f9e40eb18a99 ("esp6: add gso_segment for esp6 beet mode")
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit a6d95c5a628a09be129f25d5663a7e9db8261f51 upstream.
This reverts commit b515d2637276a3810d6595e10ab02c13bfd0b63a.
Commit b515d2637276a3810d6595e10ab02c13bfd0b63a ("xfrm: xfrm_state_mtu
should return at least 1280 for ipv6") in v5.14 breaks the TCP MSS
calculation in ipsec transport mode, resulting complete stalls of TCP
connections. This happens when the (P)MTU is 1280 or slighly larger.
The desired formula for the MSS is:
MSS = (MTU - ESP_overhead) - IP header - TCP header
However, the above commit clamps the (MTU - ESP_overhead) to a
minimum of 1280, turning the formula into
MSS = max(MTU - ESP overhead, 1280) - IP header - TCP header
With the (P)MTU near 1280, the calculated MSS is too large and the
resulting TCP packets never make it to the destination because they
are over the actual PMTU.
The above commit also causes suboptimal double fragmentation in
xfrm tunnel mode, as described in
https://lore.kernel.org/netdev/20210429202529.codhwpc7w6kbudug@dwarf.suse.cz/
The original problem the above commit was trying to fix is now fixed
by commit 6596a0229541270fb8d38d989f91b78838e5e9da ("xfrm: fix MTU
regression").
Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit de7b2efacf4e83954aed3f029d347dfc0b7a4f49 upstream.
This test is checking if we exited the list via break or not. However
if it did not exit via a break then "node" does not point to a valid
udp_tunnel_nic_shared_node struct. It will work because of the way
the structs are laid out it's the equivalent of
"if (info->shared->udp_tunnel_nic_info != dev)" which will always be
true, but it's not the right way to test.
Fixes: 74cc6d182d03 ("udp_tunnel: add the ability to share port tables")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit cc20cced0598d9a5ff91ae4ab147b3b5e99ee819 upstream.
We encounter a tcp drop issue in our cloud environment. Packet GROed in
host forwards to a VM virtio_net nic with net_failover enabled. VM acts
as a IPVS LB with ipip encapsulation. The full path like:
host gro -> vm virtio_net rx -> net_failover rx -> ipvs fullnat
-> ipip encap -> net_failover tx -> virtio_net tx
When net_failover transmits a ipip pkt (gso_type = 0x0103, which means
SKB_GSO_TCPV4, SKB_GSO_DODGY and SKB_GSO_IPXIP4), there is no gso
did because it supports TSO and GSO_IPXIP4. But network_header points to
inner ip header.
Call Trace:
tcp4_gso_segment ------> return NULL
inet_gso_segment ------> inner iph, network_header points to
ipip_gso_segment
inet_gso_segment ------> outer iph
skb_mac_gso_segment
Afterwards virtio_net transmits the pkt, only inner ip header is modified.
And the outer one just keeps unchanged. The pkt will be dropped in remote
host.
Call Trace:
inet_gso_segment ------> inner iph, outer iph is skipped
skb_mac_gso_segment
__skb_gso_segment
validate_xmit_skb
validate_xmit_skb_list
sch_direct_xmit
__qdisc_run
__dev_queue_xmit ------> virtio_net
dev_hard_start_xmit
__dev_queue_xmit ------> net_failover
ip_finish_output2
ip_output
iptunnel_xmit
ip_tunnel_xmit
ipip_tunnel_xmit ------> ipip
dev_hard_start_xmit
__dev_queue_xmit
ip_finish_output2
ip_output
ip_forward
ip_rcv
__netif_receive_skb_one_core
netif_receive_skb_internal
napi_gro_receive
receive_buf
virtnet_poll
net_rx_action
The root cause of this issue is specific with the rare combination of
SKB_GSO_DODGY and a tunnel device that adds an SKB_GSO_ tunnel option.
SKB_GSO_DODGY is set from external virtio_net. We need to reset network
header when callbacks.gso_segment() returns NULL.
This patch also includes ipv6_gso_segment(), considering SIT, etc.
Fixes: cb32f511a70b ("ipip: add GSO/TSO support")
Signed-off-by: Tao Liu <thomas.liu@ucloud.cn>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit cd33bdcbead882c2e58fdb4a54a7bd75b610a452 upstream.
As Jakub noticed, prints should be avoided on the datapath.
Also, as packets would never come to the else branch in
ping_lookup(), remove pr_err() from ping_lookup().
Fixes: 35a79e64de29 ("ping: fix the dif and sdif check in ping_lookup")
Reported-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Link: https://lore.kernel.org/r/1ef3f2fcd31bd681a193b1fcf235eee1603819bd.1645674068.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 35a79e64de29e8d57a5989aac57611c0cd29e13e upstream.
When 'ping' changes to use PING socket instead of RAW socket by:
# sysctl -w net.ipv4.ping_group_range="0 100"
There is another regression caused when matching sk_bound_dev_if
and dif, RAW socket is using inet_iif() while PING socket lookup
is using skb->dev->ifindex, the cmd below fails due to this:
# ip link add dummy0 type dummy
# ip link set dummy0 up
# ip addr add 192.168.111.1/24 dev dummy0
# ping -I dummy0 192.168.111.1 -c1
The issue was also reported on:
https://github.com/iputils/iputils/issues/104
But fixed in iputils in a wrong way by not binding to device when
destination IP is on device, and it will cause some of kselftests
to fail, as Jianlin noticed.
This patch is to use inet(6)_iif and inet(6)_sdif to get dif and
sdif for PING socket, and keep consistent with RAW socket.
Fixes: c319b4d76b9e ("net: ipv4: add IPPROTO_ICMP socket kind")
Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 5611a00697c8ecc5aad04392bea629e9d6a20463 ]
ip[6]mr_free_table() can only be called under RTNL lock.
RTNL: assertion failed at net/core/dev.c (10367)
WARNING: CPU: 1 PID: 5890 at net/core/dev.c:10367 unregister_netdevice_many+0x1246/0x1850 net/core/dev.c:10367
Modules linked in:
CPU: 1 PID: 5890 Comm: syz-executor.2 Not tainted 5.16.0-syzkaller-11627-g422ee58dc0ef #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:unregister_netdevice_many+0x1246/0x1850 net/core/dev.c:10367
Code: 0f 85 9b ee ff ff e8 69 07 4b fa ba 7f 28 00 00 48 c7 c6 00 90 ae 8a 48 c7 c7 40 90 ae 8a c6 05 6d b1 51 06 01 e8 8c 90 d8 01 <0f> 0b e9 70 ee ff ff e8 3e 07 4b fa 4c 89 e7 e8 86 2a 59 fa e9 ee
RSP: 0018:ffffc900046ff6e0 EFLAGS: 00010286
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffff888050f51d00 RSI: ffffffff815fa008 RDI: fffff520008dfece
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
R10: ffffffff815f3d6e R11: 0000000000000000 R12: 00000000fffffff4
R13: dffffc0000000000 R14: ffffc900046ff750 R15: ffff88807b7dc000
FS: 00007f4ab736e700(0000) GS:ffff8880b9d00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fee0b4f8990 CR3: 000000001e7d2000 CR4: 00000000003506e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
mroute_clean_tables+0x244/0xb40 net/ipv6/ip6mr.c:1509
ip6mr_free_table net/ipv6/ip6mr.c:389 [inline]
ip6mr_rules_init net/ipv6/ip6mr.c:246 [inline]
ip6mr_net_init net/ipv6/ip6mr.c:1306 [inline]
ip6mr_net_init+0x3f0/0x4e0 net/ipv6/ip6mr.c:1298
ops_init+0xaf/0x470 net/core/net_namespace.c:140
setup_net+0x54f/0xbb0 net/core/net_namespace.c:331
copy_net_ns+0x318/0x760 net/core/net_namespace.c:475
create_new_namespaces+0x3f6/0xb20 kernel/nsproxy.c:110
copy_namespaces+0x391/0x450 kernel/nsproxy.c:178
copy_process+0x2e0c/0x7300 kernel/fork.c:2167
kernel_clone+0xe7/0xab0 kernel/fork.c:2555
__do_sys_clone+0xc8/0x110 kernel/fork.c:2672
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f4ab89f9059
Code: Unable to access opcode bytes at RIP 0x7f4ab89f902f.
RSP: 002b:00007f4ab736e118 EFLAGS: 00000206 ORIG_RAX: 0000000000000038
RAX: ffffffffffffffda RBX: 00007f4ab8b0bf60 RCX: 00007f4ab89f9059
RDX: 0000000020000280 RSI: 0000000020000270 RDI: 0000000040200000
RBP: 00007f4ab8a5308d R08: 0000000020000300 R09: 0000000020000300
R10: 00000000200002c0 R11: 0000000000000206 R12: 0000000000000000
R13: 00007ffc3977cc1f R14: 00007f4ab736e300 R15: 0000000000022000
</TASK>
Fixes: f243e5a7859a ("ipmr,ip6mr: call ip6mr_free_table() on failure path")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Cong Wang <cong.wang@bytedance.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Link: https://lore.kernel.org/r/20220208053451.2885398-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit b67985be400969578d4d4b17299714c0e5d2c07b upstream.
tcp_shift_skb_data() might collapse three packets into a larger one.
P_A, P_B, P_C -> P_ABC
Historically, it used a single tcp_skb_can_collapse_to(P_A) call,
because it was enough.
In commit 85712484110d ("tcp: coalesce/collapse must respect MPTCP extensions"),
this call was replaced by a call to tcp_skb_can_collapse(P_A, P_B)
But the now needed test over P_C has been missed.
This probably broke MPTCP.
Then later, commit 9b65b17db723 ("net: avoid double accounting for pure zerocopy skbs")
added an extra condition to tcp_skb_can_collapse(), but the missing call
from tcp_shift_skb_data() is also breaking TCP zerocopy, because P_A and P_C
might have different skb_zcopy_pure() status.
Fixes: 85712484110d ("tcp: coalesce/collapse must respect MPTCP extensions")
Fixes: 9b65b17db723 ("net: avoid double accounting for pure zerocopy skbs")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Mat Martineau <mathew.j.martineau@linux.intel.com>
Cc: Talal Ahmad <talalahmad@google.com>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20220201184640.756716-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 970a5a3ea86da637471d3cd04d513a0755aba4bf ]
In commit 431280eebed9 ("ipv4: tcp: send zero IPID for RST and
ACK sent in SYN-RECV and TIME-WAIT state") we took care of some
ctl packets sent by TCP.
It turns out we need to use a similar strategy for SYNACK packets.
By default, they carry IP_DF and IPID==0, but there are ways
to ask them to use the hashed IP ident generator and thus
be used to build off-path attacks.
(Ref: Off-Path TCP Exploits of the Mixed IPID Assignment)
One of this way is to force (before listener is started)
echo 1 >/proc/sys/net/ipv4/ip_no_pmtu_disc
Another way is using forged ICMP ICMP_FRAG_NEEDED
with a very small MTU (like 68) to force a false return from
ip_dont_fragment()
In this patch, ip_build_and_send_pkt() uses the following
heuristics.
1) Most SYNACK packets are smaller than IPV4_MIN_MTU and therefore
can use IP_DF regardless of the listener or route pmtu setting.
2) In case the SYNACK packet is bigger than IPV4_MIN_MTU,
we use prandom_u32() generator instead of the IPv4 hashed ident one.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Ray Che <xijiache@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Cc: Geoff Alexander <alexandg@cs.unm.edu>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 153a0d187e767c68733b8e9f46218eb1f41ab902 ]
For some reason, raw_bind() forgot to lock the socket.
BUG: KCSAN: data-race in __ip4_datagram_connect / raw_bind
write to 0xffff8881170d4308 of 4 bytes by task 5466 on cpu 0:
raw_bind+0x1b0/0x250 net/ipv4/raw.c:739
inet_bind+0x56/0xa0 net/ipv4/af_inet.c:443
__sys_bind+0x14b/0x1b0 net/socket.c:1697
__do_sys_bind net/socket.c:1708 [inline]
__se_sys_bind net/socket.c:1706 [inline]
__x64_sys_bind+0x3d/0x50 net/socket.c:1706
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
read to 0xffff8881170d4308 of 4 bytes by task 5468 on cpu 1:
__ip4_datagram_connect+0xb7/0x7b0 net/ipv4/datagram.c:39
ip4_datagram_connect+0x2a/0x40 net/ipv4/datagram.c:89
inet_dgram_connect+0x107/0x190 net/ipv4/af_inet.c:576
__sys_connect_file net/socket.c:1900 [inline]
__sys_connect+0x197/0x1b0 net/socket.c:1917
__do_sys_connect net/socket.c:1927 [inline]
__se_sys_connect net/socket.c:1924 [inline]
__x64_sys_connect+0x3d/0x50 net/socket.c:1924
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
value changed: 0x00000000 -> 0x0003007f
Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 5468 Comm: syz-executor.5 Not tainted 5.17.0-rc1-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 27a8caa59babb96c5890569e131bc0eb6d45daee ]
During IP fragmentation we sanitize IP options. This means overwriting
options which should not be copied with NOPs. Only the first fragment
has the original, full options.
ip_fraglist_prepare() copies the IP header and options from previous
fragment to the next one. Commit 19c3401a917b ("net: ipv4: place control
buffer handling away from fragmentation iterators") moved sanitizing
options before ip_fraglist_prepare() which means options are sanitized
and then overwritten again with the old values.
Fixing this is not enough, however, nor did the sanitization work
prior to aforementioned commit.
ip_options_fragment() (which does the sanitization) uses ipcb->opt.optlen
for the length of the options. ipcb->opt of fragments is not populated
(it's 0), only the head skb has the state properly built. So even when
called at the right time ip_options_fragment() does nothing. This seems
to date back all the way to v2.5.44 when the fast path for pre-fragmented
skbs had been introduced. Prior to that ip_options_build() would have been
called for every fragment (in fact ever since v2.5.44 the fragmentation
handing in ip_options_build() has been dead code, I'll clean it up in
-next).
In the original patch (see Link) caixf mentions fixing the handling
for fragments other than the second one, but I'm not sure how _any_
fragment could have had their options sanitized with the code
as it stood.
Tested with python (MTU on lo lowered to 1000 to force fragmentation):
import socket
s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
s.setsockopt(socket.IPPROTO_IP, socket.IP_OPTIONS,
bytearray([7,4,5,192, 20|0x80,4,1,0]))
s.sendto(b'1'*2000, ('127.0.0.1', 1234))
Before:
IP (tos 0x0, ttl 64, id 1053, offset 0, flags [+], proto UDP (17), length 996, options (RR [bad length 4] [bad ptr 5] 192.148.4.1,,RA value 256))
localhost.36500 > localhost.search-agent: UDP, length 2000
IP (tos 0x0, ttl 64, id 1053, offset 968, flags [+], proto UDP (17), length 996, options (RR [bad length 4] [bad ptr 5] 192.148.4.1,,RA value 256))
localhost > localhost: udp
IP (tos 0x0, ttl 64, id 1053, offset 1936, flags [none], proto UDP (17), length 100, options (RR [bad length 4] [bad ptr 5] 192.148.4.1,,RA value 256))
localhost > localhost: udp
After:
IP (tos 0x0, ttl 96, id 42549, offset 0, flags [+], proto UDP (17), length 996, options (RR [bad length 4] [bad ptr 5] 192.148.4.1,,RA value 256))
localhost.51607 > localhost.search-agent: UDP, bad length 2000 > 960
IP (tos 0x0, ttl 96, id 42549, offset 968, flags [+], proto UDP (17), length 996, options (NOP,NOP,NOP,NOP,RA value 256))
localhost > localhost: udp
IP (tos 0x0, ttl 96, id 42549, offset 1936, flags [none], proto UDP (17), length 100, options (NOP,NOP,NOP,NOP,RA value 256))
localhost > localhost: udp
RA (20 | 0x80) is now copied as expected, RR (7) is "NOPed out".
Link: https://lore.kernel.org/netdev/20220107080559.122713-1-ooppublic@163.com/
Fixes: 19c3401a917b ("net: ipv4: place control buffer handling away from fragmentation iterators")
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: caixf <ooppublic@163.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 1b9fbe813016b08e08b22ddba4ddbf9cb1b04b00 ]
Add a if statements to avoid the warning.
Dan Carpenter report:
The patch faf482ca196a: "net: ipv4: Move ip_options_fragment() out of
loop" from Aug 23, 2021, leads to the following Smatch complaint:
net/ipv4/ip_output.c:833 ip_do_fragment()
warn: variable dereferenced before check 'iter.frag' (see line 828)
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: faf482ca196a ("net: ipv4: Move ip_options_fragment() out of loop")
Link: https://lore.kernel.org/netdev/20210830073802.GR7722@kadam/T/#t
Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit faf482ca196a5b16007190529b3b2dd32ab3f761 ]
The ip_options_fragment() only called when iter->offset is equal to zero,
so move it out of loop, and inline 'Copy the flags to each fragment.'
As also, remove the unused parameter in ip_frag_ipcb().
Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 2afc3b5a31f9edf3ef0f374f5d70610c79c93a42 upstream.
When 'ping' changes to use PING socket instead of RAW socket by:
# sysctl -w net.ipv4.ping_group_range="0 100"
the selftests 'router_broadcast.sh' will fail, as such command
# ip vrf exec vrf-h1 ping -I veth0 198.51.100.255 -b
can't receive the response skb by the PING socket. It's caused by mismatch
of sk_bound_dev_if and dif in ping_rcv() when looking up the PING socket,
as dif is vrf-h1 if dif's master was set to vrf-h1.
This patch is to fix this regression by also checking the sk_bound_dev_if
against sdif so that the packets can stil be received even if the socket
is not bound to the vrf device but to the real iif.
Fixes: c319b4d76b9e ("net: ipv4: add IPPROTO_ICMP socket kind")
Reported-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit f7716b318568b22fbf0e3be99279a979e217cf71 upstream.
Mask the ECN bits before initialising ->flowi4_tos. The tunnel key may
have the last ECN bit set, which will interfere with the route lookup
process as ip_route_output_key_hash() interpretes this bit specially
(to restrict the route scope).
Found by code inspection, compile tested only.
Fixes: 962924fa2b7a ("ip_gre: Refactor collect metatdata mode tunnel xmit to ip_md_tunnel_xmit")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 91341fa0003befd097e190ec2a4bf63ad957c49a upstream.
Both fields can be read/written without synchronization,
add proper accessors and documentation.
Fixes: d5dd88794a13 ("inet: fix various use-after-free in defrags units")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit d07418afea8f1d9896aaf9dc5ae47ac4f45b220c upstream.
net/ipv4/fib_semantics.c uses an hash table of 256 slots,
keyed by device ifindexes: fib_info_devhash[DEVINDEX_HASHSIZE]
Problem is that with network namespaces, devices tend
to use the same ifindex.
lo device for instance has a fixed ifindex of one,
for all network namespaces.
This means that hosts with thousands of netns spend
a lot of time looking at some hash buckets with thousands
of elements, notably at netns dismantle.
Simply add a per netns perturbation (net_hash_mix())
to spread elements more uniformely.
Also change fib_devindex_hashfn() to use more entropy.
Fixes: aa79e66eee5d ("net: Make ifindex generation per-net namespace")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 0a6e6b3c7db6c34e3d149f09cd714972f8753e3f upstream.
In the past, free_fib_info() was supposed to be called
under RTNL protection.
This eventually was no longer the case.
Instead of enforcing RTNL it seems we simply can
move fib_info_cnt changes to occur when fib_info_lock
is held.
v2: David Laight suggested to update fib_info_cnt
only when an entry is added/deleted to/from the hash table,
as fib_info_cnt is used to make sure hash table size
is optimal.
BUG: KCSAN: data-race in fib_create_info / free_fib_info
write to 0xffffffff86e243a0 of 4 bytes by task 26429 on cpu 0:
fib_create_info+0xe78/0x3440 net/ipv4/fib_semantics.c:1428
fib_table_insert+0x148/0x10c0 net/ipv4/fib_trie.c:1224
fib_magic+0x195/0x1e0 net/ipv4/fib_frontend.c:1087
fib_add_ifaddr+0xd0/0x2e0 net/ipv4/fib_frontend.c:1109
fib_netdev_event+0x178/0x510 net/ipv4/fib_frontend.c:1466
notifier_call_chain kernel/notifier.c:83 [inline]
raw_notifier_call_chain+0x53/0xb0 kernel/notifier.c:391
__dev_notify_flags+0x1d3/0x3b0
dev_change_flags+0xa2/0xc0 net/core/dev.c:8872
do_setlink+0x810/0x2410 net/core/rtnetlink.c:2719
rtnl_group_changelink net/core/rtnetlink.c:3242 [inline]
__rtnl_newlink net/core/rtnetlink.c:3396 [inline]
rtnl_newlink+0xb10/0x13b0 net/core/rtnetlink.c:3506
rtnetlink_rcv_msg+0x745/0x7e0 net/core/rtnetlink.c:5571
netlink_rcv_skb+0x14e/0x250 net/netlink/af_netlink.c:2496
rtnetlink_rcv+0x18/0x20 net/core/rtnetlink.c:5589
netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline]
netlink_unicast+0x5fc/0x6c0 net/netlink/af_netlink.c:1345
netlink_sendmsg+0x726/0x840 net/netlink/af_netlink.c:1921
sock_sendmsg_nosec net/socket.c:704 [inline]
sock_sendmsg net/socket.c:724 [inline]
____sys_sendmsg+0x39a/0x510 net/socket.c:2409
___sys_sendmsg net/socket.c:2463 [inline]
__sys_sendmsg+0x195/0x230 net/socket.c:2492
__do_sys_sendmsg net/socket.c:2501 [inline]
__se_sys_sendmsg net/socket.c:2499 [inline]
__x64_sys_sendmsg+0x42/0x50 net/socket.c:2499
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
read to 0xffffffff86e243a0 of 4 bytes by task 31505 on cpu 1:
free_fib_info+0x35/0x80 net/ipv4/fib_semantics.c:252
fib_info_put include/net/ip_fib.h:575 [inline]
nsim_fib4_rt_destroy drivers/net/netdevsim/fib.c:294 [inline]
nsim_fib4_rt_replace drivers/net/netdevsim/fib.c:403 [inline]
nsim_fib4_rt_insert drivers/net/netdevsim/fib.c:431 [inline]
nsim_fib4_event drivers/net/netdevsim/fib.c:461 [inline]
nsim_fib_event drivers/net/netdevsim/fib.c:881 [inline]
nsim_fib_event_work+0x15ca/0x2cf0 drivers/net/netdevsim/fib.c:1477
process_one_work+0x3fc/0x980 kernel/workqueue.c:2298
process_scheduled_works kernel/workqueue.c:2361 [inline]
worker_thread+0x7df/0xa70 kernel/workqueue.c:2447
kthread+0x2c7/0x2e0 kernel/kthread.c:327
ret_from_fork+0x1f/0x30
value changed: 0x00000d2d -> 0x00000d2e
Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 31505 Comm: kworker/1:21 Not tainted 5.16.0-rc6-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: events nsim_fib_event_work
Fixes: 48bb9eb47b27 ("netdevsim: fib: Add dummy implementation for FIB offload")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Ido Schimmel <idosch@mellanox.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit d94a69cb2cfa77294921aae9afcfb866e723a2da ]
The issue takes place in one error path of clusterip_tg_check(). When
memcmp() returns nonzero, the function simply returns the error code,
forgetting to decrease the reference count of a clusterip_config
object, which is bumped earlier by clusterip_config_find_get(). This
may incur reference count leak.
Fix this issue by decrementing the refcount of the object in specific
error path.
Fixes: 06aa151ad1fc74 ("netfilter: ipt_CLUSTERIP: check MAC address when duplicate config is set")
Signed-off-by: Xin Xiong <xiongx18@fudan.edu.cn>
Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Signed-off-by: Xin Tan <tanxin.ctf@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 6c25449e1a32c594d743df8e8258e8ef870b6a77 ]
$ cat /pro/net/udp
before:
sl local_address rem_address st tx_queue rx_queue tr tm->when
26050: 0100007F:0035 00000000:0000 07 00000000:00000000 00:00000000
26320: 0100007F:0143 00000000:0000 07 00000000:00000000 00:00000000
27135: 00000000:8472 00000000:0000 07 00000000:00000000 00:00000000
after:
sl local_address rem_address st tx_queue rx_queue tr tm->when
26050: 0100007F:0035 00000000:0000 07 00000000:00000000 00:00000000
26320: 0100007F:0143 00000000:0000 07 00000000:00000000 00:00000000
27135: 00000000:8472 00000000:0000 07 00000000:00000000 00:00000000
Signed-off-by: yangxingwu <xingwu.yang@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 8bda81a4d400cf8a72e554012f0d8c45e07a3904 upstream.
lwtunnel_valid_encap_type_attr is used to validate encap attributes
within a multipath route. Add length validation checking to the type.
lwtunnel_valid_encap_type_attr is called converting attributes to
fib{6,}_config struct which means it is used before fib_get_nhs,
ip6_route_multipath_add, and ip6_route_multipath_del - other
locations that use rtnh_ok and then nla_get_u16 on RTA_ENCAP_TYPE
attribute.
Fixes: 9ed59592e3e3 ("lwtunnel: fix autoload of lwt modules")
Signed-off-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 664b9c4b7392ce723b013201843264bf95481ce5 upstream.
Make sure RTA_FLOW is at least 4B before using.
Fixes: 4e902c57417c ("[IPv4]: FIB configuration using struct fib_config")
Signed-off-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 7a3429bace0e08d94c39245631ea6bc109dafa49 upstream.
syzbot reported uninit-value:
============================================================
BUG: KMSAN: uninit-value in fib_get_nhs+0xac4/0x1f80
net/ipv4/fib_semantics.c:708
fib_get_nhs+0xac4/0x1f80 net/ipv4/fib_semantics.c:708
fib_create_info+0x2411/0x4870 net/ipv4/fib_semantics.c:1453
fib_table_insert+0x45c/0x3a10 net/ipv4/fib_trie.c:1224
inet_rtm_newroute+0x289/0x420 net/ipv4/fib_frontend.c:886
Add helper to validate RTA_GATEWAY length before using the attribute.
Fixes: 4e902c57417c ("[IPv4]: FIB configuration using struct fib_config")
Reported-by: syzbot+d4b9a2851cc3ce998741@syzkaller.appspotmail.com
Signed-off-by: David Ahern <dsahern@kernel.org>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit e22e45fc9e41bf9fcc1e92cfb78eb92786728ef0 upstream.
A real world panic issue was found as follow in Linux 5.4.
BUG: unable to handle page fault for address: ffffde49a863de28
PGD 7e6fe62067 P4D 7e6fe62067 PUD 7e6fe63067 PMD f51e064067 PTE 0
RIP: 0010:tw_timer_handler+0x20/0x40
Call Trace:
<IRQ>
call_timer_fn+0x2b/0x120
run_timer_softirq+0x1ef/0x450
__do_softirq+0x10d/0x2b8
irq_exit+0xc7/0xd0
smp_apic_timer_interrupt+0x68/0x120
apic_timer_interrupt+0xf/0x20
This issue was also reported since 2017 in the thread [1],
unfortunately, the issue was still can be reproduced after fixing
DCCP.
The ipv4_mib_exit_net is called before tcp_sk_exit_batch when a net
namespace is destroyed since tcp_sk_ops is registered befrore
ipv4_mib_ops, which means tcp_sk_ops is in the front of ipv4_mib_ops
in the list of pernet_list. There will be a use-after-free on
net->mib.net_statistics in tw_timer_handler after ipv4_mib_exit_net
if there are some inflight time-wait timers.
This bug is not introduced by commit f2bf415cfed7 ("mib: add net to
NET_ADD_STATS_BH") since the net_statistics is a global variable
instead of dynamic allocation and freeing. Actually, commit
61a7e26028b9 ("mib: put net statistics on struct net") introduces
the bug since it put net statistics on struct net and free it when
net namespace is destroyed.
Moving init_ipv4_mibs() to the front of tcp_init() to fix this bug
and replace pr_crit() with panic() since continuing is meaningless
when init_ipv4_mibs() fails.
[1] https://groups.google.com/g/syzkaller/c/p1tn-_Kc6l4/m/smuL_FMAAgAJ?pli=1
Fixes: 61a7e26028b9 ("mib: put net statistics on struct net")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Cc: Cong Wang <cong.wang@bytedance.com>
Cc: Fam Zheng <fam.zheng@bytedance.com>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20211228104145.9426-1-songmuchun@bytedance.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 71ddeac8cd1d217744a0e060ff520e147c9328d1 ]
KMSAN reported a kernel-infoleak [1], that can exploited
by unpriv users.
After analysis it turned out UDP was not initializing
r->idiag_expires. Other users of inet_sk_diag_fill()
might make the same mistake in the future, so fix this
in inet_sk_diag_fill().
[1]
BUG: KMSAN: kernel-infoleak in instrument_copy_to_user include/linux/instrumented.h:121 [inline]
BUG: KMSAN: kernel-infoleak in copyout lib/iov_iter.c:156 [inline]
BUG: KMSAN: kernel-infoleak in _copy_to_iter+0x69d/0x25c0 lib/iov_iter.c:670
instrument_copy_to_user include/linux/instrumented.h:121 [inline]
copyout lib/iov_iter.c:156 [inline]
_copy_to_iter+0x69d/0x25c0 lib/iov_iter.c:670
copy_to_iter include/linux/uio.h:155 [inline]
simple_copy_to_iter+0xf3/0x140 net/core/datagram.c:519
__skb_datagram_iter+0x2cb/0x1280 net/core/datagram.c:425
skb_copy_datagram_iter+0xdc/0x270 net/core/datagram.c:533
skb_copy_datagram_msg include/linux/skbuff.h:3657 [inline]
netlink_recvmsg+0x660/0x1c60 net/netlink/af_netlink.c:1974
sock_recvmsg_nosec net/socket.c:944 [inline]
sock_recvmsg net/socket.c:962 [inline]
sock_read_iter+0x5a9/0x630 net/socket.c:1035
call_read_iter include/linux/fs.h:2156 [inline]
new_sync_read fs/read_write.c:400 [inline]
vfs_read+0x1631/0x1980 fs/read_write.c:481
ksys_read+0x28c/0x520 fs/read_write.c:619
__do_sys_read fs/read_write.c:629 [inline]
__se_sys_read fs/read_write.c:627 [inline]
__x64_sys_read+0xdb/0x120 fs/read_write.c:627
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82
entry_SYSCALL_64_after_hwframe+0x44/0xae
Uninit was created at:
slab_post_alloc_hook mm/slab.h:524 [inline]
slab_alloc_node mm/slub.c:3251 [inline]
__kmalloc_node_track_caller+0xe0c/0x1510 mm/slub.c:4974
kmalloc_reserve net/core/skbuff.c:354 [inline]
__alloc_skb+0x545/0xf90 net/core/skbuff.c:426
alloc_skb include/linux/skbuff.h:1126 [inline]
netlink_dump+0x3d5/0x16a0 net/netlink/af_netlink.c:2245
__netlink_dump_start+0xd1c/0xee0 net/netlink/af_netlink.c:2370
netlink_dump_start include/linux/netlink.h:254 [inline]
inet_diag_handler_cmd+0x2e7/0x400 net/ipv4/inet_diag.c:1343
sock_diag_rcv_msg+0x24a/0x620
netlink_rcv_skb+0x447/0x800 net/netlink/af_netlink.c:2491
sock_diag_rcv+0x63/0x80 net/core/sock_diag.c:276
netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline]
netlink_unicast+0x1095/0x1360 net/netlink/af_netlink.c:1345
netlink_sendmsg+0x16f3/0x1870 net/netlink/af_netlink.c:1916
sock_sendmsg_nosec net/socket.c:704 [inline]
sock_sendmsg net/socket.c:724 [inline]
sock_write_iter+0x594/0x690 net/socket.c:1057
do_iter_readv_writev+0xa7f/0xc70
do_iter_write+0x52c/0x1500 fs/read_write.c:851
vfs_writev fs/read_write.c:924 [inline]
do_writev+0x63f/0xe30 fs/read_write.c:967
__do_sys_writev fs/read_write.c:1040 [inline]
__se_sys_writev fs/read_write.c:1037 [inline]
__x64_sys_writev+0xe5/0x120 fs/read_write.c:1037
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82
entry_SYSCALL_64_after_hwframe+0x44/0xae
Bytes 68-71 of 312 are uninitialized
Memory access of size 312 starts at ffff88812ab54000
Data copied to user address 0000000020001440
CPU: 1 PID: 6365 Comm: syz-executor801 Not tainted 5.16.0-rc3-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Fixes: 3c4d05c80567 ("inet_diag: Introduce the inet socket dumping routine")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Link: https://lore.kernel.org/r/20211209185058.53917-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 158390e45612ef0fde160af0826f1740c36daf21 upstream.
The max number of UDP gso segments is intended to cap to UDP_MAX_SEGMENTS,
this is checked in udp_send_skb():
if (skb->len > cork->gso_size * UDP_MAX_SEGMENTS) {
kfree_skb(skb);
return -EINVAL;
}
skb->len contains network and transport header len here, we should use
only data len instead.
Fixes: bec1f6f69736 ("udp: generate gso with UDP_SEGMENT")
Signed-off-by: Jianguo Wu <wujianguo@chinatelecom.cn>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/900742e5-81fb-30dc-6e0b-375c6cdd7982@163.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 213f5f8f31f10aa1e83187ae20fb7fa4e626b724 upstream.
Before commit faa041a40b9f ("ipv4: Create cleanup helper for fib_nh")
changes to net->ipv4.fib_num_tclassid_users were protected by RTNL.
After the change, this is no longer the case, as free_fib_info_rcu()
runs after rcu grace period, without rtnl being held.
Fixes: faa041a40b9f ("ipv4: Create cleanup helper for fib_nh")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: David Ahern <dsahern@kernel.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit cdef485217d30382f3bf6448c54b4401648fe3f1 upstream.
The kernel leaks memory when a `fib` rule is present in IPv6 nftables
firewall rules and a suppress_prefix rule is present in the IPv6 routing
rules (used by certain tools such as wg-quick). In such scenarios, every
incoming packet will leak an allocation in `ip6_dst_cache` slab cache.
After some hours of `bpftrace`-ing and source code reading, I tracked
down the issue to ca7a03c41753 ("ipv6: do not free rt if
FIB_LOOKUP_NOREF is set on suppress rule").
The problem with that change is that the generic `args->flags` always have
`FIB_LOOKUP_NOREF` set[1][2] but the IPv6-specific flag
`RT6_LOOKUP_F_DST_NOREF` might not be, leading to `fib6_rule_suppress` not
decreasing the refcount when needed.
How to reproduce:
- Add the following nftables rule to a prerouting chain:
meta nfproto ipv6 fib saddr . mark . iif oif missing drop
This can be done with:
sudo nft create table inet test
sudo nft create chain inet test test_chain '{ type filter hook prerouting priority filter + 10; policy accept; }'
sudo nft add rule inet test test_chain meta nfproto ipv6 fib saddr . mark . iif oif missing drop
- Run:
sudo ip -6 rule add table main suppress_prefixlength 0
- Watch `sudo slabtop -o | grep ip6_dst_cache` to see memory usage increase
with every incoming ipv6 packet.
This patch exposes the protocol-specific flags to the protocol
specific `suppress` function, and check the protocol-specific `flags`
argument for RT6_LOOKUP_F_DST_NOREF instead of the generic
FIB_LOOKUP_NOREF when decreasing the refcount, like this.
[1]: https://github.com/torvalds/linux/blob/ca7a03c4175366a92cee0ccc4fec0038c3266e26/net/ipv6/fib6_rules.c#L71
[2]: https://github.com/torvalds/linux/blob/ca7a03c4175366a92cee0ccc4fec0038c3266e26/net/ipv6/fib6_rules.c#L99
Link: https://bugzilla.kernel.org/show_bug.cgi?id=215105
Fixes: ca7a03c41753 ("ipv6: do not free rt if FIB_LOOKUP_NOREF is set on suppress rule")
Cc: stable@vger.kernel.org
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 6def480181f15f6d9ec812bca8cbc62451ba314c ]
When kmemdup called failed and register_net_sysctl return NULL, should
return ENOMEM instead of ENOBUFS
Signed-off-by: liuguoqiang <liuguoqiang@uniontech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit e0fecb289ad3fd2245cdc50bf450b97fcca39884 ]
A prior patch increased the size of struct tcp_zerocopy_receive
but did not update do_tcp_getsockopt() handling to properly account
for this.
This patch simply reintroduces content erroneously cut from the
referenced prior patch that handles the new struct size.
Fixes: 18fb76ed5386 ("net-zerocopy: Copy straggler unaligned data for TCP Rx. zerocopy.")
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 4e1fddc98d2585ddd4792b5e44433dcee7ece001 ]
While testing BIG TCP patch series, I was expecting that TCP_RR workloads
with 80KB requests/answers would send one 80KB TSO packet,
then being received as a single GRO packet.
It turns out this was not happening, and the root cause was that
cubic Hystart ACK train was triggering after a few (2 or 3) rounds of RPC.
Hystart was wrongly setting CWND/SSTHRESH to 30, while my RPC
needed a budget of ~20 segments.
Ideally these TCP_RR flows should not exit slow start.
Cubic Hystart should reset itself at each round, instead of assuming
every TCP flow is a bulk one.
Note that even after this patch, Hystart can still trigger, depending
on scheduling artifacts, but at a higher CWND/SSTHRESH threshold,
keeping optimal TSO packet sizes.
Tested:
ip link set dev eth0 gro_ipv6_max_size 131072 gso_ipv6_max_size 131072
nstat -n; netperf -H ... -t TCP_RR -l 5 -- -r 80000,80000 -K cubic; nstat|egrep "Ip6InReceives|Hystart|Ip6OutRequests"
Before:
8605
Ip6InReceives 87541 0.0
Ip6OutRequests 129496 0.0
TcpExtTCPHystartTrainDetect 1 0.0
TcpExtTCPHystartTrainCwnd 30 0.0
After:
8760
Ip6InReceives 88514 0.0
Ip6OutRequests 87975 0.0
Fixes: ae27e98a5152 ("[TCP] CUBIC v2.3")
Co-developed-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Link: https://lore.kernel.org/r/20211123202535.1843771-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 1005f19b9357b81aa64e1decd08d6e332caaa284 ]
When replacing a nexthop group, we must release the IPv6 per-cpu dsts of
the removed nexthop entries after an RCU grace period because they
contain references to the nexthop's net device and to the fib6 info.
With specific series of events[1] we can reach net device refcount
imbalance which is unrecoverable. IPv4 is not affected because dsts
don't take a refcount on the route.
[1]
$ ip nexthop list
id 200 via 2002:db8::2 dev bridge.10 scope link onlink
id 201 via 2002:db8::3 dev bridge scope link onlink
id 203 group 201/200
$ ip -6 route
2001:db8::10 nhid 203 metric 1024 pref medium
nexthop via 2002:db8::3 dev bridge weight 1 onlink
nexthop via 2002:db8::2 dev bridge.10 weight 1 onlink
Create rt6_info through one of the multipath legs, e.g.:
$ taskset -a -c 1 ./pkt_inj 24 bridge.10 2001:db8::10
(pkt_inj is just a custom packet generator, nothing special)
Then remove that leg from the group by replace (let's assume it is id
200 in this case):
$ ip nexthop replace id 203 group 201
Now remove the IPv6 route:
$ ip -6 route del 2001:db8::10/128
The route won't be really deleted due to the stale rt6_info holding 1
refcnt in nexthop id 200.
At this point we have the following reference count dependency:
(deleted) IPv6 route holds 1 reference over nhid 203
nh 203 holds 1 ref over id 201
nh 200 holds 1 ref over the net device and the route due to the stale
rt6_info
Now to create circular dependency between nh 200 and the IPv6 route, and
also to get a reference over nh 200, restore nhid 200 in the group:
$ ip nexthop replace id 203 group 201/200
And now we have a permanent circular dependncy because nhid 203 holds a
reference over nh 200 and 201, but the route holds a ref over nh 203 and
is deleted.
To trigger the bug just delete the group (nhid 203):
$ ip nexthop del id 203
It won't really be deleted due to the IPv6 route dependency, and now we
have 2 unlinked and deleted objects that reference each other: the group
and the IPv6 route. Since the group drops the reference it holds over its
entries at free time (i.e. its own refcount needs to drop to 0) that will
never happen and we get a permanent ref on them, since one of the entries
holds a reference over the IPv6 route it will also never be released.
At this point the dependencies are:
(deleted, only unlinked) IPv6 route holds reference over group nh 203
(deleted, only unlinked) group nh 203 holds reference over nh 201 and 200
nh 200 holds 1 ref over the net device and the route due to the stale
rt6_info
This is the last point where it can be fixed by running traffic through
nh 200, and specifically through the same CPU so the rt6_info (dst) will
get released due to the IPv6 genid, that in turn will free the IPv6
route, which in turn will free the ref count over the group nh 203.
If nh 200 is deleted at this point, it will never be released due to the
ref from the unlinked group 203, it will only be unlinked:
$ ip nexthop del id 200
$ ip nexthop
$
Now we can never release that stale rt6_info, we have IPv6 route with ref
over group nh 203, group nh 203 with ref over nh 200 and 201, nh 200 with
rt6_info (dst) with ref over the net device and the IPv6 route. All of
these objects are only unlinked, and cannot be released, thus they can't
release their ref counts.
Message from syslogd@dev at Nov 19 14:04:10 ...
kernel:[73501.828730] unregister_netdevice: waiting for bridge.10 to become free. Usage count = 3
Message from syslogd@dev at Nov 19 14:04:20 ...
kernel:[73512.068811] unregister_netdevice: waiting for bridge.10 to become free. Usage count = 3
Fixes: 7bf4796dd099 ("nexthops: add support for replace")
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 1c743127cc54b112b155f434756bd4b5fa565a99 upstream.
When we try to add an IPv6 nexthop and IPv6 is not enabled
(!CONFIG_IPV6) we'll hit a NULL pointer dereference[1] in the error path
of nh_create_ipv6() due to calling ipv6_stub->fib6_nh_release. The bug
has been present since the beginning of IPv6 nexthop gateway support.
Commit 1aefd3de7bc6 ("ipv6: Add fib6_nh_init and release to stubs") tells
us that only f |