Age | Commit message (Collapse) | Author | Files | Lines |
|
[ Upstream commit 56440d7ec28d60f8da3bfa09062b3368ff9b16db ]
While running net selftests with CONFIG_PROVE_RCU_LIST=y I saw
one lockdep splat [1].
genlmsg_mcast() uses for_each_net_rcu(), and must therefore hold RCU.
Instead of letting all callers guard genlmsg_multicast_allns()
with a rcu_read_lock()/rcu_read_unlock() pair, do it in genlmsg_mcast().
This also means the @flags parameter is useless, we need to always use
GFP_ATOMIC.
[1]
[10882.424136] =============================
[10882.424166] WARNING: suspicious RCU usage
[10882.424309] 6.12.0-rc2-virtme #1156 Not tainted
[10882.424400] -----------------------------
[10882.424423] net/netlink/genetlink.c:1940 RCU-list traversed in non-reader section!!
[10882.424469]
other info that might help us debug this:
[10882.424500]
rcu_scheduler_active = 2, debug_locks = 1
[10882.424744] 2 locks held by ip/15677:
[10882.424791] #0: ffffffffb6b491b0 (cb_lock){++++}-{3:3}, at: genl_rcv (net/netlink/genetlink.c:1219)
[10882.426334] #1: ffffffffb6b49248 (genl_mutex){+.+.}-{3:3}, at: genl_rcv_msg (net/netlink/genetlink.c:61 net/netlink/genetlink.c:57 net/netlink/genetlink.c:1209)
[10882.426465]
stack backtrace:
[10882.426805] CPU: 14 UID: 0 PID: 15677 Comm: ip Not tainted 6.12.0-rc2-virtme #1156
[10882.426919] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[10882.427046] Call Trace:
[10882.427131] <TASK>
[10882.427244] dump_stack_lvl (lib/dump_stack.c:123)
[10882.427335] lockdep_rcu_suspicious (kernel/locking/lockdep.c:6822)
[10882.427387] genlmsg_multicast_allns (net/netlink/genetlink.c:1940 (discriminator 7) net/netlink/genetlink.c:1977 (discriminator 7))
[10882.427436] l2tp_tunnel_notify.constprop.0 (net/l2tp/l2tp_netlink.c:119) l2tp_netlink
[10882.427683] l2tp_nl_cmd_tunnel_create (net/l2tp/l2tp_netlink.c:253) l2tp_netlink
[10882.427748] genl_family_rcv_msg_doit (net/netlink/genetlink.c:1115)
[10882.427834] genl_rcv_msg (net/netlink/genetlink.c:1195 net/netlink/genetlink.c:1210)
[10882.427877] ? __pfx_l2tp_nl_cmd_tunnel_create (net/l2tp/l2tp_netlink.c:186) l2tp_netlink
[10882.427927] ? __pfx_genl_rcv_msg (net/netlink/genetlink.c:1201)
[10882.427959] netlink_rcv_skb (net/netlink/af_netlink.c:2551)
[10882.428069] genl_rcv (net/netlink/genetlink.c:1220)
[10882.428095] netlink_unicast (net/netlink/af_netlink.c:1332 net/netlink/af_netlink.c:1357)
[10882.428140] netlink_sendmsg (net/netlink/af_netlink.c:1901)
[10882.428210] ____sys_sendmsg (net/socket.c:729 (discriminator 1) net/socket.c:744 (discriminator 1) net/socket.c:2607 (discriminator 1))
Fixes: 33f72e6f0c67 ("l2tp : multicast notification to the registered listeners")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: James Chapman <jchapman@katalix.com>
Cc: Tom Parkin <tparkin@katalix.com>
Cc: Johannes Berg <johannes.berg@intel.com>
Link: https://patch.msgid.link/20241011171217.3166614-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 89b768ec2dfefaeba5212de14fc71368e12d06ba ]
l2tp_v3_session_htable and tunnel->session_list are read by lockless
getters using RCU. Use rcu list variants when adding or removing list
items.
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit d17e89999574aca143dd4ede43e4382d32d98724 ]
l2tp sessions may be accessed under an rcu read lock. Have them freed
via rcu and remove the now unneeded synchronize_rcu when a session is
removed.
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 24256415d18695b46da06c93135f5b51c548b950 ]
When a session is created, it sets a backpointer to its tunnel. When
the session refcount drops to 0, l2tp_session_free drops the tunnel
refcount if session->tunnel is non-NULL. However, session->tunnel is
set in l2tp_session_create, before the tunnel refcount is incremented
by l2tp_session_register, which leaves a small window where
session->tunnel is non-NULL when the tunnel refcount hasn't been
bumped.
Moving the assignment to l2tp_session_register is trivial but
l2tp_session_create calls l2tp_session_set_header_len which uses
session->tunnel to get the tunnel's encap. Add an encap arg to
l2tp_session_set_header_len to avoid using session->tunnel.
If l2tpv3 sessions have colliding IDs, it is possible for
l2tp_v3_session_get to race with l2tp_session_register and fetch a
session which doesn't yet have session->tunnel set. Add a check for
this case.
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
When l2tp tunnels use a socket provided by userspace, we can hit
lockdep splats like the below when data is transmitted through another
(unrelated) userspace socket which then gets routed over l2tp.
This issue was previously discussed here:
https://lore.kernel.org/netdev/87sfialu2n.fsf@cloudflare.com/
The solution is to have lockdep treat socket locks of l2tp tunnel
sockets separately than those of standard INET sockets. To do so, use
a different lockdep subclass where lock nesting is possible.
============================================
WARNING: possible recursive locking detected
6.10.0+ #34 Not tainted
--------------------------------------------
iperf3/771 is trying to acquire lock:
ffff8881027601d8 (slock-AF_INET/1){+.-.}-{2:2}, at: l2tp_xmit_skb+0x243/0x9d0
but task is already holding lock:
ffff888102650d98 (slock-AF_INET/1){+.-.}-{2:2}, at: tcp_v4_rcv+0x1848/0x1e10
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(slock-AF_INET/1);
lock(slock-AF_INET/1);
*** DEADLOCK ***
May be due to missing lock nesting notation
10 locks held by iperf3/771:
#0: ffff888102650258 (sk_lock-AF_INET){+.+.}-{0:0}, at: tcp_sendmsg+0x1a/0x40
#1: ffffffff822ac220 (rcu_read_lock){....}-{1:2}, at: __ip_queue_xmit+0x4b/0xbc0
#2: ffffffff822ac220 (rcu_read_lock){....}-{1:2}, at: ip_finish_output2+0x17a/0x1130
#3: ffffffff822ac220 (rcu_read_lock){....}-{1:2}, at: process_backlog+0x28b/0x9f0
#4: ffffffff822ac220 (rcu_read_lock){....}-{1:2}, at: ip_local_deliver_finish+0xf9/0x260
#5: ffff888102650d98 (slock-AF_INET/1){+.-.}-{2:2}, at: tcp_v4_rcv+0x1848/0x1e10
#6: ffffffff822ac220 (rcu_read_lock){....}-{1:2}, at: __ip_queue_xmit+0x4b/0xbc0
#7: ffffffff822ac220 (rcu_read_lock){....}-{1:2}, at: ip_finish_output2+0x17a/0x1130
#8: ffffffff822ac1e0 (rcu_read_lock_bh){....}-{1:2}, at: __dev_queue_xmit+0xcc/0x1450
#9: ffff888101f33258 (dev->qdisc_tx_busylock ?: &qdisc_tx_busylock#2){+...}-{2:2}, at: __dev_queue_xmit+0x513/0x1450
stack backtrace:
CPU: 2 UID: 0 PID: 771 Comm: iperf3 Not tainted 6.10.0+ #34
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
Call Trace:
<IRQ>
dump_stack_lvl+0x69/0xa0
dump_stack+0xc/0x20
__lock_acquire+0x135d/0x2600
? srso_alias_return_thunk+0x5/0xfbef5
lock_acquire+0xc4/0x2a0
? l2tp_xmit_skb+0x243/0x9d0
? __skb_checksum+0xa3/0x540
_raw_spin_lock_nested+0x35/0x50
? l2tp_xmit_skb+0x243/0x9d0
l2tp_xmit_skb+0x243/0x9d0
l2tp_eth_dev_xmit+0x3c/0xc0
dev_hard_start_xmit+0x11e/0x420
sch_direct_xmit+0xc3/0x640
__dev_queue_xmit+0x61c/0x1450
? ip_finish_output2+0xf4c/0x1130
ip_finish_output2+0x6b6/0x1130
? srso_alias_return_thunk+0x5/0xfbef5
? __ip_finish_output+0x217/0x380
? srso_alias_return_thunk+0x5/0xfbef5
__ip_finish_output+0x217/0x380
ip_output+0x99/0x120
__ip_queue_xmit+0xae4/0xbc0
? srso_alias_return_thunk+0x5/0xfbef5
? srso_alias_return_thunk+0x5/0xfbef5
? tcp_options_write.constprop.0+0xcb/0x3e0
ip_queue_xmit+0x34/0x40
__tcp_transmit_skb+0x1625/0x1890
__tcp_send_ack+0x1b8/0x340
tcp_send_ack+0x23/0x30
__tcp_ack_snd_check+0xa8/0x530
? srso_alias_return_thunk+0x5/0xfbef5
tcp_rcv_established+0x412/0xd70
tcp_v4_do_rcv+0x299/0x420
tcp_v4_rcv+0x1991/0x1e10
ip_protocol_deliver_rcu+0x50/0x220
ip_local_deliver_finish+0x158/0x260
ip_local_deliver+0xc8/0xe0
ip_rcv+0xe5/0x1d0
? __pfx_ip_rcv+0x10/0x10
__netif_receive_skb_one_core+0xce/0xe0
? process_backlog+0x28b/0x9f0
__netif_receive_skb+0x34/0xd0
? process_backlog+0x28b/0x9f0
process_backlog+0x2cb/0x9f0
__napi_poll.constprop.0+0x61/0x280
net_rx_action+0x332/0x670
? srso_alias_return_thunk+0x5/0xfbef5
? find_held_lock+0x2b/0x80
? srso_alias_return_thunk+0x5/0xfbef5
? srso_alias_return_thunk+0x5/0xfbef5
handle_softirqs+0xda/0x480
? __dev_queue_xmit+0xa2c/0x1450
do_softirq+0xa1/0xd0
</IRQ>
<TASK>
__local_bh_enable_ip+0xc8/0xe0
? __dev_queue_xmit+0xa2c/0x1450
__dev_queue_xmit+0xa48/0x1450
? ip_finish_output2+0xf4c/0x1130
ip_finish_output2+0x6b6/0x1130
? srso_alias_return_thunk+0x5/0xfbef5
? __ip_finish_output+0x217/0x380
? srso_alias_return_thunk+0x5/0xfbef5
__ip_finish_output+0x217/0x380
ip_output+0x99/0x120
__ip_queue_xmit+0xae4/0xbc0
? srso_alias_return_thunk+0x5/0xfbef5
? srso_alias_return_thunk+0x5/0xfbef5
? tcp_options_write.constprop.0+0xcb/0x3e0
ip_queue_xmit+0x34/0x40
__tcp_transmit_skb+0x1625/0x1890
tcp_write_xmit+0x766/0x2fb0
? __entry_text_end+0x102ba9/0x102bad
? srso_alias_return_thunk+0x5/0xfbef5
? __might_fault+0x74/0xc0
? srso_alias_return_thunk+0x5/0xfbef5
__tcp_push_pending_frames+0x56/0x190
tcp_push+0x117/0x310
tcp_sendmsg_locked+0x14c1/0x1740
tcp_sendmsg+0x28/0x40
inet_sendmsg+0x5d/0x90
sock_write_iter+0x242/0x2b0
vfs_write+0x68d/0x800
? __pfx_sock_write_iter+0x10/0x10
ksys_write+0xc8/0xf0
__x64_sys_write+0x3d/0x50
x64_sys_call+0xfaf/0x1f50
do_syscall_64+0x6d/0x140
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f4d143af992
Code: c3 8b 07 85 c0 75 24 49 89 fb 48 89 f0 48 89 d7 48 89 ce 4c 89 c2 4d 89 ca 4c 8b 44 24 08 4c 8b 4c 24 10 4c 89 5c 24 08 0f 05 <c3> e9 01 cc ff ff 41 54 b8 02 00 00 0
RSP: 002b:00007ffd65032058 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f4d143af992
RDX: 0000000000000025 RSI: 00007f4d143f3bcc RDI: 0000000000000005
RBP: 00007f4d143f2b28 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f4d143f3bcc
R13: 0000000000000005 R14: 0000000000000000 R15: 00007ffd650323f0
</TASK>
Fixes: 0b2c59720e65 ("l2tp: close all race conditions in l2tp_tunnel_register()")
Suggested-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot+6acef9e0a4d1f46c83d4@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=6acef9e0a4d1f46c83d4
CC: gnault@redhat.com
CC: cong.wang@bytedance.com
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Link: https://patch.msgid.link/20240806160626.1248317-1-jchapman@katalix.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Modify l2tp_session_register and l2tp_session_unhash so that the
session IDR and tunnel session lists remain coherent. To do so, hold
the session IDR lock and the tunnel's session list lock when making
any changes to either list.
Without this change, a rare race condition could hit the WARN_ON_ONCE
in l2tp_session_unhash if a thread replaced the IDR entry while
another thread was registering the same ID.
[ 7126.151795][T17511] WARNING: CPU: 3 PID: 17511 at net/l2tp/l2tp_core.c:1282 l2tp_session_delete.part.0+0x87e/0xbc0
[ 7126.163754][T17511] ? show_regs+0x93/0xa0
[ 7126.164157][T17511] ? __warn+0xe5/0x3c0
[ 7126.164536][T17511] ? l2tp_session_delete.part.0+0x87e/0xbc0
[ 7126.165070][T17511] ? report_bug+0x2e1/0x500
[ 7126.165486][T17511] ? l2tp_session_delete.part.0+0x87e/0xbc0
[ 7126.166013][T17511] ? handle_bug+0x99/0x130
[ 7126.166428][T17511] ? exc_invalid_op+0x35/0x80
[ 7126.166890][T17511] ? asm_exc_invalid_op+0x1a/0x20
[ 7126.167372][T17511] ? l2tp_session_delete.part.0+0x87d/0xbc0
[ 7126.167900][T17511] ? l2tp_session_delete.part.0+0x87e/0xbc0
[ 7126.168429][T17511] ? __local_bh_enable_ip+0xa4/0x120
[ 7126.168917][T17511] l2tp_session_delete+0x40/0x50
[ 7126.169369][T17511] pppol2tp_release+0x1a1/0x3f0
[ 7126.169817][T17511] __sock_release+0xb3/0x270
[ 7126.170247][T17511] ? __pfx_sock_close+0x10/0x10
[ 7126.170697][T17511] sock_close+0x1c/0x30
[ 7126.171087][T17511] __fput+0x40b/0xb90
[ 7126.171470][T17511] task_work_run+0x16c/0x260
[ 7126.171897][T17511] ? __pfx_task_work_run+0x10/0x10
[ 7126.172362][T17511] ? srso_alias_return_thunk+0x5/0xfbef5
[ 7126.172863][T17511] ? do_raw_spin_unlock+0x174/0x230
[ 7126.173348][T17511] do_exit+0xaae/0x2b40
[ 7126.173730][T17511] ? srso_alias_return_thunk+0x5/0xfbef5
[ 7126.174235][T17511] ? __pfx_lock_release+0x10/0x10
[ 7126.174690][T17511] ? srso_alias_return_thunk+0x5/0xfbef5
[ 7126.175190][T17511] ? do_raw_spin_lock+0x12c/0x2b0
[ 7126.175650][T17511] ? __pfx_do_exit+0x10/0x10
[ 7126.176072][T17511] ? _raw_spin_unlock_irq+0x23/0x50
[ 7126.176543][T17511] do_group_exit+0xd3/0x2a0
[ 7126.176990][T17511] __x64_sys_exit_group+0x3e/0x50
[ 7126.177456][T17511] x64_sys_call+0x1821/0x1830
[ 7126.177895][T17511] do_syscall_64+0xcb/0x250
[ 7126.178317][T17511] entry_SYSCALL_64_after_hwframe+0x77/0x7f
Fixes: aa5e17e1f5ec ("l2tp: store l2tpv3 sessions in per-net IDR")
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20240718134348.289865-1-jchapman@katalix.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
When handling colliding L2TPv3 session IDs, we use the existing
session IDR entry and link the new session on that using
session->coll_list. However, when using an existing IDR entry, we must
not do the idr_replace step.
Fixes: aa5e17e1f5ec ("l2tp: store l2tpv3 sessions in per-net IDR")
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
syzbot reported a UAF caused by a race when the L2TP work queue closes a
tunnel at the same time as a userspace thread closes a session in that
tunnel.
Tunnel cleanup is handled by a work queue which iterates through the
sessions contained within a tunnel, and closes them in turn.
Meanwhile, a userspace thread may arbitrarily close a session via
either netlink command or by closing the pppox socket in the case of
l2tp_ppp.
The race condition may occur when l2tp_tunnel_closeall walks the list
of sessions in the tunnel and deletes each one. Currently this is
implemented using list_for_each_safe, but because the list spinlock is
dropped in the loop body it's possible for other threads to manipulate
the list during list_for_each_safe's list walk. This can lead to the
list iterator being corrupted, leading to list_for_each_safe spinning.
One sequence of events which may lead to this is as follows:
* A tunnel is created, containing two sessions A and B.
* A thread closes the tunnel, triggering tunnel cleanup via the work
queue.
* l2tp_tunnel_closeall runs in the context of the work queue. It
removes session A from the tunnel session list, then drops the list
lock. At this point the list_for_each_safe temporary variable is
pointing to the other session on the list, which is session B, and
the list can be manipulated by other threads since the list lock has
been released.
* Userspace closes session B, which removes the session from its parent
tunnel via l2tp_session_delete. Since l2tp_tunnel_closeall has
released the tunnel list lock, l2tp_session_delete is able to call
list_del_init on the session B list node.
* Back on the work queue, l2tp_tunnel_closeall resumes execution and
will now spin forever on the same list entry until the underlying
session structure is freed, at which point UAF occurs.
The solution is to iterate over the tunnel's session list using
list_first_entry_not_null to avoid the possibility of the list
iterator pointing at a list item which may be removed during the walk.
Also, have l2tp_tunnel_closeall ref each session while it processes it
to prevent another thread from freeing it.
cpu1 cpu2
--- ---
pppol2tp_release()
spin_lock_bh(&tunnel->list_lock);
for (;;) {
session = list_first_entry_or_null(&tunnel->session_list,
struct l2tp_session, list);
if (!session)
break;
list_del_init(&session->list);
spin_unlock_bh(&tunnel->list_lock);
l2tp_session_delete(session);
l2tp_session_delete(session);
spin_lock_bh(&tunnel->list_lock);
}
spin_unlock_bh(&tunnel->list_lock);
Calling l2tp_session_delete on the same session twice isn't a problem
per-se, but if cpu2 manages to destruct the socket and unref the
session to zero before cpu1 progresses then it would lead to UAF.
Reported-by: syzbot+b471b7c936301a59745b@syzkaller.appspotmail.com
Reported-by: syzbot+c041b4ce3a6dfd1e63e2@syzkaller.appspotmail.com
Fixes: d18d3f0a24fc ("l2tp: replace hlist with simple list for per-tunnel session list")
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Link: https://patch.msgid.link/20240704152508.1923908-1-jchapman@katalix.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Remove duplicate included header file trace.h and the following warning
reported by make includecheck:
trace.h is included more than once
Compile-tested only.
Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Link: https://patch.msgid.link/20240703061147.691973-2-thorsten.blum@toblux.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
This fixes a sparse warning.
Fixes: d18d3f0a24fc ("l2tp: replace hlist with simple list for per-tunnel session list")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202406220754.evK8Hrjw-lkp@intel.com/
Signed-off-by: James Chapman <jchapman@katalix.com>
Link: https://patch.msgid.link/20240624082945.1925009-1-jchapman@katalix.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The per-tunnel session list is no longer used by the
datapath. However, we still need a list of sessions in the tunnel for
l2tp_session_get_nth, which is used by management code. (An
alternative might be to walk each session IDR list, matching only
sessions of a given tunnel.)
Replace the per-tunnel hlist with a per-tunnel list. In functions
which walk a list of sessions of a tunnel, walk this list instead.
Signed-off-by: James Chapman <jchapman@katalix.com>
Reviewed-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
All users of l2tp_tunnel_get_session are now gone so it can be
removed.
Signed-off-by: James Chapman <jchapman@katalix.com>
Reviewed-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add generic session getter which uses IDR. Replace all users of
l2tp_tunnel_get_session which uses the per-tunnel session list to use
the generic getter.
Signed-off-by: James Chapman <jchapman@katalix.com>
Reviewed-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
If UDP sockets are aliased, sk might be the wrong socket. There's no
benefit to using sk_user_data to do some checks on the associated
tunnel context. Just report the error anyway, like udp core does.
Signed-off-by: James Chapman <jchapman@katalix.com>
Reviewed-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Modify UDP decap to not use the tunnel pointer which comes from the
sock's sk_user_data when parsing the L2TP header. By looking up the
destination session using only the packet contents we avoid potential
UDP 5-tuple aliasing issues which arise from depending on the socket
that received the packet.
Drop the useless error messages on short packet or on failing to find
a session since the tunnel pointer might point to a different tunnel
if multiple sockets use the same 5-tuple.
Short packets (those not big enough to contain an L2TP header) are no
longer counted in the tunnel's invalid counter because we can't derive
the tunnel until we parse the l2tp header to lookup the session.
l2tp_udp_encap_recv was a small wrapper around l2tp_udp_recv_core which
used sk_user_data to derive a tunnel pointer in an RCU-safe way. But
we no longer need the tunnel pointer, so remove that code and combine
the two functions.
Signed-off-by: James Chapman <jchapman@katalix.com>
Reviewed-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
L2TPv2 sessions are currently kept in a per-tunnel hashlist, keyed by
16-bit session_id. When handling received L2TPv2 packets, we need to
first derive the tunnel using the 16-bit tunnel_id or sock, then
lookup the session in a per-tunnel hlist using the 16-bit session_id.
We want to avoid using sk_user_data in the datapath and double lookups
on every packet. So instead, use a per-net IDR to hold L2TPv2
sessions, keyed by a 32-bit value derived from the 16-bit tunnel_id
and session_id. This will allow the L2TPv2 UDP receive datapath to
lookup a session with a single lookup without deriving the tunnel
first.
L2TPv2 sessions are held in their own IDR to avoid potential
key collisions with L2TPv3 sessions.
Signed-off-by: James Chapman <jchapman@katalix.com>
Reviewed-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
L2TPv3 sessions are currently held in one of two fixed-size hash
lists: either a per-net hashlist (IP-encap), or a per-tunnel hashlist
(UDP-encap), keyed by the L2TPv3 32-bit session_id.
In order to lookup L2TPv3 sessions in UDP-encap tunnels efficiently
without finding the tunnel first via sk_user_data, UDP sessions are
now kept in a per-net session list, keyed by session ID. Convert the
existing per-net hashlist to use an IDR for better performance when
there are many sessions and have L2TPv3 UDP sessions use the same IDR.
Although the L2TPv3 RFC states that the session ID alone identifies
the session, our implementation has allowed the same session ID to be
used in different L2TP UDP tunnels. To retain support for this, a new
per-net session hashtable is used, keyed by the sock and session
ID. If on creating a new session, a session already exists with that
ID in the IDR, the colliding sessions are added to the new hashtable
and the existing IDR entry is flagged. When looking up sessions, the
approach is to first check the IDR and if no unflagged match is found,
check the new hashtable. The sock is made available to session getters
where session ID collisions are to be considered. In this way, the new
hashtable is used only for session ID collisions so can be kept small.
For managing session removal, we need a list of colliding sessions
matching a given ID in order to update or remove the IDR entry of the
ID. This is necessary to detect session ID collisions when future
sessions are created. The list head is allocated on first collision
of a given ID and refcounted.
Signed-off-by: James Chapman <jchapman@katalix.com>
Reviewed-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Remove an unused variable in struct l2tp_tunnel which was left behind
by commit c4d48a58f32c5 ("l2tp: convert l2tp_tunnel_list to idr").
Signed-off-by: James Chapman <jchapman@katalix.com>
Reviewed-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Since commit a36e185e8c85
("udp: Handle ICMP errors for tunnels with same destination port on both endpoints")
UDP's handling of ICMP errors has allowed for UDP-encap tunnels to
determine socket associations in scenarios where the UDP hash lookup
could not.
Subsequently, commit d26796ae58940
("udp: check udp sock encap_type in __udp_lib_err")
subtly tweaked the approach such that UDP ICMP error handling would be
skipped for any UDP socket which has encapsulation enabled.
In the case of L2TP tunnel sockets using UDP-encap, this latter
modification effectively broke ICMP error reporting for the L2TP
control plane.
To a degree this isn't catastrophic inasmuch as the L2TP control
protocol defines a reliable transport on top of the underlying packet
switching network which will eventually detect errors and time out.
However, paying attention to the ICMP error reporting allows for more
timely detection of errors in L2TP userspace, and aids in debugging
connectivity issues.
Reinstate ICMP error handling for UDP encap L2TP tunnels:
* implement struct udp_tunnel_sock_cfg .encap_err_rcv in order to allow
the L2TP code to handle ICMP errors;
* only implement error-handling for tunnels which have a managed
socket: unmanaged tunnels using a kernel socket have no userspace to
report errors back to;
* flag the error on the socket, which allows for userspace to get an
error such as -ECONNREFUSED back from sendmsg/recvmsg;
* pass the error into ip[v6]_icmp_error() which allows for userspace to
get extended error information via. MSG_ERRQUEUE.
Fixes: d26796ae5894 ("udp: check udp sock encap_type in __udp_lib_err")
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Link: https://lore.kernel.org/r/20240513172248.623261-1-tparkin@katalix.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
628bc3e5a1be ("l2tp: Support several sockets with same IP/port quadruple")
added support for several L2TPv2 tunnels using the same IP/port quadruple,
but if an L2TPv3 socket exists it could eat all the trafic. We thus have to
first use the version from the packet to get the proper tunnel, and only
then check that the version matches.
Signed-off-by: Samuel Thibault <samuel.thibault@ens-lyon.org>
Reviewed-by: James Chapman <jchapman@katalix.com>
Link: https://lore.kernel.org/r/20240509205812.4063198-1-samuel.thibault@ens-lyon.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Some l2tp providers will use 1701 as origin port and open several
tunnels for the same origin and target. On the Linux side, this
may mean opening several sockets, but then trafic will go to only
one of them, losing the trafic for the tunnel of the other socket
(or leaving it up to userland, consuming a lot of cpu%).
This can also happen when the l2tp provider uses a cluster, and
load-balancing happens to migrate from one origin IP to another one,
for which a socket was already established. Managing reassigning
tunnels from one socket to another would be very hairy for userland.
Lastly, as documented in l2tpconfig(1), as client it may be necessary
to use 1701 as origin port for odd firewalls reasons, which could
prevent from establishing several tunnels to a l2tp server, for the
same reason: trafic would get only on one of the two sockets.
With the V2 protocol it is however easy to route trafic to the proper
tunnel, by looking up the tunnel number in the network namespace. This
fixes the three cases altogether.
Signed-off-by: Samuel Thibault <samuel.thibault@ens-lyon.org>
Link: https://lore.kernel.org/r/20240506215336.1470009-1-samuel.thibault@ens-lyon.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Cross-merge networking fixes after downstream PR.
Conflicts:
include/linux/filter.h
kernel/bpf/core.c
66e13b615a0c ("bpf: verifier: prevent userspace memory access")
d503a04f8bc0 ("bpf: Add support for certain atomics in bpf_arena to x86 JIT")
https://lore.kernel.org/all/20240429114939.210328b0@canb.auug.org.au/
No adjacent changes.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
I added dst_rt6_info() in commit
e8dfd42c17fa ("ipv6: introduce dst_rt6_info() helper")
This patch does a similar change for IPv4.
Instead of (struct rtable *)dst casts, we can use :
#define dst_rtable(_ptr) \
container_of_const(_ptr, struct rtable, dst)
Patch is smaller than IPv6 one, because IPv4 has skb_rtable() helper.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://lore.kernel.org/r/20240429133009.1227754-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Instead of (struct rt6_info *)dst casts, we can use :
#define dst_rt6_info(_ptr) \
container_of_const(_ptr, struct rt6_info, dst)
Some places needed missing const qualifiers :
ip6_confirm_neigh(), ipv6_anycast_destination(),
ipv6_unicast_destination(), has_gateway()
v2: added missing parts (David Ahern)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Drop the flow-hash of the skb when forwarding to the L2TP netdev.
This avoids the L2TP qdisc from using the flow-hash from the outer
packet, which is identical for every flow within the tunnel.
This does not affect every platform but is specific for the ethernet
driver. It depends on the platform including L4 information in the
flow-hash.
One such example is the Mediatek Filogic MT798x family of networking
processors.
Fixes: d9e31d17ceba ("l2tp: Add L2TP ethernet pseudowire support")
Acked-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David Bauer <mail@david-bauer.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240424171110.13701-1-mail@david-bauer.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
The 'len' variable can't be negative when assigned the result of
'min_t' because all 'min_t' parameters are cast to unsigned int,
and then the minimum one is chosen.
To fix the logic, check 'len' as read from 'optlen',
where the types of relevant variables are (signed) int.
Fixes: 3557baabf280 ("[L2TP]: PPP over L2TP driver core")
Reviewed-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: Gavrilov Ilia <Ilia.Gavrilov@infotecs.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Cross-merge networking fixes after downstream PR.
Conflicts:
net/ipv4/udp.c
f796feabb9f5 ("udp: add local "peek offset enabled" flag")
56667da7399e ("net: implement lockless setsockopt(SO_PEEK_OFF)")
Adjacent changes:
net/unix/garbage.c
aa82ac51d633 ("af_unix: Drop oob_skb ref before purging queue in GC.")
11498715f266 ("af_unix: Remove io_uring code for GC.")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
l2tp_ip6_sendmsg needs to avoid accounting for the transport header
twice when splicing more data into an already partially-occupied skbuff.
To manage this, we check whether the skbuff contains data using
skb_queue_empty when deciding how much data to append using
ip6_append_data.
However, the code which performed the calculation was incorrect:
ulen = len + skb_queue_empty(&sk->sk_write_queue) ? transhdrlen : 0;
...due to C operator precedence, this ends up setting ulen to
transhdrlen for messages with a non-zero length, which results in
corrupted packets on the wire.
Add parentheses to correct the calculation in line with the original
intent.
Fixes: 9d4c75800f61 ("ipv4, ipv6: Fix handling of transhdrlen in __ip{,6}_append_data()")
Cc: David Howells <dhowells@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240220122156.43131-1-tparkin@katalix.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Since commit aed65af1cc2f ("drivers: make device_type const"), the driver
core can properly handle constant struct device_type. Move the l2tpeth_type
variable to be a constant structure as well, placing it into read-only
memory which can not be modified at runtime.
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Set scope automatically in ip_route_output_ports() (using the socket
SOCK_LOCALROUTE flag). This way, callers don't have to overload the
tos with the RTO_ONLINK flag, like RT_CONN_FLAGS() does.
For callers that don't pass a struct sock, this doesn't change anything
as the scope is still set to RT_SCOPE_UNIVERSE when sk is NULL.
Callers that passed a struct sock and used RT_CONN_FLAGS(sk) or
RT_CONN_FLAGS_TOS(sk, tos) for the tos are modified to use
ip_sock_tos(sk) and RT_TOS(tos) respectively, as overloading tos with
the RTO_ONLINK flag now becomes unnecessary.
In drivers/net/amt.c, all ip_route_output_ports() calls use a 0 tos
parameter, ignoring the SOCK_LOCALROUTE flag of the socket. But the sk
parameter is a kernel socket, which doesn't have any configuration path
for setting SOCK_LOCALROUTE anyway. Therefore, ip_route_output_ports()
will continue to initialise scope with RT_SCOPE_UNIVERSE and amt.c
doesn't need to be modified.
Also, remove RT_CONN_FLAGS() and RT_CONN_FLAGS_TOS() from route.h as
these macros are now unused.
The objective is to eventually remove RTO_ONLINK entirely to allow
converting ->flowi4_tos to dscp_t. This will ensure proper isolation
between the DSCP and ECN bits, thus minimising the risk of introducing
bugs where TOS values interfere with ECN.
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/dacfd2ab40685e20959ab7b53c427595ba229e7d.1707496938.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
np->ucast_oif is read locklessly in some contexts.
Make all accesses to this field lockless, adding appropriate
annotations.
This also makes setsockopt( IPV6_UNICAST_IF ) lockless.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
np->mcast_oif is read locklessly in some contexts.
Make all accesses to this field lockless, adding appropriate
annotations.
This also makes setsockopt( IPV6_MULTICAST_IF ) lockless.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Cross-merge networking fixes after downstream PR.
No conflicts (or adjacent changes of note).
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Including the transhdrlen in length is a problem when the packet is
partially filled (e.g. something like send(MSG_MORE) happened previously)
when appending to an IPv4 or IPv6 packet as we don't want to repeat the
transport header or account for it twice. This can happen under some
circumstances, such as splicing into an L2TP socket.
The symptom observed is a warning in __ip6_append_data():
WARNING: CPU: 1 PID: 5042 at net/ipv6/ip6_output.c:1800 __ip6_append_data.isra.0+0x1be8/0x47f0 net/ipv6/ip6_output.c:1800
that occurs when MSG_SPLICE_PAGES is used to append more data to an already
partially occupied skbuff. The warning occurs when 'copy' is larger than
the amount of data in the message iterator. This is because the requested
length includes the transport header length when it shouldn't. This can be
triggered by, for example:
sfd = socket(AF_INET6, SOCK_DGRAM, IPPROTO_L2TP);
bind(sfd, ...); // ::1
connect(sfd, ...); // ::1 port 7
send(sfd, buffer, 4100, MSG_MORE);
sendfile(sfd, dfd, NULL, 1024);
Fix this by only adding transhdrlen into the length if the write queue is
empty in l2tp_ip6_sendmsg(), analogously to how UDP does things.
l2tp_ip_sendmsg() looks like it won't suffer from this problem as it builds
the UDP packet itself.
Fixes: a32e0eec7042 ("l2tp: introduce L2TPv3 IP encapsulation support for IPv6")
Reported-by: syzbot+62cbf263225ae13ff153@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/0000000000001c12b30605378ce8@google.com/
Suggested-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Eric Dumazet <edumazet@google.com>
cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: David Ahern <dsahern@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: netdev@vger.kernel.org
cc: bpf@vger.kernel.org
cc: syzkaller-bugs@googlegroups.com
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Core networking has opt-in atomic variant of dev->stats,
simply use DEV_STATS_INC(), DEV_STATS_ADD() and DEV_STATS_READ().
v2: removed @priv local var in l2tp_eth_dev_recv() (Simon)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
np->sndflow reads are racy.
Use one bit ftom atomic inet->inet_flags instead,
IPV6_FLOWINFO_SEND setsockopt() can be lockless.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Move np->dontfrag flag to inet->inet_flags to fix data-races.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
syzbot/KCSAN complained about UDP_ENCAP_L2TPINUDP setsockopt() racing.
Add READ_ONCE()/WRITE_ONCE() to document races on this lockless field.
syzbot report was:
BUG: KCSAN: data-race in udp_lib_setsockopt / udp_lib_setsockopt
read-write to 0xffff8881083603fa of 1 bytes by task 16557 on cpu 0:
udp_lib_setsockopt+0x682/0x6c0
udp_setsockopt+0x73/0xa0 net/ipv4/udp.c:2779
sock_common_setsockopt+0x61/0x70 net/core/sock.c:3697
__sys_setsockopt+0x1c9/0x230 net/socket.c:2263
__do_sys_setsockopt net/socket.c:2274 [inline]
__se_sys_setsockopt net/socket.c:2271 [inline]
__x64_sys_setsockopt+0x66/0x80 net/socket.c:2271
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
read-write to 0xffff8881083603fa of 1 bytes by task 16554 on cpu 1:
udp_lib_setsockopt+0x682/0x6c0
udp_setsockopt+0x73/0xa0 net/ipv4/udp.c:2779
sock_common_setsockopt+0x61/0x70 net/core/sock.c:3697
__sys_setsockopt+0x1c9/0x230 net/socket.c:2263
__do_sys_setsockopt net/socket.c:2274 [inline]
__se_sys_setsockopt net/socket.c:2271 [inline]
__x64_sys_setsockopt+0x66/0x80 net/socket.c:2271
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
value changed: 0x01 -> 0x05
Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 16554 Comm: syz-executor.5 Not tainted 6.5.0-rc7-syzkaller-00004-gf7757129e3de #0
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Various inet fields are currently racy.
do_ip_setsockopt() and do_ip_getsockopt() are mostly holding
the socket lock, but some (fast) paths do not.
Use a new inet->inet_flags to hold atomic bits in the series.
Remove inet->cmsg_flags, and use instead 9 bits from inet_flags.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Cross-merge networking fixes after downstream PR.
Conflicts:
net/dsa/port.c
9945c1fb03a3 ("net: dsa: fix older DSA drivers using phylink")
a88dd7538461 ("net: dsa: remove legacy_pre_march2020 detection")
https://lore.kernel.org/all/20230731102254.2c9868ca@canb.auug.org.au/
net/xdp/xsk.c
3c5b4d69c358 ("net: annotate data-races around sk->sk_mark")
b7f72a30e9ac ("xsk: introduce wrappers and helpers for supporting multi-buffer in Tx path")
https://lore.kernel.org/all/20230731102631.39988412@canb.auug.org.au/
drivers/net/ethernet/broadcom/bnxt/bnxt.c
37b61cda9c16 ("bnxt: don't handle XDP in netpoll")
2b56b3d99241 ("eth: bnxt: handle invalid Tx completions more gracefully")
https://lore.kernel.org/all/20230801101708.1dc7faac@canb.auug.org.au/
Adjacent changes:
drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_fs.c
62da08331f1a ("net/mlx5e: Set proper IPsec source port in L4 selector")
fbd517549c32 ("net/mlx5e: Add function to get IPsec offload namespace")
drivers/net/ethernet/sfc/selftest.c
55c1528f9b97 ("sfc: fix field-spanning memcpy in selftest")
ae9d445cd41f ("sfc: Miscellaneous comment removals")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
sk->sk_mark is often read while another thread could change the value.
Fixes: 4a19ec5800fc ("[NET]: Introducing socket mark socket option.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
IPv6 inet sockets are supposed to have a "struct ipv6_pinfo"
field at the end of their definition, so that inet6_sk_generic()
can derive from socket size the offset of the "struct ipv6_pinfo".
This is very fragile, and prevents adding bigger alignment
in sockets, because inet6_sk_generic() does not work
if the compiler adds padding after the ipv6_pinfo comp |