summaryrefslogtreecommitdiff
path: root/net/sctp
AgeCommit message (Collapse)AuthorFilesLines
2023-10-10sctp: update hb timer immediately after users change hb_intervalXin Long1-0/+1
[ Upstream commit 1f4e803cd9c9166eb8b6c8b0b8e4124f7499fc07 ] Currently, when hb_interval is changed by users, it won't take effect until the next expiry of hb timer. As the default value is 30s, users have to wait up to 30s to wait its hb_interval update to work. This becomes pretty bad in containers where a much smaller value is usually set on hb_interval. This patch improves it by resetting the hb timer immediately once the value of hb_interval is updated by users. Note that we don't address the already existing 'problem' when sending a heartbeat 'on demand' if one hb has just been sent(from the timer) mentioned in: https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg590224.html Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Link: https://lore.kernel.org/r/75465785f8ee5df2fb3acdca9b8fafdc18984098.1696172660.git.lucien.xin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-10sctp: update transport state when processing a dupcook packetXin Long1-2/+1
[ Upstream commit 2222a78075f0c19ca18db53fd6623afb4aff602d ] During the 4-way handshake, the transport's state is set to ACTIVE in sctp_process_init() when processing INIT_ACK chunk on client or COOKIE_ECHO chunk on server. In the collision scenario below: 192.168.1.2 > 192.168.1.1: sctp (1) [INIT] [init tag: 3922216408] 192.168.1.1 > 192.168.1.2: sctp (1) [INIT] [init tag: 144230885] 192.168.1.2 > 192.168.1.1: sctp (1) [INIT ACK] [init tag: 3922216408] 192.168.1.1 > 192.168.1.2: sctp (1) [COOKIE ECHO] 192.168.1.2 > 192.168.1.1: sctp (1) [COOKIE ACK] 192.168.1.1 > 192.168.1.2: sctp (1) [INIT ACK] [init tag: 3914796021] when processing COOKIE_ECHO on 192.168.1.2, as it's in COOKIE_WAIT state, sctp_sf_do_dupcook_b() is called by sctp_sf_do_5_2_4_dupcook() where it creates a new association and sets its transport to ACTIVE then updates to the old association in sctp_assoc_update(). However, in sctp_assoc_update(), it will skip the transport update if it finds a transport with the same ipaddr already existing in the old asoc, and this causes the old asoc's transport state not to move to ACTIVE after the handshake. This means if DATA retransmission happens at this moment, it won't be able to enter PF state because of the check 'transport->state == SCTP_ACTIVE' in sctp_do_8_2_transport_strike(). This patch fixes it by updating the transport in sctp_assoc_update() with sctp_assoc_add_peer() where it updates the transport state if there is already a transport with the same ipaddr exists in the old asoc. Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Link: https://lore.kernel.org/r/fd17356abe49713ded425250cc1ae51e9f5846c6.1696172325.git.lucien.xin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-19sctp: annotate data-races around sk->sk_wmem_queuedEric Dumazet2-6/+6
[ Upstream commit dc9511dd6f37fe803f6b15b61b030728d7057417 ] sk->sk_wmem_queued can be read locklessly from sctp_poll() Use sk_wmem_queued_add() when the field is changed, and add READ_ONCE() annotations in sctp_writeable() and sctp_assocs_seq_show() syzbot reported: BUG: KCSAN: data-race in sctp_poll / sctp_wfree read-write to 0xffff888149d77810 of 4 bytes by interrupt on cpu 0: sctp_wfree+0x170/0x4a0 net/sctp/socket.c:9147 skb_release_head_state+0xb7/0x1a0 net/core/skbuff.c:988 skb_release_all net/core/skbuff.c:1000 [inline] __kfree_skb+0x16/0x140 net/core/skbuff.c:1016 consume_skb+0x57/0x180 net/core/skbuff.c:1232 sctp_chunk_destroy net/sctp/sm_make_chunk.c:1503 [inline] sctp_chunk_put+0xcd/0x130 net/sctp/sm_make_chunk.c:1530 sctp_datamsg_put+0x29a/0x300 net/sctp/chunk.c:128 sctp_chunk_free+0x34/0x50 net/sctp/sm_make_chunk.c:1515 sctp_outq_sack+0xafa/0xd70 net/sctp/outqueue.c:1381 sctp_cmd_process_sack net/sctp/sm_sideeffect.c:834 [inline] sctp_cmd_interpreter net/sctp/sm_sideeffect.c:1366 [inline] sctp_side_effects net/sctp/sm_sideeffect.c:1198 [inline] sctp_do_sm+0x12c7/0x31b0 net/sctp/sm_sideeffect.c:1169 sctp_assoc_bh_rcv+0x2b2/0x430 net/sctp/associola.c:1051 sctp_inq_push+0x108/0x120 net/sctp/inqueue.c:80 sctp_rcv+0x116e/0x1340 net/sctp/input.c:243 sctp6_rcv+0x25/0x40 net/sctp/ipv6.c:1120 ip6_protocol_deliver_rcu+0x92f/0xf30 net/ipv6/ip6_input.c:437 ip6_input_finish net/ipv6/ip6_input.c:482 [inline] NF_HOOK include/linux/netfilter.h:303 [inline] ip6_input+0xbd/0x1b0 net/ipv6/ip6_input.c:491 dst_input include/net/dst.h:468 [inline] ip6_rcv_finish+0x1e2/0x2e0 net/ipv6/ip6_input.c:79 NF_HOOK include/linux/netfilter.h:303 [inline] ipv6_rcv+0x74/0x150 net/ipv6/ip6_input.c:309 __netif_receive_skb_one_core net/core/dev.c:5452 [inline] __netif_receive_skb+0x90/0x1b0 net/core/dev.c:5566 process_backlog+0x21f/0x380 net/core/dev.c:5894 __napi_poll+0x60/0x3b0 net/core/dev.c:6460 napi_poll net/core/dev.c:6527 [inline] net_rx_action+0x32b/0x750 net/core/dev.c:6660 __do_softirq+0xc1/0x265 kernel/softirq.c:553 run_ksoftirqd+0x17/0x20 kernel/softirq.c:921 smpboot_thread_fn+0x30a/0x4a0 kernel/smpboot.c:164 kthread+0x1d7/0x210 kernel/kthread.c:389 ret_from_fork+0x2e/0x40 arch/x86/kernel/process.c:145 ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:304 read to 0xffff888149d77810 of 4 bytes by task 17828 on cpu 1: sctp_writeable net/sctp/socket.c:9304 [inline] sctp_poll+0x265/0x410 net/sctp/socket.c:8671 sock_poll+0x253/0x270 net/socket.c:1374 vfs_poll include/linux/poll.h:88 [inline] do_pollfd fs/select.c:873 [inline] do_poll fs/select.c:921 [inline] do_sys_poll+0x636/0xc00 fs/select.c:1015 __do_sys_ppoll fs/select.c:1121 [inline] __se_sys_ppoll+0x1af/0x1f0 fs/select.c:1101 __x64_sys_ppoll+0x67/0x80 fs/select.c:1101 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd value changed: 0x00019e80 -> 0x0000cc80 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 17828 Comm: syz-executor.1 Not tainted 6.5.0-rc7-syzkaller-00185-g28f20a19294d #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/26/2023 Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Xin Long <lucien.xin@gmail.com> Link: https://lore.kernel.org/r/20230830094519.950007-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-13sctp: handle invalid error codes without calling BUG()Dan Carpenter1-1/+4
[ Upstream commit a0067dfcd9418fd3b0632bc59210d120d038a9c6 ] The sctp_sf_eat_auth() function is supposed to return enum sctp_disposition values but if the call to sctp_ulpevent_make_authkey() fails, it returns -ENOMEM. This results in calling BUG() inside the sctp_side_effects() function. Calling BUG() is an over reaction and not helpful. Call WARN_ON_ONCE() instead. This code predates git. Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-30ipv4: fix data-races around inet->inet_idEric Dumazet1-1/+1
[ Upstream commit f866fbc842de5976e41ba874b76ce31710b634b5 ] UDP sendmsg() is lockless, so ip_select_ident_segs() can very well be run from multiple cpus [1] Convert inet->inet_id to an atomic_t, but implement a dedicated path for TCP, avoiding cost of a locked instruction (atomic_add_return()) Note that this patch will cause a trivial merge conflict because we added inet->flags in net-next tree. v2: added missing change in drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c (David Ahern) [1] BUG: KCSAN: data-race in __ip_make_skb / __ip_make_skb read-write to 0xffff888145af952a of 2 bytes by task 7803 on cpu 1: ip_select_ident_segs include/net/ip.h:542 [inline] ip_select_ident include/net/ip.h:556 [inline] __ip_make_skb+0x844/0xc70 net/ipv4/ip_output.c:1446 ip_make_skb+0x233/0x2c0 net/ipv4/ip_output.c:1560 udp_sendmsg+0x1199/0x1250 net/ipv4/udp.c:1260 inet_sendmsg+0x63/0x80 net/ipv4/af_inet.c:830 sock_sendmsg_nosec net/socket.c:725 [inline] sock_sendmsg net/socket.c:748 [inline] ____sys_sendmsg+0x37c/0x4d0 net/socket.c:2494 ___sys_sendmsg net/socket.c:2548 [inline] __sys_sendmmsg+0x269/0x500 net/socket.c:2634 __do_sys_sendmmsg net/socket.c:2663 [inline] __se_sys_sendmmsg net/socket.c:2660 [inline] __x64_sys_sendmmsg+0x57/0x60 net/socket.c:2660 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd read to 0xffff888145af952a of 2 bytes by task 7804 on cpu 0: ip_select_ident_segs include/net/ip.h:541 [inline] ip_select_ident include/net/ip.h:556 [inline] __ip_make_skb+0x817/0xc70 net/ipv4/ip_output.c:1446 ip_make_skb+0x233/0x2c0 net/ipv4/ip_output.c:1560 udp_sendmsg+0x1199/0x1250 net/ipv4/udp.c:1260 inet_sendmsg+0x63/0x80 net/ipv4/af_inet.c:830 sock_sendmsg_nosec net/socket.c:725 [inline] sock_sendmsg net/socket.c:748 [inline] ____sys_sendmsg+0x37c/0x4d0 net/socket.c:2494 ___sys_sendmsg net/socket.c:2548 [inline] __sys_sendmmsg+0x269/0x500 net/socket.c:2634 __do_sys_sendmmsg net/socket.c:2663 [inline] __se_sys_sendmmsg net/socket.c:2660 [inline] __x64_sys_sendmmsg+0x57/0x60 net/socket.c:2660 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd value changed: 0x184d -> 0x184e Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 7804 Comm: syz-executor.1 Not tainted 6.5.0-rc6-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/26/2023 ================================================================== Fixes: 23f57406b82d ("ipv4: avoid using shared IP generator for connected sockets") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-30sock: annotate data-races around prot->memory_pressureEric Dumazet1-1/+1
[ Upstream commit 76f33296d2e09f63118db78125c95ef56df438e9 ] *prot->memory_pressure is read/writen locklessly, we need to add proper annotations. A recent commit added a new race, it is time to audit all accesses. Fixes: 2d0c88e84e48 ("sock: Fix misuse of sk_under_memory_pressure()") Fixes: 4d93df0abd50 ("[SCTP]: Rewrite of sctp buffer management code") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Abel Wu <wuyun.abel@bytedance.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Link: https://lore.kernel.org/r/20230818015132.2699348-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-19sctp: fix potential deadlock on &net->sctp.addr_wq_lockChengfeng Ye1-2/+2
[ Upstream commit 6feb37b3b06e9049e20dcf7e23998f92c9c5be9a ] As &net->sctp.addr_wq_lock is also acquired by the timer sctp_addr_wq_timeout_handler() in protocal.c, the same lock acquisition at sctp_auto_asconf_init() seems should disable irq since it is called from sctp_accept() under process context. Possible deadlock scenario: sctp_accept() -> sctp_sock_migrate() -> sctp_auto_asconf_init() -> spin_lock(&net->sctp.addr_wq_lock) <timer interrupt> -> sctp_addr_wq_timeout_handler() -> spin_lock_bh(&net->sctp.addr_wq_lock); (deadlock here) This flaw was found using an experimental static analysis tool we are developing for irq-related deadlock. The tentative patch fix the potential deadlock by spin_lock_bh(). Signed-off-by: Chengfeng Ye <dg573847474@gmail.com> Fixes: 34e5b0118685 ("sctp: delay auto_asconf init until binding the first addr") Acked-by: Xin Long <lucien.xin@gmail.com> Link: https://lore.kernel.org/r/20230627120340.19432-1-dg573847474@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-19sctp: add bpf_bypass_getsockopt proto callbackAlexander Mikhalitsyn1-0/+18
[ Upstream commit 2598619e012cee5273a2821441b9a051ad931249 ] Implement ->bpf_bypass_getsockopt proto callback and filter out SCTP_SOCKOPT_PEELOFF, SCTP_SOCKOPT_PEELOFF_FLAGS and SCTP_SOCKOPT_CONNECTX3 socket options from running eBPF hook on them. SCTP_SOCKOPT_PEELOFF and SCTP_SOCKOPT_PEELOFF_FLAGS options do fd_install(), and if BPF_CGROUP_RUN_PROG_GETSOCKOPT hook returns an error after success of the original handler sctp_getsockopt(...), userspace will receive an error from getsockopt syscall and will be not aware that fd was successfully installed into a fdtable. As pointed by Marcelo Ricardo Leitner it seems reasonable to skip bpf getsockopt hook for SCTP_SOCKOPT_CONNECTX3 sockopt too. Because internaly, it triggers connect() and if error is masked then userspace will be confused. This patch was born as a result of discussion around a new SCM_PIDFD interface: https://lore.kernel.org/all/20230413133355.350571-3-aleksandr.mikhalitsyn@canonical.com/ Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt hooks") Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Christian Brauner <brauner@kernel.org> Cc: Stanislav Fomichev <sdf@google.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Cc: Xin Long <lucien.xin@gmail.com> Cc: linux-sctp@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: netdev@vger.kernel.org Suggested-by: Stanislav Fomichev <sdf@google.com> Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Acked-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-06-21sctp: fix an error code in sctp_sf_eat_auth()Dan Carpenter1-1/+1
[ Upstream commit 75e6def3b26736e7ff80639810098c9074229737 ] The sctp_sf_eat_auth() function is supposed to enum sctp_disposition values and returning a kernel error code will cause issues in the caller. Change -ENOMEM to SCTP_DISPOSITION_NOMEM. Fixes: 65b07e5d0d09 ("[SCTP]: API updates to suport SCTP-AUTH extensions.") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Acked-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-06-05inet: Add IP_LOCAL_PORT_RANGE socket optionJakub Sitnicki1-1/+1
[ Upstream commit 91d0b78c5177f3e42a4d8738af8ac19c3a90d002 ] Users who want to share a single public IP address for outgoing connections between several hosts traditionally reach for SNAT. However, SNAT requires state keeping on the node(s) performing the NAT. A stateless alternative exists, where a single IP address used for egress can be shared between several hosts by partitioning the available ephemeral port range. In such a setup: 1. Each host gets assigned a disjoint range of ephemeral ports. 2. Applications open connections from the host-assigned port range. 3. Return traffic gets routed to the host based on both, the destination IP and the destination port. An application which wants to open an outgoing connection (connect) from a given port range today can choose between two solutions: 1. Manually pick the source port by bind()'ing to it before connect()'ing the socket. This approach has a couple of downsides: a) Search for a free port has to be implemented in the user-space. If the chosen 4-tuple happens to be busy, the application needs to retry from a different local port number. Detecting if 4-tuple is busy can be either easy (TCP) or hard (UDP). In TCP case, the application simply has to check if connect() returned an error (EADDRNOTAVAIL). That is assuming that the local port sharing was enabled (REUSEADDR) by all the sockets. # Assume desired local port range is 60_000-60_511 s = socket(AF_INET, SOCK_STREAM) s.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1) s.bind(("192.0.2.1", 60_000)) s.connect(("1.1.1.1", 53)) # Fails only if 192.0.2.1:60000 -> 1.1.1.1:53 is busy # Application must retry with another local port In case of UDP, the network stack allows binding more than one socket to the same 4-tuple, when local port sharing is enabled (REUSEADDR). Hence detecting the conflict is much harder and involves querying sock_diag and toggling the REUSEADDR flag [1]. b) For TCP, bind()-ing to a port within the ephemeral port range means that no connecting sockets, that is those which leave it to the network stack to find a free local port at connect() time, can use the this port. IOW, the bind hash bucket tb->fastreuse will be 0 or 1, and the port will be skipped during the free port search at connect() time. 2. Isolate the app in a dedicated netns and use the use the per-netns ip_local_port_range sysctl to adjust the ephemeral port range bounds. The per-netns setting affects all sockets, so this approach can be used only if: - there is just one egress IP address, or - the desired egress port range is the same for all egress IP addresses used by the application. For TCP, this approach avoids the downsides of (1). Free port search and 4-tuple conflict detection is done by the network stack: system("sysctl -w net.ipv4.ip_local_port_range='60000 60511'") s = socket(AF_INET, SOCK_STREAM) s.setsockopt(SOL_IP, IP_BIND_ADDRESS_NO_PORT, 1) s.bind(("192.0.2.1", 0)) s.connect(("1.1.1.1", 53)) # Fails if all 4-tuples 192.0.2.1:60000-60511 -> 1.1.1.1:53 are busy For UDP this approach has limited applicability. Setting the IP_BIND_ADDRESS_NO_PORT socket option does not result in local source port being shared with other connected UDP sockets. Hence relying on the network stack to find a free source port, limits the number of outgoing UDP flows from a single IP address down to the number of available ephemeral ports. To put it another way, partitioning the ephemeral port range between hosts using the existing Linux networking API is cumbersome. To address this use case, add a new socket option at the SOL_IP level, named IP_LOCAL_PORT_RANGE. The new option can be used to clamp down the ephemeral port range for each socket individually. The option can be used only to narrow down the per-netns local port range. If the per-socket range lies outside of the per-netns range, the latter takes precedence. UAPI-wise, the low and high range bounds are passed to the kernel as a pair of u16 values in host byte order packed into a u32. This avoids pointer passing. PORT_LO = 40_000 PORT_HI = 40_511 s = socket(AF_INET, SOCK_STREAM) v = struct.pack("I", PORT_HI << 16 | PORT_LO) s.setsockopt(SOL_IP, IP_LOCAL_PORT_RANGE, v) s.bind(("127.0.0.1", 0)) s.getsockname() # Local address between ("127.0.0.1", 40_000) and ("127.0.0.1", 40_511), # if there is a free port. EADDRINUSE otherwise. [1] https://github.com/cloudflare/cloudflare-blog/blob/232b432c1d57/2022-02-connectx/connectx.py#L116 Reviewed-by: Marek Majkowski <marek@cloudflare.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Stable-dep-of: 3632679d9e4f ("ipv{4,6}/raw: fix output xfrm lookup wrt protocol") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-30sctp: fix an issue that plpmtu can never go to complete stateXin Long1-4/+7
commit 6ca328e985cd995dfd1d5de44046e6074f853fbb upstream. When doing plpmtu probe, the probe size is growing every time when it receives the ACK during the Search state until the probe fails. When the failure occurs, pl.probe_high is set and it goes to the Complete state. However, if the link pmtu is huge, like 65535 in loopback_dev, the probe eventually keeps using SCTP_MAX_PLPMTU as the probe size and never fails. Because of that, pl.probe_high can not be set, and the plpmtu probe can never go to the Complete state. Fix it by setting pl.probe_high to SCTP_MAX_PLPMTU when the probe size grows to SCTP_MAX_PLPMTU in sctp_transport_pl_recv(). Also, not allow the probe size greater than SCTP_MAX_PLPMTU in the Complete state. Fixes: b87641aff9e7 ("sctp: do state transition when a probe succeeds on HB ACK recv path") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-04-26sctp: Call inet6_destroy_sock() via sk->sk_destruct().Kuniyuki Iwashima1-8/+21
commit 6431b0f6ff1633ae598667e4cdd93830074a03e8 upstream. After commit d38afeec26ed ("tcp/udp: Call inet6_destroy_sock() in IPv6 sk->sk_destruct()."), we call inet6_destroy_sock() in sk->sk_destruct() by setting inet6_sock_destruct() to it to make sure we do not leak inet6-specific resources. SCTP sets its own sk->sk_destruct() in the sctp_init_sock(), and SCTPv6 socket reuses it as the init function. To call inet6_sock_destruct() from SCTPv6 sk->sk_destruct(), we set sctp_v6_destruct_sock() in a new init function. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-04-20sctp: fix a potential overflow in sctp_ifwdtsn_skipXin Long1-1/+2
[ Upstream commit 32832a2caf82663870126c5186cf8f86c8b2a649 ] Currently, when traversing ifwdtsn skips with _sctp_walk_ifwdtsn, it only checks the pos against the end of the chunk. However, the data left for the last pos may be < sizeof(struct sctp_ifwdtsn_skip), and dereference it as struct sctp_ifwdtsn_skip may cause coverflow. This patch fixes it by checking the pos against "the end of the chunk - sizeof(struct sctp_ifwdtsn_skip)" in sctp_ifwdtsn_skip, similar to sctp_fwdtsn_skip. Fixes: 0fc2ea922c8a ("sctp: implement validate_ftsn for sctp_stream_interleave") Signed-off-by: Xin Long <lucien.xin@gmail.com> Link: https://lore.kernel.org/r/2a71bffcd80b4f2c61fac6d344bb2f11c8fd74f7.1681155810.git.lucien.xin@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-04-13sctp: check send stream number after wait_for_sndbufXin Long1-0/+4
[ Upstream commit 2584024b23552c00d95b50255e47bd18d306d31a ] This patch fixes a corner case where the asoc out stream count may change after wait_for_sndbuf. When the main thread in the client starts a connection, if its out stream count is set to N while the in stream count in the server is set to N - 2, another thread in the client keeps sending the msgs with stream number N - 1, and waits for sndbuf before processing INIT_ACK. However, after processing INIT_ACK, the out stream count in the client is shrunk to N - 2, the same to the in stream count in the server. The crash occurs when the thread waiting for sndbuf is awake and sends the msg in a non-existing stream(N - 1), the call trace is as below: KASAN: null-ptr-deref in range [0x0000000000000038-0x000000000000003f] Call Trace: <TASK> sctp_cmd_send_msg net/sctp/sm_sideeffect.c:1114 [inline] sctp_cmd_interpreter net/sctp/sm_sideeffect.c:1777 [inline] sctp_side_effects net/sctp/sm_sideeffect.c:1199 [inline] sctp_do_sm+0x197d/0x5310 net/sctp/sm_sideeffect.c:1170 sctp_primitive_SEND+0x9f/0xc0 net/sctp/primitive.c:163 sctp_sendmsg_to_asoc+0x10eb/0x1a30 net/sctp/socket.c:1868 sctp_sendmsg+0x8d4/0x1d90 net/sctp/socket.c:2026 inet_sendmsg+0x9d/0xe0 net/ipv4/af_inet.c:825 sock_sendmsg_nosec net/socket.c:722 [inline] sock_sendmsg+0xde/0x190 net/socket.c:745 The fix is to add an unlikely check for the send stream number after the thread wakes up from the wait_for_sndbuf. Fixes: 5bbbbe32a431 ("sctp: introduce stream scheduler foundations") Reported-by: syzbot+47c24ca20a2fa01f082e@syzkaller.appspotmail.com Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-11sctp: add a refcnt in sctp_stream_priorities to avoid a nested loopXin Long1-31/+21
[ Upstream commit 68ba44639537de6f91fe32783766322d41848127 ] With this refcnt added in sctp_stream_priorities, we don't need to traverse all streams to check if the prio is used by other streams when freeing one stream's prio in sctp_sched_prio_free_sid(). This can avoid a nested loop (up to 65535 * 65535), which may cause a stuck as Ying reported: watchdog: BUG: soft lockup - CPU#23 stuck for 26s! [ksoftirqd/23:136] Call Trace: <TASK> sctp_sched_prio_free_sid+0xab/0x100 [sctp] sctp_stream_free_ext+0x64/0xa0 [sctp] sctp_stream_free+0x31/0x50 [sctp] sctp_association_free+0xa5/0x200 [sctp] Note that it doesn't need to use refcount_t type for this counter, as its accessing is always protected under the sock lock. v1->v2: - add a check in sctp_sched_prio_set to avoid the possible prio_head refcnt overflow. Fixes: 9ed7bfc79542 ("sctp: fix memory leak in sctp_stream_outq_migrate()") Reported-by: Ying Xu <yinxu@redhat.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Link: https://lore.kernel.org/r/825eb0c905cb864991eba335f4a2b780e543f06b.1677085641.git.lucien.xin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-22sctp: sctp_sock_filter(): avoid list_entry() on possibly empty listPietro Borrello1-3/+1
commit a1221703a0f75a9d81748c516457e0fc76951496 upstream. Use list_is_first() to check whether tsp->asoc matches the first element of ep->asocs, as the list is not guaranteed to have an entry. Fixes: 8f840e47f190 ("sctp: add the sctp_diag.c file") Signed-off-by: Pietro Borrello <borrello@diag.uniroma1.it> Acked-by: Xin Long <lucien.xin@gmail.com> Link: https://lore.kernel.org/r/20230208-sctp-filter-v2-1-6e1f4017f326@diag.uniroma1.it Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-09sctp: do not check hb_timer.expires when resetting hb_timerXin Long1-3/+1
[ Upstream commit 8f35ae17ef565a605de5f409e04bcd49a55d7646 ] It tries to avoid the frequently hb_timer refresh in commit ba6f5e33bdbb ("sctp: avoid refreshing heartbeat timer too often"), and it only allows mod_timer when the new expires is after hb_timer.expires. It means even a much shorter interval for hb timer gets applied, it will have to wait until the current hb timer to time out. In sctp_do_8_2_transport_strike(), when a transport enters PF state, it expects to update the hb timer to resend a heartbeat every rto after calling sctp_transport_reset_hb_timer(), which will not work as the change mentioned above. The frequently hb_timer refresh was caused by sctp_transport_reset_timers() called in sctp_outq_flush() and it was already removed in the commit above. So we don't have to check hb_timer.expires when resetting hb_timer as it is now not called very often. Fixes: ba6f5e33bdbb ("sctp: avoid refreshing heartbeat timer too often") Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Link: https://lore.kernel.org/r/d958c06985713ec84049a2d5664879802710179a.1675095933.git.lucien.xin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-01sctp: fail if no bound addresses can be used for a given scopeMarcelo Ricardo Leitner1-0/+6
[ Upstream commit 458e279f861d3f61796894cd158b780765a1569f ] Currently, if you bind the socket to something like: servaddr.sin6_family = AF_INET6; servaddr.sin6_port = htons(0); servaddr.sin6_scope_id = 0; inet_pton(AF_INET6, "::1", &servaddr.sin6_addr); And then request a connect to: connaddr.sin6_family = AF_INET6; connaddr.sin6_port = htons(20000); connaddr.sin6_scope_id = if_nametoindex("lo"); inet_pton(AF_INET6, "fe88::1", &connaddr.sin6_addr); What the stack does is: - bind the socket - create a new asoc - to handle the connect - copy the addresses that can be used for the given scope - try to connect But the copy returns 0 addresses, and the effect is that it ends up trying to connect as if the socket wasn't bound, which is not the desired behavior. This unexpected behavior also allows KASLR leaks through SCTP diag interface. The fix here then is, if when trying to copy the addresses that can be used for the scope used in connect() it returns 0 addresses, bail out. This is what TCP does with a similar reproducer. Reported-by: Pietro Borrello <borrello@diag.uniroma1.it> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Reviewed-by: Xin Long <lucien.xin@gmail.com> Link: https://lore.kernel.org/r/9fcd182f1099f86c6661f3717f63712ddd1c676c.1674496737.git.marcelo.leitner@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31sctp: sysctl: make extra pointers netns awareFiro Yang1-29/+44
[ Upstream commit da05cecc4939c0410d56c29e252998b192756318 ] Recently, a customer reported that from their container whose net namespace is different to the host's init_net, they can't set the container's net.sctp.rto_max to any value smaller than init_net.sctp.rto_min. For instance, Host: sudo sysctl net.sctp.rto_min net.sctp.rto_min = 1000 Container: echo 100 > /mnt/proc-net/sctp/rto_min echo 400 > /mnt/proc-net/sctp/rto_max echo: write error: Invalid argument This is caused by the check made from this'commit 4f3fdf3bc59c ("sctp: add check rto_min and rto_max in sysctl")' When validating the input value, it's always referring the boundary value set for the init_net namespace. Having container's rto_max smaller than host's init_net.sctp.rto_min does make sense. Consider that the rto between two containers on the same host is very likely smaller than it for two hosts. So to fix this problem, as suggested by Marcelo, this patch makes the extra pointers of rto_min, rto_max, pf_retrans, and ps_retrans point to the corresponding variables from the newly created net namespace while the new net namespace is being registered in sctp_sysctl_net_register. Fixes: 4f3fdf3bc59c ("sctp: add check rto_min and rto_max in sysctl") Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Firo Yang <firo.yang@suse.com> Link: https://lore.kernel.org/r/20221209054854.23889-1-firo.yang@suse.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-11-29sctp: fix memory leak in sctp_stream_outq_migrate()Zhengchao Shao4-7/+47
When sctp_stream_outq_migrate() is called to release stream out resources, the memory pointed to by prio_head in stream out is not released. The memory leak information is as follows: unreferenced object 0xffff88801fe79f80 (size 64): comm "sctp_repo", pid 7957, jiffies 4294951704 (age 36.480s) hex dump (first 32 bytes): 80 9f e7 1f 80 88 ff ff 80 9f e7 1f 80 88 ff ff ................ 90 9f e7 1f 80 88 ff ff 90 9f e7 1f 80 88 ff ff ................ backtrace: [<ffffffff81b215c6>] kmalloc_trace+0x26/0x60 [<ffffffff88ae517c>] sctp_sched_prio_set+0x4cc/0x770 [<ffffffff88ad64f2>] sctp_stream_init_ext+0xd2/0x1b0 [<ffffffff88aa2604>] sctp_sendmsg_to_asoc+0x1614/0x1a30 [<ffffffff88ab7ff1>] sctp_sendmsg+0xda1/0x1ef0 [<ffffffff87f765ed>] inet_sendmsg+0x9d/0xe0 [<ffffffff8754b5b3>] sock_sendmsg+0xd3/0x120 [<ffffffff8755446a>] __sys_sendto+0x23a/0x340 [<ffffffff87554651>] __x64_sys_sendto+0xe1/0x1b0 [<ffffffff89978b49>] do_syscall_64+0x39/0xb0 [<ffffffff89a0008b>] entry_SYSCALL_64_after_hwframe+0x63/0xcd Link: https://syzkaller.appspot.com/bug?exrid=29c402e56c4760763cc0 Fixes: 637784ade221 ("sctp: introduce priority based stream scheduler") Reported-by: syzbot+29c402e56c4760763cc0@syzkaller.appspotmail.com Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Reviewed-by: Xin Long <lucien.xin@gmail.com> Link: https://lore.kernel.org/r/20221126031720.378562-1-shaozhengchao@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-07sctp: clear out_curr if all frag chunks of current msg are prunedXin Long1-0/+5
A crash was reported by Zhen Chen: list_del corruption, ffffa035ddf01c18->next is NULL WARNING: CPU: 1 PID: 250682 at lib/list_debug.c:49 __list_del_entry_valid+0x59/0xe0 RIP: 0010:__list_del_entry_valid+0x59/0xe0 Call Trace: sctp_sched_dequeue_common+0x17/0x70 [sctp] sctp_sched_fcfs_dequeue+0x37/0x50 [sctp] sctp_outq_flush_data+0x85/0x360 [sctp] sctp_outq_uncork+0x77/0xa0 [sctp] sctp_cmd_interpreter.constprop.0+0x164/0x1450 [sctp] sctp_side_effects+0x37/0xe0 [sctp] sctp_do_sm+0xd0/0x230 [sctp] sctp_primitive_SEND+0x2f/0x40 [sctp] sctp_sendmsg_to_asoc+0x3fa/0x5c0 [sctp] sctp_sendmsg+0x3d5/0x440 [sctp] sock_sendmsg+0x5b/0x70 and in sctp_sched_fcfs_dequeue() it dequeued a chunk from stream out_curr outq while this outq was empty. Normally stream->out_curr must be set to NULL once all frag chunks of current msg are dequeued, as we can see in sctp_sched_dequeue_done(). However, in sctp_prsctp_prune_unsent() as it is not a proper dequeue, sctp_sched_dequeue_done() is not called to do this. This patch is to fix it by simply setting out_curr to NULL when the last frag chunk of current msg is dequeued from out_curr stream in sctp_prsctp_prune_unsent(). Fixes: 5bbbbe32a431 ("sctp: introduce stream scheduler foundations") Reported-by: Zhen Chen <chenzhen126@huawei.com> Tested-by: Caowangbao <caowangbao@huawei.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-07sctp: remove the unnecessary sinfo_stream check in sctp_prsctp_prune_unsentXin Long1-5/+3
Since commit 5bbbbe32a431 ("sctp: introduce stream scheduler foundations"), sctp_stream_outq_migrate() has been called in sctp_stream_init/update to removes those chunks to streams higher than the new max. There is no longer need to do such check in sctp_prsctp_prune_unsent(). Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-11treewide: use get_random_{u8,u16}() when possible, part 1Jason A. Donenfeld1-1/+1
Rather than truncate a 32-bit value to a 16-bit value or an 8-bit value, simply use the get_random_{u8,u16}() functions, which are faster than wasting the additional bytes from a 32-bit value. This was done mechanically with this coccinelle script: @@ expression E; identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; typedef u16; typedef __be16; typedef __le16; typedef u8; @@ ( - (get_random_u32() & 0xffff) + get_random_u16() | - (get_random_u32() & 0xff) + get_random_u8() | - (get_random_u32() % 65536) + get_random_u16() | - (get_random_u32() % 256) + get_random_u8() | - (get_random_u32() >> 16) + get_random_u16() | - (get_random_u32() >> 24) + get_random_u8() | - (u16)get_random_u32() + get_random_u16() | - (u8)get_random_u32() + get_random_u8() | - (__be16)get_random_u32() + (__be16)get_random_u16() | - (__le16)get_random_u32() + (__le16)get_random_u16() | - prandom_u32_max(65536) + get_random_u16() | - prandom_u32_max(256) + get_random_u8() | - E->inet_id = get_random_u32() + E->inet_id = get_random_u16() ) @@ identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; typedef u16; identifier v; @@ - u16 v = get_random_u32(); + u16 v = get_random_u16(); @@ identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; typedef u8; identifier v; @@ - u8 v = get_random_u32(); + u8 v = get_random_u8(); @@ identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; typedef u16; u16 v; @@ - v = get_random_u32(); + v = get_random_u16(); @@ identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; typedef u8; u8 v; @@ - v = get_random_u32(); + v = get_random_u8(); // Find a potential literal @literal_mask@ expression LITERAL; type T; identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; position p; @@ ((T)get_random_u32()@p & (LITERAL)) // Examine limits @script:python add_one@ literal << literal_mask.LITERAL; RESULT; @@ value = None if literal.startswith('0x'): value = int(literal, 16) elif literal[0] in '123456789': value = int(literal, 10) if value is None: print("I don't know how to handle %s" % (literal)) cocci.include_match(False) elif value < 256: coccinelle.RESULT = cocci.make_ident("get_random_u8") elif value < 65536: coccinelle.RESULT = cocci.make_ident("get_random_u16") else: print("Skipping large mask of %s" % (literal)) cocci.include_match(False) // Replace the literal mask with the calculated result. @plus_one@ expression literal_mask.LITERAL; position literal_mask.p; identifier add_one.RESULT; identifier FUNC; @@ - (FUNC()@p & (LITERAL)) + (RESULT() & LITERAL) Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Yury Norov <yury.norov@gmail.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> # for sch_cake Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-10-11treewide: use prandom_u32_max() when possible, part 1Jason A. Donenfeld1-1/+1
Rather than incurring a division or requesting too many random bytes for the given range, use the prandom_u32_max() function, which only takes the minimum required bytes from the RNG and avoids divisions. This was done mechanically with this coccinelle script: @basic@ expression E; type T; identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; typedef u64; @@ ( - ((T)get_random_u32() % (E)) + prandom_u32_max(E) | - ((T)get_random_u32() & ((E) - 1)) + prandom_u32_max(E * XXX_MAKE_SURE_E_IS_POW2) | - ((u64)(E) * get_random_u32() >> 32) + prandom_u32_max(E) | - ((T)get_random_u32() & ~PAGE_MASK) + prandom_u32_max(PAGE_SIZE) ) @multi_line@ identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; identifier RAND; expression E; @@ - RAND = get_random_u32(); ... when != RAND - RAND %= (E); + RAND = prandom_u32_max(E); // Find a potential literal @literal_mask@ expression LITERAL; type T; identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; position p; @@ ((T)get_random_u32()@p & (LITERAL)) // Add one to the literal. @script:python add_one@ literal << literal_mask.LITERAL; RESULT; @@ value = None if literal.startswith('0x'): value = int(literal, 16) elif literal[0] in '123456789': value = int(literal, 10) if value is None: print("I don't know how to handle %s" % (literal)) cocci.include_match(False) elif value == 2**32 - 1 or value == 2**31 - 1 or value == 2**24 - 1 or value == 2**16 - 1 or value == 2**8 - 1: print("Skipping 0x%x for cleanup elsewhere" % (value)) cocci.include_match(False) elif value & (value + 1) != 0: print("Skipping 0x%x because it's not a power of two minus one" % (value)) cocci.include_match(False) elif literal.startswith('0x'): coccinelle.RESULT = cocci.make_expr("0x%x" % (value + 1)) else: coccinelle.RESULT = cocci.make_expr("%d" % (value + 1)) // Replace the literal mask with the calculated result. @plus_one@ expression literal_mask.LITERAL; position literal_mask.p; expression add_one.RESULT; identifier FUNC; @@ - (FUNC()@p & (LITERAL)) + prandom_u32_max(RESULT) @collapse_ret@ type T; identifier VAR; expression E; @@ { - T VAR; - VAR = (E); - return VAR; + return E; } @drop_var@ type T; identifier VAR; @@ { - T VAR; ... when != VAR } Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Yury Norov <yury.norov@gmail.com> Reviewed-by: KP Singh <kpsingh@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 and sbitmap Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> # for drbd Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390 Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-09-30sctp: handle the error returned from sctp_auth_asoc_init_active_keyXin Long1-4/+14
When it returns an error from sctp_auth_asoc_init_active_key(), the active_key is actually not updated. The old sh_key will be freeed while it's still used as active key in asoc. Then an use-after-free will be triggered when sending patckets, as found by syzbot: sctp_auth_shkey_hold+0x22/0xa0 net/sctp/auth.c:112 sctp_set_owner_w net/sctp/socket.c:132 [inline] sctp_sendmsg_to_asoc+0xbd5/0x1a20 net/sctp/socket.c:1863 sctp_sendmsg+0x1053/0x1d50 net/sctp/socket.c:2025 inet_sendmsg+0x99/0xe0 net/ipv4/af_inet.c:819 sock_sendmsg_nosec net/socket.c:714 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:734 This patch is to fix it by not replacing the sh_key when it returns errors from sctp_auth_asoc_init_active_key() in sctp_auth_set_key(). For sctp_auth_set_active_key(), old active_key_id will be set back to asoc->active_key_id when the same thing happens. Fixes: 58acd1009226 ("sctp: update active_key for asoc when old key is being replaced") Reported-by: syzbot+a236dd8e9622ed8954a3@syzkaller.appspotmail.com Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski3-20/+6
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-27sctp: leave the err path free in sctp_stream_init to sctp_stream_freeXin Long2-19/+5
A NULL pointer dereference was reported by Wei Chen: BUG: kernel NULL pointer dereference, address: 0000000000000000 RIP: 0010:__list_del_entry_valid+0x26/0x80 Call Trace: <TASK> sctp_sched_dequeue_common+0x1c/0x90 sctp_sched_prio_dequeue+0x67/0x80 __sctp_outq_teardown+0x299/0x380 sctp_outq_free+0x15/0x20 sctp_association_free+0xc3/0x440 sctp_do_sm+0x1ca7/0x2210 sctp_assoc_bh_rcv+0x1f6/0x340 This happens when calling sctp_sendmsg without connecting to server first. In this case, a data chunk already queues up in send queue of client side when processing the INIT_ACK from server in sctp_process_init() where it calls sctp_stream_init() to alloc stream_in. If it fails to alloc stream_in all stream_out will be freed in sctp_stream_init's err path. Then in the asoc freeing it will crash when dequeuing this data chunk as stream_out is missing. As we can't free stream out before dequeuing all data from send queue, and this patch is to fix it by moving the err path stream_out/in freeing in sctp_stream_init() to sctp_stream_free() which is eventually called when freeing the asoc in sctp_association_free(). This fix also makes the code in sctp_process_init() more clear. Note that in sctp_association_init() when it fails in sctp_stream_init(), sctp_association_free() will not be called, and in that case it should go to 'stream_free' err path to free stream instead of 'fail_init'. Fixes: 5bbbbe32a431 ("sctp: introduce stream scheduler foundations") Reported-by: Wei Chen <harperchen1110@gmail.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Link: https://lore.kernel.org/r/831a3dc100c4908ff76e5bcc363be97f2778bc0b.1658787066.git.lucien.xin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-25sctp: fix sleep in atomic context bug in timer handlersDuoming Zhou1-1/+1
There are sleep in atomic context bugs in timer handlers of sctp such as sctp_generate_t3_rtx_event(), sctp_generate_probe_event(), sctp_generate_t1_init_event(), sctp_generate_timeout_event(), sctp_generate_t3_rtx_event() and so on. The root cause is sctp_sched_prio_init_sid() with GFP_KERNEL parameter that may sleep could be called by different timer handlers which is in interrupt context. One of the call paths that could trigger bug is shown below: (interrupt context) sctp_generate_probe_event sctp_do_sm sctp_side_effects sctp_cmd_interpreter sctp_outq_teardown sctp_outq_init sctp_sched_set_sched n->init_sid(..,GFP_KERNEL) sctp_sched_prio_init_sid //may sleep This patch changes gfp_t parameter of init_sid in sctp_sched_set_sched() from GFP_KERNEL to GFP_ATOMIC in order to prevent sleep in atomic context bugs. Fixes: 5bbbbe32a431 ("sctp: introduce stream scheduler foundations") Signed-off-by: Duoming Zhou <duoming@zju.edu.cn> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Link: https://lore.kernel.org/r/20220723015809.11553-1-duoming@zju.edu.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-1/+1
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-15ip: Fix data-races around sysctl_ip_nonlocal_bind.Kuniyuki Iwashima1-1/+1
While reading sysctl_ip_nonlocal_bind, it can be changed concurrently. Thus, we need to add READ_ONCE() to its readers. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-06-10net: keep sk->sk_forward_alloc as small as possibleEric Dumazet4-13/+0
Currently, tcp_memory_allocated can hit tcp_mem[] limits quite fast. Each TCP socket can forward allocate up to 2 MB of memory, even after flow became less active. 10,000 sockets can have reserved 20 GB of memory, and we have no shrinker in place to reclaim that. Instead of trying to reclaim the extra allocations in some places, just keep sk->sk_forward_alloc values as small as possible. This should not impact performance too much now we have per-cpu reserves: Changes to tcp_memory_allocated should not be too frequent. For sockets not using SO_RESERVE_MEM: - idle sockets (no packets in tx/rx queues) have zero forward alloc. - non idle sockets have a forward alloc smaller than one page. Note: - Removal of SK_RECLAIM_CHUNK and SK_RECLAIM_THRESHOLD is left to MPTCP maintainers as a follow up. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-06-10net: add per_cpu_fw_alloc field to struct protoEric Dumazet1-0/+7
Each protocol having a ->memory_allocated pointer gets a corresponding per-cpu reserve, that following patches will use. Instead of having reserved bytes per socket, we want to have per-cpu reserves. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-06-10net: remove SK_MEM_QUANTUM and SK_MEM_QUANTUM_SHIFTEric Dumazet1-2/+2
Due to memcg interface, SK_MEM_QUANTUM is e