summaryrefslogtreecommitdiff
path: root/net/tipc/core.c
AgeCommit message (Collapse)AuthorFilesLines
2022-06-17tipc: fix use-after-free Read in tipc_named_reinitHoang Le1-2/+1
syzbot found the following issue on: ================================================================== BUG: KASAN: use-after-free in tipc_named_reinit+0x94f/0x9b0 net/tipc/name_distr.c:413 Read of size 8 at addr ffff88805299a000 by task kworker/1:9/23764 CPU: 1 PID: 23764 Comm: kworker/1:9 Not tainted 5.18.0-rc4-syzkaller-00878-g17d49e6e8012 #0 Hardware name: Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: events tipc_net_finalize_work Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106 print_address_description.constprop.0.cold+0xeb/0x495 mm/kasan/report.c:313 print_report mm/kasan/report.c:429 [inline] kasan_report.cold+0xf4/0x1c6 mm/kasan/report.c:491 tipc_named_reinit+0x94f/0x9b0 net/tipc/name_distr.c:413 tipc_net_finalize+0x234/0x3d0 net/tipc/net.c:138 process_one_work+0x996/0x1610 kernel/workqueue.c:2289 worker_thread+0x665/0x1080 kernel/workqueue.c:2436 kthread+0x2e9/0x3a0 kernel/kthread.c:376 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:298 </TASK> [...] ================================================================== In the commit d966ddcc3821 ("tipc: fix a deadlock when flushing scheduled work"), the cancel_work_sync() function just to make sure ONLY the work tipc_net_finalize_work() is executing/pending on any CPU completed before tipc namespace is destroyed through tipc_exit_net(). But this function is not guaranteed the work is the last queued. So, the destroyed instance may be accessed in the work which will try to enqueue later. In order to completely fix, we re-order the calling of cancel_work_sync() to make sure the work tipc_net_finalize_work() was last queued and it must be completed by calling cancel_work_sync(). Reported-by: syzbot+47af19f3307fc9c5c82e@syzkaller.appspotmail.com Fixes: d966ddcc3821 ("tipc: fix a deadlock when flushing scheduled work") Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-18tipc: simplify the finalize work queueXin Long1-2/+2
This patch is to use "struct work_struct" for the finalize work queue instead of "struct tipc_net_work", as it can get the "net" and "addr" from tipc_net's other members and there is no need to add extra net and addr in tipc_net by defining "struct tipc_net_work". Note that it's safe to get net from tn->bcl as bcl is always released after the finalize work queue is done. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-17tipc: wait and exit until all work queues are doneXin Long1-0/+2
On some host, a crash could be triggered simply by repeating these commands several times: # modprobe tipc # tipc bearer enable media udp name UDP1 localip 127.0.0.1 # rmmod tipc [] BUG: unable to handle kernel paging request at ffffffffc096bb00 [] Workqueue: events 0xffffffffc096bb00 [] Call Trace: [] ? process_one_work+0x1a7/0x360 [] ? worker_thread+0x30/0x390 [] ? create_worker+0x1a0/0x1a0 [] ? kthread+0x116/0x130 [] ? kthread_flush_work_fn+0x10/0x10 [] ? ret_from_fork+0x35/0x40 When removing the TIPC module, the UDP tunnel sock will be delayed to release in a work queue as sock_release() can't be done in rtnl_lock(). If the work queue is schedule to run after the TIPC module is removed, kernel will crash as the work queue function cleanup_beareri() code no longer exists when trying to invoke it. To fix it, this patch introduce a member wq_count in tipc_net to track the numbers of work queues in schedule, and wait and exit until all work queues are done in tipc_exit_net(). Fixes: d0f91938bede ("tipc: add ip/udp media type") Reported-by: Shuang Li <shuali@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-10-30tipc: remove dead code in tipc_net and relativesHoang Huu Le1-2/+0
dist_queue is no longer used since commit 37922ea4a310 ("tipc: permit overlapping service ranges in name table") Acked-by: Jon Maloy <jmaloy@redhat.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Hoang Huu Le <hoang.h.le@dektech.com.au> Link: https://lore.kernel.org/r/20201028032712.31009-1-hoang.h.le@dektech.com.au Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-09-07tipc: fix a deadlock when flushing scheduled workHoang Huu Le1-4/+5
In the commit fdeba99b1e58 ("tipc: fix use-after-free in tipc_bcast_get_mode"), we're trying to make sure the tipc_net_finalize_work work item finished if it enqueued. But calling flush_scheduled_work() is not just affecting above work item but either any scheduled work. This has turned out to be overkill and caused to deadlock as syzbot reported: ====================================================== WARNING: possible circular locking dependency detected 5.9.0-rc2-next-20200828-syzkaller #0 Not tainted ------------------------------------------------------ kworker/u4:6/349 is trying to acquire lock: ffff8880aa063d38 ((wq_completion)events){+.+.}-{0:0}, at: flush_workqueue+0xe1/0x13e0 kernel/workqueue.c:2777 but task is already holding lock: ffffffff8a879430 (pernet_ops_rwsem){++++}-{3:3}, at: cleanup_net+0x9b/0xb10 net/core/net_namespace.c:565 [...] Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(pernet_ops_rwsem); lock(&sb->s_type->i_mutex_key#13); lock(pernet_ops_rwsem); lock((wq_completion)events); *** DEADLOCK *** [...] v1: To fix the original issue, we replace above calling by introducing a bit flag. When a namespace cleaned-up, bit flag is set to zero and: - tipc_net_finalize functionial just does return immediately. - tipc_net_finalize_work does not enqueue into the scheduled work queue. v2: Use cancel_work_sync() helper to make sure ONLY the tipc_net_finalize_work() stopped before releasing bcbase object. Reported-by: syzbot+d5aa7e0385f6a5d0f4fd@syzkaller.appspotmail.com Fixes: fdeba99b1e58 ("tipc: fix use-after-free in tipc_bcast_get_mode") Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Hoang Huu Le <hoang.h.le@dektech.com.au> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-08-27tipc: fix use-after-free in tipc_bcast_get_modeHoang Huu Le1-0/+5
Syzbot has reported those issues as: ================================================================== BUG: KASAN: use-after-free in tipc_bcast_get_mode+0x3ab/0x400 net/tipc/bcast.c:759 Read of size 1 at addr ffff88805e6b3571 by task kworker/0:6/3850 CPU: 0 PID: 3850 Comm: kworker/0:6 Not tainted 5.8.0-rc7-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: events tipc_net_finalize_work Thread 1's call trace: [...] kfree+0x103/0x2c0 mm/slab.c:3757 <- bcbase releasing tipc_bcast_stop+0x1b0/0x2f0 net/tipc/bcast.c:721 tipc_exit_net+0x24/0x270 net/tipc/core.c:112 [...] Thread 2's call trace: [...] tipc_bcast_get_mode+0x3ab/0x400 net/tipc/bcast.c:759 <- bcbase has already been freed by Thread 1 tipc_node_broadcast+0x9e/0xcc0 net/tipc/node.c:1744 tipc_nametbl_publish+0x60b/0x970 net/tipc/name_table.c:752 tipc_net_finalize net/tipc/net.c:141 [inline] tipc_net_finalize+0x1fa/0x310 net/tipc/net.c:131 tipc_net_finalize_work+0x55/0x80 net/tipc/net.c:150 [...] ================================================================== BUG: KASAN: use-after-free in tipc_named_reinit+0xef/0x290 net/tipc/name_distr.c:344 Read of size 8 at addr ffff888052ab2000 by task kworker/0:13/30628 CPU: 0 PID: 30628 Comm: kworker/0:13 Not tainted 5.8.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: events tipc_net_finalize_work Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1f0/0x31e lib/dump_stack.c:118 print_address_description+0x66/0x5a0 mm/kasan/report.c:383 __kasan_report mm/kasan/report.c:513 [inline] kasan_report+0x132/0x1d0 mm/kasan/report.c:530 tipc_named_reinit+0xef/0x290 net/tipc/name_distr.c:344 tipc_net_finalize+0x85/0xe0 net/tipc/net.c:138 tipc_net_finalize_work+0x50/0x70 net/tipc/net.c:150 process_one_work+0x789/0xfc0 kernel/workqueue.c:2269 worker_thread+0xaa4/0x1460 kernel/workqueue.c:2415 kthread+0x37e/0x3a0 drivers/block/aoe/aoecmd.c:1234 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:293 [...] Freed by task 14058: save_stack mm/kasan/common.c:48 [inline] set_track mm/kasan/common.c:56 [inline] kasan_set_free_info mm/kasan/common.c:316 [inline] __kasan_slab_free+0x114/0x170 mm/kasan/common.c:455 __cache_free mm/slab.c:3426 [inline] kfree+0x10a/0x220 mm/slab.c:3757 tipc_exit_net+0x29/0x50 net/tipc/core.c:113 ops_exit_list net/core/net_namespace.c:186 [inline] cleanup_net+0x708/0xba0 net/core/net_namespace.c:603 process_one_work+0x789/0xfc0 kernel/workqueue.c:2269 worker_thread+0xaa4/0x1460 kernel/workqueue.c:2415 kthread+0x37e/0x3a0 drivers/block/aoe/aoecmd.c:1234 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:293 Fix it by calling flush_scheduled_work() to make sure the tipc_net_finalize_work() stopped before releasing bcbase object. Reported-by: syzbot+6ea1f7a8df64596ef4d7@syzkaller.appspotmail.com Reported-by: syzbot+e9cc557752ab126c1b99@syzkaller.appspotmail.com Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Hoang Huu Le <hoang.h.le@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-06tipc: fix ordering of tipc module init and exit routineTaehee Yoo1-14/+15
In order to set/get/dump, the tipc uses the generic netlink infrastructure. So, when tipc module is inserted, init function calls genl_register_family(). After genl_register_family(), set/get/dump commands are immediately allowed and these callbacks internally use the net_generic. net_generic is allocated by register_pernet_device() but this is called after genl_register_family() in the __init function. So, these callbacks would use un-initialized net_generic. Test commands: #SHELL1 while : do modprobe tipc modprobe -rv tipc done #SHELL2 while : do tipc link list done Splat looks like: [ 59.616322][ T2788] kasan: CONFIG_KASAN_INLINE enabled [ 59.617234][ T2788] kasan: GPF could be caused by NULL-ptr deref or user memory access [ 59.618398][ T2788] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI [ 59.619389][ T2788] CPU: 3 PID: 2788 Comm: tipc Not tainted 5.4.0+ #194 [ 59.620231][ T2788] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 59.621428][ T2788] RIP: 0010:tipc_bcast_get_broadcast_mode+0x131/0x310 [tipc] [ 59.622379][ T2788] Code: c7 c6 ef 8b 38 c0 65 ff 0d 84 83 c9 3f e8 d7 a5 f2 e3 48 8d bb 38 11 00 00 48 b8 00 00 00 00 [ 59.622550][ T2780] NET: Registered protocol family 30 [ 59.624627][ T2788] RSP: 0018:ffff88804b09f578 EFLAGS: 00010202 [ 59.624630][ T2788] RAX: dffffc0000000000 RBX: 0000000000000011 RCX: 000000008bc66907 [ 59.624631][ T2788] RDX: 0000000000000229 RSI: 000000004b3cf4cc RDI: 0000000000001149 [ 59.624633][ T2788] RBP: ffff88804b09f588 R08: 0000000000000003 R09: fffffbfff4fb3df1 [ 59.624635][ T2788] R10: fffffbfff50318f8 R11: ffff888066cadc18 R12: ffffffffa6cc2f40 [ 59.624637][ T2788] R13: 1ffff11009613eba R14: ffff8880662e9328 R15: ffff8880662e9328 [ 59.624639][ T2788] FS: 00007f57d8f7b740(0000) GS:ffff88806cc00000(0000) knlGS:0000000000000000 [ 59.624645][ T2788] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 59.625875][ T2780] tipc: Started in single node mode [ 59.626128][ T2788] CR2: 00007f57d887a8c0 CR3: 000000004b140002 CR4: 00000000000606e0 [ 59.633991][ T2788] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 59.635195][ T2788] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 59.636478][ T2788] Call Trace: [ 59.637025][ T2788] tipc_nl_add_bc_link+0x179/0x1470 [tipc] [ 59.638219][ T2788] ? lock_downgrade+0x6e0/0x6e0 [ 59.638923][ T2788] ? __tipc_nl_add_link+0xf90/0xf90 [tipc] [ 59.639533][ T2788] ? tipc_nl_node_dump_link+0x318/0xa50 [tipc] [ 59.640160][ T2788] ? mutex_lock_io_nested+0x1380/0x1380 [ 59.640746][ T2788] tipc_nl_node_dump_link+0x4fd/0xa50 [tipc] [ 59.641356][ T2788] ? tipc_nl_node_reset_link_stats+0x340/0x340 [tipc] [ 59.642088][ T2788] ? __skb_ext_del+0x270/0x270 [ 59.642594][ T2788] genl_lock_dumpit+0x85/0xb0 [ 59.643050][ T2788] netlink_dump+0x49c/0xed0 [ 59.643529][ T2788] ? __netlink_sendskb+0xc0/0xc0 [ 59.644044][ T2788] ? __netlink_dump_start+0x190/0x800 [ 59.644617][ T2788] ? __mutex_unlock_slowpath+0xd0/0x670 [ 59.645177][ T2788] __netlink_dump_start+0x5a0/0x800 [ 59.645692][ T2788] genl_rcv_msg+0xa75/0xe90 [ 59.646144][ T2788] ? __lock_acquire+0xdfe/0x3de0 [ 59.646692][ T2788] ? genl_family_rcv_msg_attrs_parse+0x320/0x320 [ 59.647340][ T2788] ? genl_lock_dumpit+0xb0/0xb0 [ 59.647821][ T2788] ? genl_unlock+0x20/0x20 [ 59.648290][ T2788] ? genl_parallel_done+0xe0/0xe0 [ 59.648787][ T2788] ? find_held_lock+0x39/0x1d0 [ 59.649276][ T2788] ? genl_rcv+0x15/0x40 [ 59.649722][ T2788] ? lock_contended+0xcd0/0xcd0 [ 59.650296][ T2788] netlink_rcv_skb+0x121/0x350 [ 59.650828][ T2788] ? genl_family_rcv_msg_attrs_parse+0x320/0x320 [ 59.651491][ T2788] ? netlink_ack+0x940/0x940 [ 59.651953][ T2788] ? lock_acquire+0x164/0x3b0 [ 59.652449][ T2788] genl_rcv+0x24/0x40 [ 59.652841][ T2788] netlink_unicast+0x421/0x600 [ ... ] Fixes: 7e4369057806 ("tipc: fix a slab object leak") Fixes: a62fbccecd62 ("tipc: make subscriber server support net namespace") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller1-2/+0
Lots of overlapping changes and parallel additions, stuff like that. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-14tipc: add back tipc prefix to log messagesMatt Bennett1-2/+0
The tipc prefix for log messages generated by tipc was removed in commit 07f6c4bc048a ("tipc: convert tipc reference table to use generic rhashtable"). This is still a useful prefix so add it back. Signed-off-by: Matt Bennett <matt.bennett@alliedtelesis.co.nz> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-08tipc: introduce TIPC encryption & authenticationTuong Lien1-0/+14
This commit offers an option to encrypt and authenticate all messaging, including the neighbor discovery messages. The currently most advanced algorithm supported is the AEAD AES-GCM (like IPSec or TLS). All encryption/decryption is done at the bearer layer, just before leaving or after entering TIPC. Supported features: - Encryption & authentication of all TIPC messages (header + data); - Two symmetric-key modes: Cluster and Per-node; - Automatic key switching; - Key-expired revoking (sequence number wrapped); - Lock-free encryption/decryption (RCU); - Asynchronous crypto, Intel AES-NI supported; - Multiple cipher transforms; - Logs & statistics; Two key modes: - Cluster key mode: One single key is used for both TX & RX in all nodes in the cluster. - Per-node key mode: Each nodes in the cluster has one specific TX key. For RX, a node requires its peers' TX key to be able to decrypt the messages from those peers. Key setting from user-space is performed via netlink by a user program (e.g. the iproute2 'tipc' tool). Internal key state machine: Attach Align(RX) +-+ +-+ | V | V +---------+ Attach +---------+ | IDLE |---------------->| PENDING |(user = 0) +---------+ +---------+ A A Switch| A | | | | | | Free(switch/revoked) | | (Free)| +----------------------+ | |Timeout | (TX) | | |(RX) | | | | | | v | +---------+ Switch +---------+ | PASSIVE |<----------------| ACTIVE | +---------+ (RX) +---------+ (user = 1) (user >= 1) The number of TFMs is 10 by default and can be changed via the procfs 'net/tipc/max_tfms'. At this moment, as for simplicity, this file is also used to print the crypto statistics at runtime: echo 0xfff1 > /proc/sys/net/tipc/max_tfms The patch defines a new TIPC version (v7) for the encryption message (- backward compatibility as well). The message is basically encapsulated as follows: +----------------------------------------------------------+ | TIPCv7 encryption | Original TIPCv2 | Authentication | | header | packet (encrypted) | Tag | +----------------------------------------------------------+ The throughput is about ~40% for small messages (compared with non- encryption) and ~9% for large messages. With the support from hardware crypto i.e. the Intel AES-NI CPU instructions, the throughput increases upto ~85% for small messages and ~55% for large messages. By default, the new feature is inactive (i.e. no encryption) until user sets a key for TIPC. There is however also a new option - "TIPC_CRYPTO" in the kernel configuration to enable/disable the new code when needed. MAINTAINERS | add two new files 'crypto.h' & 'crypto.c' in tipc Acked-by: Ying Xue <ying.xue@windreiver.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29tipc: improve throughput between nodes in netnsHoang Le1-0/+16
Currently, TIPC transports intra-node user data messages directly socket to socket, hence shortcutting all the lower layers of the communication stack. This gives TIPC very good intra node performance, both regarding throughput and latency. We now introduce a similar mechanism for TIPC data traffic across network namespaces located in the same kernel. On the send path, the call chain is as always accompanied by the sending node's network name space pointer. However, once we have reliably established that the receiving node is represented by a namespace on the same host, we just replace the namespace pointer with the receiving node/namespace's ditto, and follow the regular socket receive patch though the receiving node. This technique gives us a throughput similar to the node internal throughput, several times larger than if we let the traffic go though the full network stacks. As a comparison, max throughput for 64k messages is four times larger than TCP throughput for the same type of traffic. To meet any security concerns, the following should be noted. - All nodes joining a cluster are supposed to have been be certified and authenticated by mechanisms outside TIPC. This is no different for nodes/namespaces on the same host; they have to auto discover each other using the attached interfaces, and establish links which are supervised via the regular link monitoring mechanism. Hence, a kernel local node has no other way to join a cluster than any other node, and have to obey to policies set in the IP or device layers of the stack. - Only when a sender has established with 100% certainty that the peer node is located in a kernel local namespace does it choose to let user data messages, and only those, take the crossover path to the receiving node/namespace. - If the receiving node/namespace is removed, its namespace pointer is invalidated at all peer nodes, and their neighbor link monitoring will eventually note that this node is gone. - To ensure the "100% certainty" criteria, and prevent any possible spoofing, received discovery messages must contain a proof that the sender knows a common secret. We use the hash mix of the sending node/namespace for this purpose, since it can be accessed directly by all other namespaces in the kernel. Upon reception of a discovery message, the receiver checks this proof against all the local namespaces'hash_mix:es. If it finds a match, that, along with a matching node id and cluster id, this is deemed sufficient proof that the peer node in question is in a local namespace, and a wormhole can be opened. - We should also consider that TIPC is intended to be a cluster local IPC mechanism (just like e.g. UNIX sockets) rather than a network protocol, and hence we think it can justified to allow it to shortcut the lower protocol layers. Regarding traceability, we should notice that since commit 6c9081a3915d ("tipc: add loopback device tracking") it is possible to follow the node internal packet flow by just activating tcpdump on the loopback interface. This will be true even for this mechanism; by activating tcpdump on the involved nodes' loopback interfaces their inter-name space messaging can easily be tracked. v2: - update 'net' pointer when node left/rejoined v3: - grab read/write lock when using node ref obj v4: - clone traffics between netns to loopback Suggested-by: Jon Maloy <jon.maloy@ericsson.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-08tipc: add loopback device trackingJohn Rutherford1-0/+5
Since node internal messages are passed directly to the socket, it is not possible to observe those messages via tcpdump or wireshark. We now remedy this by making it possible to clone such messages and send the clones to the loopback interface. The clones are dropped at reception and have no functional role except making the traffic visible. The feature is enabled if network taps are active for the loopback device. pcap filtering restrictions require the messages to be presented to the receiving side of the loopback device. v3 - Function dev_nit_active used to check for network taps. - Procedure netif_rx_ni used to send cloned messages to loopback device. Signed-off-by: John Rutherford <john.rutherford@dektech.com.au> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-22tipc: change to use register_pernet_deviceXin Long1-6/+6
This patch is to fix a dst defcnt leak, which can be reproduced by doing: # ip net a c; ip net a s; modprobe tipc # ip net e s ip l a n eth1 type veth peer n eth1 netns c # ip net e c ip l s lo up; ip net e c ip l s eth1 up # ip net e s ip l s lo up; ip net e s ip l s eth1 up # ip net e c ip a a 1.1.1.2/8 dev eth1 # ip net e s ip a a 1.1.1.1/8 dev eth1 # ip net e c tipc b e m udp n u1 localip 1.1.1.2 # ip net e s tipc b e m udp n u1 localip 1.1.1.1 # ip net d c; ip net d s; rmmod tipc and it will get stuck and keep logging the error: unregister_netdevice: waiting for lo to become free. Usage count = 1 The cause is that a dst is held by the udp sock's sk_rx_dst set on udp rx path with udp_early_demux == 1, and this dst (eventually holding lo dev) can't be released as bearer's removal in tipc pernet .exit happens after lo dev's removal, default_device pernet .exit. "There are two distinct types of pernet_operations recognized: subsys and device. At creation all subsys init functions are called before device init functions, and at destruction all device exit functions are called before subsys exit function." So by calling register_pernet_device instead to register tipc_net_ops, the pernet .exit() will be invoked earlier than loopback dev's removal when a netns is being destroyed, as fou/gue does. Note that vxlan and geneve udp tunnels don't have this issue, as the udp sock is released in their device ndo_stop(). This fix is also necessary for tipc dst_cache, which will hold dsts on tx path and I will introduce in my next patch. Reported-by: Li Shuang <shuali@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-20tipc: fix modprobe tipc failed after switch order of device registrationJunwei Hu1-6/+12
Error message printed: modprobe: ERROR: could not insert 'tipc': Address family not supported by protocol. when modprobe tipc after the following patch: switch order of device registration, commit 7e27e8d6130c ("tipc: switch order of device registration to fix a crash") Because sock_create_kern(net, AF_TIPC, ...) called by tipc_topsrv_create_listener() in the initialization process of tipc_init_net(), so tipc_socket_init() must be execute before that. Meanwhile, tipc_net_id need to be initialized when sock_create() called, and tipc_socket_init() is no need to be called for each namespace. I add a variable tipc_topsrv_net_ops, and split the register_pernet_subsys() of tipc into two parts, and split tipc_socket_init() with initialization of pernet params. By the way, I fixed resources rollback error when tipc_bcast_init() failed in tipc_init_net(). Fixes: 7e27e8d6130c ("tipc: switch order of device registration to fix a crash") Signed-off-by: Junwei Hu <hujunwei4@huawei.com> Reported-by: Wang Wang <wangwang2@huawei.com> Reported-by: syzbot+1e8114b61079bfe9cbc5@syzkaller.appspotmail.com Reviewed-by: Kang Zhou <zhoukang7@huawei.com> Reviewed-by: Suanming Mou <mousuanming@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-17Revert "tipc: fix modprobe tipc failed after switch order of device ↵David S. Miller1-7/+7
registration" This reverts commit 532b0f7ece4cb2ffd24dc723ddf55242d1188e5e. More revisions coming up. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-17tipc: fix modprobe tipc failed after switch order of device registrationJunwei Hu1-7/+7
Error message printed: modprobe: ERROR: could not insert 'tipc': Address family not supported by protocol. when modprobe tipc after the following patch: switch order of device registration, commit 7e27e8d6130c ("tipc: switch order of device registration to fix a crash") Because sock_create_kern(net, AF_TIPC, ...) is called by tipc_topsrv_create_listener() in the initialization process of tipc_net_ops, tipc_socket_init() must be execute before that. I move tipc_socket_init() into function tipc_init_net(). Fixes: 7e27e8d6130c ("tipc: switch order of device registration to fix a crash") Signed-off-by: Junwei Hu <hujunwei4@huawei.com> Reported-by: Wang Wang <wangwang2@huawei.com> Reviewed-by: Kang Zhou <zhoukang7@huawei.com> Reviewed-by: Suanming Mou <mousuanming@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-16tipc: switch order of device registration to fix a crashJunwei Hu1-7/+7
When tipc is loaded while many processes try to create a TIPC socket, a crash occurs: PANIC: Unable to handle kernel paging request at virtual address "dfff20000000021d" pc : tipc_sk_create+0x374/0x1180 [tipc] lr : tipc_sk_create+0x374/0x1180 [tipc] Exception class = DABT (current EL), IL = 32 bits Call trace: tipc_sk_create+0x374/0x1180 [tipc] __sock_create+0x1cc/0x408 __sys_socket+0xec/0x1f0 __arm64_sys_socket+0x74/0xa8 ... This is due to race between sock_create and unfinished register_pernet_device. tipc_sk_insert tries to do "net_generic(net, tipc_net_id)". but tipc_net_id is not initialized yet. So switch the order of the two to close the race. This can be reproduced with multiple processes doing socket(AF_TIPC, ...) and one process doing module removal. Fixes: a62fbccecd62 ("tipc: make subscriber server support net namespace") Signed-off-by: Junwei Hu <hujunwei4@huawei.com> Reported-by: Wang Wang <wangwang2@huawei.com> Reviewed-by: Xiaogang Wang <wangxiaogang3@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-19tipc: introduce new capability flag for clusterHoang Le1-0/+2
As a preparation for introducing a smooth switching between replicast and broadcast method for multicast message, We have to introduce a new capability flag TIPC_MCAST_RBCTL to handle this new feature. During a cluster upgrade a node can come back with this new capabilities which also must be reflected in the cluster capabilities field. The new feature is only applicable if all node in the cluster supports this new capability. Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-27net: Drop pernet_operations::asyncKirill Tkhai1-1/+0
Synchronous pernet_operations are not allowed anymore. All are asynchronous. So, drop the structure member. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-23tipc: handle collisions of 32-bit node address hash valuesJon Maloy1-0/+2
When a 32-bit node address is generated from a 128-bit identifier, there is a risk of collisions which must be discovered and handled. We do this as follows: - We don't apply the generated address immediately to the node, but do instead initiate a 1 sec trial period to allow other cluster members to discover and handle such collisions. - During the trial period the node periodically sends out a new type of message, DSC_TRIAL_MSG, using broadcast or emulated broadcast, to all the other nodes in the cluster. - When a node is receiving such a message, it must check that the presented 32-bit identifier either is unused, or was used by the very same peer in a previous session. In both cases it accepts the request by not responding to it. - If it finds that the same node has been up before using a different address, it responds with a DSC_TRIAL_FAIL_MSG containing that address. - If it finds that the address has already been taken by some other node, it generates a new, unused address and returns it to the requester. - During the trial period the requesting node must always be prepared to accept a failure message, i.e., a message where a peer suggests a different (or equal) address to the one tried. In those cases it must apply the suggested value as trial address and restart the trial period. This algorithm ensures that in the vast majority of cases a node will have the same address before and after a reboot. If a legacy user configures the address explicitly, there will be no trial period and messages, so this protocol addition is completely backwards compatible. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-23tipc: add 128-bit node identifierJon Maloy1-1/+3
We add a 128-bit node identity, as an alternative to the currently used 32-bit node address. For the sake of compatibility and to minimize message header changes we retain the existing 32-bit address field. When not set explicitly by the user, this field will be filled with a hash value generated from the much longer node identity, and be used as a shorthand value for the latter. We permit either the address or the identity to be set by configuration, but not both, so when the address value is set by a legacy user the corresponding 128-bit node identity is generated based on the that value. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-13net: Convert tipc_net_opsKirill Tkhai1-0/+1
TIPC looks concentrated in itself, and other pernet_operations seem not touching its entities. tipc_net_ops look pernet-divided, and they should be safe to be executed in parallel for several net the same time. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-18netns: make struct pernet_operations::id unsigned intAlexey Dobriyan1-1/+1
Make struct pernet_operations::id unsigned. There are 2 reasons to do so: 1) This field is really an index into an zero based array and thus is unsigned entity. Using negative value is out-of-bound access by definition. 2) On x86_64 unsigned 32-bit data which are mixed with pointers via array indexing or offsets added or subtracted to pointers are preffered to signed 32-bit data. "int" being used as an array index needs to be sign-extended to 64-bit before being used. void f(long *p, int i) { g(p[i]); } roughly translates to movsx rsi, esi mov rdi, [rsi+...] call g MOVSX is 3 byte instruction which isn't necessary if the variable is unsigned because x86_64 is zero extending by default. Now, there is net_generic() function which, you guessed it right, uses "int" as an array index: static inline void *net_generic(const struct net *net, int id) { ... ptr = ng->ptr[id - 1]; ... } And this function is used a lot, so those sign extensions add up. Patch snipes ~1730 bytes on allyesconfig kernel (without all junk messing with code generation): add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730) Unfortunately some functions actually grow bigger. This is a semmingly random artefact of code generation with register allocator being used differently. gcc decides that some variable needs to live in new r8+ registers and every access now requires REX prefix. Or it is shifted into r12, so [r12+0] addressing mode has to be used which is longer than [r8] However, overall balance is in negative direction: add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730) function old new delta nfsd4_lock 3886 3959 +73 tipc_link_build_proto_msg 1096 1140 +44 mac80211_hwsim_new_radio 2776 2808 +32 tipc_mon_rcv 1032 1058 +26 svcauth_gss_legacy_init 1413 1429 +16 tipc_bcbase_select_primary 379 392 +13 nfsd4_exchange_id 1247 1260 +13 nfsd4_setclientid_confirm 782 793 +11 ... put_client_renew_locked 494 480 -14 ip_set_sockfn_get 730 716 -14 geneve_sock_add 829 813 -16 nfsd4_sequence_done 721 703 -18 nlmclnt_lookup_host 708 686 -22 nfsd4_lockt 1085 1063 -22 nfs_get_client 1077 1050 -27 tcf_bpf_init 1106 1076 -30 nfsd4_encode_fattr 5997 5930 -67 Total: Before=154856051, After=154854321, chg -0.00% Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-15tipc: add neighbor monitoring frameworkJon Paul Maloy1-0/+1
TIPC based clusters are by default set up with full-mesh link connectivity between all nodes. Those links are expected to provide a short failure detection time, by default set to 1500 ms. Because of this, the background load for neighbor monitoring in an N-node cluster increases with a factor N on each node, while the overall monitoring traffic through the network infrastructure increases at a ~(N * (N - 1)) rate. Experience has shown that such clusters don't scale well beyond ~100 nodes unless we significantly increase failure discovery tolerance. This commit introduces a framework and an algorithm that drastically reduces this background load, while basically maintaining the original failure detection times across the whole cluster. Using this algorithm, background load will now grow at a rate of ~(2 * sqrt(N)) per node, and at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will now have to actively monitor 38 neighbors in a 400-node cluster, instead of as before 399. This "Overlapping Ring Supervision Algorithm" is completely distributed and employs no centralized or coordinated state. It goes as follows: - Each node makes up a linearly ascending, circular list of all its N known neighbors, based on their TIPC node identity. This algorithm must be the same on all nodes. - The node then selects the next M = sqrt(N) - 1 nodes downstream from itself in the list, and chooses to actively monitor those. This is called its "local monitoring domain". - It creates a domain record describing the monitoring domain, and piggy-backs this in the data area of all neighbor monitoring messages (LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in the cluster eventually (default within 400 ms) will learn about its monitoring domain. - Whenever a node discovers a change in its local domain, e.g., a node has been added or has gone down, it creates and sends out a new version of its node record to inform all neighbors about the change. - A node receiving a domain record from anybody outside its local domain matches this against its own list (which may not look the same), and chooses to not actively monitor those members of the received domain record that are also present in its own list. Instead, it relies on indications from the direct monitoring nodes if an indirectly monitored node has gone up or down. If a node is indicated lost, the receiving node temporarily activates its own direct monitoring towards that node in order to confirm, or not, that it is actually gone. - Since each node is actively monitoring sqrt(N) downstream neighbors, each node is also actively monitored by the same number of upstream neighbors. This means that all non-direct monitoring nodes normally will receive sqrt(N) indications that a node is gone. - A major drawback with ring monitoring is how it handles failures that cause massive network partitionings. If both a lost node and all its direct monitoring neighbors are inside the lost partition, the nodes in the remaining partition will never receive indications about the loss. To overcome this, each node also chooses to actively monitor some nodes outside its local domain. Those nodes are called remote domain "heads", and are selected in such a way that no node in the cluster will be more than two direct monitoring hops away. Because of this, each node, apart from monitoring the member of its local domain, will also typically monitor sqrt(N) remote head nodes. - As an optimization, local list status, domain status and domain records are marked with a generation number. This saves senders from unnecessarily conveying unaltered domain records, and receivers from performing unneeded re-adaptations of their node monitoring list, such as re-assigning domain heads. - As a measure of caution we have added the possibility to disable the new algorithm through configuration. We do this by keeping a threshold value for the cluster size; a cluster that grows beyond this value will switch from full-mesh to ring monitoring, and vice versa when it shrinks below the value. This means that if the threshold is set to a value larger than any anticipated cluster size (default size is 32) the new algorithm is effectively disabled. A patch set for altering the threshold value and for listing the table contents will follow shortly. - This change is fully backwards compatible. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03tipc: redesign connection-level flow controlJon Paul Maloy1-5/+3
There are two flow control mechanisms in TIPC; one at link level that handles network congestion, burst control, and retransmission, and one at connection level which' only remaining task is to prevent overflow in the receiving socket buffer. In TIPC, the latter task has to be solved end-to-end because messages can not be thrown away once they have been accepted and delivered upwards from the link layer, i.e, we can never permit the receive buffer to overflow. Currently, this algorithm is message based. A counter in the receiving socket keeps track of number of consumed messages, and sends a dedicated acknowledge message back to the sender for each 256 consumed message. A counter at the sending end keeps track of the sent, not yet acknowledged messages, and blocks the sender if this number ever reaches 512 unacknowledged messages. When the missing acknowledge arrives, the socket is then woken up for renewed transmission. This works well for keeping the message flow running, as it almost never happens that a sender socket is blocked this way. A problem with the current mechanism is that it potentially is very memory consuming. Since we don't distinguish between small and large messages, we have to dimension the socket receive buffer according to a worst-case of both. I.e., the window size must be chosen large enough to sustain a reasonable throughput even for the smallest messages, while we must still consider a scenario where all messages are of maximum size. Hence, the current fix window size of 512 messages and a maximum message size of 66k results in a receive buffer of 66 MB when truesize(66k) = 131k is taken into account. It is possible to do much better. This commit introduces an algorithm where we instead use 1024-byte blocks as base unit. This unit, always rounded upwards from the actual message size, is used when we advertise windows as well as when we count and acknowledge transmitted data. The advertised window is based on the configured receive buffer size in such a way that even the worst-case truesize/msgsize ratio always is covered. Since the smallest possible message size (from a flow control viewpoint) now is 1024 bytes, we can safely assume this ratio to be less than four, which is the value we are now using. This way, we have been able to reduce the default receive buffer size from 66 MB to 2 MB with maintained performance. In order to keep this solution backwards compatible, we introduce a new capability bit in the discovery protocol, and use this throughout the message sending/reception path to always select the right unit. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-11tipc: make dist queue pernetErik Hugne1-0/+1
Nametable updates received from the network that cannot be applied immediately are placed on a defer queue. This queue is global to the TIPC module, which might cause problems when using TIPC in containers. To prevent nametable updates from escaping into the wrong namespace, we make the queue pernet instead. Signed-off-by: Erik Hugne <erik.hugne@gmail.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-24tipc: create broadcast transmission link at namespace initJon Paul Maloy1-0/+9
The broadcast transmission link is currently instantiated when the network subsystem is started, i.e., on order from user space via netlink. This forces the broadcast transmission code to do unnecessary tests for the existence of the transmission link, as well in single mode node as in network mode. In this commit, we do instead create the link during initialization of the name space, and remove it when it is stopped. The fact that the transmission link now has a guaranteed longer life cycle than any of its potential clients paves the way for further code simplifcations and optimizations. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-04tipc: rename functions defined in subscr.cYing Xue1-2/+2
When a topology server accepts a connection request from its client, it allocates a connection instance and a tipc_subscriber structure object. The former is used to communicate with client, and the latter is often treated as a subscriber which manages all subscription events requested from a same client. When a topology server receives a request of subscribing name services from a client through the connection, it creates a tipc_subscription structure instance which is seen as a subscription recording what name services are subscribed. In order to manage all subscriptions from a same client, topology server links them into the subscrp_list of the subscriber. So subscriber and subscription completely represents different meanings respectively, but function names associated with them make us so confused that we are unable to easily tell which function is against subscriber and which is to subscription. So we want to eliminate the confusion by renaming them. Signed-off-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-31tipc: fix a slab object leakYing Xue1-1/+1
When remove TIPC module, there is a warning to remind us that a slab object is leaked like: root@localhost:~# rmmod tipc [ 19.056226] ============================================================================= [ 19.057549] BUG TIPC (Not tainted): Objects remaining in TIPC on kmem_cache_close() [ 19.058736] ----------------------------------------------------------------------------- [ 19.058736] [ 19.060287] INFO: Slab 0xffffea0000519a00 objects=23 used=1 fp=0xffff880014668b00 flags=0x100000000004080 [ 19.061915] INFO: Object 0xffff880014668000 @offset=0 [ 19.062717] kmem_cache_destroy TIPC: Slab cache still has objects This is because the listening socket of TIPC topology server is not closed before TIPC proto handler is unregistered with proto_unregister(). However, as the socket is closed in tipc_exit_net() which is called by unregister_pernet_subsys() during unregistering TIPC namespace operation, the warning can be eliminated if calling unregister_pernet_subsys() is moved before calling proto_unregister(). Fixes: e05b31f4bf89 ("tipc: make tipc socket support net namespace") Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Signed-off-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09tipc: nl compat add noop and remove legacy nl frameworkRichard Alpe1-1/+2
Add TIPC_CMD_NOOP to compat layer and remove the old framework. All legacy nl commands are now converted to the compat layer in netlink_compat.c. Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09tipc: move and rename the legacy nl api to "nl compat"Richard Alpe1-0/+7
The new netlink API is no longer "v2" but rather the standard API and the legacy API is now "nl compat". We split them into separate start/stop and put them in different files in order to further distinguish them. Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-12tipc: make tipc random value aware of net namespaceYing Xue1-5/+1
After namespace is s