summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)AuthorFilesLines
2021-03-17tipc: simplify call signatures for publication creationJon Maloy3-40/+34
We simplify the call signatures for tipc_nametbl_insert_publ() and tipc_publ_create() so that fewer parameters are passed around. Signed-off-by: Jon Maloy <jmaloy@redhat.com> Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Hoang Le <hoang.h.le@dektech.com.au> Acked-by: Tung Nguyen <tung.q.nguyen@dektech.com.au> Acked-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-17tipc: simplify signature of tipc_namtbl_publish()Jon Maloy5-58/+68
Using the new address structure tipc_uaddr, we simplify the signature of function tipc_sk_publish() and tipc_namtbl_publish() so that fewer parameters need to be passed around. Signed-off-by: Jon Maloy <jmaloy@redhat.com> Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Hoang Le <hoang.h.le@dektech.com.au> Acked-by: Tung Nguyen <tung.q.nguyen@dektech.com.au> Acked-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-17tipc: introduce new unified address type for internal useJon Maloy2-1/+46
We introduce a simplified version of struct sockaddr_tipc, using anonymous unions and structures. Apart from being nicer to work with, this struct will come in handy when we in a later commit add another address type. Signed-off-by: Jon Maloy <jmaloy@redhat.com> Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Hoang Le <hoang.h.le@dektech.com.au> Acked-by: Tung Nguyen <tung.q.nguyen@dektech.com.au> Acked-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-17tipc: move creation of publication item one level up in call chainJon Maloy1-33/+32
We instantiate struct publication in tipc_nametbl_insert_publ() instead of as currently in tipc_service_insert_publ(). This has the advantage that we can pass a pointer to the publication struct to the next call levels, instead of the numerous individual parameters we pass on now. It also gives us a location to keep the contents of the additional fields we will introduce in a later commit. Signed-off-by: Jon Maloy <jmaloy@redhat.com> Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Hoang Le <hoang.h.le@dektech.com.au> Acked-by: Tung Nguyen <tung.q.nguyen@dektech.com.au> Acked-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-17tipc: re-organize members of struct publicationJon Maloy4-97/+92
In a future commit we will introduce more members to struct publication. In order to keep this structure comprehensible we now group some of its current fields into the sub-structures where they really belong, - A struct tipc_service_range for the functional address the publication is representing. - A struct tipc_socket_addr for the socket bound to that service range. We also rename the stack variable 'publ' to just 'p' in a few places. This is just as easy to understand in the given context, and keeps the number of wrapped code lines to a minimum. There are no functional changes in this commit. Signed-off-by: Jon Maloy <jmaloy@redhat.com> Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Hoang Le <hoang.h.le@dektech.com.au> Acked-by: Tung Nguyen <tung.q.nguyen@dektech.com.au> Acked-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-17ethtool: Add common function for filling out stringsAlexander Duyck1-0/+12
Add a function to handle the common pattern of printing a string into the ethtool strings interface and incrementing the string pointer by the ETH_GSTRING_LEN. Most of the drivers end up doing this and several have implemented their own versions of this function so it would make sense to consolidate on one implementation. Signed-off-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-16openvswitch: Warn over-mtu packets only if iface is UP.Flavio Leitner1-3/+5
It is not unusual to have the bridge port down. Sometimes it has the old MTU, which is fine since it's not being used. However, the kernel spams the log with a warning message when a packet is going to be sent over such port. Fix that by warning only if the interface is UP. Signed-off-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-16net: ocelot: Remove ocelot_xfh_get_cpuqHoratiu Vultur1-2/+0
Now when extracting frames from CPU the cpuq is not used anymore so remove it. Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-16net: ocelot: Extend MRPHoratiu Vultur1-6/+0
This patch extends MRP support for Ocelot. It allows to have multiple rings and when the node has the MRC role it forwards MRP Test frames in HW. For MRM there is no change. Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-16net: ipv4: route.c: simplify procfs codeYejune Deng1-30/+4
proc_creat_seq() that directly take a struct seq_operations, and deal with network namespaces in ->open. Signed-off-by: Yejune Deng <yejune.deng@gmail.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-16net: bridge: mcast: factor out common allow/block EHT handlingNikolay Aleksandrov1-71/+27
We hande EHT state change for ALLOW messages in INCLUDE mode and for BLOCK messages in EXCLUDE mode similarly - create the new set entries with the proper filter mode. We also handle EHT state change for ALLOW messages in EXCLUDE mode and for BLOCK messages in INCLUDE mode in a similar way - delete the common entries (current set and new set). Factor out all the common code as follows: - ALLOW/INCLUDE, BLOCK/EXCLUDE: call __eht_create_set_entries() - ALLOW/EXCLUDE, BLOCK/INCLUDE: call __eht_del_common_set_entries() The set entries creation can be reused in __eht_inc_exc() as well. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-16net: bridge: mcast: remove unreachable EHT codeNikolay Aleksandrov1-42/+15
In the initial EHT versions there were common functions which handled allow/block messages for both INCLUDE and EXCLUDE modes, but later they were separated. It seems I've left some common code which cannot be reached because the filter mode is checked before calling the respective functions, i.e. the host filter is always in EXCLUDE mode when using __eht_allow_excl() and __eht_block_excl() thus we can drop the host_excl checks inside and simplify the code a bit. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-16net: dsa: mt7530: support MDB and bridge flag operationsDENG Qingfang1-13/+1
Support port MDB and bridge flag operations. As the hardware can manage multicast forwarding itself, offload_fwd_mark can be unconditionally set to true. Signed-off-by: DENG Qingfang <dqfext@gmail.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-15net: export dev_set_threaded symbolLorenzo Bianconi1-0/+1
For wireless devices (e.g. mt76 driver) multiple net_devices belongs to the same wireless phy and the napi object is registered in a dummy netdevice related to the wireless phy. Export dev_set_threaded in order to be reused in device drivers enabling threaded NAPI. Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-14psample: Add additional metadata attributesIdo Schimmel1-1/+38
Extend psample to report the following attributes when available: * Output traffic class as a 16-bit value * Output traffic class occupancy in bytes as a 64-bit value * End-to-end latency of the packet in nanoseconds resolution * Software timestamp in nanoseconds resolution (always available) * Packet's protocol. Needed for packet dissection in user space (always available) Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-14psample: Encapsulate packet metadata in a structIdo Schimmel2-12/+10
Currently, callers of psample_sample_packet() pass three metadata attributes: Ingress port, egress port and truncated size. Subsequent patches are going to add more attributes (e.g., egress queue occupancy), which also need an indication whether they are valid or not. Encapsulate packet metadata in a struct in order to keep the number of arguments reasonable. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-14ethernet: constify eth_get_headlen()'s data argumentAlexander Lobakin1-1/+1
It's used only for flow dissection, which now takes constant data pointers. Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-14flow_dissector: constify raw input data argumentAlexander Lobakin1-19/+22
Flow Dissector code never modifies the input buffer, neither skb nor raw data. Make 'data' argument const for all of the Flow dissector's functions. Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-14gro: give 'hash' variable in dev_gro_receive() a less confusing nameAlexander Lobakin1-6/+6
'hash' stores not the flow hash, but the index of the GRO bucket corresponding to it. Change its name to 'bucket' to avoid confusion while reading lines like '__set_bit(hash, &napi->gro_bitmask)'. Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-14gro: consistentify napi->gro_hash[x] access in dev_gro_receive()Alexander Lobakin1-11/+11
GRO bucket index doesn't change through the entire function. Store a pointer to the corresponding bucket instead of its member and use it consistently through the function. It is performance-safe since &gro_list->list == gro_list. Misc: remove superfluous braces around single-line branches. Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-14gro: simplify gro_list_prepare()Alexander Lobakin1-8/+4
gro_list_prepare() always returns &napi->gro_hash[bucket].list, without any variations. Moreover, it uses 'napi' argument only to have access to this list, and calculates the bucket index for the second time (firstly it happens at the beginning of dev_gro_receive()) to do that. Given that dev_gro_receive() already has an index to the needed list, just pass it as the first argument to eliminate redundant calculations, and make gro_list_prepare() return void. Also, both arguments of gro_list_prepare() can be constified since this function can only modify the skbs from the bucket list. Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-13Merge tag 'batadv-next-pullrequest-20210312' of ↵David S. Miller1-4/+1
git://git.open-mesh.org/linux-merge Simon Wunderlich says: ==================== There is only a single patch this time: - Use netif_rx_any_context(), by Sebastian Andrzej Siewior ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-13net/sched: act_police: add support for packet-per-second policingBaowen Zheng2-32/+102
Allow a policer action to enforce a rate-limit based on packets-per-second, configurable using a packet-per-second rate and burst parameters. e.g. tc filter add dev tap1 parent ffff: u32 match \ u32 0 0 police pkts_rate 3000 pkts_burst 1000 Testing was unable to uncover a performance impact of this change on existing features. Signed-off-by: Baowen Zheng <baowen.zheng@corigine.com> Signed-off-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: Louis Peens <louis.peens@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-13flow_offload: add support for packet-per-second policingXingfeng Hu1-0/+3
Allow flow_offload API to configure packet-per-second policing using rate and burst parameters. Dummy implementations of tcf_police_rate_pkt_ps() and tcf_police_burst_pkt() are supplied which return 0, the unconfigured state. This is to facilitate splitting the offload, driver, and TC code portion of this feature into separate patches with the aim of providing a logical flow for review. And the implementation of these helpers will be filled out by a follow-up patch. Signed-off-by: Xingfeng Hu <xingfeng.hu@corigine.com> Signed-off-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: Louis Peens <louis.peens@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-12mptcp: remove a list of addrs when flushingGeliang Tang1-4/+4
This patch invoked mptcp_nl_remove_addrs_list to remove a list of addresses when the netlink flushes addresses, instead of using mptcp_nl_remove_subflow_and_signal_addr to remove them one by one. And dropped the unused parameter net in __flush_addrs too. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-12mptcp: remove multi addresses and subflows in PMGeliang Tang1-0/+48
This patch implemented the function to remove a list of addresses and subflows, named mptcp_nl_remove_addrs_list, which had a input parameter rm_list as the removing addresses list. In mptcp_nl_remove_addrs_list, traverse all the existing msk sockets to invoke mptcp_pm_remove_addrs_and_subflows to remove a list of addresses for each msk socket. In mptcp_pm_remove_addrs_and_subflows, traverse all the addresses in the removing addresses list, to find whether this address is in the conn_list or anno_list. If it is, put the address ID into the removing address list or the removing subflow list, and pass the two lists to mptcp_pm_remove_addr and mptcp_pm_remove_subflow. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-12mptcp: remove multi subflows in PMGeliang Tang3-22/+31
This patch dealt with removing multi subflows in PM: In mptcp_pm_remove_subflow, changed the input parameter local_id as an list of removing address ids, and passed the list to mptcp_pm_nl_rm_subflow_received. In mptcp_pm_nl_rm_subflow_received, iterated each address id from the received ids list. Then shut down and closed each address id's subsocket. In mptcp_nl_remove_subflow_and_signal_addr, put the single address id into an ids list, and passed it to mptcp_pm_remove_subflow. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-12mptcp: remove multi addresses in PMGeliang Tang2-17/+20
This patch dropped the member rm_id of struct mptcp_pm_data. Use rm_list_rx in mptcp_pm_nl_rm_addr_received instead of using rm_id. In mptcp_pm_nl_rm_addr_received, iterated each address id from pm.rm_list_rx, then shut down and closed each address id's subsocket. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-12mptcp: add rm_list_rx in mptcp_pm_dataGeliang Tang2-1/+3
This patch added a new member rm_list_rx for struct mptcp_pm_data as an list of the removing address ids on the incoming direction. Initialized its nr field to zero in mptcp_pm_data_init. In mptcp_pm_rm_addr_received, set it as the input rm_list. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-12mptcp: add rm_list in mptcp_options_receivedGeliang Tang3-10/+18
This patch changed the member rm_id in struct mptcp_options_received as a list of the removing address ids, and renamed it to rm_list. In mptcp_parse_option, parsed the RM_ADDR suboption and filled them into the rm_list in struct mptcp_options_received. In mptcp_incoming_options, passed this rm_list to the function mptcp_pm_rm_addr_received. It also changed the parameter type of mptcp_pm_rm_addr_received. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-12mptcp: add rm_list_tx in mptcp_pm_dataGeliang Tang3-10/+18
This patch added a new member rm_list_tx for struct mptcp_pm_data as the removing address list on the outgoing direction. Initialize its nr field to zero in mptcp_pm_data_init. In mptcp_pm_remove_anno_addr, put the single address id into an removing list, and passed it to mptcp_pm_remove_addr. In mptcp_pm_remove_addr, save the input rm_list to rm_list_tx in struct mptcp_pm_data. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-12mptcp: add rm_list in mptcp_out_optionsGeliang Tang3-12/+40
This patch defined a new struct mptcp_rm_list, the ids field was an array of the removing address ids, the nr field was the valid number of removing address ids in the array. The array size was definced as a new macro MPTCP_RM_IDS_MAX. Changed the member rm_id of struct mptcp_out_options to rm_list. In mptcp_established_options_rm_addr, invoked mptcp_pm_rm_addr_signal to get the rm_list. According the number of addresses in it, calculated the padded RM_ADDR suboption length. And saved the ids array in struct mptcp_out_options's rm_list member. In mptcp_write_options, iterated each address id from struct mptcp_out_options's rm_list member, set the invalid ones as TCPOPT_NOP, then filled them into the RM_ADDR suboption. Changed TCPOLEN_MPTCP_RM_ADDR_BASE from 4 to 3. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-12net: ipv4: route.c: Fix indentation of multi line comment.Shubhankar Kuranagatti1-48/+49
All comment lines inside the comment block have been aligned. Every line of comment starts with a * (uniformity in code). Signed-off-by: Shubhankar Kuranagatti <shubhankarvk@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-11tcp: remove obsolete check in __tcp_retransmit_skb()Eric Dumazet1-8/+0
TSQ provides a nice way to avoid bufferbloat on individual socket, including retransmit packets. We can get rid of the old heuristic: /* Do not sent more than we queued. 1/4 is reserved for possible * copying overhead: fragmentation, tunneling, mangling etc. */ if (refcount_read(&sk->sk_wmem_alloc) > min_t(u32, sk->sk_wmem_queued + (sk->sk_wmem_queued >> 2), sk->sk_sndbuf)) return -EAGAIN; This heuristic was giving false positives according to Jakub, whenever TX completions are delayed above RTT. (Ack packets are processed by TCP stack before clones are orphaned/freed) Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Jakub Kicinski <kuba@kernel.org> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-11tcp: consider using standard rtx logic in tcp_rcv_fastopen_synack()Eric Dumazet1-6/+4
Jakub reported Data included in a Fastopen SYN that had to be retransmit would have to wait for an RTO if TX completions are slow, even with prior fix. This is because tcp_rcv_fastopen_synack() does not use standard rtx logic, meaning TSQ handler exits early in tcp_tsq_write() because tp->lost_out == tp->retrans_out Lets make tcp_rcv_fastopen_synack() use standard rtx logic, by using tcp_mark_skb_lost() on the skb thats needs to be sent again. Not this raised a warning in tcp_fastretrans_alert() during my tests since we consider the data not being aknowledged by the receiver does not mean packet was lost on the network. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Jakub Kicinski <kuba@kernel.org> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-11tcp: plug skb_still_in_host_queue() to TSQEric Dumazet1-4/+8
Jakub and Neil reported an increase of RTO timers whenever TX completions are delayed a bit more (by increasing NIC TX coalescing parameters) Main issue is that TCP stack has a logic preventing a packet being retransmit if the prior clone has not yet been orphaned or freed. This logic came with commit 1f3279ae0c13 ("tcp: avoid retransmits of TCP packets hanging in host queues") Thankfully, in the case skb_still_in_host_queue() detects the initial clone is still in flight, it can use TSQ logic that will eventually retry later, at the moment the clone is freed or orphaned. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Neil Spring <ntspring@fb.com> Reported-by: Jakub Kicinski <kuba@kernel.org> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-11tipc: clean up warnings detected by sparseHoang Huu Le3-22/+58
This patch fixes the following warning from sparse: net/tipc/monitor.c:263:35: warning: incorrect type in assignment (different base types) net/tipc/monitor.c:263:35: expected unsigned int net/tipc/monitor.c:263:35: got restricted __be32 [usertype] [...] net/tipc/node.c:374:13: warning: context imbalance in 'tipc_node_read_lock' - wrong count at exit net/tipc/node.c:379:13: warning: context imbalance in 'tipc_node_read_unlock' - unexpected unlock net/tipc/node.c:384:13: warning: context imbalance in 'tipc_node_write_lock' - wrong count at exit net/tipc/node.c:389:13: warning: context imbalance in 'tipc_node_write_unlock_fast' - unexpected unlock net/tipc/node.c:404:17: warning: context imbalance in 'tipc_node_write_unlock' - unexpected unlock [...] net/tipc/crypto.c:1201:9: warning: incorrect type in initializer (different address spaces) net/tipc/crypto.c:1201:9: expected struct tipc_aead [noderef] __rcu *__tmp net/tipc/crypto.c:1201:9: got struct tipc_aead * [...] Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Hoang Huu Le <hoang.h.le@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-11tipc: convert dest node's address to network orderHoang Le1-1/+1
(struct tipc_link_info)->dest is in network order (__be32), so we must convert the value to network order before assigning. The problem detected by sparse: net/tipc/netlink_compat.c:699:24: warning: incorrect type in assignment (different base types) net/tipc/netlink_compat.c:699:24: expected restricted __be32 [usertype] dest net/tipc/netlink_compat.c:699:24: got int Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-11nexthop: Enable resilient next-hop groupsPetr Machata1-4/+0
Now that all the code is in place, stop rejecting requests to create resilient next-hop groups. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-11nexthop: Notify userspace about bucket migrationsPetr Machata1-6/+39
Nexthop replacements et.al. are notified through netlink, but if a delayed work migrates buckets on the background, userspace will stay oblivious. Notify these as RTM_NEWNEXTHOPBUCKET events. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-11nexthop: Add netlink handlers for bucket getPetr Machata1-1/+109
Allow getting (but not setting) individual buckets to inspect the next hop mapped therein, idle time, and flags. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-11nexthop: Add netlink handlers for bucket dumpPetr Machata1-0/+283
Add a dump handler for resilient next hop buckets. When next-hop group ID is given, it walks buckets of that group, otherwise it walks buckets of all groups. It then dumps the buckets whose next hops match the given filtering criteria. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-11nexthop: Add netlink handlers for resilient nexthop groupsPetr Machata1-5/+145
Implement the netlink messages that allow creation and dumping of resilient nexthop groups. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-11nexthop: Allow reporting activity of nexthop bucketsIdo Schimmel1-0/+35
The kernel periodically checks the idle time of nexthop buckets to determine if they are idle and can be re-populated with a new nexthop. When the resilient nexthop group is offloaded to hardware, the kernel will not see activity on nexthop buckets unless it is reported from hardware. Add a function that can be periodically called by device drivers to report activity on nexthop buckets after querying it from the underlying device. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-11nexthop: Allow setting "offload" and "trap" indication of nexthop bucketsIdo Schimmel1-0/+34
Add a function that can be called by device drivers to set "offload" or "trap" indication on nexthop buckets following nexthop notifications and other changes such as a neighbour becoming invalid. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-11nexthop: Implement notifiers for resilient nexthop groupsPetr Machata1-12/+308
Implement the following notifications towards drivers: - NEXTHOP_EVENT_REPLACE, when a resilient nexthop group is created. - NEXTHOP_EVENT_BUCKET_REPLACE any time there is a change in assignment of next hops to hash table buckets. That includes replacements, deletions, and delayed upkeep cycles. Some bucket notifications can be vetoed by the driver, to make it possible to propagate bucket busy-ness flags from the HW back to the algorithm. Some are however forced, e.g. if a next hop is deleted, all buckets that use this next hop simply must be migrated, whether the HW wishes so or not. - NEXTHOP_EVENT_RES_TABLE_PRE_REPLACE, before a resilient nexthop group is replaced. Usually the driver will get the bucket notifications as well, and could veto those. But in some cases, a bucket may not be migrated immediately, but during delayed upkeep, and that is too late to roll the transaction back. This notification allows the driver to take a look and veto the new proposed group up front, before anything is committed. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-11nexthop: Add implementation of resilient next-hop groupsPetr Machata1-13/+504
At this moment, there is only one type of next-hop group: an mpath group, which implements the hash-threshold algorithm. To select a next hop, hash-threshold algorithm first assigns a range of hashes to each next hop in the group, and then selects the next hop by comparing the SKB hash with the individual ranges. When a next hop is removed from the group, the ranges are recomputed, which leads to reassignment of parts of hash space from one next hop to another. While there will usually be some overlap between the previous and the new distribution, some traffic flows change the next hop that they resolve to. That causes problems e.g. as established TCP connections are reset, because the traffic is forwarded to a server that is not familiar with the connection. Resilient hashing is a technique to address the above problem. Resilient next-hop group has another layer of indirection between the group itself and its constituent next hops: a hash table. The selection algorithm uses a straightforward modulo operation to choose a hash bucket, and then reads the next hop that this bucket contains, and forwards traffic there. This indirection brings an important feature. In the hash-threshold algorithm, the range of hashes associated with a next hop must be continuous. With a hash table, mapping between the hash table buckets and the individual next hops is arbitrary. Therefore when a next hop is deleted the buckets that held it are simply reassigned to other next hops. When weights of next hops in a group are altered, it may be possible to choose a subset of buckets that are currently not used for forwarding traffic, and use those to satisfy the new next-hop distribution demands, keeping the "busy" buckets intact. This way, established flows are ideally kept being forwarded to the same endpoints through the same paths as before the next-hop group change. In a nutshell, the algorithm works as follows. Each next hop has a number of buckets that it wants to have, according to its weight and the number of buckets in the hash table. In case of an event that might cause bucket allocation change, the numbers for individual next hops are updated, similarly to how ranges are updated for mpath group next hops. Following that, a new "upkeep" algorithm runs, and for idle buckets that belong to a next hop that is currently occupying more buckets than it wants (it is "overweight"), it migrates the buckets to one of the next hops that has fewer buckets than it wants (it is "underweight"). If, after this, there are still underweight next hops, another upkeep run is scheduled to a future time. Chances are there are not enough "idle" buckets to satisfy the new demands. The algorithm has knobs to select both what it means for a bucket to be idle, and for whether and when to forcefully migrate buckets if there keeps being an insufficient number of idle buckets. There are three users of the resilient data structures. - The forwarding code accesses them under RCU, and does not modify them except for updating the time a selected bucket was last used. - Netlink code, running under RTNL, which may modify the data. - The delayed upkeep code, which may modify the data. This runs unlocked, and mutual exclusion between the RTNL code and the delayed upkeep is maintained by canceling the delayed work synchronously before the RTNL code touches anything. Later it restarts the delayed work if necessary. The RTNL code has to implement next-hop group replacement, next hop removal, etc. For removal, the mpath code uses a neat trick of having a backup next hop group structure, doing the necessary changes offline, and then RCU-swapping them in. However, the hash tables for resilient hashing are about an order of magnitude larger than the groups themselves (the size might be e.g. 4K entries), and it was felt that keeping two of them is an overkill. Both the primary next-hop group and the spare therefore use the same resilient table, and writers are careful to keep all references valid for the forwarding code. The hash table references next-hop group entries from the next-hop group that is currently in the primary role (i.e. not spare). During the transition from primary to spare, the table references a mix of both the primary group and the spare. When a next hop is deleted, the corresponding buckets are not set to NULL, but instead marked as empty, so that the pointer is valid and can be used by the forwarding code. The buckets are then migrated to a new next-hop group entry during upkeep. The only times that the hash table is invalid is the very beginning and very end of its lifetime. Between those points, it is always kept valid. This patch introduces the core support code itself. It does not handle notifications towards drivers, which are kept as if the group were an mpath one. It does not handle netlink either. The only bit currently exposed to user space is the new next-hop group type, and that is currently bounced. There is therefore no way to actually access this code. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-11nexthop: Add netlink defines and enumerators for resilient NH groupsIdo Schimmel1-0/+2
- RTM_NEWNEXTHOP et.al. that handle resilient groups will have a new nested attribute, NHA_RES_GROUP, whose elements are attributes NHA_RES_GROUP_*. - RTM_NEWNEXTHOPBUCKET et.al. is a suite of new messages that will currently serve only for dumping of individual buckets of resilient next hop groups. For nexthop group buckets, these messages will carry a nested attribute NHA_RES_BUCKET, whose elements are attributes NHA_RES_BUCKET_*. There are several reasons why a new suite of messages is created for nexthop buckets instead of overloading the information on the existing RTM_{NEW,DEL,GET}NEXTHOP messages. First, a nexthop group can contain a large number of nexthop buckets (4k is not unheard of). This imposes limits on the amount of information that can be encoded for each nexthop bucket given a netlink message is limited to 64k bytes. Second, while RTM_NEWNEXTHOPBUCKET is only used for notifications at this point, in the future it can be extended to provide user space with control over nexthop buckets configuration. - The new group type is NEXTHOP_GRP_TYPE_RES. Note that nexthop code is adjusted to bounce groups with that type for now. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-11nexthop: Add a dedicated flag for multipath next-hop groupsPetr Machata1-1/+4
With the introduction of resilient nexthop groups, there will be two types of multipath groups: the current hash-threshold "mpath" ones, and resilient groups. Both are multipath, but to determine the fact, the system needs to consider two flags. This might prove costly in the datapath. Therefore, introduce a new flag, that should be set for next-hop groups that have more than one nexthop, and should be considered multipath. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-11nexthop: __nh_notifier_single_info_init(): Make nh_info an argumentPetr Machata1-5/+7
The cited function currently uses rtnl_dereference() to get nh_info from a handed-in nexthop. However, under the resilient hashing scheme, this function will not always be called under RTNL, sometimes the mutual exclusion will be achieved differently. Therefore move the nh_info extraction from the function to its callers to make it possible to use a different synchronization guarantee. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>