| Age | Commit message (Collapse) | Author | Files | Lines |
|
commit 859ae7018316daa4adbc496012dcbbb458d7e510 upstream.
There are two problems with the current code that have been highlighted
with the AQL feature that is now enbaled by default.
First problem is in ieee80211_rx_h_mesh_fwding(),
ieee80211_select_queue_80211() is used on received packets to choose
the sending AC queue of the forwarding packet although this function
should only be called on TX packet (it uses ieee80211_tx_info).
This ends with forwarded mesh packets been sent on unrelated random AC
queue. To fix that, AC queue can directly be infered from skb->priority
which has been extracted from QOS info (see ieee80211_parse_qos()).
Second problem is the value of queue_mapping set on forwarded mesh
frames via skb_set_queue_mapping() is not the AC of the packet but a
hardware queue index. This may or may not work depending on AC to HW
queue mapping which is driver specific.
Both of these issues lead to improper AC selection while forwarding
mesh packets but more importantly due to improper airtime accounting
(which is done on a per STA, per AC basis) caused traffic stall with
the introduction of AQL.
Fixes: cf44012810cc ("mac80211: fix unnecessary frame drops in mesh fwding")
Fixes: d3c1597b8d1b ("mac80211: fix forwarded mesh frame queue mapping")
Co-developed-by: Remi Pommarel <repk@triplefau.lt>
Signed-off-by: Remi Pommarel <repk@triplefau.lt>
Signed-off-by: Nicolas Escande <nico.escande@gmail.com>
Link: https://lore.kernel.org/r/20220214173214.368862-1-nico.escande@gmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit b22a8f7b4bde4e4ab73b64908ffd5d90ecdcdbfd upstream.
John Paul reported a warning about bogus NUMA distance values spurred by
commit:
620a6dc40754 ("sched/topology: Make sched_init_numa() use a set for the deduplicating sort")
In this case, the afflicted machine comes up with a reported 256 possible
nodes, all of which are 0 distance away from one another. This was
previously silently ignored, but is now caught by the aforementioned
commit.
The culprit is ia64's node_possible_map which remains unchanged from its
initialization value of NODE_MASK_ALL. In John's case, the machine
doesn't have any SRAT nor SLIT table, but AIUI the possible map remains
untouched regardless of what ACPI tables end up being parsed. Thus,
!online && possible nodes remain with a bogus distance of 0 (distances \in
[0, 9] are "reserved and have no meaning" as per the ACPI spec).
Follow x86 / drivers/base/arch_numa's example and set the possible map to
the parsed map, which in this case seems to be the online map.
Link: http://lore.kernel.org/r/255d6b5d-194e-eb0e-ecdd-97477a534441@physik.fu-berlin.de
Link: https://lkml.kernel.org/r/20210318130617.896309-1-valentin.schneider@arm.com
Fixes: 620a6dc40754 ("sched/topology: Make sched_init_numa() use a set for the deduplicating sort")
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Reported-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Tested-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Tested-by: Sergei Trofimovich <slyfox@gentoo.org>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Anatoly Pugachev <matorola@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: dann frazier <dann.frazier@canonical.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 71e5f6644fb2f3304fcb310145ded234a37e7cc1 upstream.
Commit "sched/topology: Make sched_init_numa() use a set for the
deduplicating sort" allocates 'i + nr_levels (level)' instead of
'i + nr_levels + 1' sched_domain_topology_level.
This led to an Oops (on Arm64 juno with CONFIG_SCHED_DEBUG):
sched_init_domains
build_sched_domains()
__free_domain_allocs()
__sdt_free() {
...
for_each_sd_topology(tl)
...
sd = *per_cpu_ptr(sdd->sd, j); <--
...
}
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Barry Song <song.bao.hua@hisilicon.com>
Link: https://lkml.kernel.org/r/6000e39e-7d28-c360-9cd6-8798fd22a9bf@arm.com
Signed-off-by: dann frazier <dann.frazier@canonical.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 620a6dc40754dc218f5b6389b5d335e9a107fd29 upstream.
The deduplicating sort in sched_init_numa() assumes that the first line in
the distance table contains all unique values in the entire table. I've
been trying to pen what this exactly means for the topology, but it's not
straightforward. For instance, topology.c uses this example:
node 0 1 2 3
0: 10 20 20 30
1: 20 10 20 20
2: 20 20 10 20
3: 30 20 20 10
0 ----- 1
| / |
| / |
| / |
2 ----- 3
Which works out just fine. However, if we swap nodes 0 and 1:
1 ----- 0
| / |
| / |
| / |
2 ----- 3
we get this distance table:
node 0 1 2 3
0: 10 20 20 20
1: 20 10 20 30
2: 20 20 10 20
3: 20 30 20 10
Which breaks the deduplicating sort (non-representative first line). In
this case this would just be a renumbering exercise, but it so happens that
we can have a deduplicating sort that goes through the whole table in O(n²)
at the extra cost of a temporary memory allocation (i.e. any form of set).
The ACPI spec (SLIT) mentions distances are encoded on 8 bits. Following
this, implement the set as a 256-bits bitmap. Should this not be
satisfactory (i.e. we want to support 32-bit values), then we'll have to go
for some other sparse set implementation.
This has the added benefit of letting us allocate just the right amount of
memory for sched_domains_numa_distance[], rather than an arbitrary
(nr_node_ids + 1).
Note: DT binding equivalent (distance-map) decodes distances as 32-bit
values.
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210122123943.1217-2-valentin.schneider@arm.com
Signed-off-by: dann frazier <dann.frazier@canonical.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit fadead80fe4c033b5e514fcbadd20b55c4494112 upstream.
Commit c503e63200c6 ("ice: Stop processing VF messages during teardown")
introduced a driver state flag, ICE_VF_DEINIT_IN_PROGRESS, which is
intended to prevent some issues with concurrently handling messages from
VFs while tearing down the VFs.
This change was motivated by crashes caused while tearing down and
bringing up VFs in rapid succession.
It turns out that the fix actually introduces issues with the VF driver
caused because the PF no longer responds to any messages sent by the VF
during its .remove routine. This results in the VF potentially removing
its DMA memory before the PF has shut down the device queues.
Additionally, the fix doesn't actually resolve concurrency issues within
the ice driver. It is possible for a VF to initiate a reset just prior
to the ice driver removing VFs. This can result in the remove task
concurrently operating while the VF is being reset. This results in
similar memory corruption and panics purportedly fixed by that commit.
Fix this concurrency at its root by protecting both the reset and
removal flows using the existing VF cfg_lock. This ensures that we
cannot remove the VF while any outstanding critical tasks such as a
virtchnl message or a reset are occurring.
This locking change also fixes the root cause originally fixed by commit
c503e63200c6 ("ice: Stop processing VF messages during teardown"), so we
can simply revert it.
Note that I kept these two changes together because simply reverting the
original commit alone would leave the driver vulnerable to worse race
conditions.
Fixes: c503e63200c6 ("ice: Stop processing VF messages during teardown")
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit e6ba5273d4ede03d075d7a116b8edad1f6115f4d upstream.
The VF can be configured via the PF's ndo ops at the same time the PF is
receiving/handling virtchnl messages. This has many issues, with
one of them being the ndo op could be actively resetting a VF (i.e.
resetting it to the default state and deleting/re-adding the VF's VSI)
while a virtchnl message is being handled. The following error was seen
because a VF ndo op was used to change a VF's trust setting while the
VIRTCHNL_OP_CONFIG_VSI_QUEUES was ongoing:
[35274.192484] ice 0000:88:00.0: Failed to set LAN Tx queue context, error: ICE_ERR_PARAM
[35274.193074] ice 0000:88:00.0: VF 0 failed opcode 6, retval: -5
[35274.193640] iavf 0000:88:01.0: PF returned error -5 (IAVF_ERR_PARAM) to our request 6
Fix this by making sure the virtchnl handling and VF ndo ops that
trigger VF resets cannot run concurrently. This is done by adding a
struct mutex cfg_lock to each VF structure. For VF ndo ops, the mutex
will be locked around the critical operations and VFR. Since the ndo ops
will trigger a VFR, the virtchnl thread will use mutex_trylock(). This
is done because if any other thread (i.e. VF ndo op) has the mutex, then
that means the current VF message being handled is no longer valid, so
just ignore it.
This issue can be seen using the following commands:
for i in {0..50}; do
rmmod ice
modprobe ice
sleep 1
echo 1 > /sys/class/net/ens785f0/device/sriov_numvfs
echo 1 > /sys/class/net/ens785f1/device/sriov_numvfs
ip link set ens785f1 vf 0 trust on
ip link set ens785f0 vf 0 trust on
sleep 2
echo 0 > /sys/class/net/ens785f0/device/sriov_numvfs
echo 0 > /sys/class/net/ens785f1/device/sriov_numvfs
sleep 1
echo 1 > /sys/class/net/ens785f0/device/sriov_numvfs
echo 1 > /sys/class/net/ens785f1/device/sriov_numvfs
ip link set ens785f1 vf 0 trust on
ip link set ens785f0 vf 0 trust on
done
Fixes: 7c710869d64e ("ice: Add handlers for VF netdevice operations")
Signed-off-by: Brett Creeley <brett.creeley@intel.com>
Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit b2fcf2102049f6e56981e0ab3d9b633b8e2741da upstream.
This sequence of events can lead to a failure to requeue a CPU's
->nocb_timer:
1. There are no callbacks queued for any CPU covered by CPU 0-2's
->nocb_gp_kthread. Note that ->nocb_gp_kthread is associated
with CPU 0.
2. CPU 1 enqueues its first callback with interrupts disabled, and
thus must defer awakening its ->nocb_gp_kthread. It therefore
queues its rcu_data structure's ->nocb_timer. At this point,
CPU 1's rdp->nocb_defer_wakeup is RCU_NOCB_WAKE.
3. CPU 2, which shares the same ->nocb_gp_kthread, also enqueues a
callback, but with interrupts enabled, allowing it to directly
awaken the ->nocb_gp_kthread.
4. The newly awakened ->nocb_gp_kthread associates both CPU 1's
and CPU 2's callbacks with a future grace period and arranges
for that grace period to be started.
5. This ->nocb_gp_kthread goes to sleep waiting for the end of this
future grace period.
6. This grace period elapses before the CPU 1's timer fires.
This is normally improbably given that the timer is set for only
one jiffy, but timers can be delayed. Besides, it is possible
that kernel was built with CONFIG_RCU_STRICT_GRACE_PERIOD=y.
7. The grace period ends, so rcu_gp_kthread awakens the
->nocb_gp_kthread, which in turn awakens both CPU 1's and
CPU 2's ->nocb_cb_kthread. Then ->nocb_gb_kthread sleeps
waiting for more newly queued callbacks.
8. CPU 1's ->nocb_cb_kthread invokes its callback, then sleeps
waiting for more invocable callbacks.
9. Note that neither kthread updated any ->nocb_timer state,
so CPU 1's ->nocb_defer_wakeup is still set to RCU_NOCB_WAKE.
10. CPU 1 enqueues its second callback, this time with interrupts
enabled so it can wake directly ->nocb_gp_kthread.
It does so with calling wake_nocb_gp() which also cancels the
pending timer that got queued in step 2. But that doesn't reset
CPU 1's ->nocb_defer_wakeup which is still set to RCU_NOCB_WAKE.
So CPU 1's ->nocb_defer_wakeup and its ->nocb_timer are now
desynchronized.
11. ->nocb_gp_kthread associates the callback queued in 10 with a new
grace period, arranges for that grace period to start and sleeps
waiting for it to complete.
12. The grace period ends, rcu_gp_kthread awakens ->nocb_gp_kthread,
which in turn wakes up CPU 1's ->nocb_cb_kthread which then
invokes the callback queued in 10.
13. CPU 1 enqueues its third callback, this time with interrupts
disabled so it must queue a timer for a deferred wakeup. However
the value of its ->nocb_defer_wakeup is RCU_NOCB_WAKE which
incorrectly indicates that a timer is already queued. Instead,
CPU 1's ->nocb_timer was cancelled in 10. CPU 1 therefore fails
to queue the ->nocb_timer.
14. CPU 1 has its pending callback and it may go unnoticed until
some other CPU ever wakes up ->nocb_gp_kthread or CPU 1 ever
calls an explicit deferred wakeup, for example, during idle entry.
This commit fixes this bug by resetting rdp->nocb_defer_wakeup everytime
we delete the ->nocb_timer.
It is quite possible that there is a similar scenario involving
->nocb_bypass_timer and ->nocb_defer_wakeup. However, despite some
effort from several people, a failure scenario has not yet been located.
However, that by no means guarantees that no such scenario exists.
Finding a failure scenario is left as an exercise for the reader, and the
"Fixes:" tag below relates to ->nocb_bypass_timer instead of ->nocb_timer.
Fixes: d1b222c6be1f (rcu/nocb: Add bypass callback queueing)
Cc: <stable@vger.kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Reviewed-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 4940a1fdf31c39f0806ac831cde333134862030b upstream.
The problem of SMC_CLC_DECL_ERR_REGRMB on the server is very clear.
Based on the fact that whether a new SMC connection can be accepted or
not depends on not only the limit of conn nums, but also the available
entries of rtoken. Since the rtoken release is trigger by peer, while
the conn nums is decrease by local, tons of thing can happen in this
time difference.
This only thing that needs to be mentioned is that now all connection
creations are completely protected by smc_server_lgr_pending lock, it's
enough to check only the available entries in rtokens_used_mask.
Fixes: cd6851f30386 ("smc: remote memory buffers (RMBs)")
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 0537f0a2151375dcf90c1bbfda6a0aaf57164e89 upstream.
The main reason for this unexpected SMC_CLC_DECL_ERR_REGRMB in client
dues to following execution sequence:
Server Conn A: Server Conn B: Client Conn B:
smc_lgr_unregister_conn
smc_lgr_register_conn
smc_clc_send_accept ->
smc_rtoken_add
smcr_buf_unuse
-> Client Conn A:
smc_rtoken_delete
smc_lgr_unregister_conn() makes current link available to assigned to new
incoming connection, while smcr_buf_unuse() has not executed yet, which
means that smc_rtoken_add may fail because of insufficient rtoken_entry,
reversing their execution order will avoid this problem.
Fixes: 3e034725c0d8 ("net/smc: common functions for RMBs and send buffers")
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 9f1c50cf39167ff71dc5953a3234f3f6eeb8fcb5 upstream.
There's a potential leak issue under following execution sequence :
smc_release smc_connect_work
if (sk->sk_state == SMC_INIT)
send_clc_confirim
tcp_abort();
...
sk.sk_state = SMC_ACTIVE
smc_close_active
switch(sk->sk_state) {
...
case SMC_ACTIVE:
smc_close_final()
// then wait peer closed
Unfortunately, tcp_abort() may discard CLC CONFIRM messages that are
still in the tcp send buffer, in which case our connection token cannot
be delivered to the server side, which means that we cannot get a
passive close message at all. Therefore, it is impossible for the to be
disconnected at all.
This patch tries a very simple way to avoid this issue, once the state
has changed to SMC_ACTIVE after tcp_abort(), we can actively abort the
smc connection, considering that the state is SMC_INIT before
tcp_abort(), abandoning the complete disconnection process should not
cause too much problem.
In fact, this problem may exist as long as the CLC CONFIRM message is
not received by the server. Whether a timer should be added after
smc_close_final() needs to be discussed in the future. But even so, this
patch provides a faster release for connection in above case, it should
also be valuable.
Fixes: 39f41f367b08 ("net/smc: common release code for non-accepted sockets")
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Acked-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 91b0383fef06f20b847fa9e4f0e3054ead0b1a1b upstream.
If I'm not mistaken (and I don't think I am), the way in which the
dcbnl_ops work is that drivers call dcb_ieee_setapp() and this populates
the application table with dynamically allocated struct dcb_app_type
entries that are kept in the module-global dcb_app_list.
However, nobody keeps exact track of these entries, and although
dcb_ieee_delapp() is supposed to remove them, nobody does so when the
interface goes away (example: driver unbinds from device). So the
dcb_app_list will contain lingering entries with an ifindex that no
longer matches any device in dcb_app_lookup().
Reclaim the lost memory by listening for the NETDEV_UNREGISTER event and
flushing the app table entries of interfaces that are now gone.
In fact something like this used to be done as part of the initial
commit (blamed below), but it was done in dcbnl_exit() -> dcb_flushapp(),
essentially at module_exit time. That became dead code after commit
7a6b6f515f77 ("DCB: fix kconfig option") which essentially merged
"tristate config DCB" and "bool config DCBNL" into a single "bool config
DCB", so net/dcb/dcbnl.c could not be built as a module anymore.
Commit 36b9ad8084bd ("net/dcb: make dcbnl.c explicitly non-modular")
recognized this and deleted dcbnl_exit() and dcb_flushapp() altogether,
leaving us with the version we have today.
Since flushing application table entries can and should be done as soon
as the netdevice disappears, fundamentally the commit that is to blame
is the one that introduced the design of this API.
Fixes: 9ab933ab2cc8 ("dcbnl: add appliction tlv handlers")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 9995b408f17ff8c7f11bc725c8aa225ba3a63b1c upstream.
There are two reasons for addrconf_notify() to be called with NETDEV_DOWN:
either the network device is actually going down, or IPv6 was disabled
on the interface.
If either of them stays down while the other is toggled, we repeatedly
call the code for NETDEV_DOWN, including ipv6_mc_down(), while never
calling the corresponding ipv6_mc_up() in between. This will cause a
new entry in idev->mc_tomb to be allocated for each multicast group
the interface is subscribed to, which in turn leaks one struct ifmcaddr6
per nontrivial multicast group the interface is subscribed to.
The following reproducer will leak at least $n objects:
ip addr add ff2e::4242/32 dev eth0 autojoin
sysctl -w net.ipv6.conf.eth0.disable_ipv6=1
for i in $(seq 1 $n); do
ip link set up eth0; ip link set down eth0
done
Joining groups with IPV6_ADD_MEMBERSHIP (unprivileged) or setting the
sysctl net.ipv6.conf.eth0.forwarding to 1 (=> subscribing to ff02::2)
can also be used to create a nontrivial idev->mc_list, which will the
leak objects with the right up-down-sequence.
Based on both sources for NETDEV_DOWN events the interface IPv6 state
should be considered:
- not ready if the network interface is not ready OR IPv6 is disabled
for it
- ready if the network interface is ready AND IPv6 is enabled for it
The functions ipv6_mc_up() and ipv6_down() should only be run when this
state changes.
Implement this by remembering when the IPv6 state is ready, and only
run ipv6_mc_down() if it actually changed from ready to not ready.
The other direction (not ready -> ready) already works correctly, as:
- the interface notification triggered codepath for NETDEV_UP /
NETDEV_CHANGE returns early if ipv6 is disabled, and
- the disable_ipv6=0 triggered codepath skips fully initializing the
interface as long as addrconf_link_ready(dev) returns false
- calling ipv6_mc_up() repeatedly does not leak anything
Fixes: 3ce62a84d53c ("ipv6: exit early in addrconf_notify() if IPv6 is disabled")
Signed-off-by: Johannes Nixdorf <j.nixdorf@avm.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 6c1f41afc1dbe59d9d3c8bb0d80b749c119aa334 upstream.
The ifindex doesn't have to be unique for multiple network namespaces on
the same machine.
$ ip netns add test1
$ ip -net test1 link add dummy1 type dummy
$ ip netns add test2
$ ip -net test2 link add dummy2 type dummy
$ ip -net test1 link show dev dummy1
6: dummy1: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 96:81:55:1e:dd:85 brd ff:ff:ff:ff:ff:ff
$ ip -net test2 link show dev dummy2
6: dummy2: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 5a:3c:af:35:07:c3 brd ff:ff:ff:ff:ff:ff
But the batman-adv code to walk through the various layers of virtual
interfaces uses this assumption because dev_get_iflink handles it
internally and doesn't return the actual netns of the iflink. And
dev_get_iflink only documents the situation where ifindex == iflink for
physical devices.
But only checking for dev->netdev_ops->ndo_get_iflink is also not an option
because ipoib_get_iflink implements it even when it sometimes returns an
iflink != ifindex and sometimes iflink == ifindex. The caller must
therefore make sure itself to check both netns and iflink + ifindex for
equality. Only when they are equal, a "physical" interface was detected
which should stop the traversal. On the other hand, vxcan_get_iflink can
also return 0 in case there was currently no valid peer. In this case, it
is still necessary to stop.
Fixes: b7eddd0b3950 ("batman-adv: prevent using any virtual device created on batman-adv as hard-interface")
Fixes: 5ed4a460a1d3 ("batman-adv: additional checks for virtual interfaces on top of WiFi")
Reported-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 6116ba09423f7d140f0460be6a1644dceaad00da upstream.
There is no need to call dev_get_iflink multiple times for the same
net_device in batadv_get_real_netdevice. And since some of the
ndo_get_iflink callbacks are dynamic (for example via RCUs like in
vxcan_get_iflink), it could easily happen that the returned values are not
stable. The pre-checks before __dev_get_by_index are then of course bogus.
Fixes: 5ed4a460a1d3 ("batman-adv: additional checks for virtual interfaces on top of WiFi")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 690bb6fb64f5dc7437317153902573ecad67593d upstream.
There is no need to call dev_get_iflink multiple times for the same
net_device in batadv_is_on_batman_iface. And since some of the
.ndo_get_iflink callbacks are dynamic (for example via RCUs like in
vxcan_get_iflink), it could easily happen that the returned values are not
stable. The pre-checks before __dev_get_by_index are then of course bogus.
Fixes: b7eddd0b3950 ("batman-adv: prevent using any virtual device created on batman-adv as hard-interface")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 3b836da4081fa585cf6c392f62557496f2cb0efe upstream.
In case someone combines bpf socket assign and nf_queue, then we will
queue an skb who references a struct sock that did not have its
reference count incremented.
As we leave rcu protection, there is no guarantee that skb->sk is still
valid.
For refcount-less skb->sk case, try to increment the reference count
and then override the destructor.
In case of failure we have two choices: orphan the skb and 'delete'
preselect or let nf_queue() drop the packet.
Do the latter, it should not happen during normal operation.
Fixes: cf7fbe660f2d ("bpf: Add socket assign support")
Acked-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit c3873070247d9e3c7a6b0cf9bf9b45e8018427b1 upstream.
Eric Dumazet says:
The sock_hold() side seems suspect, because there is no guarantee
that sk_refcnt is not already 0.
On failure, we cannot queue the packet and need to indicate an
error. The packet will be dropped by the caller.
v2: split skb prefetch hunk into separate change
Fixes: 271b72c7fa82c ("udp: RCU handling for Unicast packets.")
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 747670fd9a2d1b7774030dba65ca022ba442ce71 upstream.
There is no guarantee that state->sk refers to a full socket.
If refcount transitions to 0, sock_put calls sk_free which then ends up
with garbage fields.
I'd like to thank Oleksandr Natalenko and Jiri Benc for considerable
debug work and pointing out state->sk oddities.
Fixes: ca6fb0651883 ("tcp: attach SYNACK messages to request sockets instead of listener")
Tested-by: Oleksandr Natalenko <oleksandr@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 224102de2ff105a2c05695e66a08f4b5b6b2d19c upstream.
The truesize for a UDP GRO packet is added by main skb and skbs in main
skb's frag_list:
skb_gro_receive_list
p->truesize += skb->truesize;
The commit 53475c5dd856 ("net: fix use-after-free when UDP GRO with
shared fraglist") introduced a truesize increase for frag_list skbs.
When uncloning skb, it will call pskb_expand_head and trusesize for
frag_list skbs may increase. This can occur when allocators uses
__netdev_alloc_skb and not jump into __alloc_skb. This flow does not
use ksize(len) to calculate truesize while pskb_expand_head uses.
skb_segment_list
err = skb_unclone(nskb, GFP_ATOMIC);
pskb_expand_head
if (!skb->sk || skb->destructor == sock_edemux)
skb->truesize += size - osize;
If we uses increased truesize adding as delta_truesize, it will be
larger than before and even larger than previous total truesize value
if skbs in frag_list are abundant. The main skb truesize will become
smaller and even a minus value or a huge value for an unsigned int
parameter. Then the following memory check will drop this abnormal skb.
To avoid this error we should use the original truesize to segment the
main skb.
Fixes: 53475c5dd856 ("net: fix use-after-free when UDP GRO with shared fraglist")
Signed-off-by: lena wang <lena.wang@mediatek.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/1646133431-8948-1-git-send-email-lena.wang@mediatek.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit ffd24fa2fcc76ecb2e61e7a4ef8588177bcb42a6 upstream.
Update MAC type check e1000_pch_tgp because for e1000_pch_cnp,
NVM checksum update is still possible.
Emit a more detailed warning message.
Bugzilla: https://bugzilla.opensuse.org/show_bug.cgi?id=1191663
Fixes: 4051f68318ca ("e1000e: Do not take care about recovery NVM checksum")
Reported-by: Thomas Bogendoerfer <tbogendoerfer@suse.de>
Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Naama Meir <naamax.meir@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 7c76ecd9c99b6e9a771d813ab1aa7fa428b3ade1 upstream.
struct xfrm_user_offload has flags variable that received user input,
but kernel didn't check if valid bits were provided. It caused a situation
where not sanitized input was forwarded directly to the drivers.
For example, XFRM_OFFLOAD_IPV6 define that was exposed, was used by
strongswan, but not implemented in the kernel at all.
As a solution, check and sanitize input flags to forward
XFRM_OFFLOAD_INBOUND to the drivers.
Fixes: d77e38e612a0 ("xfrm: Add an IPsec hardware offloading API")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 6d0d95a1c2b07270870e7be16575c513c29af3f1 upstream.
if_id will be always 0, because it was not yet initialized.
Fixes: 8dce43919566 ("xfrm: interface with if_id 0 should return error")
Reported-by: Pavel Machek <pavel@denx.de>
Signed-off-by: Antony Antony <antony.antony@secunet.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 60ce37b03917e593d8e5d8bcc7ec820773daf81d upstream.
Currently, sk_psock_verdict_recv() returns skb->len
This is problematic because tcp_read_sock() might have
passed orig_len < skb->len, due to the presence of TCP urgent data.
This causes an infinite loop from tcp_read_sock()
Followup patch will make tcp_read_sock() more robust vs bad actors.
Fixes: ef5659280eb1 ("bpf, sockmap: Allow skipping sk_skb parser program")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Tested-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20220302161723.3910001-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 56763f12b0f02706576a088e85ef856deacc98a0 upstream.
We must not dereference @new_hooks after nf_hook_mutex has been released,
because other threads might have freed our allocated hooks already.
BUG: KASAN: use-after-free in nf_hook_entries_get_hook_ops include/linux/netfilter.h:130 [inline]
BUG: KASAN: use-after-free in hooks_validate net/netfilter/core.c:171 [inline]
BUG: KASAN: use-after-free in __nf_register_net_hook+0x77a/0x820 net/netfilter/core.c:438
Read of size 2 at addr ffff88801c1a8000 by task syz-executor237/4430
CPU: 1 PID: 4430 Comm: syz-executor237 Not tainted 5.17.0-rc5-syzkaller-00306-g2293be58d6a1 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
print_address_description.constprop.0.cold+0x8d/0x336 mm/kasan/report.c:255
__kasan_report mm/kasan/report.c:442 [inline]
kasan_report.cold+0x83/0xdf mm/kasan/report.c:459
nf_hook_entries_get_hook_ops include/linux/netfilter.h:130 [inline]
hooks_validate net/netfilter/core.c:171 [inline]
__nf_register_net_hook+0x77a/0x820 net/netfilter/core.c:438
nf_register_net_hook+0x114/0x170 net/netfilter/core.c:571
nf_register_net_hooks+0x59/0xc0 net/netfilter/core.c:587
nf_synproxy_ipv6_init+0x85/0xe0 net/netfilter/nf_synproxy_core.c:1218
synproxy_tg6_check+0x30d/0x560 net/ipv6/netfilter/ip6t_SYNPROXY.c:81
xt_check_target+0x26c/0x9e0 net/netfilter/x_tables.c:1038
check_target net/ipv6/netfilter/ip6_tables.c:530 [inline]
find_check_entry.constprop.0+0x7f1/0x9e0 net/ipv6/netfilter/ip6_tables.c:573
translate_table+0xc8b/0x1750 net/ipv6/netfilter/ip6_tables.c:735
do_replace net/ipv6/netfilter/ip6_tables.c:1153 [inline]
do_ip6t_set_ctl+0x56e/0xb90 net/ipv6/netfilter/ip6_tables.c:1639
nf_setsockopt+0x83/0xe0 net/netfilter/nf_sockopt.c:101
ipv6_setsockopt+0x122/0x180 net/ipv6/ipv6_sockglue.c:1024
rawv6_setsockopt+0xd3/0x6a0 net/ipv6/raw.c:1084
__sys_setsockopt+0x2db/0x610 net/socket.c:2180
__do_sys_setsockopt net/socket.c:2191 [inline]
__se_sys_setsockopt net/socket.c:2188 [inline]
__x64_sys_setsockopt+0xba/0x150 net/socket.c:2188
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f65a1ace7d9
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 71 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f65a1a7f308 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007f65a1ace7d9
RDX: 0000000000000040 RSI: 0000000000000029 RDI: 0000000000000003
RBP: 00007f65a1b574c8 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000020000000 R11: 0000000000000246 R12: 00007f65a1b55130
R13: 00007f65a1b574c0 R14: 00007f65a1b24090 R15: 0000000000022000
</TASK>
The buggy address belongs to the page:
page:ffffea0000706a00 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1c1a8
flags: 0xfff00000000000(node=0|zone=1|lastcpupid=0x7ff)
raw: 00fff00000000000 ffffea0001c1b108 ffffea000046dd08 0000000000000000
raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as freed
page last allocated via order 2, migratetype Unmovable, gfp_mask 0x52dc0(GFP_KERNEL|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_ZERO), pid 4430, ts 1061781545818, free_ts 1061791488993
prep_new_page mm/page_alloc.c:2434 [inline]
get_page_from_freelist+0xa72/0x2f50 mm/page_alloc.c:4165
__alloc_pages+0x1b2/0x500 mm/page_alloc.c:5389
__alloc_pages_node include/linux/gfp.h:572 [inline]
alloc_pages_node include/linux/gfp.h:595 [inline]
kmalloc_large_node+0x62/0x130 mm/slub.c:4438
__kmalloc_node+0x35a/0x4a0 mm/slub.c:4454
kmalloc_node include/linux/slab.h:604 [inline]
kvmalloc_node+0x97/0x100 mm/util.c:580
kvmalloc include/linux/slab.h:731 [inline]
kvzalloc include/linux/slab.h:739 [inline]
allocate_hook_entries_size net/netfilter/core.c:61 [inline]
nf_hook_entries_grow+0x140/0x780 net/netfilter/core.c:128
__nf_register_net_hook+0x144/0x820 net/netfilter/core.c:429
nf_register_net_hook+0x114/0x170 net/netfilter/core.c:571
nf_register_net_hooks+0x59/0xc0 net/netfilter/core.c:587
nf_synproxy_ipv6_init+0x85/0xe0 net/netfilter/nf_synproxy_core.c:1218
synproxy_tg6_check+0x30d/0x560 net/ipv6/netfilter/ip6t_SYNPROXY.c:81
xt_check_target+0x26c/0x9e0 net/netfilter/x_tables.c:1038
check_target net/ipv6/netfilter/ip6_tables.c:530 [inline]
find_check_entry.constprop.0+0x7f1/0x9e0 net/ipv6/netfilter/ip6_tables.c:573
translate_table+0xc8b/0x1750 net/ipv6/netfilter/ip6_tables.c:735
do_replace net/ipv6/netfilter/ip6_tables.c:1153 [inline]
do_ip6t_set_ctl+0x56e/0xb90 net/ipv6/netfilter/ip6_tables.c:1639
nf_setsockopt+0x83/0xe0 net/netfilter/nf_sockopt.c:101
page last free stack trace:
reset_page_owner include/linux/page_owner.h:24 [inline]
free_pages_prepare mm/page_alloc.c:1352 [inline]
free_pcp_prepare+0x374/0x870 mm/page_alloc.c:1404
free_unref_page_prepare mm/page_alloc.c:3325 [inline]
free_unref_page+0x19/0x690 mm/page_alloc.c:3404
kvfree+0x42/0x50 mm/util.c:613
rcu_do_batch kernel/rcu/tree.c:2527 [inline]
rcu_core+0x7b1/0x1820 kernel/rcu/tree.c:2778
__do_softirq+0x29b/0x9c2 kernel/softirq.c:558
Memory state around the buggy address:
ffff88801c1a7f00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ffff88801c1a7f80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>ffff88801c1a8000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
^
ffff88801c1a8080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ffff88801c1a8100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
Fixes: 2420b79f8c18 ("netfilter: debug: check for sorted array")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 6596a0229541270fb8d38d989f91b78838e5e9da upstream.
Commit 749439bfac6e1a2932c582e2699f91d329658196 ("ipv6: fix udpv6
sendmsg crash caused by too small MTU") breaks PMTU for xfrm.
A Packet Too Big ICMPv6 message received in response to an ESP
packet will prevent all further communication through the tunnel
if the reported MTU minus the ESP overhead is smaller than 1280.
E.g. in a case of a tunnel-mode ESP with sha256/aes the overhead
is 92 bytes. Receiving a PTB with MTU of 1371 or less will result
in all further packets in the tunnel dropped. A ping through the
tunnel fails with "ping: sendmsg: Invalid argument".
Apparently the MTU on the xfrm route is smaller than 1280 and
fails the check inside ip6_setup_cork() added by 749439bf.
We found this by debugging USGv6/ipv6ready failures. Failing
tests are: "Phase-2 Interoperability Test Scenario IPsec" /
5.3.11 and 5.4.11 (Tunnel Mode: Fragmentation).
Commit b515d2637276a3810d6595e10ab02c13bfd0b63a ("xfrm:
xfrm_state_mtu should return at least 1280 for ipv6") attempted
to fix this but caused another regression in TCP MSS calculations
and had to be reverted.
The patch below fixes the situation by dropping the MTU
check and instead checking for the underflows described in the
749439bf commit message.
Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Fixes: 749439bfac6e ("ipv6: fix udpv6 sendmsg crash caused by too small MTU")
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 0708a0afe291bdfe1386d74d5ec1f0c27e8b9168 upstream.
syzkaller was recently triggering an oversized kvmalloc() warning via
xdp_umem_create().
The triggered warning was added back in 7661809d493b ("mm: don't allow
oversized kvmalloc() calls"). The rationale for the warning for huge
kvmalloc sizes was as a reaction to a security bug where the size was
more than UINT_MAX but not everything was prepared to handle unsigned
long sizes.
Anyway, the AF_XDP related call trace from this syzkaller report was:
kvmalloc include/linux/mm.h:806 [inline]
kvmalloc_array include/linux/mm.h:824 [inline]
kvcalloc include/linux/mm.h:829 [inline]
xdp_umem_pin_pages net/xdp/xdp_umem.c:102 [inline]
xdp_umem_reg net/xdp/xdp_umem.c:219 [inline]
xdp_umem_create+0x6a5/0xf00 net/xdp/xdp_umem.c:252
xsk_setsockopt+0x604/0x790 net/xdp/xsk.c:1068
__sys_setsockopt+0x1fd/0x4e0 net/socket.c:2176
__do_sys_setsockopt net/socket.c:2187 [inline]
__se_sys_setsockopt net/socket.c:2184 [inline]
__x64_sys_setsockopt+0xb5/0x150 net/socket.c:2184
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
Björn mentioned that requests for >2GB allocation can still be valid:
The structure that is being allocated is the page-pinning accounting.
AF_XDP has an internal limit of U32_MAX pages, which is *a lot*, but
still fewer than what memcg allows (PAGE_COUNTER_MAX is a LONG_MAX/
PAGE_SIZE on 64 bit systems). [...]
I could just change from U32_MAX to INT_MAX, but as I stated earlier
that has a hacky feeling to it. [...] From my perspective, the code
isn't broken, with the memcg limits in consideration. [...]
Linus says:
[...] Pretty much every time this has come up, the kernel warning has
shown that yes, the code was broken and there really wasn't a reason
for doing allocations that big.
Of course, some people would be perfectly fine with the allocation
failing, they just don't want the warning. I didn't want __GFP_NOWARN
to shut it up originally because I wanted people to see all those
cases, but these days I think we can just say "yeah, people can shut
it up explicitly by saying 'go ahead and fail this allocation, don't
warn about it'".
So enough time has passed that by now I'd certainly be ok with [it].
Thus allow call-sites to silence such userspace triggered splats if the
allocation requests have __GFP_NOWARN. For xdp_umem_pin_pages()'s call
to kvcalloc() this is already the case, so nothing else needed there.
Fixes: 7661809d493b ("mm: don't allow oversized kvmalloc() calls")
Reported-by: syzbot+11421fbbff99b989670e@syzkaller.appspotmail.com
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: syzbot+11421fbbff99b989670e@syzkaller.appspotmail.com
Cc: Björn Töpel <bjorn@kernel.org>
Cc: Magnus Karlsson <magnus.karlsson@intel.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: David S. Miller <davem@davemloft.net>
Link: https://lore.kernel.org/bpf/CAJ+HfNhyfsT5cS_U9EC213ducHs9k9zNxX9+abqC0kTrPbQ0gg@mail.gmail.com
Link: https://lore.kernel.org/bpf/20211201202905.b9892171e3f5b9a60f9da251@linux-foundation.org
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Ackd-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit d5081bf5dcfb1cb83fb538708b0ac07a10a79cc4 upstream.
The field offset for port configuration status on SPR has been changed to
bit 14 from ICX where it resides at bit 12. By chance link status detection
continued to work on SPR. This is due to bit 12 being a configuration bit
which is in sync with the status bit. Fix this by checking for a SPR device
and checking correct status bit.
Fixes: 26bfe3d0b227 ("ntb: intel: Add Icelake (gen4) support for Intel NTB")
Tested-by: Jerry Dai <jerry.dai@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Jon Mason <jdmason@kudzu.us>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 5838a14832d447990827d85e90afe17e6fb9c175 upstream.
Do not call get_trip_hyst() from thermal_genl_cmd_tz_get_trip() if
the thermal zone does not define one.
Fixes: 1ce50e7d408e ("thermal: core: genetlink support for events/cmd/sampling")
Signed-off-by: Nicolas Cavallari <nicolas.cavallari@green-communications.fr>
Cc: 5.10+ <stable@vger.kernel.org> # 5.10+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit dcf4ff7a48e7598e6b10126cc02177abb8ae4f3f upstream.
xennet_destroy_queues() relies on info->netdev->real_num_tx_queues to
delete queues. Since d7dac083414eb5bb99a6d2ed53dc2c1b405224e5
("net-sysfs: update the queue counts in the unregistration path"),
unregister_netdev() indirectly sets real_num_tx_queues to 0. Those two
facts together means, that xennet_destroy_queues() called from
xennet_remove() cannot do its job, because it's called after
unregister_netdev(). This results in kfree-ing queues that are still
linked in napi, which ultimately crashes:
BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 1 PID: 52 Comm: xenwatch Tainted: G W 5.16.10-1.32.fc32.qubes.x86_64+ #226
RIP: 0010:free_netdev+0xa3/0x1a0
Code: ff 48 89 df e8 2e e9 00 00 48 8b 43 50 48 8b 08 48 8d b8 a0 fe ff ff 48 8d a9 a0 fe ff ff 49 39 c4 75 26 eb 47 e8 ed c1 66 ff <48> 8b 85 60 01 00 00 48 8d 95 60 01 00 00 48 89 ef 48 2d 60 01 00
RSP: 0000:ffffc90000bcfd00 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff88800edad000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffffc90000bcfc30 RDI: 00000000ffffffff
RBP: fffffffffffffea0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff88800edad050
R13: ffff8880065f8f88 R14: 0000000000000000 R15: ffff8880066c6680
FS: 0000000000000000(0000) GS:ffff8880f3300000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000000e998c006 CR4: 00000000003706e0
Call Trace:
<TASK>
xennet_remove+0x13d/0x300 [xen_netfront]
xenbus_dev_remove+0x6d/0xf0
__device_release_driver+0x17a/0x240
device_release_driver+0x24/0x30
bus_remove_device+0xd8/0x140
device_del+0x18b/0x410
? _raw_spin_unlock+0x16/0x30
? klist_iter_exit+0x14/0x20
? xenbus_dev_request_and_reply+0x80/0x80
device_unregister+0x13/0x60
xenbus_dev_changed+0x18e/0x1f0
xenwatch_thread+0xc0/0x1a0
? do_wait_intr_irq+0xa0/0xa0
kthread+0x16b/0x190
? set_kthread_struct+0x40/0x40
ret_from_fork+0x22/0x30
</TASK>
Fix this by calling xennet_destroy_queues() from xennet_uninit(),
when real_num_tx_queues is still available. This ensures that queues are
destroyed when real_num_tx_queues is set to 0, regardless of how
unregister_netdev() was called.
Originally reported at
https://github.com/QubesOS/qubes-issues/issues/7257
Fixes: d7dac083414eb5bb9 ("net-sysfs: update the queue counts in the unregistration path")
Cc: stable@vger.kernel.org
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|