linux.git/include/net, branch v4.14.145

ip6: fix skb leak in ip6frag_expire_frag_queue()

2019-09-16T06:20:44+00:00

commit 47d3d7fdb10a21c223036b58bd70ffdc24a472c4 upstream.

Since ip6frag_expire_frag_queue() now pulls the head skb
from frag queue, we should no longer use skb_get(), since
this leads to an skb leak.

Stefan Bader initially reported a problem in 4.4.stable [1] caused
by the skb_get(), so this patch should also fix this issue.

296583.091021] kernel BUG at /build/linux-6VmqmP/linux-4.4.0/net/core/skbuff.c:1207!
[296583.091734] Call Trace:
[296583.091749]  [] __pskb_pull_tail+0x50/0x350
[296583.091764]  [] _decode_session6+0x26a/0x400
[296583.091779]  [] __xfrm_decode_session+0x39/0x50
[296583.091795]  [] icmpv6_route_lookup+0xf0/0x1c0
[296583.091809]  [] icmp6_send+0x5e1/0x940
[296583.091823]  [] ? __netif_receive_skb+0x18/0x60
[296583.091838]  [] ? netif_receive_skb_internal+0x32/0xa0
[296583.091858]  [] ? ixgbe_clean_rx_irq+0x594/0xac0 [ixgbe]
[296583.091876]  [] ? nf_ct_net_exit+0x50/0x50 [nf_defrag_ipv6]
[296583.091893]  [] icmpv6_send+0x21/0x30
[296583.091906]  [] ip6_expire_frag_queue+0xe0/0x120
[296583.091921]  [] nf_ct_frag6_expire+0x1f/0x30 [nf_defrag_ipv6]
[296583.091938]  [] call_timer_fn+0x37/0x140
[296583.091951]  [] ? nf_ct_net_exit+0x50/0x50 [nf_defrag_ipv6]
[296583.091968]  [] run_timer_softirq+0x234/0x330
[296583.091982]  [] __do_softirq+0x109/0x2b0

Fixes: d4289fcc9b16 ("net: IP6 defrag: use rbtrees for IPv6 defrag")
Signed-off-by: Eric Dumazet 
Reported-by: Stefan Bader 
Cc: Peter Oskolkov 
Cc: Florian Westphal 
Signed-off-by: David S. Miller 
Signed-off-by: Baolin Wang 
Signed-off-by: Greg Kroah-Hartman

xfrm: clean up xfrm protocol checks

2019-09-16T06:20:44+00:00

commit dbb2483b2a46fbaf833cfb5deb5ed9cace9c7399 upstream.

In commit 6a53b7593233 ("xfrm: check id proto in validate_tmpl()")
I introduced a check for xfrm protocol, but according to Herbert
IPSEC_PROTO_ANY should only be used as a wildcard for lookup, so
it should be removed from validate_tmpl().

And, IPSEC_PROTO_ANY is expected to only match 3 IPSec-specific
protocols, this is why xfrm_state_flush() could still miss
IPPROTO_ROUTING, which leads that those entries are left in
net->xfrm.state_all before exit net. Fix this by replacing
IPSEC_PROTO_ANY with zero.

This patch also extracts the check from validate_tmpl() to
xfrm_id_proto_valid() and uses it in parse_ipsecrequest().
With this, no other protocols should be added into xfrm.

Fixes: 6a53b7593233 ("xfrm: check id proto in validate_tmpl()")
Reported-by: syzbot+0bf0519d6e0de15914fe@syzkaller.appspotmail.com
Cc: Steffen Klassert 
Cc: Herbert Xu 
Signed-off-by: Cong Wang 
Acked-by: Herbert Xu 
Signed-off-by: Steffen Klassert 
Signed-off-by: Zubin Mithra 
Signed-off-by: Greg Kroah-Hartman

net: sched: act_sample: fix psample group handling on overwrite

2019-09-10T09:32:21+00:00

[ Upstream commit dbf47a2a094edf58983265e323ca4bdcdb58b5ee ]

Action sample doesn't properly handle psample_group pointer in overwrite
case. Following issues need to be fixed:

- In tcf_sample_init() function RCU_INIT_POINTER() is used to set
  s->psample_group, even though we neither setting the pointer to NULL, nor
  preventing concurrent readers from accessing the pointer in some way.
  Use rcu_swap_protected() instead to safely reset the pointer.

- Old value of s->psample_group is not released or deallocated in any way,
  which results resource leak. Use psample_group_put() on non-NULL value
  obtained with rcu_swap_protected().

- The function psample_group_put() that released reference to struct
  psample_group pointed by rcu-pointer s->psample_group doesn't respect rcu
  grace period when deallocating it. Extend struct psample_group with rcu
  head and use kfree_rcu when freeing it.

Fixes: 5c5670fae430 ("net/sched: Introduce sample tc action")
Signed-off-by: Vlad Buslov 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

net_sched: fix a NULL pointer deref in ipt action

2019-09-10T09:32:20+00:00

[ Upstream commit 981471bd3abf4d572097645d765391533aac327d ]

The net pointer in struct xt_tgdtor_param is not explicitly
initialized therefore is still NULL when dereferencing it.
So we have to find a way to pass the correct net pointer to
ipt_destroy_target().

The best way I find is just saving the net pointer inside the per
netns struct tcf_idrinfo, which could make this patch smaller.

Fixes: 0c66dc1ea3f0 ("netfilter: conntrack: register hooks in netns when needed by ruleset")
Reported-and-tested-by: itugrok@yahoo.com
Cc: Jamal Hadi Salim 
Cc: Jiri Pirko 
Signed-off-by: Cong Wang 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

tcp: fix tcp_rtx_queue_tail in case of empty retransmit queue

2019-09-06T08:20:49+00:00

Commit 8c3088f895a0 ("tcp: be more careful in tcp_fragment()")
triggers following stack trace:

[25244.848046] kernel BUG at ./include/linux/skbuff.h:1406!
[25244.859335] RIP: 0010:skb_queue_prev+0x9/0xc
[25244.888167] Call Trace:
[25244.889182]  
[25244.890001]  tcp_fragment+0x9c/0x2cf
[25244.891295]  tcp_write_xmit+0x68f/0x988
[25244.892732]  __tcp_push_pending_frames+0x3b/0xa0
[25244.894347]  tcp_data_snd_check+0x2a/0xc8
[25244.895775]  tcp_rcv_established+0x2a8/0x30d
[25244.897282]  tcp_v4_do_rcv+0xb2/0x158
[25244.898666]  tcp_v4_rcv+0x692/0x956
[25244.899959]  ip_local_deliver_finish+0xeb/0x169
[25244.901547]  __netif_receive_skb_core+0x51c/0x582
[25244.903193]  ? inet_gro_receive+0x239/0x247
[25244.904756]  netif_receive_skb_internal+0xab/0xc6
[25244.906395]  napi_gro_receive+0x8a/0xc0
[25244.907760]  receive_buf+0x9a1/0x9cd
[25244.909160]  ? load_balance+0x17a/0x7b7
[25244.910536]  ? vring_unmap_one+0x18/0x61
[25244.911932]  ? detach_buf+0x60/0xfa
[25244.913234]  virtnet_poll+0x128/0x1e1
[25244.914607]  net_rx_action+0x12a/0x2b1
[25244.915953]  __do_softirq+0x11c/0x26b
[25244.917269]  ? handle_irq_event+0x44/0x56
[25244.918695]  irq_exit+0x61/0xa0
[25244.919947]  do_IRQ+0x9d/0xbb
[25244.921065]  common_interrupt+0x85/0x85
[25244.922479]  

tcp_rtx_queue_tail() (called by tcp_fragment()) can call
tcp_write_queue_prev() on the first packet in the queue, which will trigger
the BUG in tcp_write_queue_prev(), because there is no previous packet.

This happens when the retransmit queue is empty, for example in case of a
zero window.

Commit 8c3088f895a0 ("tcp: be more careful in tcp_fragment()") was not a
simple cherry-pick of the original one from master (b617158dc096)
because there is a specific TCP rtx queue only since v4.15. For more
details, please see the commit message of b617158dc096 ("tcp: be more
careful in tcp_fragment()").

The BUG() is hit due to the specific code added to versions older than
v4.15. The comment in skb_queue_prev() (include/linux/skbuff.h:1406),
just before the BUG_ON() somehow suggests to add a check before using
it, what Tim did.

In master, this code path causing the issue will not be taken because
the implementation of tcp_rtx_queue_tail() is different:

    tcp_fragment() → tcp_rtx_queue_tail() → tcp_write_queue_prev() →
skb_queue_prev() → BUG_ON()

Fixes: 8c3088f895a0 ("tcp: be more careful in tcp_fragment()")
Signed-off-by: Tim Froidcoeur 
Signed-off-by: Matthieu Baerts 
Reviewed-by: Christoph Paasch 
Signed-off-by: Sasha Levin

xfrm: policy: remove pcpu policy cache

2019-08-25T08:50:29+00:00

commit e4db5b61c572475bbbcf63e3c8a2606bfccf2c9d upstream.

Kristian Evensen says:
  In a project I am involved in, we are running ipsec (Strongswan) on
  different mt7621-based routers. Each router is configured as an
  initiator and has around ~30 tunnels to different responders (running
  on misc. devices). Before the flow cache was removed (kernel 4.9), we
  got a combined throughput of around 70Mbit/s for all tunnels on one
  router. However, we recently switched to kernel 4.14 (4.14.48), and
  the total throughput is somewhere around 57Mbit/s (best-case). I.e., a
  drop of around 20%. Reverting the flow cache removal restores, as
  expected, performance levels to that of kernel 4.9.

When pcpu xdst exists, it has to be validated first before it can be
used.

A negative hit thus increases cost vs. no-cache.

As number of tunnels increases, hit rate decreases so this pcpu caching
isn't a viable strategy.

Furthermore, the xdst cache also needs to run with BH off, so when
removing this the bh disable/enable pairs can be removed too.

Kristian tested a 4.14.y backport of this change and reported
increased performance:

  In our tests, the throughput reduction has been reduced from around -20%
  to -5%. We also see that the overall throughput is independent of the
  number of tunnels, while before the throughput was reduced as the number
  of tunnels increased.

Reported-by: Kristian Evensen 
Signed-off-by: Florian Westphal 
Signed-off-by: Steffen Klassert 
Signed-off-by: Greg Kroah-Hartman

Revert "tcp: Clear sk_send_head after purging the write queue"

2019-08-25T08:50:23+00:00

This reverts commit e99e7745d03fc50ba7c5b7c91c17294fee2d5991.

Ben Hutchings writes:

>Sorry, this is the same issue that was already fixed by "tcp: reset
>sk_send_head in tcp_write_queue_purge".  You can drop my version from
>the queue for 4.4 and 4.9 and revert it for 4.14.

Signed-off-by: Sasha Levin

tcp: Clear sk_send_head after purging the write queue

2019-08-16T08:13:47+00:00

Denis Andzakovic discovered a potential use-after-free in older kernel
versions, using syzkaller.  tcp_write_queue_purge() frees all skbs in
the TCP write queue and can leave sk->sk_send_head pointing to freed
memory.  tcp_disconnect() clears that pointer after calling
tcp_write_queue_purge(), but tcp_connect() does not.  It is
(surprisingly) possible to add to the write queue between
disconnection and reconnection, so this needs to be done in both
places.

This bug was introduced by backports of commit 7f582b248d0a ("tcp:
purge write queue in tcp_connect_init()") and does not exist upstream
because of earlier changes in commit 75c119afe14f ("tcp: implement
rb-tree based retransmit queue").  The latter is a major change that's
not suitable for stable.

Reported-by: Denis Andzakovic 
Bisected-by: Salvatore Bonaccorso 
Fixes: 7f582b248d0a ("tcp: purge write queue in tcp_connect_init()")
Cc:  # before 4.15
Cc: Eric Dumazet 
Signed-off-by: Ben Hutchings 
Signed-off-by: Greg Kroah-Hartman

tcp: be more careful in tcp_fragment()

2019-08-09T15:53:32+00:00

commit b617158dc096709d8600c53b6052144d12b89fab upstream.

Some applications set tiny SO_SNDBUF values and expect
TCP to just work. Recent patches to address CVE-2019-11478
broke them in case of losses, since retransmits might
be prevented.

We should allow these flows to make progress.

This patch allows the first and last skb in retransmit queue
to be split even if memory limits are hit.

It also adds the some room due to the fact that tcp_sendmsg()
and tcp_sendpage() might overshoot sk_wmem_queued by about one full
TSO skb (64KB size). Note this allowance was already present
in stable backports for kernels < 4.15

Note for < 4.15 backports :
 tcp_rtx_queue_tail() will probably look like :

static inline struct sk_buff *tcp_rtx_queue_tail(const struct sock *sk)
{
	struct sk_buff *skb = tcp_send_head(sk);

	return skb ? tcp_write_queue_prev(sk, skb) : tcp_write_queue_tail(sk);
}

Fixes: f070ef2ac667 ("tcp: tcp_fragment() should apply sane memory limits")
Signed-off-by: Eric Dumazet 
Reported-by: Andrew Prout 
Tested-by: Andrew Prout 
Tested-by: Jonathan Lemon 
Tested-by: Michal Kubecek 
Acked-by: Neal Cardwell 
Acked-by: Yuchung Cheng 
Acked-by: Christoph Paasch 
Cc: Jonathan Looney 
Signed-off-by: David S. Miller 
Signed-off-by: Matthieu Baerts 
Signed-off-by: Sasha Levin

VSOCK: use TCP state constants for sk_state

2019-08-04T07:31:59+00:00

commit 3b4477d2dcf2709d0be89e2a8dced3d0f4a017f2 upstream.

There are two state fields: socket->state and sock->sk_state.  The
socket->state field uses SS_UNCONNECTED, SS_CONNECTED, etc while the
sock->sk_state typically uses values that match TCP state constants
(TCP_CLOSE, TCP_ESTABLISHED).  AF_VSOCK does not follow this convention
and instead uses SS_* constants for both fields.

The sk_state field will be exposed to userspace through the vsock_diag
interface for ss(8), netstat(8), and other programs.

This patch switches sk_state to TCP state constants so that the meaning
of this field is consistent with other address families.  Not just
AF_INET and AF_INET6 use the TCP constants, AF_UNIX and others do too.

The following mapping was used to convert the code:

  SS_FREE -> TCP_CLOSE
  SS_UNCONNECTED -> TCP_CLOSE
  SS_CONNECTING -> TCP_SYN_SENT
  SS_CONNECTED -> TCP_ESTABLISHED
  SS_DISCONNECTING -> TCP_CLOSING
  VSOCK_SS_LISTEN -> TCP_LISTEN

In __vsock_create() the sk_state initialization was dropped because
sock_init_data() already initializes sk_state to TCP_CLOSE.

Signed-off-by: Stefan Hajnoczi 
Signed-off-by: David S. Miller 
[Adjusted net/vmw_vsock/hyperv_transport.c since the commit
b4562ca7925a ("hv_sock: add locking in the open/close/release code paths")
and the commit
c9d3fe9da094 ("VSOCK: fix outdated sk_state value in hvs_release()")
were backported before 3b4477d2dcf2.]
Signed-off-by: Dexuan Cui 
Signed-off-by: Greg Kroah-Hartman