linux.git/include/net, branch v3.10.25

netfilter: push reasm skb through instead of original frag skbs

2013-12-08T15:29:25+00:00

[ Upstream commit 6aafeef03b9d9ecf255f3a80ed85ee070260e1ae ]

Pushing original fragments through causes several problems. For example
for matching, frags may not be matched correctly. Take following
example:


On HOSTA do:
ip6tables -I INPUT -p icmpv6 -j DROP
ip6tables -I INPUT -p icmpv6 -m icmp6 --icmpv6-type 128 -j ACCEPT

and on HOSTB you do:
ping6 HOSTA -s2000    (MTU is 1500)

Incoming echo requests will be filtered out on HOSTA. This issue does
not occur with smaller packets than MTU (where fragmentation does not happen)


As was discussed previously, the only correct solution seems to be to use
reassembled skb instead of separete frags. Doing this has positive side
effects in reducing sk_buff by one pointer (nfct_reasm) and also the reams
dances in ipvs and conntrack can be removed.

Future plan is to remove net/ipv6/netfilter/nf_conntrack_reasm.c
entirely and use code in net/ipv6/reassembly.c instead.

Signed-off-by: Jiri Pirko 
Acked-by: Julian Anastasov 
Signed-off-by: Marcelo Ricardo Leitner 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

inet: fix addr_len/msg->msg_namelen assignment in recv_error and rxpmtu functions

2013-12-08T15:29:25+00:00

[ Upstream commit 85fbaa75037d0b6b786ff18658ddf0b4014ce2a4 ]

Commit bceaa90240b6019ed73b49965eac7d167610be69 ("inet: prevent leakage
of uninitialized memory to user in recv syscalls") conditionally updated
addr_len if the msg_name is written to. The recv_error and rxpmtu
functions relied on the recvmsg functions to set up addr_len before.

As this does not happen any more we have to pass addr_len to those
functions as well and set it to the size of the corresponding sockaddr
length.

This broke traceroute and such.

Fixes: bceaa90240b6 ("inet: prevent leakage of uninitialized memory to user in recv syscalls")
Reported-by: Brad Spengler 
Reported-by: Tom Labanowski
Cc: mpb 
Cc: David S. Miller 
Cc: Eric Dumazet 
Signed-off-by: Hannes Frederic Sowa 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

netfilter: nf_conntrack: use RCU safe kfree for conntrack extensions

2013-12-04T18:57:35+00:00

commit c13a84a830a208fb3443628773c8ca0557773cc7 upstream.

Commit 68b80f11 (netfilter: nf_nat: fix RCU races) introduced
RCU protection for freeing extension data when reallocation
moves them to a new location. We need the same protection when
freeing them in nf_ct_ext_free() in order to prevent a
use-after-free by other threads referencing a NAT extension data
via bysource list.

Signed-off-by: Michal Kubecek 
Signed-off-by: Pablo Neira Ayuso 
Signed-off-by: Greg Kroah-Hartman

ipv6: reset dst.expires value when clearing expire flag

2013-11-20T20:27:46+00:00

[ Upstream commit 01ba16d6ec85a1ec4669c75513a76b61ec53ee50 ]

On receiving a packet too big icmp error we update the expire value by
calling rt6_update_expires. This function uses dst_set_expires which is
implemented that it can only reduce the expiration value of the dst entry.

If we insert new routing non-expiry information into the ipv6 fib where
we already have a matching rt6_info we only clear the RTF_EXPIRES flag
in rt6i_flags and leave the dst.expires value as is.

When new mtu information arrives for that cached dst_entry we again
call dst_set_expires. This time it won't update the dst.expire value
because we left the dst.expire value intact from the last update. So
dst_set_expires won't touch dst.expires.

Fix this by resetting dst.expires when clearing the RTF_EXPIRE flag.
dst_set_expires checks for a zero expiration and updates the
dst.expires.

In the past this (not updating dst.expires) was necessary because
dst.expire was placed in a union with the dst_entry *from reference
and rt6_clean_expires did assign NULL to it. This split happend in
ecd9883724b78cc72ed92c98bcb1a46c764fff21 ("ipv6: fix race condition
regarding dst->expires and dst->from").

Reported-by: Steinar H. Gunderson 
Reported-by: Valentijn Sessink 
Cc: YOSHIFUJI Hideaki 
Acked-by: Eric Dumazet 
Tested-by: Valentijn Sessink 
Signed-off-by: Hannes Frederic Sowa 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

ip_gre: Fix WCCPv2 header parsing.

2013-11-20T20:27:46+00:00

[ No applicable upstream commit, the upstream implementation is
  now completely different and doesn't have this bug. ]

In case of WCCPv2 GRE header has extra four bytes.  Following
patch pull those extra four bytes so that skb offsets are set
correctly.

CC: Eric Dumazet 
Reported-by: Peter Schmitt 
Tested-by: Peter Schmitt 
Signed-off-by: Pravin B Shelar 
Signed-off-by: Greg Kroah-Hartman

ipv6: fill rt6i_gateway with nexthop address

2013-11-04T12:31:05+00:00

[ Upstream commit 550bab42f83308c9d6ab04a980cc4333cef1c8fa ]

Make sure rt6i_gateway contains nexthop information in
all routes returned from lookup or when routes are directly
attached to skb for generated ICMP packets.

The effect of this patch should be a faster version of
rt6_nexthop() and the consideration of local addresses as
nexthop.

Signed-off-by: Julian Anastasov 
Acked-by: Hannes Frederic Sowa 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

ipv6: always prefer rt6i_gateway if present

2013-11-04T12:31:05+00:00

[ Upstream commit 96dc809514fb2328605198a0602b67554d8cce7b ]

In v3.9 6fd6ce2056de2709 ("ipv6: Do not depend on rt->n in
ip6_finish_output2()." changed the behaviour of ip6_finish_output2()
such that the recently introduced rt6_nexthop() is used
instead of an assigned neighbor.

As rt6_nexthop() prefers rt6i_gateway only for gatewayed
routes this causes a problem for users like IPVS, xt_TEE and
RAW(hdrincl) if they want to use different address for routing
compared to the destination address.

Another case is when redirect can create RTF_DYNAMIC
route without RTF_GATEWAY flag, we ignore the rt6i_gateway
in rt6_nexthop().

Fix the above problems by considering the rt6i_gateway if
present, so that traffic routed to address on local subnet is
not wrongly diverted to the destination address.

Thanks to Simon Horman and Phil Oester for spotting the
problematic commit.

Thanks to Hannes Frederic Sowa for his review and help in testing.

Reported-by: Phil Oester 
Reported-by: Mark Brooks 
Signed-off-by: Julian Anastasov 
Acked-by: Hannes Frederic Sowa 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

net: fix cipso packet validation when !NETLABEL

2013-11-04T12:31:04+00:00

[ Upstream commit f2e5ddcc0d12f9c4c7b254358ad245c9dddce13b ]

When CONFIG_NETLABEL is disabled, the cipso_v4_validate() function could loop
forever in the main loop if opt[opt_iter +1] == 0, this will causing a kernel
crash in an SMP system, since the CPU executing this function will
stall /not respond to IPIs.

This problem can be reproduced by running the IP Stack Integrity Checker
(http://isic.sourceforge.net) using the following command on a Linux machine
connected to DUT:

"icmpsic -s rand -d  -r 123456"
wait (1-2 min)

Signed-off-by: Seif Mazareeb 
Acked-by: Paul Moore 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

net: dst: provide accessor function to dst->xfrm

2013-11-04T12:31:03+00:00

[ Upstream commit e87b3998d795123b4139bc3f25490dd236f68212 ]

dst->xfrm is conditionally defined.  Provide accessor funtion that
is always available.

Signed-off-by: Vlad Yasevich 
Acked-by: Neil Horman 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

tcp: TSO packets automatic sizing

2013-11-04T12:30:59+00:00

[ Upstream commits 6d36824e730f247b602c90e8715a792003e3c5a7,
  02cf4ebd82ff0ac7254b88e466820a290ed8289a, and parts of
  7eec4174ff29cd42f2acfae8112f51c228545d40 ]

After hearing many people over past years complaining against TSO being
bursty or even buggy, we are proud to present automatic sizing of TSO
packets.

One part of the problem is that tcp_tso_should_defer() uses an heuristic
relying on upcoming ACKS instead of a timer, but more generally, having
big TSO packets makes little sense for low rates, as it tends to create
micro bursts on the network, and general consensus is to reduce the
buffering amount.

This patch introduces a per socket sk_pacing_rate, that approximates
the current sending rate, and allows us to size the TSO packets so
that we try to send one packet every ms.

This field could be set by other transports.

Patch has no impact for high speed flows, where having large TSO packets
makes sense to reach line rate.

For other flows, this helps better packet scheduling and ACK clocking.

This patch increases performance of TCP flows in lossy environments.

A new sysctl (tcp_min_tso_segs) is added, to specify the
minimal size of a TSO packet (default being 2).

A follow-up patch will provide a new packet scheduler (FQ), using
sk_pacing_rate as an input to perform optional per flow pacing.

This explains why we chose to set sk_pacing_rate to twice the current
rate, allowing 'slow start' ramp up.

sk_pacing_rate = 2 * cwnd * mss / srtt

v2: Neal Cardwell reported a suspect deferring of last two segments on
initial write of 10 MSS, I had to change tcp_tso_should_defer() to take
into account tp->xmit_size_goal_segs

Signed-off-by: Eric Dumazet 
Cc: Neal Cardwell 
Cc: Yuchung Cheng 
Cc: Van Jacobson 
Cc: Tom Herbert 
Acked-by: Yuchung Cheng 
Acked-by: Neal Cardwell 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman