<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux.git/include/net/af_unix.h, branch v4.8.11</title>
<subtitle>Clone of https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git</subtitle>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/'/>
<entry>
<title>af_unix: split 'u-&gt;readlock' into two: 'iolock' and 'bindlock'</title>
<updated>2016-09-04T20:29:29+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2016-09-01T21:43:53+00:00</published>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/commit/?id=6e1ce3c3451291142a57c4f3f6f999a29fb5b3bc'/>
<id>6e1ce3c3451291142a57c4f3f6f999a29fb5b3bc</id>
<content type='text'>
Right now we use the 'readlock' both for protecting some of the af_unix
IO path and for making the bind be single-threaded.

The two are independent, but using the same lock makes for a nasty
deadlock due to ordering with regards to filesystem locking.  The bind
locking would want to nest outside the VSF pathname locking, but the IO
locking wants to nest inside some of those same locks.

We tried to fix this earlier with commit c845acb324aa ("af_unix: Fix
splice-bind deadlock") which moved the readlock inside the vfs locks,
but that caused problems with overlayfs that will then call back into
filesystem routines that take the lock in the wrong order anyway.

Splitting the locks means that we can go back to having the bind lock be
the outermost lock, and we don't have any deadlocks with lock ordering.

Acked-by: Rainer Weikusat &lt;rweikusat@cyberadapt.com&gt;
Acked-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Acked-by: Hannes Frederic Sowa &lt;hannes@stressinduktion.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Right now we use the 'readlock' both for protecting some of the af_unix
IO path and for making the bind be single-threaded.

The two are independent, but using the same lock makes for a nasty
deadlock due to ordering with regards to filesystem locking.  The bind
locking would want to nest outside the VSF pathname locking, but the IO
locking wants to nest inside some of those same locks.

We tried to fix this earlier with commit c845acb324aa ("af_unix: Fix
splice-bind deadlock") which moved the readlock inside the vfs locks,
but that caused problems with overlayfs that will then call back into
filesystem routines that take the lock in the wrong order anyway.

Splitting the locks means that we can go back to having the bind lock be
the outermost lock, and we don't have any deadlocks with lock ordering.

Acked-by: Rainer Weikusat &lt;rweikusat@cyberadapt.com&gt;
Acked-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Acked-by: Hannes Frederic Sowa &lt;hannes@stressinduktion.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>unix: correctly track in-flight fds in sending process user_struct</title>
<updated>2016-02-08T15:30:42+00:00</updated>
<author>
<name>Hannes Frederic Sowa</name>
<email>hannes@stressinduktion.org</email>
</author>
<published>2016-02-03T01:11:03+00:00</published>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/commit/?id=415e3d3e90ce9e18727e8843ae343eda5a58fad6'/>
<id>415e3d3e90ce9e18727e8843ae343eda5a58fad6</id>
<content type='text'>
The commit referenced in the Fixes tag incorrectly accounted the number
of in-flight fds over a unix domain socket to the original opener
of the file-descriptor. This allows another process to arbitrary
deplete the original file-openers resource limit for the maximum of
open files. Instead the sending processes and its struct cred should
be credited.

To do so, we add a reference counted struct user_struct pointer to the
scm_fp_list and use it to account for the number of inflight unix fds.

Fixes: 712f4aad406bb1 ("unix: properly account for FDs passed over unix sockets")
Reported-by: David Herrmann &lt;dh.herrmann@gmail.com&gt;
Cc: David Herrmann &lt;dh.herrmann@gmail.com&gt;
Cc: Willy Tarreau &lt;w@1wt.eu&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Suggested-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Hannes Frederic Sowa &lt;hannes@stressinduktion.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The commit referenced in the Fixes tag incorrectly accounted the number
of in-flight fds over a unix domain socket to the original opener
of the file-descriptor. This allows another process to arbitrary
deplete the original file-openers resource limit for the maximum of
open files. Instead the sending processes and its struct cred should
be credited.

To do so, we add a reference counted struct user_struct pointer to the
scm_fp_list and use it to account for the number of inflight unix fds.

Fixes: 712f4aad406bb1 ("unix: properly account for FDs passed over unix sockets")
Reported-by: David Herrmann &lt;dh.herrmann@gmail.com&gt;
Cc: David Herrmann &lt;dh.herrmann@gmail.com&gt;
Cc: Willy Tarreau &lt;w@1wt.eu&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Suggested-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Hannes Frederic Sowa &lt;hannes@stressinduktion.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>unix: avoid use-after-free in ep_remove_wait_queue</title>
<updated>2015-11-23T17:29:58+00:00</updated>
<author>
<name>Rainer Weikusat</name>
<email>rweikusat@mobileactivedefense.com</email>
</author>
<published>2015-11-20T22:07:23+00:00</published>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/commit/?id=7d267278a9ece963d77eefec61630223fce08c6c'/>
<id>7d267278a9ece963d77eefec61630223fce08c6c</id>
<content type='text'>
Rainer Weikusat &lt;rweikusat@mobileactivedefense.com&gt; writes:
An AF_UNIX datagram socket being the client in an n:1 association with
some server socket is only allowed to send messages to the server if the
receive queue of this socket contains at most sk_max_ack_backlog
datagrams. This implies that prospective writers might be forced to go
to sleep despite none of the message presently enqueued on the server
receive queue were sent by them. In order to ensure that these will be
woken up once space becomes again available, the present unix_dgram_poll
routine does a second sock_poll_wait call with the peer_wait wait queue
of the server socket as queue argument (unix_dgram_recvmsg does a wake
up on this queue after a datagram was received). This is inherently
problematic because the server socket is only guaranteed to remain alive
for as long as the client still holds a reference to it. In case the
connection is dissolved via connect or by the dead peer detection logic
in unix_dgram_sendmsg, the server socket may be freed despite "the
polling mechanism" (in particular, epoll) still has a pointer to the
corresponding peer_wait queue. There's no way to forcibly deregister a
wait queue with epoll.

Based on an idea by Jason Baron, the patch below changes the code such
that a wait_queue_t belonging to the client socket is enqueued on the
peer_wait queue of the server whenever the peer receive queue full
condition is detected by either a sendmsg or a poll. A wake up on the
peer queue is then relayed to the ordinary wait queue of the client
socket via wake function. The connection to the peer wait queue is again
dissolved if either a wake up is about to be relayed or the client
socket reconnects or a dead peer is detected or the client socket is
itself closed. This enables removing the second sock_poll_wait from
unix_dgram_poll, thus avoiding the use-after-free, while still ensuring
that no blocked writer sleeps forever.

Signed-off-by: Rainer Weikusat &lt;rweikusat@mobileactivedefense.com&gt;
Fixes: ec0d215f9420 ("af_unix: fix 'poll for write'/connected DGRAM sockets")
Reviewed-by: Jason Baron &lt;jbaron@akamai.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Rainer Weikusat &lt;rweikusat@mobileactivedefense.com&gt; writes:
An AF_UNIX datagram socket being the client in an n:1 association with
some server socket is only allowed to send messages to the server if the
receive queue of this socket contains at most sk_max_ack_backlog
datagrams. This implies that prospective writers might be forced to go
to sleep despite none of the message presently enqueued on the server
receive queue were sent by them. In order to ensure that these will be
woken up once space becomes again available, the present unix_dgram_poll
routine does a second sock_poll_wait call with the peer_wait wait queue
of the server socket as queue argument (unix_dgram_recvmsg does a wake
up on this queue after a datagram was received). This is inherently
problematic because the server socket is only guaranteed to remain alive
for as long as the client still holds a reference to it. In case the
connection is dissolved via connect or by the dead peer detection logic
in unix_dgram_sendmsg, the server socket may be freed despite "the
polling mechanism" (in particular, epoll) still has a pointer to the
corresponding peer_wait queue. There's no way to forcibly deregister a
wait queue with epoll.

Based on an idea by Jason Baron, the patch below changes the code such
that a wait_queue_t belonging to the client socket is enqueued on the
peer_wait queue of the server whenever the peer receive queue full
condition is detected by either a sendmsg or a poll. A wake up on the
peer queue is then relayed to the ordinary wait queue of the client
socket via wake function. The connection to the peer wait queue is again
dissolved if either a wake up is about to be relayed or the client
socket reconnects or a dead peer is detected or the client socket is
itself closed. This enables removing the second sock_poll_wait from
unix_dgram_poll, thus avoiding the use-after-free, while still ensuring
that no blocked writer sleeps forever.

Signed-off-by: Rainer Weikusat &lt;rweikusat@mobileactivedefense.com&gt;
Fixes: ec0d215f9420 ("af_unix: fix 'poll for write'/connected DGRAM sockets")
Reviewed-by: Jason Baron &lt;jbaron@akamai.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>af_unix: constify the sock parameter in unix_sk()</title>
<updated>2015-10-08T11:05:18+00:00</updated>
<author>
<name>Paul Moore</name>
<email>pmoore@redhat.com</email>
</author>
<published>2015-10-06T19:03:53+00:00</published>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/commit/?id=c72eda0608a30ca6aa8722d4afdbd2557b3c9345'/>
<id>c72eda0608a30ca6aa8722d4afdbd2557b3c9345</id>
<content type='text'>
Make unix_sk() just like inet[6]_sk() by constify'ing the sock
parameter.

Signed-off-by: Paul Moore &lt;pmoore@redhat.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Make unix_sk() just like inet[6]_sk() by constify'ing the sock
parameter.

Signed-off-by: Paul Moore &lt;pmoore@redhat.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>af_unix: Convert the unix_sk macro to an inline function for type safety</title>
<updated>2015-09-29T20:47:07+00:00</updated>
<author>
<name>Aaron Conole</name>
<email>aconole@bytheb.org</email>
</author>
<published>2015-09-26T22:50:42+00:00</published>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/commit/?id=4613012db1d911f80897f9446a49de817b2c4c47'/>
<id>4613012db1d911f80897f9446a49de817b2c4c47</id>
<content type='text'>
As suggested by Eric Dumazet this change replaces the
#define with a static inline function to enjoy
complaints by the compiler when misusing the API.

Signed-off-by: Aaron Conole &lt;aconole@bytheb.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
As suggested by Eric Dumazet this change replaces the
#define with a static inline function to enjoy
complaints by the compiler when misusing the API.

Signed-off-by: Aaron Conole &lt;aconole@bytheb.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>net/unix: support SCM_SECURITY for stream sockets</title>
<updated>2015-06-11T05:49:20+00:00</updated>
<author>
<name>Stephen Smalley</name>
<email>sds@tycho.nsa.gov</email>
</author>
<published>2015-06-10T12:44:59+00:00</published>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/commit/?id=37a9a8df8ce9de6ea73349c9ac8bdf6ba4ec4f70'/>
<id>37a9a8df8ce9de6ea73349c9ac8bdf6ba4ec4f70</id>
<content type='text'>
SCM_SECURITY was originally only implemented for datagram sockets,
not for stream sockets.  However, SCM_CREDENTIALS is supported on
Unix stream sockets.  For consistency, implement Unix stream support
for SCM_SECURITY as well.  Also clean up the existing code and get
rid of the superfluous UNIXSID macro.

Motivated by https://bugzilla.redhat.com/show_bug.cgi?id=1224211,
where systemd was using SCM_CREDENTIALS and assumed wrongly that
SCM_SECURITY was also supported on Unix stream sockets.

Signed-off-by: Stephen Smalley &lt;sds@tycho.nsa.gov&gt;
Acked-by: Paul Moore &lt;paul@paul-moore.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
SCM_SECURITY was originally only implemented for datagram sockets,
not for stream sockets.  However, SCM_CREDENTIALS is supported on
Unix stream sockets.  For consistency, implement Unix stream support
for SCM_SECURITY as well.  Also clean up the existing code and get
rid of the superfluous UNIXSID macro.

Motivated by https://bugzilla.redhat.com/show_bug.cgi?id=1224211,
where systemd was using SCM_CREDENTIALS and assumed wrongly that
SCM_SECURITY was also supported on Unix stream sockets.

Signed-off-by: Stephen Smalley &lt;sds@tycho.nsa.gov&gt;
Acked-by: Paul Moore &lt;paul@paul-moore.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>af_unix: improve STREAM behavior with fragmented memory</title>
<updated>2013-08-10T08:16:44+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2013-08-08T21:37:32+00:00</published>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/commit/?id=e370a7236321773245c5522d8bb299380830d3b2'/>
<id>e370a7236321773245c5522d8bb299380830d3b2</id>
<content type='text'>
unix_stream_sendmsg() currently uses order-2 allocations,
and we had numerous reports this can fail.

The __GFP_REPEAT flag present in sock_alloc_send_pskb() is
not helping.

This patch extends the work done in commit eb6a24816b247c
("af_unix: reduce high order page allocations) for
datagram sockets.

This opens the possibility of zero copy IO (splice() and
friends)

The trick is to not use skb_pull() anymore in recvmsg() path,
and instead add a @consumed field in UNIXCB() to track amount
of already read payload in the skb.

There is a performance regression for large sends
because of extra page allocations that will be addressed
in a follow-up patch, allowing sock_alloc_send_pskb()
to attempt high order page allocations.

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
unix_stream_sendmsg() currently uses order-2 allocations,
and we had numerous reports this can fail.

The __GFP_REPEAT flag present in sock_alloc_send_pskb() is
not helping.

This patch extends the work done in commit eb6a24816b247c
("af_unix: reduce high order page allocations) for
datagram sockets.

This opens the possibility of zero copy IO (splice() and
friends)

The trick is to not use skb_pull() anymore in recvmsg() path,
and instead add a @consumed field in UNIXCB() to track amount
of already read payload in the skb.

There is a performance regression for large sends
because of extra page allocations that will be addressed
in a follow-up patch, allowing sock_alloc_send_pskb()
to attempt high order page allocations.

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>af_unix.h: Remove extern from function prototypes</title>
<updated>2013-08-01T00:50:01+00:00</updated>
<author>
<name>Joe Perches</name>
<email>joe@perches.com</email>
</author>
<published>2013-08-01T00:31:33+00:00</published>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/commit/?id=b60a8280ba2a34351dd8b98664df3785c27528dd'/>
<id>b60a8280ba2a34351dd8b98664df3785c27528dd</id>
<content type='text'>
There are a mix of function prototypes with and without extern
in the kernel sources.  Standardize on not using extern for
function prototypes.

Function prototypes don't need to be written with extern.
extern is assumed by the compiler.  Its use is as unnecessary as
using auto to declare automatic/local variables in a block.

Reflow modified prototypes to 80 columns.

Signed-off-by: Joe Perches &lt;joe@perches.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
There are a mix of function prototypes with and without extern
in the kernel sources.  Standardize on not using extern for
function prototypes.

Function prototypes don't need to be written with extern.
extern is assumed by the compiler.  Its use is as unnecessary as
using auto to declare automatic/local variables in a block.

Reflow modified prototypes to 80 columns.

Signed-off-by: Joe Perches &lt;joe@perches.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>af_unix: fix a fatal race with bit fields</title>
<updated>2013-05-01T19:13:49+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>eric.dumazet@gmail.com</email>
</author>
<published>2013-05-01T05:24:03+00:00</published>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/commit/?id=60bc851ae59bfe99be6ee89d6bc50008c85ec75d'/>
<id>60bc851ae59bfe99be6ee89d6bc50008c85ec75d</id>
<content type='text'>
Using bit fields is dangerous on ppc64/sparc64, as the compiler [1]
uses 64bit instructions to manipulate them.
If the 64bit word includes any atomic_t or spinlock_t, we can lose
critical concurrent changes.

This is happening in af_unix, where unix_sk(sk)-&gt;gc_candidate/
gc_maybe_cycle/lock share the same 64bit word.

This leads to fatal deadlock, as one/several cpus spin forever
on a spinlock that will never be available again.

A safer way would be to use a long to store flags.
This way we are sure compiler/arch wont do bad things.

As we own unix_gc_lock spinlock when clearing or setting bits,
we can use the non atomic __set_bit()/__clear_bit().

recursion_level can share the same 64bit location with the spinlock,
as it is set only with this spinlock held.

[1] bug fixed in gcc-4.8.0 :
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52080

Reported-by: Ambrose Feinstein &lt;ambrose@google.com&gt;
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Cc: Benjamin Herrenschmidt &lt;benh@kernel.crashing.org&gt;
Cc: Paul Mackerras &lt;paulus@samba.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Using bit fields is dangerous on ppc64/sparc64, as the compiler [1]
uses 64bit instructions to manipulate them.
If the 64bit word includes any atomic_t or spinlock_t, we can lose
critical concurrent changes.

This is happening in af_unix, where unix_sk(sk)-&gt;gc_candidate/
gc_maybe_cycle/lock share the same 64bit word.

This leads to fatal deadlock, as one/several cpus spin forever
on a spinlock that will never be available again.

A safer way would be to use a long to store flags.
This way we are sure compiler/arch wont do bad things.

As we own unix_gc_lock spinlock when clearing or setting bits,
we can use the non atomic __set_bit()/__clear_bit().

recursion_level can share the same 64bit location with the spinlock,
as it is set only with this spinlock held.

[1] bug fixed in gcc-4.8.0 :
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52080

Reported-by: Ambrose Feinstein &lt;ambrose@google.com&gt;
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Cc: Benjamin Herrenschmidt &lt;benh@kernel.crashing.org&gt;
Cc: Paul Mackerras &lt;paulus@samba.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>scm: Stop passing struct cred</title>
<updated>2013-04-07T22:58:55+00:00</updated>
<author>
<name>Eric W. Biederman</name>
<email>ebiederm@xmission.com</email>
</author>
<published>2013-04-03T17:28:16+00:00</published>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/commit/?id=6b0ee8c036ecb3ac92e18e6ca0dca7bff88beaf0'/>
<id>6b0ee8c036ecb3ac92e18e6ca0dca7bff88beaf0</id>
<content type='text'>
Now that uids and gids are completely encapsulated in kuid_t
and kgid_t we no longer need to pass struct cred which allowed
us to test both the uid and the user namespace for equality.

Passing struct cred potentially allows us to pass the entire group
list as BSD does but I don't believe the cost of cache line misses
justifies retaining code for a future potential application.

Signed-off-by: "Eric W. Biederman" &lt;ebiederm@xmission.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Now that uids and gids are completely encapsulated in kuid_t
and kgid_t we no longer need to pass struct cred which allowed
us to test both the uid and the user namespace for equality.

Passing struct cred potentially allows us to pass the entire group
list as BSD does but I don't believe the cost of cache line misses
justifies retaining code for a future potential application.

Signed-off-by: "Eric W. Biederman" &lt;ebiederm@xmission.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
</feed>
