linux.git/kernel/exit.c, branch v5.4.63

exit: Move preemption fixup up, move blocking operations down

2020-06-22T07:31:01+00:00

[ Upstream commit 586b58cac8b4683eb58a1446fbc399de18974e40 ]

With CONFIG_DEBUG_ATOMIC_SLEEP=y and CONFIG_CGROUPS=y, kernel oopses in
non-preemptible context look untidy; after the main oops, the kernel prints
a "sleeping function called from invalid context" report because
exit_signals() -> cgroup_threadgroup_change_begin() -> percpu_down_read()
can sleep, and that happens before the preempt_count_set(PREEMPT_ENABLED)
fixup.

It looks like the same thing applies to profile_task_exit() and
kcov_task_exit().

Fix it by moving the preemption fixup up and the calls to
profile_task_exit() and kcov_task_exit() down.

Fixes: 1dc0fffc48af ("sched/core: Robustify preemption leak checks")
Signed-off-by: Jann Horn 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20200305220657.46800-1-jannh@google.com
Signed-off-by: Sasha Levin

exit: panic before exit_mm() on global init exit

2020-01-09T09:20:01+00:00

commit 43cf75d96409a20ef06b756877a2e72b10a026fc upstream.

Currently, when global init and all threads in its thread-group have exited
we panic via:
do_exit()
-> exit_notify()
   -> forget_original_parent()
      -> find_child_reaper()
This makes it hard to extract a useable coredump for global init from a
kernel crashdump because by the time we panic exit_mm() will have already
released global init's mm.
This patch moves the panic futher up before exit_mm() is called. As was the
case previously, we only panic when global init and all its threads in the
thread-group have exited.

Signed-off-by: chenqiwu 
Acked-by: Christian Brauner 
Acked-by: Oleg Nesterov 
[christian.brauner@ubuntu.com: fix typo, rewrite commit message]
Link: https://lore.kernel.org/r/1576736993-10121-1-git-send-email-qiwuchen55@gmail.com
Signed-off-by: Christian Brauner 
Signed-off-by: Greg Kroah-Hartman

futex: Mark the begin of futex exit explicitly

2019-11-29T09:10:11+00:00

commit 18f694385c4fd77a09851fd301236746ca83f3cb upstream.

Instead of relying on PF_EXITING use an explicit state for the futex exit
and set it in the futex exit function. This moves the smp barrier and the
lock/unlock serialization into the futex code.

As with the DEAD state this is restricted to the exit path as exec
continues to use the same task struct.

This allows to simplify that logic in a next step.

Signed-off-by: Thomas Gleixner 
Reviewed-by: Ingo Molnar 
Acked-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20191106224556.539409004@linutronix.de
Signed-off-by: Greg Kroah-Hartman

futex: Set task::futex_state to DEAD right after handling futex exit

2019-11-29T09:10:10+00:00

commit f24f22435dcc11389acc87e5586239c1819d217c upstream.

Setting task::futex_state in do_exit() is rather arbitrarily placed for no
reason. Move it into the futex code.

Note, this is only done for the exit cleanup as the exec cleanup cannot set
the state to FUTEX_STATE_DEAD because the task struct is still in active
use.

Signed-off-by: Thomas Gleixner 
Reviewed-by: Ingo Molnar 
Acked-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20191106224556.439511191@linutronix.de
Signed-off-by: Greg Kroah-Hartman

exit/exec: Seperate mm_release()

2019-11-29T09:10:10+00:00

commit 4610ba7ad877fafc0a25a30c6c82015304120426 upstream.

mm_release() contains the futex exit handling. mm_release() is called from
do_exit()->exit_mm() and from exec()->exec_mm().

In the exit_mm() case PF_EXITING and the futex state is updated. In the
exec_mm() case these states are not touched.

As the futex exit code needs further protections against exit races, this
needs to be split into two functions.

Preparatory only, no functional change.

Signed-off-by: Thomas Gleixner 
Reviewed-by: Ingo Molnar 
Acked-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20191106224556.240518241@linutronix.de
Signed-off-by: Greg Kroah-Hartman

futex: Replace PF_EXITPIDONE with a state

2019-11-29T09:10:09+00:00

commit 3d4775df0a89240f671861c6ab6e8d59af8e9e41 upstream.

The futex exit handling relies on PF_ flags. That's suboptimal as it
requires a smp_mb() and an ugly lock/unlock of the exiting tasks pi_lock in
the middle of do_exit() to enforce the observability of PF_EXITING in the
futex code.

Add a futex_state member to task_struct and convert the PF_EXITPIDONE logic
over to the new state. The PF_EXITING dependency will be cleaned up in a
later step.

This prepares for handling various futex exit issues later.

Signed-off-by: Thomas Gleixner 
Reviewed-by: Ingo Molnar 
Acked-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20191106224556.149449274@linutronix.de
Signed-off-by: Greg Kroah-Hartman

tasks, sched/core: With a grace period after finish_task_switch(), remove unnecessary code

2019-09-25T15:42:29+00:00

Remove work arounds that were written before there was a grace period
after tasks left the runqueue in finish_task_switch().

In particular now that there tasks exiting the runqueue exprience
a RCU grace period none of the work performed by task_rcu_dereference()
excpet the rcu_dereference() is necessary so replace task_rcu_dereference()
with rcu_dereference().

Remove the code in rcuwait_wait_event() that checks to ensure the current
task has not exited.  It is no longer necessary as it is guaranteed
that any running task will experience a RCU grace period after it
leaves the run queueue.

Remove the comment in rcuwait_wake_up() as it is no longer relevant.

Ref: 8f95c90ceb54 ("sched/wait, RCU: Introduce rcuwait machinery")
Ref: 150593bf8693 ("sched/api: Introduce task_rcu_dereference() and try_get_task_struct()")
Signed-off-by: Eric W. Biederman 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Chris Metcalf 
Cc: Christoph Lameter 
Cc: Davidlohr Bueso 
Cc: Kirill Tkhai 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Oleg Nesterov 
Cc: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: Russell King - ARM Linux admin 
Cc: Thomas Gleixner 
Link: https://lkml.kernel.org/r/87lfurdpk9.fsf_-_@x220.int.ebiederm.org
Signed-off-by: Ingo Molnar

tasks: Add a count of task RCU users

2019-09-25T15:42:29+00:00

Add a count of the number of RCU users (currently 1) of the task
struct so that we can later add the scheduler case and get rid of the
very subtle task_rcu_dereference(), and just use rcu_dereference().

As suggested by Oleg have the count overlap rcu_head so that no
additional space in task_struct is required.

Inspired-by: Linus Torvalds 
Inspired-by: Oleg Nesterov 
Signed-off-by: Eric W. Biederman 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Chris Metcalf 
Cc: Christoph Lameter 
Cc: Davidlohr Bueso 
Cc: Kirill Tkhai 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: Russell King - ARM Linux admin 
Cc: Thomas Gleixner 
Link: https://lkml.kernel.org/r/87woebdplt.fsf_-_@x220.int.ebiederm.org
Signed-off-by: Ingo Molnar

Merge tag 'core-process-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

2019-09-16T16:28:19+00:00

Pull pidfd/waitid updates from Christian Brauner:
 "This contains two features and various tests.

  First, it adds support for waiting on process through pidfds by adding
  the P_PIDFD type to the waitid() syscall. This completes the basic
  functionality of the pidfd api (cf. [1]). In the meantime we also have
  a new adition to the userspace projects that make use of the pidfd
  api. The qt project was nice enough to send a mail pointing out that
  they have a pr up to switch to the pidfd api (cf. [2]).

  Second, this tag contains an extension to the waitid() syscall to make
  it possible to wait on the current process group in a race free manner
  (even though the actual problem is very unlikely) by specifing 0
  together with the P_PGID type. This extension traces back to a
  discussion on the glibc development mailing list.

  There are also a range of tests for the features above. Additionally,
  the test-suite which detected the pidfd-polling race we fixed in [3]
  is included in this tag"

[1] https://lwn.net/Articles/794707/
[2] https://codereview.qt-project.org/c/qt/qtbase/+/108456
[3] commit b191d6491be6 ("pidfd: fix a poll race when setting exit_state")

* tag 'core-process-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  waitid: Add support for waiting for the current process group
  tests: add pidfd poll tests
  tests: move common definitions and functions into pidfd.h
  pidfd: add pidfd_wait tests
  pidfd: add P_PIDFD to waitid()

waitid: Add support for waiting for the current process group

2019-09-10T15:05:46+00:00

It was recently discovered that the linux version of waitid is not a
superset of the other wait functions because it does not include support
for waiting for the current process group. This has two downsides:
1. An extra system call is needed to get the current process group.
2. After the current process group is received and before it is passed
   to waitid a signal could arrive causing the current process group to change.
   Inherent race-conditions as these make it impossible for userspace to
   emulate this functionaly and thus violate async-signal safety
   requirements for waitpid.

Arguments can be made for using a different choice of idtype and id
for this case but the BSDs already use this P_PGID and 0 to indicate
waiting for the current process's process group.  So be nice to user
space programmers and don't introduce an unnecessary incompatibility.

Some people have noted that the posix description is that
waitpid will wait for the current process group, and that in
the presence of pthreads that process group can change.  To get
clarity on this issue I looked at XNU, FreeBSD, and Luminos.  All of
those flavors of unix waited for the current process group at the
time of call and as written could not adapt to the process group
changing after the call.

At one point Linux did adapt to the current process group changing but
that stopped in 161550d74c07 ("pid: sys_wait... fixes").  It has been
over 11 years since Linux has that behavior, no programs that fail
with the change in behavior have been reported, and I could not
find any other unix that does this.  So I think it is safe to clarify
the definition of current process group, to current process group
at the time of the wait function.

Signed-off-by: "Eric W. Biederman" 
Signed-off-by: Christian Brauner 
Reviewed-by: Oleg Nesterov 
Cc: "H. Peter Anvin" 
Cc: Arnd Bergmann 
Cc: Palmer Dabbelt 
Cc: Rich Felker 
Cc: Alistair Francis 
Cc: Zong Li 
Cc: Andrew Morton 
Cc: Oleg Nesterov 
Cc: Linus Torvalds 
Cc: Al Viro 
Cc: Florian Weimer 
Cc: Adhemerval Zanella 
Cc: GNU C Library 
Link: https://lore.kernel.org/r/20190814154400.6371-2-christian.brauner@ubuntu.com