linux.git/kernel/fork.c, branch v3.16.81

sched/fair: Don't free p->numa_faults with concurrent readers

2019-11-22T15:57:07+00:00

commit 16d51a590a8ce3befb1308e0e7ab77f3b661af33 upstream.

When going through execve(), zero out the NUMA fault statistics instead of
freeing them.

During execve, the task is reachable through procfs and the scheduler. A
concurrent /proc/*/sched reader can read data from a freed ->numa_faults
allocation (confirmed by KASAN) and write it back to userspace.
I believe that it would also be possible for a use-after-free read to occur
through a race between a NUMA fault and execve(): task_numa_fault() can
lead to task_numa_compare(), which invokes task_weight() on the currently
running task of a different CPU.

Another way to fix this would be to make ->numa_faults RCU-managed or add
extra locking, but it seems easier to wipe the NUMA fault statistics on
execve.

Signed-off-by: Jann Horn 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Petr Mladek 
Cc: Sergey Senozhatsky 
Cc: Thomas Gleixner 
Cc: Will Deacon 
Fixes: 82727018b0d3 ("sched/numa: Call task_numa_free() from do_execve()")
Link: https://lkml.kernel.org/r/20190716152047.14424-1-jannh@google.com
Signed-off-by: Ingo Molnar 
[bwh: Backported to 3.16: adjust filename, context]
Signed-off-by: Ben Hutchings

fork: record start_time late

2019-05-11T14:22:48+00:00

commit 7b55851367136b1efd84d98fea81ba57a98304cf upstream.

This changes the fork(2) syscall to record the process start_time after
initializing the basic task structure but still before making the new
process visible to user-space.

Technically, we could record the start_time anytime during fork(2).  But
this might lead to scenarios where a start_time is recorded long before
a process becomes visible to user-space.  For instance, with
userfaultfd(2) and TLS, user-space can delay the execution of fork(2)
for an indefinite amount of time (and will, if this causes network
access, or similar).

By recording the start_time late, it much closer reflects the point in
time where the process becomes live and can be observed by other
processes.

Lastly, this makes it much harder for user-space to predict and control
the start_time they get assigned.  Previously, user-space could fork a
process and stall it in copy_thread_tls() before its pid is allocated,
but after its start_time is recorded.  This can be misused to later-on
cycle through PIDs and resume the stalled fork(2) yielding a process
that has the same pid and start_time as a process that existed before.
This can be used to circumvent security systems that identify processes
by their pid+start_time combination.

Even though user-space was always aware that start_time recording is
flaky (but several projects are known to still rely on start_time-based
identification), changing the start_time to be recorded late will help
mitigate existing attacks and make it much harder for user-space to
control the start_time a process gets assigned.

Reported-by: Jann Horn 
Signed-off-by: Tom Gundersen 
Signed-off-by: David Herrmann 
Signed-off-by: Linus Torvalds 
[bwh: Backported to 3.16: start_time initialisation code is different]
Signed-off-by: Ben Hutchings

rmap: drop support of non-linear mappings

2018-10-03T03:09:51+00:00

commit 27ba0644ea9dfe6e7693abc85837b60e40583b96 upstream.

We don't create non-linear mappings anymore.  Let's drop code which
handles them in rmap.

Signed-off-by: Kirill A. Shutemov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
[bwh: Backported to 3.16:
 - Deleted code is slightly different
 - Adjust context]
Signed-off-by: Ben Hutchings

KAISER: Kernel Address Isolation

2018-01-09T00:35:13+00:00

This patch introduces our implementation of KAISER (Kernel Address Isolation to
have Side-channels Efficiently Removed), a kernel isolation technique to close
hardware side channels on kernel address information.

More information about the patch can be found on:

        https://github.com/IAIK/KAISER

From: Richard Fellner 
From: Daniel Gruss 
Subject: [RFC, PATCH] x86_64: KAISER - do not map kernel in user mode
Date: Thu, 4 May 2017 14:26:50 +0200
Link: http://marc.info/?l=linux-kernel&m=149390087310405&w=2
Kaiser-4.10-SHA1: c4b1831d44c6144d3762ccc72f0c4e71a0c713e5

To: 
To: 
Cc: 
Cc: 
Cc: Michael Schwarz 
Cc: Richard Fellner 
Cc: Ingo Molnar 
Cc: 
Cc: 

After several recent works [1,2,3] KASLR on x86_64 was basically
considered dead by many researchers. We have been working on an
efficient but effective fix for this problem and found that not mapping
the kernel space when running in user mode is the solution to this
problem [4] (the corresponding paper [5] will be presented at ESSoS17).

With this RFC patch we allow anybody to configure their kernel with the
flag CONFIG_KAISER to add our defense mechanism.

If there are any questions we would love to answer them.
We also appreciate any comments!

Cheers,
Daniel (+ the KAISER team from Graz University of Technology)

[1] http://www.ieee-security.org/TC/SP2013/papers/4977a191.pdf
[2] https://www.blackhat.com/docs/us-16/materials/us-16-Fogh-Using-Undocumented-CPU-Behaviour-To-See-Into-Kernel-Mode-And-Break-KASLR-In-The-Process.pdf
[3] https://www.blackhat.com/docs/us-16/materials/us-16-Jang-Breaking-Kernel-Address-Space-Layout-Randomization-KASLR-With-Intel-TSX.pdf
[4] https://github.com/IAIK/KAISER
[5] https://gruss.cc/files/kaiser.pdf

(cherry picked from Change-Id: I0eb000c33290af01fc4454ca0c701d00f1d30b1d)

Conflicts:
arch/x86/entry/entry_64.S (not in this tree)
arch/x86/kernel/entry_64.S (patched instead of that)
arch/x86/entry/entry_64_compat.S (not in this tree)
arch/x86/ia32/ia32entry.S (patched instead of that)
arch/x86/include/asm/hw_irq.h
arch/x86/include/asm/pgtable_types.h
arch/x86/include/asm/processor.h
arch/x86/kernel/irqinit.c
arch/x86/kernel/process.c
arch/x86/mm/Makefile
arch/x86/mm/pgtable.c
init/main.c

Signed-off-by: Hugh Dickins 
[bwh: Folded in the follow-up patches from Hugh:
 - kaiser: merged update
 - kaiser: do not set _PAGE_NX on pgd_none
 - kaiser: stack map PAGE_SIZE at THREAD_SIZE-PAGE_SIZE
 - kaiser: fix build and FIXME in alloc_ldt_struct()
 - kaiser: KAISER depends on SMP
 - kaiser: fix regs to do_nmi() ifndef CONFIG_KAISER
 - kaiser: fix perf crashes
 - kaiser: ENOMEM if kaiser_pagetable_walk() NULL
 - kaiser: tidied up asm/kaiser.h somewhat
 - kaiser: tidied up kaiser_add/remove_mapping slightly
 - kaiser: kaiser_remove_mapping() move along the pgd
 - kaiser: align addition to x86/mm/Makefile
 - kaiser: cleanups while trying for gold link
 - kaiser: name that 0x1000 KAISER_SHADOW_PGD_OFFSET
 - kaiser: delete KAISER_REAL_SWITCH option
 - kaiser: vmstat show NR_KAISERTABLE as nr_overhead
 - kaiser: enhanced by kernel and user PCIDs
 - kaiser: load_new_mm_cr3() let SWITCH_USER_CR3 flush user
 - kaiser: PCID 0 for kernel and 128 for user
 - kaiser: x86_cr3_pcid_noflush and x86_cr3_pcid_user
 - kaiser: paranoid_entry pass cr3 need to paranoid_exit
 - kaiser: _pgd_alloc() without __GFP_REPEAT to avoid stalls
 - kaiser: fix unlikely error in alloc_ldt_struct()
 - kaiser: drop is_atomic arg to kaiser_pagetable_walk()
 Backported to 3.16:
 - Add missing #include in arch/x86/mm/kaiser.c
 - Use variable PEBS buffer size since we have "perf/x86/intel: Use PAGE_SIZE
   for PEBS buffer size on Core2"
 - Renumber X86_FEATURE_INVPCID_SINGLE to avoid collision
 - Adjust context]
Signed-off-by: Ben Hutchings

mm: Add a user_ns owner to mm_struct and fix ptrace permission checks

2018-01-01T20:52:10+00:00

commit bfedb589252c01fa505ac9f6f2a3d5d68d707ef4 upstream.

During exec dumpable is cleared if the file that is being executed is
not readable by the user executing the file.  A bug in
ptrace_may_access allows reading the file if the executable happens to
enter into a subordinate user namespace (aka clone(CLONE_NEWUSER),
unshare(CLONE_NEWUSER), or setns(fd, CLONE_NEWUSER).

This problem is fixed with only necessary userspace breakage by adding
a user namespace owner to mm_struct, captured at the time of exec, so
it is clear in which user namespace CAP_SYS_PTRACE must be present in
to be able to safely give read permission to the executable.

The function ptrace_may_access is modified to verify that the ptracer
has CAP_SYS_ADMIN in task->mm->user_ns instead of task->cred->user_ns.
This ensures that if the task changes it's cred into a subordinate
user namespace it does not become ptraceable.

The function ptrace_attach is modified to only set PT_PTRACE_CAP when
CAP_SYS_PTRACE is held over task->mm->user_ns.  The intent of
PT_PTRACE_CAP is to be a flag to note that whatever permission changes
the task might go through the tracer has sufficient permissions for
it not to be an issue.  task->cred->user_ns is always the same
as or descendent of mm->user_ns.  Which guarantees that having
CAP_SYS_PTRACE over mm->user_ns is the worst case for the tasks
credentials.

To prevent regressions mm->dumpable and mm->user_ns are not considered
when a task has no mm.  As simply failing ptrace_may_attach causes
regressions in privileged applications attempting to read things
such as /proc//stat

Acked-by: Kees Cook 
Tested-by: Cyrill Gorcunov 
Fixes: 8409cca70561 ("userns: allow ptrace from non-init user namespaces")
Signed-off-by: "Eric W. Biederman" 
[bwh: Backported to 3.16: adjust context]
Signed-off-by: Ben Hutchings

mm: migrate: prevent racy access to tlb_flush_pending

2017-11-11T13:33:19+00:00

commit 16af97dc5a8975371a83d9e30a64038b48f40a2d upstream.

Patch series "fixes of TLB batching races", v6.

It turns out that Linux TLB batching mechanism suffers from various
races.  Races that are caused due to batching during reclamation were
recently handled by Mel and this patch-set deals with others.  The more
fundamental issue is that concurrent updates of the page-tables allow
for TLB flushes to be batched on one core, while another core changes
the page-tables.  This other core may assume a PTE change does not
require a flush based on the updated PTE value, while it is unaware that
TLB flushes are still pending.

This behavior affects KSM (which may result in memory corruption) and
MADV_FREE and MADV_DONTNEED (which may result in incorrect behavior).  A
proof-of-concept can easily produce the wrong behavior of MADV_DONTNEED.
Memory corruption in KSM is harder to produce in practice, but was
observed by hacking the kernel and adding a delay before flushing and
replacing the KSM page.

Finally, there is also one memory barrier missing, which may affect
architectures with weak memory model.

This patch (of 7):

Setting and clearing mm->tlb_flush_pending can be performed by multiple
threads, since mmap_sem may only be acquired for read in
task_numa_work().  If this happens, tlb_flush_pending might be cleared
while one of the threads still changes PTEs and batches TLB flushes.

This can lead to the same race between migration and
change_protection_range() that led to the introduction of
tlb_flush_pending.  The result of this race was data corruption, which
means that this patch also addresses a theoretically possible data
corruption.

An actual data corruption was not observed, yet the race was was
confirmed by adding assertion to check tlb_flush_pending is not set by
two threads, adding artificial latency in change_protection_range() and
using sysctl to reduce kernel.numa_balancing_scan_delay_ms.

Link: http://lkml.kernel.org/r/20170802000818.4760-2-namit@vmware.com
Fixes: 20841405940e ("mm: fix TLB flush race between migration, and
change_protection_range")
Signed-off-by: Nadav Amit 
Acked-by: Mel Gorman 
Acked-by: Rik van Riel 
Acked-by: Minchan Kim 
Cc: Andy Lutomirski 
Cc: Hugh Dickins 
Cc: "David S. Miller" 
Cc: Andrea Arcangeli 
Cc: Heiko Carstens 
Cc: Ingo Molnar 
Cc: Jeff Dike 
Cc: Martin Schwidefsky 
Cc: Mel Gorman 
Cc: Russell King 
Cc: Sergey Senozhatsky 
Cc: Tony Luck 
Cc: Yoshinori Sato 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
[bwh: Backported to 3.16:
 - Drop change to dump_mm()
 - Adjust context]
Signed-off-by: Ben Hutchings

pid_ns: Fix race between setns'ed fork() and zap_pid_ns_processes()

2017-09-15T17:29:35+00:00

commit 3fd37226216620c1a468afa999739d5016fbc349 upstream.

Imagine we have a pid namespace and a task from its parent's pid_ns,
which made setns() to the pid namespace. The task is doing fork(),
while the pid namespace's child reaper is dying. We have the race
between them:

Task from parent pid_ns             Child reaper
copy_process()                      ..
  alloc_pid()                       ..
  ..                                zap_pid_ns_processes()
  ..                                  disable_pid_allocation()
  ..                                  read_lock(&tasklist_lock)
  ..                                  iterate over pids in pid_ns
  ..                                    kill tasks linked to pids
  ..                                  read_unlock(&tasklist_lock)
  write_lock_irq(&tasklist_lock);   ..
  attach_pid(p, PIDTYPE_PID);       ..
  ..                                ..

So, just created task p won't receive SIGKILL signal,
and the pid namespace will be in contradictory state.
Only manual kill will help there, but does the userspace
care about this? I suppose, the most users just inject
a task into a pid namespace and wait a SIGCHLD from it.

The patch fixes the problem. It simply checks for
(pid_ns->nr_hashed & PIDNS_HASH_ADDING) in copy_process().
We do it under the tasklist_lock, and can't skip
PIDNS_HASH_ADDING as noted by Oleg:

"zap_pid_ns_processes() does disable_pid_allocation()
and then takes tasklist_lock to kill the whole namespace.
Given that copy_process() checks PIDNS_HASH_ADDING
under write_lock(tasklist) they can't race;
if copy_process() takes this lock first, the new child will
be killed, otherwise copy_process() can't miss
the change in ->nr_hashed."

If allocation is disabled, we just return -ENOMEM
like it's made for such cases in alloc_pid().

v2: Do not move disable_pid_allocation(), do not
introduce a new variable in copy_process() and simplify
the patch as suggested by Oleg Nesterov.
Account the problem with double irq enabling
found by Eric W. Biederman.

Fixes: c876ad768215 ("pidns: Stop pid allocation when init dies")
Signed-off-by: Kirill Tkhai 
CC: Andrew Morton 
CC: Ingo Molnar 
CC: Peter Zijlstra 
CC: Oleg Nesterov 
CC: Mike Rapoport 
CC: Michal Hocko 
CC: Andy Lutomirski 
CC: "Eric W. Biederman" 
CC: Andrei Vagin 
CC: Cyrill Gorcunov 
CC: Serge Hallyn 
Acked-by: Oleg Nesterov 
Signed-off-by: Eric W. Biederman 
[bwh: Backported to 3.16: the proper cleanup label is bad_fork_free_pid, not
 bad_fork_cancel_cgroup]
Signed-off-by: Ben Hutchings

kernel/fork: fix CLONE_CHILD_CLEARTID regression in nscd

2016-11-20T01:17:21+00:00

commit 735f2770a770156100f534646158cb58cb8b2939 upstream.

Commit fec1d0115240 ("[PATCH] Disable CLONE_CHILD_CLEARTID for abnormal
exit") has caused a subtle regression in nscd which uses
CLONE_CHILD_CLEARTID to clear the nscd_certainly_running flag in the
shared databases, so that the clients are notified when nscd is
restarted.  Now, when nscd uses a non-persistent database, clients that
have it mapped keep thinking the database is being updated by nscd, when
in fact nscd has created a new (anonymous) one (for non-persistent
databases it uses an unlinked file as backend).

The original proposal for the CLONE_CHILD_CLEARTID change claimed
(https://lkml.org/lkml/2006/10/25/233):

: The NPTL library uses the CLONE_CHILD_CLEARTID flag on clone() syscalls
: on behalf of pthread_create() library calls.  This feature is used to
: request that the kernel clear the thread-id in user space (at an address
: provided in the syscall) when the thread disassociates itself from the
: address space, which is done in mm_release().
:
: Unfortunately, when a multi-threaded process incurs a core dump (such as
: from a SIGSEGV), the core-dumping thread sends SIGKILL signals to all of
: the other threads, which then proceed to clear their user-space tids
: before synchronizing in exit_mm() with the start of core dumping.  This
: misrepresents the state of process's address space at the time of the
: SIGSEGV and makes it more difficult for someone to debug NPTL and glibc
: problems (misleading him/her to conclude that the threads had gone away
: before the fault).
:
: The fix below is to simply avoid the CLONE_CHILD_CLEARTID action if a
: core dump has been initiated.

The resulting patch from Roland (https://lkml.org/lkml/2006/10/26/269)
seems to have a larger scope than the original patch asked for.  It
seems that limitting the scope of the check to core dumping should work
for SIGSEGV issue describe above.

[Changelog partly based on Andreas' description]
Fixes: fec1d0115240 ("[PATCH] Disable CLONE_CHILD_CLEARTID for abnormal exit")
Link: http://lkml.kernel.org/r/1471968749-26173-1-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko 
Tested-by: William Preston 
Acked-by: Oleg Nesterov 
Cc: Roland McGrath 
Cc: Andreas Schwab 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Ben Hutchings

unshare: Unsharing a thread does not require unsharing a vm

2015-09-28T09:21:39+00:00

commit 12c641ab8270f787dfcce08b5f20ce8b65008096 upstream.

In the logic in the initial commit of unshare made creating a new
thread group for a process, contingent upon creating a new memory
address space for that process.  That is wrong.  Two separate
processes in different thread groups can share a memory address space
and clone allows creation of such proceses.

This is significant because it was observed that mm_users > 1 does not
mean that a process is multi-threaded, as reading /proc/PID/maps
temporarily increments mm_users, which allows other processes to
(accidentally) interfere with unshare() calls.

Correct the check in check_unshare_flags() to test for
!thread_group_empty() for CLONE_THREAD, CLONE_SIGHAND, and CLONE_VM.
For sighand->count > 1 for CLONE_SIGHAND and CLONE_VM.
For !current_is_single_threaded instead of mm_users > 1 for CLONE_VM.

By using the correct checks in unshare this removes the possibility of
an accidental denial of service attack.

Additionally using the correct checks in unshare ensures that only an
explicit unshare(CLONE_VM) can possibly trigger the slow path of
current_is_single_threaded().  As an explict unshare(CLONE_VM) is
pointless it is not expected there are many applications that make
that call.

Fixes: b2e0d98705e60e45bbb3c0032c48824ad7ae0704 userns: Implement unshare of the user namespace
Reported-by: Ricky Zhou 
Reported-by: Kees Cook 
Reviewed-by: Kees Cook 
Signed-off-by: "Eric W. Biederman" 
Signed-off-by: Luis Henriques

perf: fix perf bug in fork()

2014-10-09T19:23:47+00:00

commit 6c72e3501d0d62fc064d3680e5234f3463ec5a86 upstream.

Oleg noticed that a cleanup by Sylvain actually uncovered a bug; by
calling perf_event_free_task() when failing sched_fork() we will not yet
have done the memset() on ->perf_event_ctxp[] and will therefore try and
'free' the inherited contexts, which are still in use by the parent
process.  This is bad..

Suggested-by: Oleg Nesterov 
Reported-by: Oleg Nesterov 
Reported-by: Sylvain 'ythier' Hitier 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Ingo Molnar 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman