| Age | Commit message (Collapse) | Author | Files | Lines |
|
commit f991af3daabaecff34684fd51fac80319d1baad1 upstream.
The retry logic for netlink_attachskb() inside sys_mq_notify()
is nasty and vulnerable:
1) The sock refcnt is already released when retry is needed
2) The fd is controllable by user-space because we already
release the file refcnt
so we when retry but the fd has been just closed by user-space
during this small window, we end up calling netlink_detachskb()
on the error path which releases the sock again, later when
the user-space closes this socket a use-after-free could be
triggered.
Setting 'sock' to NULL here should be sufficient to fix it.
Reported-by: GeneBlue <geneblue.mail@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 9b24fef9f0410fb5364245d6cc2bd044cc064007 ]
Commit 53dad6d3a8e5 ("ipc: fix race with LSMs") updated ipc_rcu_putref()
to receive rcu freeing function but used generic ipc_rcu_free() instead
of msg_rcu_free() which does security cleaning.
Running LTP msgsnd06 with kmemleak gives the following:
cat /sys/kernel/debug/kmemleak
unreferenced object 0xffff88003c0a11f8 (size 8):
comm "msgsnd06", pid 1645, jiffies 4294672526 (age 6.549s)
hex dump (first 8 bytes):
1b 00 00 00 01 00 00 00 ........
backtrace:
kmemleak_alloc+0x23/0x40
kmem_cache_alloc_trace+0xe1/0x180
selinux_msg_queue_alloc_security+0x3f/0xd0
security_msg_queue_alloc+0x2e/0x40
newque+0x4e/0x150
ipcget+0x159/0x1b0
SyS_msgget+0x39/0x40
entry_SYSCALL_64_fastpath+0x13/0x8f
Manfred Spraul suggested to fix sem.c as well and Davidlohr Bueso to
only use ipc_rcu_free in case of security allocation failure in newary()
Fixes: 53dad6d3a8e ("ipc: fix race with LSMs")
Link: http://lkml.kernel.org/r/1470083552-22966-1-git-send-email-fabf@skynet.be
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: <stable@vger.kernel.org> [3.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
|
|
[ Upstream commit b9a532277938798b53178d5a66af6e2915cb27cf ]
As reported by Dmitry Vyukov, we really shouldn't do ipc_addid() before
having initialized the IPC object state. Yes, we initialize the IPC
object in a locked state, but with all the lockless RCU lookup work,
that IPC object lock no longer means that the state cannot be seen.
We already did this for the IPC semaphore code (see commit e8577d1f0329:
"ipc/sem.c: fully initialize sem_array before making it visible") but we
clearly forgot about msg and shm.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit 3ed1f8a99d70ea1cd1508910eb107d0edcae5009 ]
sem_lock() did not properly pair memory barriers:
!spin_is_locked() and spin_unlock_wait() are both only control barriers.
The code needs an acquire barrier, otherwise the cpu might perform read
operations before the lock test.
As no primitive exists inside <include/spinlock.h> and since it seems
noone wants another primitive, the code creates a local primitive within
ipc/sem.c.
With regards to -stable:
The change of sem_wait_array() is a bugfix, the change to sem_lock() is a
nop (just a preprocessor redefinition to improve the readability). The
bugfix is necessary for all kernels that use sem_wait_array() (i.e.:
starting from 3.10).
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Reported-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Kirill Tkhai <ktkhai@parallels.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: <stable@vger.kernel.org> [3.10+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
set exits
[ Upstream commit 602b8593d2b4138c10e922eeaafe306f6b51817b ]
The current semaphore code allows a potential use after free: in
exit_sem we may free the task's sem_undo_list while there is still
another task looping through the same semaphore set and cleaning the
sem_undo list at freeary function (the task called IPC_RMID for the same
semaphore set).
For example, with a test program [1] running which keeps forking a lot
of processes (which then do a semop call with SEM_UNDO flag), and with
the parent right after removing the semaphore set with IPC_RMID, and a
kernel built with CONFIG_SLAB, CONFIG_SLAB_DEBUG and
CONFIG_DEBUG_SPINLOCK, you can easily see something like the following
in the kernel log:
Slab corruption (Not tainted): kmalloc-64 start=ffff88003b45c1c0, len=64
000: 6b 6b 6b 6b 6b 6b 6b 6b 00 6b 6b 6b 6b 6b 6b 6b kkkkkkkk.kkkkkkk
010: ff ff ff ff 6b 6b 6b 6b ff ff ff ff ff ff ff ff ....kkkk........
Prev obj: start=ffff88003b45c180, len=64
000: 00 00 00 00 ad 4e ad de ff ff ff ff 5a 5a 5a 5a .....N......ZZZZ
010: ff ff ff ff ff ff ff ff c0 fb 01 37 00 88 ff ff ...........7....
Next obj: start=ffff88003b45c200, len=64
000: 00 00 00 00 ad 4e ad de ff ff ff ff 5a 5a 5a 5a .....N......ZZZZ
010: ff ff ff ff ff ff ff ff 68 29 a7 3c 00 88 ff ff ........h).<....
BUG: spinlock wrong CPU on CPU#2, test/18028
general protection fault: 0000 [#1] SMP
Modules linked in: 8021q mrp garp stp llc nf_conntrack_ipv4 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables binfmt_misc ppdev input_leds joydev parport_pc parport floppy serio_raw virtio_balloon virtio_rng virtio_console virtio_net iosf_mbi crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcspkr qxl ttm drm_kms_helper drm snd_hda_codec_generic i2c_piix4 snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore crc32c_intel virtio_pci virtio_ring virtio pata_acpi ata_generic [last unloaded: speedstep_lib]
CPU: 2 PID: 18028 Comm: test Not tainted 4.2.0-rc5+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014
RIP: spin_dump+0x53/0xc0
Call Trace:
spin_bug+0x30/0x40
do_raw_spin_unlock+0x71/0xa0
_raw_spin_unlock+0xe/0x10
freeary+0x82/0x2a0
? _raw_spin_lock+0xe/0x10
semctl_down.clone.0+0xce/0x160
? __do_page_fault+0x19a/0x430
? __audit_syscall_entry+0xa8/0x100
SyS_semctl+0x236/0x2c0
? syscall_trace_leave+0xde/0x130
entry_SYSCALL_64_fastpath+0x12/0x71
Code: 8b 80 88 03 00 00 48 8d 88 60 05 00 00 48 c7 c7 a0 2c a4 81 31 c0 65 8b 15 eb 40 f3 7e e8 08 31 68 00 4d 85 e4 44 8b 4b 08 74 5e <45> 8b 84 24 88 03 00 00 49 8d 8c 24 60 05 00 00 8b 53 04 48 89
RIP [<ffffffff810d6053>] spin_dump+0x53/0xc0
RSP <ffff88003750fd68>
---[ end trace 783ebb76612867a0 ]---
NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [test:18053]
Modules linked in: 8021q mrp garp stp llc nf_conntrack_ipv4 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables binfmt_misc ppdev input_leds joydev parport_pc parport floppy serio_raw virtio_balloon virtio_rng virtio_console virtio_net iosf_mbi crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcspkr qxl ttm drm_kms_helper drm snd_hda_codec_generic i2c_piix4 snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore crc32c_intel virtio_pci virtio_ring virtio pata_acpi ata_generic [last unloaded: speedstep_lib]
CPU: 3 PID: 18053 Comm: test Tainted: G D 4.2.0-rc5+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014
RIP: native_read_tsc+0x0/0x20
Call Trace:
? delay_tsc+0x40/0x70
__delay+0xf/0x20
do_raw_spin_lock+0x96/0x140
_raw_spin_lock+0xe/0x10
sem_lock_and_putref+0x11/0x70
SYSC_semtimedop+0x7bf/0x960
? handle_mm_fault+0xbf6/0x1880
? dequeue_task_fair+0x79/0x4a0
? __do_page_fault+0x19a/0x430
? kfree_debugcheck+0x16/0x40
? __do_page_fault+0x19a/0x430
? __audit_syscall_entry+0xa8/0x100
? do_audit_syscall_entry+0x66/0x70
? syscall_trace_enter_phase1+0x139/0x160
SyS_semtimedop+0xe/0x10
SyS_semop+0x10/0x20
entry_SYSCALL_64_fastpath+0x12/0x71
Code: 47 10 83 e8 01 85 c0 89 47 10 75 08 65 48 89 3d 1f 74 ff 7e c9 c3 0f 1f 44 00 00 55 48 89 e5 e8 87 17 04 00 66 90 c9 c3 0f 1f 00 <55> 48 89 e5 0f 31 89 c1 48 89 d0 48 c1 e0 20 89 c9 48 09 c8 c9
Kernel panic - not syncing: softlockup: hung tasks
I wasn't able to trigger any badness on a recent kernel without the
proper config debugs enabled, however I have softlockup reports on some
kernel versions, in the semaphore code, which are similar as above (the
scenario is seen on some servers running IBM DB2 which uses semaphore
syscalls).
The patch here fixes the race against freeary, by acquiring or waiting
on the sem_undo_list lock as necessary (exit_sem can race with freeary,
while freeary sets un->semid to -1 and removes the same sem_undo from
list_proc or when it removes the last sem_undo).
After the patch I'm unable to reproduce the problem using the test case
[1].
[1] Test case used below:
#include <stdio.h>
#include <sys/types.h>
#include <sys/ipc.h>
#include <sys/sem.h>
#include <sys/wait.h>
#include <stdlib.h>
#include <time.h>
#include <unistd.h>
#include <errno.h>
#define NSEM 1
#define NSET 5
int sid[NSET];
void thread()
{
struct sembuf op;
int s;
uid_t pid = getuid();
s = rand() % NSET;
op.sem_num = pid % NSEM;
op.sem_op = 1;
op.sem_flg = SEM_UNDO;
semop(sid[s], &op, 1);
exit(EXIT_SUCCESS);
}
void create_set()
{
int i, j;
pid_t p;
union {
int val;
struct semid_ds *buf;
unsigned short int *array;
struct seminfo *__buf;
} un;
/* Create and initialize semaphore set */
for (i = 0; i < NSET; i++) {
sid[i] = semget(IPC_PRIVATE , NSEM, 0644 | IPC_CREAT);
if (sid[i] < 0) {
perror("semget");
exit(EXIT_FAILURE);
}
}
un.val = 0;
for (i = 0; i < NSET; i++) {
for (j = 0; j < NSEM; j++) {
if (semctl(sid[i], j, SETVAL, un) < 0)
perror("semctl");
}
}
/* Launch threads that operate on semaphore set */
for (i = 0; i < NSEM * NSET * NSET; i++) {
p = fork();
if (p < 0)
perror("fork");
if (p == 0)
thread();
}
/* Free semaphore set */
for (i = 0; i < NSET; i++) {
if (semctl(sid[i], NSEM, IPC_RMID))
perror("IPC_RMID");
}
/* Wait for forked processes to exit */
while (wait(NULL)) {
if (errno == ECHILD)
break;
};
}
int main(int argc, char **argv)
{
pid_t p;
srand(time(NULL));
while (1) {
p = fork();
if (p < 0) {
perror("fork");
exit(EXIT_FAILURE);
}
if (p == 0) {
create_set();
goto end;
}
/* Wait for forked processes to exit */
while (wait(NULL)) {
if (errno == ECHILD)
break;
};
}
end:
return 0;
}
[akpm@linux-foundation.org: use normal comment layout]
Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Rafael Aquini <aquini@redhat.com>
CC: Aristeu Rozanski <aris@redhat.com>
Cc: David Jeffery <djeffery@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
account
[ Upstream commit de54b9ac253787c366bbfb28d901a31954eb3511 ]
A while back, the message queue implementation in the kernel was
improved to use btrees to speed up retrieval of messages, in commit
d6629859b36d ("ipc/mqueue: improve performance of send/recv").
That patch introducing the improved kernel handling of message queues
(using btrees) has, as a by-product, changed the meaning of the QSIZE
field in the pseudo-file created for the queue. Before, this field
reflected the size of the user-data in the queue. Since, it also takes
kernel data structures into account. For example, if 13 bytes of user
data are in the queue, on my machine the file reports a size of 61
bytes.
There was some discussion on this topic before (for example
https://lkml.org/lkml/2014/10/1/115). Commenting on a th lkml, Michael
Kerrisk gave the following background
(https://lkml.org/lkml/2015/6/16/74):
The pseudofiles in the mqueue filesystem (usually mounted at
/dev/mqueue) expose fields with metadata describing a message
queue. One of these fields, QSIZE, as originally implemented,
showed the total number of bytes of user data in all messages in
the message queue, and this feature was documented from the
beginning in the mq_overview(7) page. In 3.5, some other (useful)
work happened to break the user-space API in a couple of places,
including the value exposed via QSIZE, which now includes a measure
of kernel overhead bytes for the queue, a figure that renders QSIZE
useless for its original purpose, since there's no way to deduce
the number of overhead bytes consumed by the implementation.
(The other user-space breakage was subsequently fixed.)
This patch removes the accounting of kernel data structures in the
queue. Reporting the size of these data-structures in the QSIZE field
was a breaking change (see Michael's comment above). Without the QSIZE
field reporting the total size of user-data in the queue, there is no
way to deduce this number.
It should be noted that the resource limit RLIMIT_MSGQUEUE is counted
against the worst-case size of the queue (in both the old and the new
implementation). Therefore, the kernel overhead accounting in QSIZE is
not necessary to help the user understand the limitations RLIMIT imposes
on the processes.
Signed-off-by: Marcus Gelderie <redmnic@gmail.com>
Acked-by: Doug Ledford <dledford@redhat.com>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: John Duffy <jb_duffy@btinternet.com>
Cc: Arto Bendiken <arto@bendiken.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
ipc_addid() makes a new ipc identifier visible to everyone. New objects
start as locked, so that the caller can complete the initialization
after the call. Within struct sem_array, at least sma->sem_base and
sma->sem_nsems are accessed without any locks, therefore this approach
doesn't work.
Thus: Move the ipc_addid() to the end of the initialization.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Reported-by: Rik van Riel <riel@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Acked-by: Rafael Aquini <aquini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Resolve some shadow warnings produced in W=2 builds by changing the name
of some parameters and local variables. Change instances of "s64"
because that clashes with the well-known typedef. Also change a local
variable with the name "up" because that clashes with the name of of the
"up" function for semaphores. These are hazards so eliminate the
hazards by renaming them.
Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Using __seq_open_private() removes boilerplate code from
sysvipc_proc_open().
The resultant code is shorter and easier to follow.
However, please note that __seq_open_private() call kzalloc() rather than
kmalloc() which may affect timing due to the memory initialisation
overhead.
Signed-off-by: Rob Jones <rob.jones@codethink.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
do_shmat() is the only user of ->start_stack (proc just reports its
value), and this check looks ugly and wrong.
The reason for this check is not clear at all, and it wrongly assumes that
the stack can only grow down.
But the main problem is that in general mm->start_stack has nothing to do
with stack_vma->vm_start. Not only the application can switch to another
stack and even unmap this area, setup_arg_pages() expands the stack
without updating mm->start_stack during exec(). This means that in the
likely case "addr > start_stack - size - PAGE_SIZE * 5" is simply
impossible after find_vma_intersection() == F, or the stack can't grow
anyway because of RLIMIT_STACK.
Many thanks to Hugh for his explanations.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
proc_dointvec_minmax() returns zero if a new value has been set. So we
don't need to check all charecters have been handled.
Below you can find two examples. In the new value has not been handled
properly.
$ strace ./a.out
open("/proc/sys/kernel/auto_msgmni", O_WRONLY) = 3
write(3, "0\n\0", 3) = 2
close(3) = 0
exit_group(0)
$ cat /sys/kernel/debug/tracing/trace
$strace ./a.out
open("/proc/sys/kernel/auto_msgmni", O_WRONLY) = 3
write(3, "0\n", 2) = 2
close(3) = 0
$ cat /sys/kernel/debug/tracing/trace
a.out-697 [000] .... 3280.998235: unregister_ipcns_notifier <-proc_ipcauto_dointvec_minmax
Fixes: 9eefe520c814 ("ipc: do not use a negative value to re-enable msgmni automatic recomputin")
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Cc: Mathias Krause <minipli@googlemail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Joe Perches <joe@perches.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull "trivial tree" updates from Jiri Kosina:
"Usual pile from trivial tree everyone is so eagerly waiting for"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
Remove MN10300_PROC_MN2WS0038
mei: fix comments
treewide: Fix typos in Kconfig
kprobes: update jprobe_example.c for do_fork() change
Documentation: change "&" to "and" in Documentation/applying-patches.txt
Documentation: remove obsolete pcmcia-cs from Changes
Documentation: update links in Changes
Documentation: Docbook: Fix generated DocBook/kernel-api.xml
score: Remove GENERIC_HAS_IOMAP
gpio: fix 'CONFIG_GPIO_IRQCHIP' comments
tty: doc: Fix grammar in serial/tty
dma-debug: modify check_for_stack output
treewide: fix errors in printk
genirq: fix reference in devm_request_threaded_irq comment
treewide: fix synchronize_rcu() in comments
checkstack.pl: port to AArch64
doc: queue-sysfs: minor fixes
init/do_mounts: better syntax description
MIPS: fix comment spelling
powerpc/simpleboot: fix comment
...
|
|
This patch fix spelling typo found in DocBook/kernel-api.xml.
It is because the file is generated from the source comments,
I have to fix the comments in source codes.
Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull namespace updates from Eric Biederman:
"This is a bunch of small changes built against 3.16-rc6. The most
significant change for users is the first patch which makes setns
drmatically faster by removing unneded rcu handling.
The next chunk of changes are so that "mount -o remount,.." will not
allow the user namespace root to drop flags on a mount set by the
system wide root. Aks this forces read-only mounts to stay read-only,
no-dev mounts to stay no-dev, no-suid mounts to stay no-suid, no-exec
mounts to stay no exec and it prevents unprivileged users from messing
with a mounts atime settings. I have included my test case as the
last patch in this series so people performing backports can verify
this change works correctly.
The next change fixes a bug in NFS that was discovered while auditing
nsproxy users for the first optimization. Today you can oops the
kernel by reading /proc/fs/nfsfs/{servers,volumes} if you are clever
with pid namespaces. I rebased and fixed the build of the
!CONFIG_NFS_FS case yesterday when a build bot caught my typo. Given
that no one to my knowledge bases anything on my tree fixing the typo
in place seems more responsible that requiring a typo-fix to be
backported as well.
The last change is a small semantic cleanup introducing
/proc/thread-self and pointing /proc/mounts and /proc/net at it. This
prevents several kinds of problemantic corner cases. It is a
user-visible change so it has a minute chance of causing regressions
so the change to /proc/mounts and /proc/net are individual one line
commits that can be trivially reverted. Unfortunately I lost and
could not find the email of the original reporter so he is not
credited. From at least one perspective this change to /proc/net is a
refgression fix to allow pthread /proc/net uses that were broken by
the introduction of the network namespace"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
proc: Point /proc/mounts at /proc/thread-self/mounts instead of /proc/self/mounts
proc: Point /proc/net at /proc/thread-self/net instead of /proc/self/net
proc: Implement /proc/thread-self to point at the directory of the current thread
proc: Have net show up under /proc/<tgid>/task/<tid>
NFS: Fix /proc/fs/nfsfs/servers and /proc/fs/nfsfs/volumes
mnt: Add tests for unprivileged remount cases that have found to be faulty
mnt: Change the default remount atime from relatime to the existing value
mnt: Correct permission checks in do_remount
mnt: Move the test for MNT_LOCK_READONLY from change_mount_flags into do_remount
mnt: Only change user settable mount flags in remount
namespaces: Use task_lock and not rcu to protect nsproxy
|
|
If shm_rmid_force (the default state) is not set then the shmids are only
marked as orphaned and does not require any add, delete, or locking of the
tree structure.
Seperate the sysctl on and off case, and only obtain the read lock. The
newly added list head can be deleted under the read lock because we are
only called with current and will only change the semids allocated by this
task and not manipulate the list.
This commit assumes that up_read includes a sufficient memory barrier for
the writes to be seen my others that later obtain a write lock.
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Jack Miller <millerjo@us.ibm.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Anton Blanchard <anton@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This is small set of patches our team has had kicking around for a few
versions internally that fixes tasks getting hung on shm_exit when there
are many threads hammering it at once.
Anton wrote a simple test to cause the issue:
http://ozlabs.org/~anton/junkcode/bust_shm_exit.c
Before applying this patchset, this test code will cause either hanging
tracebacks or pthread out of memory errors.
After this patchset, it will still produce output like:
root@somehost:~# ./bust_shm_exit 1024 160
...
INFO: rcu_sched detected stalls on CPUs/tasks: {} (detected by 116, t=2111 jiffies, g=241, c=240, q=7113)
INFO: Stall ended before state dump start
...
But the task will continue to run along happily, so we consider this an
improvement over hanging, even if it's a bit noisy.
This patch (of 3):
exit_shm obtains the ipc_ns shm rwsem for write and holds it while it
walks every shared memory segment in the namespace. Thus the amount of
work is related to the number of shm segments in the namespace not the
number of segments that might need to be cleaned.
In addition, this occurs after the task has been notified the thread has
exited, so the number of tasks waiting for the ns shm rwsem can grow
without bound until memory is exausted.
Add a list to the task struct of all shmids allocated by this task. Init
the list head in copy_process. Use the ns->rwsem for locking. Add
segments after id is added, remove before removing from id.
On unshare of NEW_IPCNS orphan any ids as if the task had exited, similar
to handling of semaphore undo.
I chose a define for the init sequence since its a simple list init,
otherwise it would require a function call to avoid include loops between
the semaphore code and the task struct. Converting the list_del to
list_del_init for the unshare cases would remove the exit followed by
init, but I left it blow up if not inited.
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Jack Miller <millerjo@us.ibm.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Anton Blanchard <anton@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The synchronous syncrhonize_rcu in switch_task_namespaces makes setns
a sufficiently expensive system call that people have complained.
Upon inspect nsproxy no longer needs rcu protection for remote reads.
remote reads are rare. So optimize for same process reads and write
by switching using rask_lock instead.
This yields a simpler to understand lock, and a faster setns system call.
In particular this fixes a performance regression observed
by Rafael David Tinoco <rafael.tinoco@canonical.com>.
This is effectively a revert of Pavel Emelyanov's commit
cf7b708c8d1d7a27736771bcf4c457b332b0f818 Make access to task's nsproxy lighter
from 2007. The race this originialy fixed no longer exists as
do_notify_parent uses task_active_pid_ns(parent) instead of
parent->nsproxy.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
|
|
This typedef is unnecessary and should just be removed.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The actual Linux implementation for semctl(GETNCNT) and semctl(GETZCNT)
always (since 0.99.10) reported a thread as sleeping on all semaphores
that are listed in the semop() call.
The documented behavior (both in the Linux man page and in the Single
Unix Specification) is that a task should be reported on exactly one
semaphore: The semaphore that caused the thread to got to sleep.
This patch adds a pr_info_once() that is triggered if a thread hits the
relevant case.
The code triggers slightly too often, otherwise it would be necessary to
replicate the old code. As there are no known users of GETNCNT or
GETZCNT, this is done to prevent unnecessary bloat.
The task that triggered is reported with name (tsk->comm) and pid.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Acked-by: Davidlohr Bueso <davidlohr@hp.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
SUSv4 clearly defines how semncnt and semzcnt must be calculated: A task
waits on exactly one semaphore: The semaphore from the first operation
in the sop array that cannot proceed.
The Linux implementation never followed the standard, it tried to count
all semaphores that might be the reason why a task sleeps.
This patch fixes that.
Note:
a) The implementation assumes that GETNCNT and GETZCNT are rare operations,
therefore the code counts them only on demand.
(If they wouldn't be rare, then the non-compliance would have
been found earlier)
b) compared to the initial version of the patch, the BUG_ONs were removed
and it was clarified that the new behavior conforms to SUS.
Back-compatibility concerns:
Manfred:
: - there is no application in Fedora that uses GETNCNT or GETZCNT.
:
: - application that use only single-sop semop() are also safe, the
: difference only affects complex apps.
:
: - portable application are also safe, the new behavior is standard
: compliant.
:
: But that's it. The old behavior existed in Linux from 0.99.something
: until now.
Michael:
: * These operations seem to be very little used. Grepping the public
: source that is contained Fedora 20 source DVD, there appear to be no
: uses. Of course, this says nothing about uses in private /
: non-mainstream FOSS code, but it seems likely that the same pattern
: is followed there.
:
: * The existing behavior is hard enough to understand that I suspect
: that no one understood it well enough to rely on it anyway
: (especially as that behavior contradicted both man page and POSIX).
:
: So, there's a chance of breakage, but I estimate that it's minute.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Preparation for the next patch:
In the slow-path of perform_atomic_semop(), store a pointer to the
operation that caused the operation to block.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Right now, perform_atomic_semop gets the content of sem_queue as
individual fields. Changes that, instead pass a pointer to sem_queue.
This is a preparation for the next patch: it uses sem_queue to store the
reason why a task must sleep.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
count_semzcnt and count_semncnt are more of less identical. The patch
creates a single function that either counts the number of tasks waiting
for zero or waiting due to a decrease operation.
Compared to the initial version, the BUG_ONs were removed.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
GETZCNT is supposed to return the number of threads that wait until a
semaphore value becomes 0.
The current implementation overlooks complex operations that contain
both wait-for-zero operation and operations that alter at least one
semaphore.
The patch fixes that. It's intentionally copy&paste, this will be
cleaned up in the next patch.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The need for volatile is not obvious, document it.
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Nothing big and no logical changes, just get rid of some redundant
function declarations. Move msg_[init/exit]_ns down the end of the
file.
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Call __set_current_state() instead of assigning the new state directly.
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Signed-off-by: Manfred Spraul <manfred@colorfullif.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
SHMMAX is the upper limit for the size of a shared memory segment, counted
in bytes. The actual allocation is that size, rounded up to the next full
page.
Add a check that prevents the creation of segments where the rounded up
size causes an integer overflow.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Acked-by: Davidlohr Bueso <davidlohr@hp.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
shm_tot counts the total number of pages used by shm segments.
If SHMALL is ULONG_MAX (or nearly ULONG_MAX), then the number can
overflow. Subsequent calls to shmctl(,SHM_INFO,) would return wrong
values for shm_tot.
The patch adds a detection for overflows.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Acked-by: Davidlohr Bueso <davidlohr@hp.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The increase of SHMMAX/SHMALL is a 4 patch series.
The change itself is trivial, the only problem are interger overflows.
The overflows are not new, but if we make huge values the default, then
the code should be free from overflows.
SHMMAX:
- shmmem_file_setup places a hard limit on the segment size:
MAX_LFS_FILESIZE.
On 32-bit, the limit is > 1 TB, i.e. 4 GB-1 byte segments are
possible. Rounded up to full pages the actual allocated size
is 0. --> must be fixed, patch 3
- shmat:
- find_vma_intersection does not handle overflows properly.
--> must be fixed, patch 1
- the rest is fine, do_mmap_pgoff limits mappings to TASK_SIZE
and checks for overflows (i.e.: map 2 GB, starting from
addr=2.5GB fails).
SHMALL:
- after creating 8192 segments size (1L<<63)-1, shm_tot overflows and
returns 0. --> must be fixed, patch 2.
Userspace:
- Obviously, there could be overflows in userspace. There is nothing
we can do, only use values smaller than ULONG_MAX.
I ended with "ULONG_MAX - 1L<<24":
- TASK_SIZE cannot be used because it is the size of the current
task. Could be 4G if it's a 32-bit task on a 64-bit kernel.
- The maximum size is not standardized across archs:
I found TASK_MAX_SIZE, TASK_SIZE_MAX and TASK_SIZE_64.
- Just in case some arch revives a 4G/4G split, nearly
ULONG_MAX is a valid segment size.
- Using "0" as a magic value for infinity is even worse, because
right now 0 means 0, i.e. fail all allocations.
This patch (of 4):
find_vma_intersection() does not work as intended if addr+size overflows.
The patch adds a manual check before the call to find_vma_intersection.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Acked-by: Davidlohr Bueso <davidlohr@hp.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
trailing whitespace
Signed-off-by: Paul McQuade <paulmcquad@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Use #include <linux/uaccess.h> instead of <asm/uaccess.h>
Use #include <linux/types.h> instead of <asm/types.h>
Signed-off-by: Paul McQuade <paulmcquad@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
There is no need to recreate the very same ipc_ops structure on every
kernel entry for msgget/semget/shmget. Just declare it static and be
done with it. While at it, constify it as we don't modify the structure
at runtime.
Found in the PaX patch, written by the PaX Team.
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: PaX Team <pageexec@freemail.hu>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
... since __initcall is now deprecated.
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This macro appears to have been introduced back in the 2.5 era for
semtimedop32 backward compatibility on ia32:
https://lkml.org/lkml/2003/4/28/78
Nowadays, this syscall in compat just defaults back to the code found in
sem.c, so it is no longer used and can thus be removed:
long compat_sys_semtimedop(int semid, struct sembuf __user *tsems,
unsigned nsops, const struct compat_timespec __user *timeout)
{
struct timespec __user *ts64;
if (compat_convert_timespec(&ts64, timeout))
return -EFAULT;
return sys_semtimedop(semid, tsems, nsops, ts64);
}
Furthermore, there are no users in compat.c. After this change, kernel
builds just fine with both CONFIG_SYSVIPC_COMPAT and CONFIG_SYSVIPC.
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull compat time conversion changes from Peter Anvin:
"Despite the branch name this is really neither an x86 nor an
x32-specific patchset, although it the implementation of the
discussions that followed the x32 security hole a few months ago.
This removes get/put_compat_timespec/val() and replaces them with
compat_get/put_timespec/val() which are savvy as to the current status
of COMPAT_USE_64BIT_TIME.
It removes several unused and/or incorrect/misleading functions (like
compat_put_timeval_convert which doesn't in fact do any conversion)
and also replaces several open-coded implementations what is now
called compat_convert_timespec() with that function"
* 'x86-x32-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
compat: Fix sparse address space warnings
compat: Get rid of (get|put)_compat_time(val|spec)
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 compat wrapper rework from Heiko Carstens:
"S390 compat system call wrapper simplification work.
The intention of this work is to get rid of all hand written assembly
compat system call wrappers on s390, which perform proper sign or zero
extension, or pointer conversion of compat system call parameters.
Instead all of this should be done with C code eg by using Al's
COMPAT_SYSCALL_DEFINEx() macro.
Therefore all common code and s390 specific compat system calls have
been converted to the COMPAT_SYSCALL_DEFINEx() macro.
In order to generate correct code all compat system calls may only
have eg compat_ulong_t parameters, but no unsigned long parameters.
Those patches which change parameter types from unsigned long to
compat_ulong_t parameters are separate in this series, but shouldn't
cause any harm.
The only compat system calls which intentionally have 64 bit
parameters (preadv64 and pwritev64) in support of the x86/32 ABI
haven't been changed, but are now only available if an architecture
defines __ARCH_WANT_COMPAT_SYS_PREADV64/PWRITEV64.
System calls which do not have a compat variant but still need proper
zero extension on s390, like eg "long sys_brk(unsigned long brk)" will
get a proper wrapper function with the new s390 specific
COMPAT_SYSCALL_WRAPx() macro:
COMPAT_SYSCALL_WRAP1(brk, unsigned long, brk);
which generates the following code (simplified):
asmlinkage long sys_brk(unsigned long brk);
asmlinkage long compat_sys_brk(long brk)
{
return sys_brk((u32)brk);
}
Given that the C file which contains all the COMPAT_SYSCALL_WRAP lines
includes both linux/syscall.h and linux/compat.h, it will generate
build errors, if the declaration of sys_brk() doesn't match, or if
there exists a non-matching compat_sys_brk() declaration.
In addition this will intentionally result in a link error if
som |