linux.git/include/linux, branch v3.15.3

genirq: Sanitize spurious interrupt detection of threaded irqs

2014-07-01T03:14:01+00:00

commit 1e77d0a1ed7417d2a5a52a7b8d32aea1833faa6c upstream.

Till reported that the spurious interrupt detection of threaded
interrupts is broken in two ways:

- note_interrupt() is called for each action thread of a shared
  interrupt line. That's wrong as we are only interested whether none
  of the device drivers felt responsible for the interrupt, but by
  calling multiple times for a single interrupt line we account
  IRQ_NONE even if one of the drivers felt responsible.

- note_interrupt() when called from the thread handler is not
  serialized. That leaves the members of irq_desc which are used for
  the spurious detection unprotected.

To solve this we need to defer the spurious detection of a threaded
interrupt to the next hardware interrupt context where we have
implicit serialization.

If note_interrupt is called with action_ret == IRQ_WAKE_THREAD, we
check whether the previous interrupt requested a deferred check. If
not, we request a deferred check for the next hardware interrupt and
return.

If set, we check whether one of the interrupt threads signaled
success. Depending on this information we feed the result into the
spurious detector.

If one primary handler of a shared interrupt returns IRQ_HANDLED we
disable the deferred check of irq threads on the same line, as we have
found at least one device driver who cared.

Reported-by: Till Straumann 
Signed-off-by: Thomas Gleixner 
Tested-by: Austin Schuh 
Cc: Oliver Hartkopp 
Cc: Wolfgang Grandegger 
Cc: Pavel Pisa 
Cc: Marc Kleine-Budde 
Cc: linux-can@vger.kernel.org
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1303071450130.22263@ionos
Signed-off-by: Greg Kroah-Hartman

ACPI: add dynamic_debug support

2014-07-01T03:13:58+00:00

commit 45fef5b88d1f2f47ecdefae6354372d440ca5c84 upstream.

Commit 1a699476e258 ("ACPI / hotplug / PCI: Hotplug notifications
from acpi_bus_notify()") added debug messages for a few common
events. These debug messages are unconditionally enabled if
CONFIG_DYNAMIC_DEBUG is defined, contrary to the documented
meaning, making the ACPI system spew lots of unwanted noise on
any kernel with dynamic debugging.

The bug was introduced by commit fbfddae69657 ("ACPI: Add
acpi_handle_() interfaces"), which added the
CONFIG_DYNAMIC_DEBUG dependency without respecting its meaning.

Fix by adding real support for dynamic_debug.

Fixes: fbfddae69657 ("ACPI: Add acpi_handle_() interfaces")
Signed-off-by: Bjørn Mork 
Signed-off-by: Rafael J. Wysocki 
Signed-off-by: Greg Kroah-Hartman

ext4: fix data integrity sync in ordered mode

2014-07-01T03:13:56+00:00

commit 1c8349a17137b93f0a83f276c764a6df1b9a116e upstream.

When we perform a data integrity sync we tag all the dirty pages with
PAGECACHE_TAG_TOWRITE at start of ext4_da_writepages.  Later we check
for this tag in write_cache_pages_da and creates a struct
mpage_da_data containing contiguously indexed pages tagged with this
tag and sync these pages with a call to mpage_da_map_and_submit.  This
process is done in while loop until all the PAGECACHE_TAG_TOWRITE
pages are synced. We also do journal start and stop in each iteration.
journal_stop could initiate journal commit which would call
ext4_writepage which in turn will call ext4_bio_write_page even for
delayed OR unwritten buffers. When ext4_bio_write_page is called for
such buffers, even though it does not sync them but it clears the
PAGECACHE_TAG_TOWRITE of the corresponding page and hence these pages
are also not synced by the currently running data integrity sync. We
will end up with dirty pages although sync is completed.

This could cause a potential data loss when the sync call is followed
by a truncate_pagecache call, which is exactly the case in
collapse_range.  (It will cause generic/127 failure in xfstests)

To avoid this issue, we can use set_page_writeback_keepwrite instead of
set_page_writeback, which doesn't clear TOWRITE tag.

Signed-off-by: Namjae Jeon 
Signed-off-by: Ashish Sangwan 
Signed-off-by: "Theodore Ts'o" 
Reviewed-by: Jan Kara 
Signed-off-by: Greg Kroah-Hartman

ptrace: fix fork event messages across pid namespaces

2014-07-01T03:13:55+00:00

commit 4e52365f279564cef0ddd41db5237f0471381093 upstream.

When tracing a process in another pid namespace, it's important for fork
event messages to contain the child's pid as seen from the tracer's pid
namespace, not the parent's.  Otherwise, the tracer won't be able to
correlate the fork event with later SIGTRAP signals it receives from the
child.

We still risk a race condition if a ptracer from a different pid
namespace attaches after we compute the pid_t value.  However, sending a
bogus fork event message in this unlikely scenario is still a vast
improvement over the status quo where we always send bogus fork event
messages to debuggers in a different pid namespace than the forking
process.

Signed-off-by: Matthew Dempsky 
Acked-by: Oleg Nesterov 
Cc: Kees Cook 
Cc: Julien Tinnes 
Cc: Roland McGrath 
Cc: Jan Kratochvil 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

mm: page_alloc: use word-based accesses for get/set pageblock bitmaps

2014-07-01T03:13:54+00:00

commit e58469bafd0524e848c3733bc3918d854595e20f upstream.

The test_bit operations in get/set pageblock flags are expensive.  This
patch reads the bitmap on a word basis and use shifts and masks to isolate
the bits of interest.  Similarly masks are used to set a local copy of the
bitmap and then use cmpxchg to update the bitmap if there have been no
other changes made in parallel.

In a test running dd onto tmpfs the overhead of the pageblock-related
functions went from 1.27% in profiles to 0.5%.

In addition to the performance benefits, this patch closes races that are
possible between:

a) get_ and set_pageblock_migratetype(), where get_pageblock_migratetype()
   reads part of the bits before and other part of the bits after
   set_pageblock_migratetype() has updated them.

b) set_pageblock_migratetype() and set_pageblock_skip(), where the non-atomic
   read-modify-update set bit operation in set_pageblock_skip() will cause
   lost updates to some bits changed in the set_pageblock_migratetype().

Joonsoo Kim first reported the case a) via code inspection.  Vlastimil
Babka's testing with a debug patch showed that either a) or b) occurs
roughly once per mmtests' stress-highalloc benchmark (although not
necessarily in the same pageblock).  Furthermore during development of
unrelated compaction patches, it was observed that frequent calls to
{start,undo}_isolate_page_range() the race occurs several thousands of
times and has resulted in NULL pointer dereferences in move_freepages()
and free_one_page() in places where free_list[migratetype] is
manipulated by e.g.  list_move().  Further debugging confirmed that
migratetype had invalid value of 6, causing out of bounds access to the
free_list array.

That confirmed that the race exist, although it may be extremely rare,
and currently only fatal where page isolation is performed due to
memory hot remove.  Races on pageblocks being updated by
set_pageblock_migratetype(), where both old and new migratetype are
lower MIGRATE_RESERVE, currently cannot result in an invalid value
being observed, although theoretically they may still lead to
unexpected creation or destruction of MIGRATE_RESERVE pageblocks.
Furthermore, things could get suddenly worse when memory isolation is
used more, or when new migratetypes are added.

After this patch, the race has no longer been observed in testing.

Signed-off-by: Mel Gorman 
Acked-by: Vlastimil Babka 
Reported-by: Joonsoo Kim 
Reported-and-tested-by: Vlastimil Babka 
Cc: Johannes Weiner 
Cc: Jan Kara 
Cc: Michal Hocko 
Cc: Hugh Dickins 
Cc: Dave Hansen 
Cc: Theodore Ts'o 
Cc: "Paul E. McKenney" 
Cc: Oleg Nesterov 
Cc: Rik van Riel 
Cc: Peter Zijlstra 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

hugetlb: restrict hugepage_migration_support() to x86_64

2014-07-01T03:13:54+00:00

commit c177c81e09e517bbf75b67762cdab1b83aba6976 upstream.

Currently hugepage migration is available for all archs which support
pmd-level hugepage, but testing is done only for x86_64 and there're
bugs for other archs.  So to avoid breaking such archs, this patch
limits the availability strictly to x86_64 until developers of other
archs get interested in enabling this feature.

Simply disabling hugepage migration on non-x86_64 archs is not enough to
fix the reported problem where sys_move_pages() hits the BUG_ON() in
follow_page(FOLL_GET), so let's fix this by checking if hugepage
migration is supported in vma_migratable().

Signed-off-by: Naoya Horiguchi 
Reported-by: Michael Ellerman 
Tested-by: Michael Ellerman 
Acked-by: Hugh Dickins 
Cc: Benjamin Herrenschmidt 
Cc: Tony Luck 
Cc: Russell King 
Cc: Martin Schwidefsky 
Cc: James Hogan 
Cc: Ralf Baechle 
Cc: David Miller 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

fs,userns: Change inode_capable to capable_wrt_inode_uidgid

2014-06-16T20:44:09+00:00

commit 23adbe12ef7d3d4195e80800ab36b37bee28cd03 upstream.

The kernel has no concept of capabilities with respect to inodes; inodes
exist independently of namespaces.  For example, inode_capable(inode,
CAP_LINUX_IMMUTABLE) would be nonsense.

This patch changes inode_capable to check for uid and gid mappings and
renames it to capable_wrt_inode_uidgid, which should make it more
obvious what it does.

Fixes CVE-2014-4014.

Cc: Theodore Ts'o 
Cc: Serge Hallyn 
Cc: "Eric W. Biederman" 
Cc: Dave Chinner 
Signed-off-by: Andy Lutomirski 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

Merge branch 'for-3.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu

2014-06-04T16:56:03+00:00

Pull percpu fix from Tejun Heo:
 "It is very late but this is an important percpu-refcount fix from
  Sebastian Ott.

  The problem is that percpu_ref_*() used __this_cpu_*() instead of
  this_cpu_*().  The difference between the two is that the latter is
  atomic on the local cpu while the former is not.  this_cpu_inc() is
  guaranteed to increment the percpu counter on the cpu that the
  operation is executed on without any synchronization; however,
  __this_cpu_inc() doesn't and if the local cpu invokes the function
  from different contexts (e.g.  process and irq) of the same CPU, it's
  not guaranteed to actually increment as it may be implemented as rmw.

  This bug existed from the get-go but it hasn't been noticed earlier
  probably because on x86 __this_cpu_inc() is equivalent to
  this_cpu_inc() as both get translated into single instruction;
  however, s390 uses the generic rmw implementation and gets affected by
  the bug.  Kudos to Sebastian and Heiko for diagnosing it.

  The change is very low risk and fixes a critical issue on the affected
  architectures, so I think it's a good candidate for inclusion although
  it's very late in the devel cycle.  On the other hand, this has been
  broken since v3.11, so backporting it through -stable post -rc1 won't
  be the end of the world.

  I'll ping Christoph whether __this_cpu_*() ops can be better annotated
  so that it can trigger lockdep warning when used from multiple
  contexts"

* 'for-3.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
  percpu-refcount: fix usage of this_cpu_ops

percpu-refcount: fix usage of this_cpu_ops

2014-06-04T16:12:29+00:00

The percpu-refcount infrastructure uses the underscore variants of
this_cpu_ops in order to modify percpu reference counters.
(e.g. __this_cpu_inc()).

However the underscore variants do not atomically update the percpu
variable, instead they may be implemented using read-modify-write
semantics (more than one instruction).  Therefore it is only safe to
use the underscore variant if the context is always the same (process,
softirq, or hardirq). Otherwise it is possible to lose updates.

This problem is something that Sebastian has seen within the aio
subsystem which uses percpu refcounters both in process and softirq
context leading to reference counts that never dropped to zeroes; even
though the number of "get" and "put" calls matched.

Fix this by using the non-underscore this_cpu_ops variant which
provides correct per cpu atomic semantics and fixes the corrupted
reference counts.

Cc: Kent Overstreet 
Cc:  # v3.11+
Reported-by: Sebastian Ott 
Signed-off-by: Heiko Carstens 
Signed-off-by: Tejun Heo 
References: http://lkml.kernel.org/g/alpine.LFD.2.11.1406041540520.21183@denkbrett

kernfs: move the last knowledge of sysfs out from kernfs

2014-06-03T15:11:18+00:00

There is still one residue of sysfs remaining: the sb_magic
SYSFS_MAGIC. However this should be kernfs user specific,
so this patch moves it out. Kerrnfs user should specify their
magic number while mouting.

Signed-off-by: Jianyu Zhan 
Acked-by: Tejun Heo 
Signed-off-by: Greg Kroah-Hartman 
Signed-off-by: Linus Torvalds