linux.git/kernel, branch v3.15-rc7

Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2014-05-23T17:04:04+00:00

Pull scheduler fixes from Ingo Molnar:
 "The biggest commit is an irqtime accounting loop latency fix, the rest
  are misc fixes all over the place: deadline scheduling, docs, numa,
  balancer and a bad to-idle latency fix"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/numa: Initialize newidle balance stats in sd_numa_init()
  sched: Fix updating rq->max_idle_balance_cost and rq->next_balance in idle_balance()
  sched: Skip double execution of pick_next_task_fair()
  sched: Use CPUPRI_NR_PRIORITIES instead of MAX_RT_PRIO in cpupri check
  sched/deadline: Fix memory leak
  sched/deadline: Fix sched_yield() behavior
  sched: Sanitize irq accounting madness
  sched/docbook: Fix 'make htmldocs' warnings caused by missing description

Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2014-05-23T17:02:34+00:00

Pull perf fixes from Ingo Molnar:
 "The biggest changes are fixes for races that kept triggering Trinity
  crashes, plus liblockdep build fixes and smaller misc fixes.

  The liblockdep bits in perf/urgent are a pull mistake - they should
  have been in locking/urgent - but by the time I noticed other commits
  were added and testing was done :-/ Sorry about that"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf: Fix a race between ring_buffer_detach() and ring_buffer_attach()
  perf: Prevent false warning in perf_swevent_add
  perf: Limit perf_event_attr::sample_period to 63 bits
  tools/liblockdep: Remove all build files when doing make clean
  tools/liblockdep: Build liblockdep from tools/Makefile
  perf/x86/intel: Fix Silvermont's event constraints
  perf: Fix perf_event_init_context()
  perf: Fix race in removing an event

Merge branch 'for-3.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

2014-05-21T09:36:40+00:00

Pull more cgroup fixes from Tejun Heo:
 "Three more patches to fix cgroup_freezer breakage due to the recent
  cgroup internal locking changes - an operation cgroup_freezer was
  using now requires sleepable context and cgroup_freezer was invoking
  that while holding a spin lock.  cgroup_freezer was using an overly
  elaborate hierarchical locking scheme.

  While it's possible to convert the hierarchical spinlocks directly to
  mutexes, this patch simplifies the overall locking so that it uses a
  global mutex.  This has the added benefit of avoiding iterating
  potentially huge number of tasks under a spinlock.  While the patch is
  on the larger side in the devel cycle, the changes made are mostly
  straight-forward and the locking logic is a lot simpler afterwards"

* 'for-3.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: fix rcu_read_lock() leak in update_if_frozen()
  cgroup_freezer: replace freezer->lock with freezer_mutex
  cgroup: introduce task_css_is_root()

Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2014-05-20T05:19:10+00:00

Pull timer fix from Thomas Gleixner:
 "A single bug fix for a long standing issue:

   - Updating the expiry value of a relative timer _after_ letting the
     idle logic select a target cpu for the timer based on its stale
     expiry value is outright stupid.  Thanks to Viresh for spotting the
     brainfart"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  hrtimer: Set expiry time before switch_hrtimer_base()

perf: Fix a race between ring_buffer_detach() and ring_buffer_attach()

2014-05-19T12:44:56+00:00

Alexander noticed that we use RCU iteration on rb->event_list but do
not use list_{add,del}_rcu() to add,remove entries to that list, nor
do we observe proper grace periods when re-using the entries.

Merge ring_buffer_detach() into ring_buffer_attach() such that
attaching to the NULL buffer is detaching.

Furthermore, ensure that between any 'detach' and 'attach' of the same
event we observe the required grace period, but only when strictly
required. In effect this means that only ioctl(.request =
PERF_EVENT_IOC_SET_OUTPUT) will wait for a grace period, while the
normal initial attach and final detach will not be delayed.

This patch should, I think, do the right thing under all
circumstances, the 'normal' cases all should never see the extra grace
period, but the two cases:

 1) PERF_EVENT_IOC_SET_OUTPUT on an event which already has a
    ring_buffer set, will now observe the required grace period between
    removing itself from the old and attaching itself to the new buffer.

    This case is 'simple' in that both buffers are present in
    perf_event_set_output() one could think an unconditional
    synchronize_rcu() would be sufficient; however...

 2) an event that has a buffer attached, the buffer is destroyed
    (munmap) and then the event is attached to a new/different buffer
    using PERF_EVENT_IOC_SET_OUTPUT.

    This case is more complex because the buffer destruction does:
      ring_buffer_attach(.rb = NULL)
    followed by the ioctl() doing:
      ring_buffer_attach(.rb = foo);

    and we still need to observe the grace period between these two
    calls due to us reusing the event->rb_entry list_head.

In order to make 2 happen we use Paul's latest cond_synchronize_rcu()
call.

Cc: Paul Mackerras 
Cc: Stephane Eranian 
Cc: Andi Kleen 
Cc: "Paul E. McKenney" 
Cc: Ingo Molnar 
Cc: Frederic Weisbecker 
Cc: Mike Galbraith 
Reported-by: Alexander Shishkin 
Signed-off-by: Peter Zijlstra 
Link: http://lkml.kernel.org/r/20140507123526.GD13658@twins.programming.kicks-ass.net
Signed-off-by: Thomas Gleixner

perf: Prevent false warning in perf_swevent_add

2014-05-19T12:44:55+00:00

The perf cpu offline callback takes down all cpu context
events and releases swhash->swevent_hlist.

This could race with task context software event being just
scheduled on this cpu via perf_swevent_add while cpu hotplug
code already cleaned up event's data.

The race happens in the gap between the cpu notifier code
and the cpu being actually taken down. Note that only cpu
ctx events are terminated in the perf cpu hotplug code.

It's easily reproduced with:
  $ perf record -e faults perf bench sched pipe

while putting one of the cpus offline:
  # echo 0 > /sys/devices/system/cpu/cpu1/online

Console emits following warning:
  WARNING: CPU: 1 PID: 2845 at kernel/events/core.c:5672 perf_swevent_add+0x18d/0x1a0()
  Modules linked in:
  CPU: 1 PID: 2845 Comm: sched-pipe Tainted: G        W    3.14.0+ #256
  Hardware name: Intel Corporation Montevina platform/To be filled by O.E.M., BIOS AMVACRB1.86C.0066.B00.0805070703 05/07/2008
   0000000000000009 ffff880077233ab8 ffffffff81665a23 0000000000200005
   0000000000000000 ffff880077233af8 ffffffff8104732c 0000000000000046
   ffff88007467c800 0000000000000002 ffff88007a9cf2a0 0000000000000001
  Call Trace:
   [] dump_stack+0x4f/0x7c
   [] warn_slowpath_common+0x8c/0xc0
   [] warn_slowpath_null+0x1a/0x20
   [] perf_swevent_add+0x18d/0x1a0
   [] event_sched_in.isra.75+0x9e/0x1f0
   [] group_sched_in+0x6a/0x1f0
   [] ? sched_clock_local+0x25/0xa0
   [] ctx_sched_in+0x1f6/0x450
   [] perf_event_sched_in+0x6b/0xa0
   [] perf_event_context_sched_in+0x7b/0xc0
   [] __perf_event_task_sched_in+0x43e/0x460
   [] ? put_lock_stats.isra.18+0xe/0x30
   [] finish_task_switch+0xb8/0x100
   [] __schedule+0x30e/0xad0
   [] ? pipe_read+0x3e2/0x560
   [] ? preempt_schedule_irq+0x3e/0x70
   [] ? preempt_schedule_irq+0x3e/0x70
   [] preempt_schedule_irq+0x44/0x70
   [] retint_kernel+0x20/0x30
   [] ? lockdep_sys_exit+0x1a/0x90
   [] lockdep_sys_exit_thunk+0x35/0x67
   [] ? sysret_check+0x5/0x56

Fixing this by tracking the cpu hotplug state and displaying
the WARN only if current cpu is initialized properly.

Cc: Corey Ashford 
Cc: Frederic Weisbecker 
Cc: Ingo Molnar 
Cc: Paul Mackerras 
Cc: Arnaldo Carvalho de Melo 
Cc: stable@vger.kernel.org
Reported-by: Fengguang Wu 
Signed-off-by: Jiri Olsa 
Signed-off-by: Peter Zijlstra 
Link: http://lkml.kernel.org/r/1396861448-10097-1-git-send-email-jolsa@redhat.com
Signed-off-by: Thomas Gleixner

perf: Limit perf_event_attr::sample_period to 63 bits

2014-05-19T12:44:55+00:00

Vince reported that using a large sample_period (one with bit 63 set)
results in wreckage since while the sample_period is fundamentally
unsigned (negative periods don't make sense) the way we implement
things very much rely on signed logic.

So limit sample_period to 63 bits to avoid tripping over this.

Reported-by: Vince Weaver 
Signed-off-by: Peter Zijlstra 
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/n/tip-p25fhunibl4y3qi0zuqmyf4b@git.kernel.org
Signed-off-by: Thomas Gleixner

cgroup: fix rcu_read_lock() leak in update_if_frozen()

2014-05-13T15:28:30+00:00

While updating cgroup_freezer locking, 68fafb77d827 ("cgroup_freezer:
replace freezer->lock with freezer_mutex") introduced a bug in
update_if_frozen() where it returns with rcu_read_lock() held.  Fix it
by adding rcu_read_unlock() before returning.

Signed-off-by: Tejun Heo 
Reported-by: kbuild test robot

cgroup_freezer: replace freezer->lock with freezer_mutex

2014-05-13T15:26:31+00:00

After 96d365e0b86e ("cgroup: make css_set_lock a rwsem and rename it
to css_set_rwsem"), css task iterators requires sleepable context as
it may block on css_set_rwsem.  I missed that cgroup_freezer was
iterating tasks under IRQ-safe spinlock freezer->lock.  This leads to
errors like the following on freezer state reads and transitions.

  BUG: sleeping function called from invalid context at /work
 /os/work/kernel/locking/rwsem.c:20
  in_atomic(): 0, irqs_disabled(): 0, pid: 462, name: bash
  5 locks held by bash/462:
   #0:  (sb_writers#7){.+.+.+}, at: [] vfs_write+0x1a3/0x1c0
   #1:  (&of->mutex){+.+.+.}, at: [] kernfs_fop_write+0xbb/0x170
   #2:  (s_active#70){.+.+.+}, at: [] kernfs_fop_write+0xc3/0x170
   #3:  (freezer_mutex){+.+...}, at: [] freezer_write+0x61/0x1e0
   #4:  (rcu_read_lock){......}, at: [] freezer_write+0x53/0x1e0
  Preemption disabled at:[] console_unlock+0x1e4/0x460

  CPU: 3 PID: 462 Comm: bash Not tainted 3.15.0-rc1-work+ #10
  Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
   ffff88000916a6d0 ffff88000e0a3da0 ffffffff81cf8c96 0000000000000000
   ffff88000e0a3dc8 ffffffff810cf4f2 ffffffff82388040 ffff880013aaf740
   0000000000000002 ffff88000e0a3de8 ffffffff81d05974 0000000000000246
  Call Trace:
   [] dump_stack+0x4e/0x7a
   [] __might_sleep+0x162/0x260
   [] down_read+0x24/0x60
   [] css_task_iter_start+0x27/0x70
   [] freezer_apply_state+0x5d/0x130
   [] freezer_write+0xf6/0x1e0
   [] cgroup_file_write+0xd8/0x230
   [] kernfs_fop_write+0xe7/0x170
   [] vfs_write+0xb6/0x1c0
   [] SyS_write+0x4d/0xc0
   [] system_call_fastpath+0x16/0x1b

freezer->lock used to be used in hot paths but that time is long gone
and there's no reason for the lock to be IRQ-safe spinlock or even
per-cgroup.  In fact, given the fact that a cgroup may contain large
number of tasks, it's not a good idea to iterate over them while
holding IRQ-safe spinlock.

Let's simplify locking by replacing per-cgroup freezer->lock with
global freezer_mutex.  This also makes the comments explaining the
intricacies of policy inheritance and the locking around it as the
states are protected by a common mutex.

The conversion is mostly straight-forward.  The followings are worth
mentioning.

* freezer_css_online() no longer needs double locking.

* freezer_attach() now performs propagation simply while holding
  freezer_mutex.  update_if_frozen() race no longer exists and the
  comment is removed.

* freezer_fork() now tests whether the task is in root cgroup using
  the new task_css_is_root() without doing rcu_read_lock/unlock().  If
  not, it grabs freezer_mutex and performs the operation.

* freezer_read() and freezer_change_state() grab freezer_mutex across
  the whole operation and pin the css while iterating so that each
  descendant processing happens in sleepable context.

Fixes: 96d365e0b86e ("cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem")
Signed-off-by: Tejun Heo 
Acked-by: Li Zefan

cgroup: introduce task_css_is_root()

2014-05-13T15:26:27+00:00

Determining the css of a task usually requires RCU read lock as that's
the only thing which keeps the returned css accessible till its
reference is acquired; however, testing whether a task belongs to the
root can be performed without dereferencing the returned css by
comparing the returned pointer against the root one in init_css_set[]
which never changes.

Implement task_css_is_root() which can be invoked in any context.
This will be used by the scheduled cgroup_freezer change.

v2: cgroup no longer supports modular controllers.  No need to export
    init_css_set.  Pointed out by Li.

Signed-off-by: Tejun Heo 
Acked-by: Li Zefan