summaryrefslogtreecommitdiff
path: root/kernel/workqueue.c
AgeCommit message (Collapse)AuthorFilesLines
2018-06-16workqueue: Allow retrieval of current task's work structLukas Wunner1-0/+16
commit 27d4ee03078aba88c5e07dcc4917e8d01d046f38 upstream. Introduce a helper to retrieve the current task's work struct if it is a workqueue worker. This allows us to fix a long-standing deadlock in several DRM drivers wherein the ->runtime_suspend callback waits for a specific worker to finish and that worker in turn calls a function which waits for runtime suspend to finish. That function is invoked from multiple call sites and waiting for runtime suspend to finish is the correct thing to do except if it's executing in the context of the worker. Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Dave Airlie <airlied@redhat.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Alex Deucher <alexander.deucher@amd.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lyude Paul <lyude@redhat.com> Signed-off-by: Lukas Wunner <lukas@wunner.de> Link: https://patchwork.freedesktop.org/patch/msgid/2d8f603074131eb87e588d2b803a71765bd3a2fd.1518338788.git.lukas@wunner.de Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2018-01-01workqueue: replace pool->manager_arb mutex with a flagTejun Heo1-22/+15
commit 692b48258dda7c302e777d7d5f4217244478f1f6 upstream. Josef reported a HARDIRQ-safe -> HARDIRQ-unsafe lock order detected by lockdep: [ 1270.472259] WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected [ 1270.472783] 4.14.0-rc1-xfstests-12888-g76833e8 #110 Not tainted [ 1270.473240] ----------------------------------------------------- [ 1270.473710] kworker/u5:2/5157 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire: [ 1270.474239] (&(&lock->wait_lock)->rlock){+.+.}, at: [<ffffffff8da253d2>] __mutex_unlock_slowpath+0xa2/0x280 [ 1270.474994] [ 1270.474994] and this task is already holding: [ 1270.475440] (&pool->lock/1){-.-.}, at: [<ffffffff8d2992f6>] worker_thread+0x366/0x3c0 [ 1270.476046] which would create a new lock dependency: [ 1270.476436] (&pool->lock/1){-.-.} -> (&(&lock->wait_lock)->rlock){+.+.} [ 1270.476949] [ 1270.476949] but this new dependency connects a HARDIRQ-irq-safe lock: [ 1270.477553] (&pool->lock/1){-.-.} ... [ 1270.488900] to a HARDIRQ-irq-unsafe lock: [ 1270.489327] (&(&lock->wait_lock)->rlock){+.+.} ... [ 1270.494735] Possible interrupt unsafe locking scenario: [ 1270.494735] [ 1270.495250] CPU0 CPU1 [ 1270.495600] ---- ---- [ 1270.495947] lock(&(&lock->wait_lock)->rlock); [ 1270.496295] local_irq_disable(); [ 1270.496753] lock(&pool->lock/1); [ 1270.497205] lock(&(&lock->wait_lock)->rlock); [ 1270.497744] <Interrupt> [ 1270.497948] lock(&pool->lock/1); , which will cause a irq inversion deadlock if the above lock scenario happens. The root cause of this safe -> unsafe lock order is the mutex_unlock(pool->manager_arb) in manage_workers() with pool->lock held. Unlocking mutex while holding an irq spinlock was never safe and this problem has been around forever but it never got noticed because the only time the mutex is usually trylocked while holding irqlock making actual failures very unlikely and lockdep annotation missed the condition until the recent b9c16a0e1f73 ("locking/mutex: Fix lockdep_assert_held() fail"). Using mutex for pool->manager_arb has always been a bit of stretch. It primarily is an mechanism to arbitrate managership between workers which can easily be done with a pool flag. The only reason it became a mutex is that pool destruction path wants to exclude parallel managing operations. This patch replaces the mutex with a new pool flag POOL_MANAGER_ACTIVE and make the destruction path wait for the current manager on a wait queue. v2: Drop unnecessary flag clearing before pool destruction as suggested by Boqun. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Boqun Feng <boqun.feng@gmail.com> [bwh: Backported to 3.16: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2017-11-11workqueue: implicit ordered attribute should be overridableTejun Heo1-4/+9
commit 0a94efb5acbb6980d7c9ab604372d93cd507e4d8 upstream. 5c0338c68706 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered") automatically enabled ordered attribute for unbound workqueues w/ max_active == 1. Because ordered workqueues reject max_active and some attribute changes, this implicit ordered mode broke cases where the user creates an unbound workqueue w/ max_active == 1 and later explicitly changes the related attributes. This patch distinguishes explicit and implicit ordered setting and overrides from attribute changes if implict. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 5c0338c68706 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered") [bwh: Backported to 3.16: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2017-11-11workqueue: restore WQ_UNBOUND/max_active==1 to be orderedTejun Heo1-0/+10
commit 5c0338c68706be53b3dc472e4308961c36e4ece1 upstream. The combination of WQ_UNBOUND and max_active == 1 used to imply ordered execution. After NUMA affinity 4c16bd327c74 ("workqueue: implement NUMA affinity for unbound workqueues"), this is no longer true due to per-node worker pools. While the right way to create an ordered workqueue is alloc_ordered_workqueue(), the documentation has been misleading for a long time and people do use WQ_UNBOUND and max_active == 1 for ordered workqueues which can lead to subtle bugs which are very difficult to trigger. It's unlikely that we'd see noticeable performance impact by enforcing ordering on WQ_UNBOUND / max_active == 1 workqueues. Let's automatically set __WQ_ORDERED for those workqueues. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Christoph Hellwig <hch@infradead.org> Reported-by: Alexei Potashnik <alexei@purestorage.com> Fixes: 4c16bd327c74 ("workqueue: implement NUMA affinity for unbound workqueues") Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2016-06-15workqueue: fix ghost PENDING flag while doing MQ IORoman Pen1-0/+29
commit 346c09f80459a3ad97df1816d6d606169a51001a upstream. The bug in a workqueue leads to a stalled IO request in MQ ctx->rq_list with the following backtrace: [ 601.347452] INFO: task kworker/u129:5:1636 blocked for more than 120 seconds. [ 601.347574] Tainted: G O 4.4.5-1-storage+ #6 [ 601.347651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 601.348142] kworker/u129:5 D ffff880803077988 0 1636 2 0x00000000 [ 601.348519] Workqueue: ibnbd_server_fileio_wq ibnbd_dev_file_submit_io_worker [ibnbd_server] [ 601.348999] ffff880803077988 ffff88080466b900 ffff8808033f9c80 ffff880803078000 [ 601.349662] ffff880807c95000 7fffffffffffffff ffffffff815b0920 ffff880803077ad0 [ 601.350333] ffff8808030779a0 ffffffff815b01d5 0000000000000000 ffff880803077a38 [ 601.350965] Call Trace: [ 601.351203] [<ffffffff815b0920>] ? bit_wait+0x60/0x60 [ 601.351444] [<ffffffff815b01d5>] schedule+0x35/0x80 [ 601.351709] [<ffffffff815b2dd2>] schedule_timeout+0x192/0x230 [ 601.351958] [<ffffffff812d43f7>] ? blk_flush_plug_list+0xc7/0x220 [ 601.352208] [<ffffffff810bd737>] ? ktime_get+0x37/0xa0 [ 601.352446] [<ffffffff815b0920>] ? bit_wait+0x60/0x60 [ 601.352688] [<ffffffff815af784>] io_schedule_timeout+0xa4/0x110 [ 601.352951] [<ffffffff815b3a4e>] ? _raw_spin_unlock_irqrestore+0xe/0x10 [ 601.353196] [<ffffffff815b093b>] bit_wait_io+0x1b/0x70 [ 601.353440] [<ffffffff815b056d>] __wait_on_bit+0x5d/0x90 [ 601.353689] [<ffffffff81127bd0>] wait_on_page_bit+0xc0/0xd0 [ 601.353958] [<ffffffff81096db0>] ? autoremove_wake_function+0x40/0x40 [ 601.354200] [<ffffffff81127cc4>] __filemap_fdatawait_range+0xe4/0x140 [ 601.354441] [<ffffffff81127d34>] filemap_fdatawait_range+0x14/0x30 [ 601.354688] [<ffffffff81129a9f>] filemap_write_and_wait_range+0x3f/0x70 [ 601.354932] [<ffffffff811ced3b>] blkdev_fsync+0x1b/0x50 [ 601.355193] [<ffffffff811c82d9>] vfs_fsync_range+0x49/0xa0 [ 601.355432] [<ffffffff811cf45a>] blkdev_write_iter+0xca/0x100 [ 601.355679] [<ffffffff81197b1a>] __vfs_write+0xaa/0xe0 [ 601.355925] [<ffffffff81198379>] vfs_write+0xa9/0x1a0 [ 601.356164] [<ffffffff811c59d8>] kernel_write+0x38/0x50 The underlying device is a null_blk, with default parameters: queue_mode = MQ submit_queues = 1 Verification that nullb0 has something inflight: root@pserver8:~# cat /sys/block/nullb0/inflight 0 1 root@pserver8:~# find /sys/block/nullb0/mq/0/cpu* -name rq_list -print -exec cat {} \; ... /sys/block/nullb0/mq/0/cpu2/rq_list CTX pending: ffff8838038e2400 ... During debug it became clear that stalled request is always inserted in the rq_list from the following path: save_stack_trace_tsk + 34 blk_mq_insert_requests + 231 blk_mq_flush_plug_list + 281 blk_flush_plug_list + 199 wait_on_page_bit + 192 __filemap_fdatawait_range + 228 filemap_fdatawait_range + 20 filemap_write_and_wait_range + 63 blkdev_fsync + 27 vfs_fsync_range + 73 blkdev_write_iter + 202 __vfs_write + 170 vfs_write + 169 kernel_write + 56 So blk_flush_plug_list() was called with from_schedule == true. If from_schedule is true, that means that finally blk_mq_insert_requests() offloads execution of __blk_mq_run_hw_queue() and uses kblockd workqueue, i.e. it calls kblockd_schedule_delayed_work_on(). That means, that we race with another CPU, which is about to execute __blk_mq_run_hw_queue() work. Further debugging shows the following traces from different CPUs: CPU#0 CPU#1 ---------------------------------- ------------------------------- reqeust A inserted STORE hctx->ctx_map[0] bit marked kblockd_schedule...() returns 1 <schedule to kblockd workqueue> request B inserted STORE hctx->ctx_map[1] bit marked kblockd_schedule...() returns 0 *** WORK PENDING bit is cleared *** flush_busy_ctxs() is executed, but bit 1, set by CPU#1, is not observed As a result request B pended forever. This behaviour can be explained by speculative LOAD of hctx->ctx_map on CPU#0, which is reordered with clear of PENDING bit and executed _before_ actual STORE of bit 1 on CPU#1. The proper fix is an explicit full barrier <mfence>, which guarantees that clear of PENDING bit is to be executed before all possible speculative LOADS or STORES inside actual work function. Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com> Cc: Gioh Kim <gi-oh.kim@profitbricks.com> Cc: Michael Wang <yun.wang@profitbricks.com> Cc: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2016-02-25workqueue: handle NUMA_NO_NODE for unbound pool_workqueue lookupTejun Heo1-0/+10
commit d6e022f1d207a161cd88e08ef0371554680ffc46 upstream. When looking up the pool_workqueue to use for an unbound workqueue, workqueue assumes that the target CPU is always bound to a valid NUMA node. However, currently, when a CPU goes offline, the mapping is destroyed and cpu_to_node() returns NUMA_NO_NODE. This has always been broken but hasn't triggered often enough before 874bbfe600a6 ("workqueue: make sure delayed work run in local cpu"). After the commit, workqueue forcifully assigns the local CPU for delayed work items without explicit target CPU to fix a different issue. This widens the window where CPU can go offline while a delayed work item is pending causing delayed work items dispatched with target CPU set to an already offlined CPU. The resulting NUMA_NO_NODE mapping makes workqueue try to queue the work item on a NULL pool_workqueue and thus crash. While 874bbfe600a6 has been reverted for a different reason making the bug less visible again, it can still happen. Fix it by mapping NUMA_NO_NODE to the default pool_workqueue from unbound_pwq_by_node(). This is a temporary workaround. The long term solution is keeping CPU -> NODE mapping stable across CPU off/online cycles which is being worked on. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Rafael J. Wysocki <rafael@kernel.org> Cc: Len Brown <len.brown@intel.com> Link: http://lkml.kernel.org/g/1454424264.11183.46.camel@gmail.com Link: http://lkml.kernel.org/g/1453702100-2597-1-git-send-email-tangchen@cn.fujitsu.com [ luis: backported to 3.16: adjusted context ] Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
2016-02-24Revert "workqueue: make sure delayed work run in local cpu"Tejun Heo1-4/+4
commit 041bd12e272c53a35c54c13875839bcb98c999ce upstream. This reverts commit 874bbfe600a660cba9c776b3957b1ce393151b76. Workqueue used to implicity guarantee that work items queued without explicit CPU specified are put on the local CPU. Recent changes in timer broke the guarantee and led to vmstat breakage which was fixed by 176bed1de5bf ("vmstat: explicitly schedule per-cpu work on the CPU we need it to run on"). vmstat is the most likely to expose the issue and it's quite possible that there are other similar problems which are a lot more difficult to trigger. As a preventive measure, 874bbfe600a6 ("workqueue: make sure delayed work run in local cpu") was applied to restore the local CPU guarnatee. Unfortunately, the change exposed a bug in timer code which got fixed by 22b886dd1018 ("timers: Use proper base migration in add_timer_on()"). Due to code restructuring, the commit couldn't be backported beyond certain point and stable kernels which only had 874bbfe600a6 started crashing. The local CPU guarantee was accidental more than anything else and we want to get rid of it anyway. As, with the vmstat case fixed, 874bbfe600a6 is causing more problems than it's fixing, it has been decided to take the chance and officially break the guarantee by reverting the commit. A debug feature will be added to force foreign CPU assignment to expose cases relying on the guarantee and fixes for the individual cases will be backported to stable as necessary. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 874bbfe600a6 ("workqueue: make sure delayed work run in local cpu") Link: http://lkml.kernel.org/g/20160120211926.GJ10810@quack.suse.cz Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Cc: Daniel Bilik <daniel.bilik@neosystem.cz> Cc: Jan Kara <jack@suse.cz> Cc: Shaohua Li <shli@fb.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Daniel Bilik <daniel.bilik@neosystem.cz> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
2015-10-30workqueue: make sure delayed work run in local cpuShaohua Li1-4/+4
commit 874bbfe600a660cba9c776b3957b1ce393151b76 upstream. My system keeps crashing with below message. vmstat_update() schedules a delayed work in current cpu and expects the work runs in the cpu. schedule_delayed_work() is expected to make delayed work run in local cpu. The problem is timer can be migrated with NO_HZ. __queue_work() queues work in timer handler, which could run in a different cpu other than where the delayed work is scheduled. The end result is the delayed work runs in different cpu. The patch makes __queue_delayed_work records local cpu earlier. Where the timer runs doesn't change where the work runs with the change. [ 28.010131] ------------[ cut here ]------------ [ 28.010609] kernel BUG at ../mm/vmstat.c:1392! [ 28.011099] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN [ 28.011860] Modules linked in: [ 28.012245] CPU: 0 PID: 289 Comm: kworker/0:3 Tainted: G W4.3.0-rc3+ #634 [ 28.013065] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153802- 04/01/2014 [ 28.014160] Workqueue: events vmstat_update [ 28.014571] task: ffff880117682580 ti: ffff8800ba428000 task.ti: ffff8800ba428000 [ 28.015445] RIP: 0010:[<ffffffff8115f921>] [<ffffffff8115f921>]vmstat_update+0x31/0x80 [ 28.016282] RSP: 0018:ffff8800ba42fd80 EFLAGS: 00010297 [ 28.016812] RAX: 0000000000000000 RBX: ffff88011a858dc0 RCX:0000000000000000 [ 28.017585] RDX: ffff880117682580 RSI: ffffffff81f14d8c RDI:ffffffff81f4df8d [ 28.018366] RBP: ffff8800ba42fd90 R08: 0000000000000001 R09:0000000000000000 [ 28.019169] R10: 0000000000000000 R11: 0000000000000121 R12:ffff8800baa9f640 [ 28.019947] R13: ffff88011a81e340 R14: ffff88011a823700 R15:0000000000000000 [ 28.020071] FS: 0000000000000000(0000) GS:ffff88011a800000(0000)knlGS:0000000000000000 [ 28.020071] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 28.020071] CR2: 00007ff6144b01d0 CR3: 00000000b8e93000 CR4:00000000000006f0 [ 28.020071] Stack: [ 28.020071] ffff88011a858dc0 ffff8800baa9f640 ffff8800ba42fe00ffffffff8106bd88 [ 28.020071] ffffffff8106bd0b 0000000000000096 0000000000000000ffffffff82f9b1e8 [ 28.020071] ffffffff829f0b10 0000000000000000 ffffffff81f18460ffff88011a81e340 [ 28.020071] Call Trace: [ 28.020071] [<ffffffff8106bd88>] process_one_work+0x1c8/0x540 [ 28.020071] [<ffffffff8106bd0b>] ? process_one_work+0x14b/0x540 [ 28.020071] [<ffffffff8106c214>] worker_thread+0x114/0x460 [ 28.020071] [<ffffffff8106c100>] ? process_one_work+0x540/0x540 [ 28.020071] [<ffffffff81071bf8>] kthread+0xf8/0x110 [ 28.020071] [<ffffffff81071b00>] ?kthread_create_on_node+0x200/0x200 [ 28.020071] [<ffffffff81a6522f>] ret_from_fork+0x3f/0x70 [ 28.020071] [<ffffffff81071b00>] ?kthread_create_on_node+0x200/0x200 Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
2015-03-23workqueue: fix hang involving racing cancel[_delayed]_work_sync()'s for ↵Tejun Heo1-4/+52
PREEMPT_NONE commit 8603e1b30027f943cc9c1eef2b291d42c3347af1 upstream. cancel[_delayed]_work_sync() are implemented using __cancel_work_timer() which grabs the PENDING bit using try_to_grab_pending() and then flushes the work item with PENDING set to prevent the on-going execution of the work item from requeueing itself. try_to_grab_pending() can always grab PENDING bit without blocking except when someone else is doing the above flushing during cancelation. In that case, try_to_grab_pending() returns -ENOENT. In this case, __cancel_work_timer() currently invokes flush_work(). The assumption is that the completion of the work item is what the other canceling task would be waiting for too and thus waiting for the same condition and retrying should allow forward progress without excessive busy looping Unfortunately, this doesn't work if preemption is disabled or the latter task has real time priority. Let's say task A just got woken up from flush_work() by the completion of the target work item. If, before task A starts executing, task B gets scheduled and invokes __cancel_work_timer() on the same work item, its try_to_grab_pending() will return -ENOENT as the work item is still being canceled by task A and flush_work() will also immediately return false as the work item is no longer executing. This puts task B in a busy loop possibly preventing task A from executing and clearing the canceling state on the work item leading to a hang. task A task B worker executing work __cancel_work_timer() try_to_grab_pending() set work CANCELING flush_work() block for work completion completion, wakes up A __cancel_work_timer() while (forever) { try_to_grab_pending() -ENOENT as work is being canceled flush_work() false as work is no longer executing } This patch removes the possible hang by updating __cancel_work_timer() to explicitly wait for clearing of CANCELING rather than invoking flush_work() after try_to_grab_pending() fails with -ENOENT. Link: http://lkml.kernel.org/g/20150206171156.GA8942@axis.com v3: bit_waitqueue() can't be used for work items defined in vmalloc area. Switched to custom wake function which matches the target work item and exclusive wait and wakeup. v2: v1 used wake_up() on bit_waitqueue() which leads to NULL deref if the target bit waitqueue has wait_bit_queue's on it. Use DEFINE_WAIT_BIT() and __wake_up_bit() instead. Reported by Tomeu Vizoso. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Rabin Vincent <rabin.vincent@axis.com> Cc: Tomeu Vizoso <tomeu.vizoso@gmail.com> Tested-by: Jesper Nilsson <jesper.nilsson@axis.com> Tested-by: Rabin Vincent <rabin.vincent@axis.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
2015-02-04workqueue: fix subtle pool management issue which can stall whole worker_poolTejun Heo1-18/+9
commit 29187a9eeaf362d8422e62e17a22a6e115277a49 upstream. A worker_pool's forward progress is guaranteed by the fact that the last idle worker assumes the manager role to create more workers and summon the rescuers if creating workers doesn't succeed in timely manner before proceeding to execute work items. This manager role is implemented in manage_workers(), which indicates whether the worker may proceed to work item execution with its return value. This is necessary because multiple workers may contend for the manager role, and, if there already is a manager, others should proceed to work item execution. Unfortunately, the function also indicates that the worker may proceed to work item execution if need_to_create_worker() is false at the head of the function. need_to_create_worker() tests the following conditions. pending work items && !nr_running && !nr_idle The first and third conditions are protected by pool->lock and thus won't change while holding pool->lock; however, nr_running can change asynchronously as other workers block and resume and while it's likely to be zero, as someone woke this worker up in the first place, some other workers could have become runnable inbetween making it non-zero. If this happens, manage_worker() could return false even with zero nr_idle making the worker, the last idle one, proceed to execute work items. If then all workers of the pool end up blocking on a resource which can only be released by a work item which is pending on that pool, the whole pool can deadlock as there's no one to create more workers or summon the rescuers. This patch fixes the problem by removing the early exit condition from maybe_create_worker() and making manage_workers() return false iff there's already another manager, which ensures that the last worker doesn't start executing work items. We can leave the early exit condition alone and just ignore the return value but the only reason it was put there is because the manage_workers() used to perform both creations and destructions of workers and thus the function may be invoked while the pool is trying to reduce the number of workers. Now that manage_workers() is called only when more workers are needed, the only case this early exit condition is triggered is rare race conditions rendering it pointless. Tested with simulated workload and modified workqueue code which trigger the pool deadlock reliably without this patch. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Eric Sandeen <sandeen@sandeen.net> Link: http://lkml.kernel.org/g/54B019F4.8030009@sandeen.net Cc: Dave Chinner <david@fromorbit.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> [ luis: backported to 3.16: - maybe_create_worker() is now void, 'return' instead of 'return true' ] Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
2014-07-07workqueue: zero cpumask of wq_numa_possible_cpumask on initYasuaki Ishimatsu1-1/+1
When hot-adding and onlining CPU, kernel panic occurs, showing following call trace. BUG: unable to handle kernel paging request at 0000000000001d08 IP: [<ffffffff8114acfd>] __alloc_pages_nodemask+0x9d/0xb10 PGD 0 Oops: 0000 [#1] SMP ... Call Trace: [<ffffffff812b8745>] ? cpumask_next_and+0x35/0x50 [<ffffffff810a3283>] ? find_busiest_group+0x113/0x8f0 [<ffffffff81193bc9>] ? deactivate_slab+0x349/0x3c0 [<ffffffff811926f1>] new_slab+0x91/0x300 [<ffffffff815de95a>] __slab_alloc+0x2bb/0x482 [<ffffffff8105bc1c>] ? copy_process.part.25+0xfc/0x14c0 [<ffffffff810a3c78>] ? load_balance+0x218/0x890 [<ffffffff8101a679>] ? sched_clock+0x9/0x10 [<ffffffff81105ba9>] ? trace_clock_local+0x9/0x10 [<ffffffff81193d1c>] kmem_cache_alloc_node+0x8c/0x200 [<ffffffff8105bc1c>] copy_process.part.25+0xfc/0x14c0 [<ffffffff81114d0d>] ? trace_buffer_unlock_commit+0x4d/0x60 [<ffffffff81085a80>] ? kthread_create_on_node+0x140/0x140 [<ffffffff8105d0ec>] do_fork+0xbc/0x360 [<ffffffff8105d3b6>] kernel_thread+0x26/0x30 [<ffffffff81086652>] kthreadd+0x2c2/0x300 [<ffffffff81086390>] ? kthread_create_on_cpu+0x60/0x60 [<ffffffff815f20ec>] ret_from_fork+0x7c/0xb0 [<ffffffff81086390>] ? kthread_create_on_cpu+0x60/0x60 In my investigation, I found the root cause is wq_numa_possible_cpumask. All entries of wq_numa_possible_cpumask is allocated by alloc_cpumask_var_node(). And these entries are used without initializing. So these entries have wrong value. When hot-adding and onlining CPU, wq_update_unbound_numa() is called. wq_update_unbound_numa() calls alloc_unbound_pwq(). And alloc_unbound_pwq() calls get_unbound_pool(). In get_unbound_pool(), worker_pool->node is set as follow: 3592 /* if cpumask is contained inside a NUMA node, we belong to that node */ 3593 if (wq_numa_enabled) { 3594 for_each_node(node) { 3595 if (cpumask_subset(pool->attrs->cpumask, 3596 wq_numa_possible_cpumask[node])) { 3597 pool->node = node; 3598 break; 3599 } 3600 } 3601 } But wq_numa_possible_cpumask[node] does not have correct cpumask. So, wrong node is selected. As a result, kernel panic occurs. By this patch, all entries of wq_numa_possible_cpumask are allocated by zalloc_cpumask_var_node to initialize them. And the panic disappeared. Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org Fixes: bce903809ab3 ("workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]")
2014-06-23workqueue: fix dev_set_uevent_suppress() imbalanceMaxime Bizon1-0/+1
Uevents are suppressed during attributes registration, but never restored, so kobject_uevent() does nothing. Signed-off-by: Maxime Bizon <mbizon@freebox.fr> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org Fixes: 226223ab3c4118ddd10688cc2c131135848371ab
2014-06-09Merge branch 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds1-301/+147
Pull workqueue updates from Tejun Heo: "Lai simplified worker destruction path and internal workqueue locking and there are some other minor changes. Except for the removal of some long-deprecated interfaces which haven't had any in-kernel user for quite a while, there shouldn't be any difference to workqueue users" * 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: kernel/workqueue.c: pr_warning/pr_warn & printk/pr_info workqueue: remove the confusing POOL_FREEZING workqueue: rename first_worker() to first_idle_worker() workqueue: remove unused work_clear_pending() workqueue: remove unused WORK_CPU_END workqueue: declare system_highpri_wq workqueue: use generic attach/detach routine for rescuers workqueue: separate pool-attaching code out from create_worker() workqueue: rename manager_mutex to attach_mutex workqueue: narrow the protection range of manager_mutex workqueue: convert worker_idr to worker_ida workqueue: separate iteration role from worker_idr workqueue: destroy worker directly in the idle timeout handler workqueue: async worker destruction workqueue: destroy_worker() should destroy idle workers only workqueue: use manager lock only to protect worker_idr workqueue: Remove deprecated system_nrt[_freezable]_wq workqueue: Remove deprecated flush[_delayed]_work_sync() kernel/workqueue.c: pr_warning/pr_warn & printk/pr_info workqueue: simplify wq_update_unbound_numa() by jumping to use_dfl_pwq if the target cpumask equals wq's
2014-05-28kernel/workqueue.c: pr_warning/pr_warn & printk/pr_infoValdis Kletnieks1-1/+1
This commit did an incorrect printk->pr_info conversion. If we were converting to pr_info() we should lose the log_level parameter. The problem is that this is called (indirectly) by show_regs_print_info(), which is called with various log_levels (from _INFO clear to _EMERG). So we leave it as a printk() call so the desired log_level is applied. Not a full revert, as the other half of the patch is correct. Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-22workqueue: remove the confusing POOL_FREEZINGLai Jiangshan1-25/+7
Currently, the global freezing state is propagated to worker_pools via POOL_FREEZING and then to each workqueue; however, the middle step - propagation through worker_pools - can be skipped as long as one or more max_active adjustments happens for each workqueue after the update to the global state is visible. The global workqueue freezing state and the max_active adjustments during workqueue creation and [un]freezing are serialized with wq_pool_mutex, so it's trivial to guarantee that max_actives stay in sync with global freezing state. POOL_FREEZING is unnecessary and makes the code more confusing and complicates freeze_workqueues_begin() and thaw_workqueues() by requiring them to walk through all pools. Remove POOL_FREEZING and use workqueue_freezing directly instead. tj: Description and comment updates. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-22workqueue: rename first_worker() to first_idle_worker()Lai Jiangshan1-5/+5
first_worker() actually returns the first idle workers, the name first_idle_worker() which is self-commnet will be better. All the callers of first_worker() expect it returns an idle worker, the name first_idle_worker() with "idle" notation makes reviewers happier. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-22Merge tag 'v3.15-rc6' into sched/core, to pick up the latest fixesIngo Molnar1-6/+30
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-20workqueue: use generic attach/detach routine for rescuersLai Jiangshan1-66/+8
There are several problems with the code that rescuers use to bind themselve to the target pool's cpumask. 1) It is very different from how the normal workers bind to cpumask, increasing code complexity and maintenance overhead. 2) The code of cpu-binding for rescuers is complicated. 3) If one or more cpu hotplugs happen while a rescuer is processing its scheduled work items, the rescuer may not stay bound to the cpumask of the pool. This is an allowed behavior, but is still hairy. It will be better if the cpumask of the rescuer is always kept synchronized with the pool across cpu hotplugs. Using generic attach/detach routine will solve the above problems and results in much simpler code. tj: Minor description updates. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-20workqueue: separate pool-attaching code out from create_worker()Lai Jiangshan1-23/+38
Currently, the code to attach a new worker to its pool is embedded in create_worker(). Separating this code out will make the codes clearer and will allow rescuers to share the code path later. tj: Description and comment updates. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-20workqueue: rename manager_mutex to attach_mutexLai Jiangshan1-22/+22
manager_mutex is only used to protect the attaching for the pool and the pool->workers list. It protects the pool->workers and operations based on this list, such as: cpu-binding for the workers in the pool->workers the operations to set/clear WORKER_UNBOUND So let's rename manager_mutex to attach_mutex to better reflect its role. This patch is a pure rename. tj: Minor command and description updates. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-20workqueue: narrow the protection range of manager_mutexLai Jiangshan1-30/+5
In create_worker(), as pool->worker_ida now uses ida_simple_get()/ida_simple_put() and doesn't require external synchronization, it doesn't need manager_mutex. struct worker allocation and kthread allocation are not visible by any one before attached, so they don't need manager_mutex either. The above operations are before the attaching operation which attaches the worker to the pool. Between attaching and starting the worker, the worker is already attached to the pool, so the cpu hotplug will handle cpu-binding for the worker correctly and we don't need the manager_mutex after attaching. The conclusion is that only the attaching operation needs manager_mutex, so we narrow the protection section of manager_mutex in create_worker(). Some comments about manager_mutex are removed, because we will rename it to attach_mutex and add worker_attach_to_pool() later which will be self-explanatory. tj: Minor description updates. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-20workqueue: convert worker_idr to worker_idaLai Jiangshan1-12/+8
We no longer iterate workers via worker_idr and worker_idr is used only for allocating/freeing ID, so we can convert it to worker_ida. By using ida_simple_get/remove(), worker_ida doesn't require external synchronization, so we don't need manager_mutex to protect it and the ID-removal code is allowed to be moved out from worker_detach_from_pool(). In a later patch, worker_detach_from_pool() will be used in rescuers which don't have IDs, so we move the ID-removal code out from worker_detach_from_pool() into worker_thread(). tj: Minor description updates. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-20workqueue: separate iteration role from worker_idrLai Jiangshan1-13/+15
worker_idr has the iteration (iterating for attached workers) and worker ID duties. These two duties don't have to be tied together. We can separate them and use a list for tracking attached workers and iteration. Before this separation, it wasn't possible to add rescuer workers to worker_idr due to rescuer workers couldn't allocate ID dynamically because ID-allocation depends on memory-allocation, which rescuer can't depend on. After separation, we can easily add the rescuer workers to the list for iteration without any memory-allocation. It is required when we attach the rescuer worker to the pool in later patch. tj: Minor description updates. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-20workqueue: destroy worker directly in the idle timeout handlerLai Jiangshan1-62/+5
Since destroy_worker() doesn't need to sleep nor require manager_mutex, destroy_worker() can be directly called in the idle timeout handler, it helps us remove POOL_MANAGE_WORKERS and maybe_destroy_worker() and simplify the manage_workers() After POOL_MANAGE_WORKERS is removed, worker_thread() doesn't need to test whether it needs to manage after processed works. So we can remove the test branch. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2014-05-20workqueue: async worker destructionLai Jiangshan1-21/+41
worker destruction includes these parts of code: adjust pool's stats remove the worker from idle list detach the worker from the pool kthread_stop() to wait for the worker's task exit free the worker struct We can find out that there is no essential work to do after kthread_stop(), which means destroy_worker() doesn't need to wait for the worker's task exit, so we can remove kthread_stop() and free the worker struct in the worker exiting path. However, put_unbound_pool() still needs to sync the all the workers' destruction before destroying the pool; otherwise, the workers may access to the invalid pool when they are exiting. So we also move the code of "detach the worker" to the exiting path and let put_unbound_pool() to sync with this code via detach_completion. The code of "detach the worker" is wrapped in a new function "worker_detach_from_pool()" although worker_detach_from_pool() is only called once (in worker_thread()) after this patch, but we need to wrap it for these reasons: 1) The code of "detach the worker" is not short enough to unfold them in worker_thread(). 2) the name of "worker_detach_from_pool()" is self-comment, and we add some comments above the function. 3) it will be shared by rescuer in later patch which allows rescuer and normal thread use the same attach/detach frameworks. The worker id is freed when detaching which happens before the worker is fully dead, but this id of the dying worker may be re-used for a new worker, so the dying worker's task name is changed to "worker/dying" to avoid two or several workers having the same name. Since "detach the worker" is moved out from destroy_worker(), destroy_worker() doesn't require manager_mutex, so the "lockdep_assert_held(&pool->manager_mutex)" in destroy_worker() is removed, and destroy_worker() is not protected by manager_mutex in put_unbound_pool(). tj: Minor description updates. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-20workqueue: destroy_worker() should destroy idle workers onlyLai Jiangshan1-11/+8
We used to have the CPU online failure path where a worker is created and then destroyed without being started. A worker was created for the CPU coming online and if the online operation failed the created worker was shut down without being started. But this behavior was changed. The first worker is created and started at the same time for the CPU coming online. It means that we had already ensured in the code that destroy_worker() destroys only idle workers and we don't want to allow it to destroy any non-idle worker in the future. Otherwise, it may be buggy and it may be extremely hard to check. We should force destroy_worker() to destroy only idle workers explicitly. Since destroy_worker() destroys only idle workers, this patch does not change any functionality. We just need to update the comments and the sanity check code. In the sanity check code, we will refuse to destroy the worker if !(worker->flags & WORKER_IDLE). If the worker entered idle which means it is already started, so we remove the check of "worker->flags & WORKER_STARTED", after this removal, WORKER_STARTED is totally unneeded, so we remove WORKER_STARTED too. In the comments for create_worker(), "Create a new worker which is bound..." is changed to "... which is attached..." due to we change the name of this behavior to attaching. tj: Minor description / comment updates. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-20workqueue: use manager lock only to protect worker_idrLai Jiangshan1-28/+6
worker_idr is highly bound to managers and is always/only accessed in manager lock context. So we don't need pool->lock for it. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-12kernel/workqueue.c: pr_warning/pr_warn & printk/pr_infoFabian Frederick1-3/+3
tj: Refreshed on top of wq/for-3.16. Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-04-18workqueue: simplify wq_update_unbound_numa() by jumping to use_dfl_pwq if ↵Daeseok Youn1-6/+2
the target cpumask equals wq's wq_update_unbound_numa(), when it's decided that the newly updated cpumask equals the default, looks at whether the current pwq is already the default one and skips setting pwq to the default one. This extra step is unnecessary and we can always jump to use_dfl_pwq instead. Simplify the code by removing the conditional. This doesn't make any functional difference. Signed-off-by: Daeseok Youn <daeseok.youn@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-04-18workqueue: fix a possible race condition between rescuer and pwq-releaseLai Jiangshan1-0/+12
There is a race condition between rescuer_thread() and pwq_unbound_release_workfn(). Even after a pwq is scheduled for rescue, the associated work items may be consumed by any worker. If all of them are consumed before the rescuer gets to them and the pwq's base ref was put due to attribute change, the pwq may be released while still being linked on @wq->maydays list making the rescuer dereference already freed pwq later. Make send_mayday() pin the target pwq until the rescuer is done with it. tj: Updated comment and patch description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org # v3.10+
2014-04-18workqueue: make rescuer_thread() empty wq->maydays list before exitingLai Jiangshan1-5/+16
After a @pwq is scheduled for emergency execution, other workers may consume the affectd work items before the rescuer gets to them. This means that a workqueue many have pwqs queued on @wq->maydays list while not having any work item pending or in-flight. If destroy_workqueue() executes in such condition, the rescuer may exit without emptying @wq->maydays. This currently doesn't cause any actual harm. destroy_workqueue() can safely destroy all the involved data structures whether @wq->maydays is populated or not as nobody access the list once the rescuer exits. However, this is nasty and makes future development difficult. Let's update rescuer_thread() so that it empties @wq->maydays after seeing should_stop to guarantee that the list is empty on rescuer exit. tj: Updated comment and patch description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org # v3.10+
2014-04-18sched, treewide: Replace hardcoded nice values with MIN_NICE/MAX_NICEDongsheng Yang1-3/+3
Replace various -20/+19 hardcoded nice values with MIN_NICE/MAX_NICE. Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/ff13819fd09b7a5dba5ab5ae797f2e7019bdfa17.1394532288.git.yangds.fnst@cn.fujitsu.com Cc: devel@driverdev.osuosl.org Cc: devicetree@vger.kernel.org Cc: fcoe-devel@open-fcoe.org Cc: linux390@de.ibm.com Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org Cc: linux-s390@vger.kernel.org Cc: linux-scsi@vger.kernel.org Cc: nbd-general@lists.sourceforge.net Cc: ocfs2-devel@oss.oracle.com Cc: openipmi-developer@lists.sourceforge.net Cc: qla2xxx-upstream@qlogic.com Cc: linux-arch@vger.kernel.org [ Consolidated the patches, twiddled the changelog. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-04-16workqueue: fix bugs in wq_update_unbound_numa() failure pathDaeseok Youn1-1/+2
wq_update_unbound_numa() failure path has the following two bugs. - alloc_unbound_pwq() is called without holding wq->mutex; however, if the allocation fails, it jumps to out_unlock which tries to unlock wq->mutex. - The function should switch to dfl_pwq on failure but didn't do so after alloc_unbound_pwq() failure. Fix it by regrabbing wq->mutex and jumping to use_dfl_pwq on alloc_unbound_pwq() failure. Signed-off-by: Daeseok Youn <daeseok.youn@gmail.com> Acked-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org Fixes: 4c16bd327c74 ("workqueue: implement NUMA affinity for unbound workqueues")
2014-04-01Merge branch 'timers-core-for-linus' of ↵Linus Torvalds1-0/+7
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer changes from Thomas Gleixner: "This assorted collection provides: - A new timer based timer broadcast feature for systems which do not provide a global accessible timer device. That allows those systems to put CPUs into deep idle states where the per cpu timer device stops. - A few NOHZ_FULL related improvements to the timer wheel - The usual updates to timer devices found in ARM SoCs - Small improvements and updates all over the place" * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits) tick: Remove code duplication in tick_handle_periodic() tick: Fix spelling mistake in tick_handle_periodic() x86: hpet: Use proper destructor for delayed work workqueue: Provide destroy_delayed_work_on_stack() clocksource: CMT, MTU2, TMU and STI should depend on GENERIC_CLOCKEVENTS timer: Remove code redundancy while calling get_nohz_timer_target() hrtimer: Rearrange comments in the order struct members are declared timer: Use variable head instead of &work_list in __run_timers() clocksource: exynos_mct: silence a static checker warning arm: zynq: Add support for cpufreq arm: zynq: Don't use arm_global_timer with cpufreq clocksource/cadence_ttc: Overhaul clocksource frequency adjustment clocksource/cadence_ttc: Call clockevents_update_freq() with IRQs enabled clocksource: Add Kconfig entries for CMT, MTU2, TMU and STI sh: Remove Kconfig entries for TMU, CMT and MTU2 ARM: shmobile: Remove CMT, TMU and S