summaryrefslogtreecommitdiff
path: root/kernel/workqueue.c
AgeCommit message (Collapse)AuthorFilesLines
2021-08-04workqueue: fix UAF in pwq_unbound_release_workfn()Yang Yingliang1-7/+13
commit b42b0bddcbc87b4c66f6497f66fc72d52b712aa7 upstream. I got a UAF report when doing fuzz test: [ 152.880091][ T8030] ================================================================== [ 152.881240][ T8030] BUG: KASAN: use-after-free in pwq_unbound_release_workfn+0x50/0x190 [ 152.882442][ T8030] Read of size 4 at addr ffff88810d31bd00 by task kworker/3:2/8030 [ 152.883578][ T8030] [ 152.883932][ T8030] CPU: 3 PID: 8030 Comm: kworker/3:2 Not tainted 5.13.0+ #249 [ 152.885014][ T8030] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 [ 152.886442][ T8030] Workqueue: events pwq_unbound_release_workfn [ 152.887358][ T8030] Call Trace: [ 152.887837][ T8030] dump_stack_lvl+0x75/0x9b [ 152.888525][ T8030] ? pwq_unbound_release_workfn+0x50/0x190 [ 152.889371][ T8030] print_address_description.constprop.10+0x48/0x70 [ 152.890326][ T8030] ? pwq_unbound_release_workfn+0x50/0x190 [ 152.891163][ T8030] ? pwq_unbound_release_workfn+0x50/0x190 [ 152.891999][ T8030] kasan_report.cold.15+0x82/0xdb [ 152.892740][ T8030] ? pwq_unbound_release_workfn+0x50/0x190 [ 152.893594][ T8030] __asan_load4+0x69/0x90 [ 152.894243][ T8030] pwq_unbound_release_workfn+0x50/0x190 [ 152.895057][ T8030] process_one_work+0x47b/0x890 [ 152.895778][ T8030] worker_thread+0x5c/0x790 [ 152.896439][ T8030] ? process_one_work+0x890/0x890 [ 152.897163][ T8030] kthread+0x223/0x250 [ 152.897747][ T8030] ? set_kthread_struct+0xb0/0xb0 [ 152.898471][ T8030] ret_from_fork+0x1f/0x30 [ 152.899114][ T8030] [ 152.899446][ T8030] Allocated by task 8884: [ 152.900084][ T8030] kasan_save_stack+0x21/0x50 [ 152.900769][ T8030] __kasan_kmalloc+0x88/0xb0 [ 152.901416][ T8030] __kmalloc+0x29c/0x460 [ 152.902014][ T8030] alloc_workqueue+0x111/0x8e0 [ 152.902690][ T8030] __btrfs_alloc_workqueue+0x11e/0x2a0 [ 152.903459][ T8030] btrfs_alloc_workqueue+0x6d/0x1d0 [ 152.904198][ T8030] scrub_workers_get+0x1e8/0x490 [ 152.904929][ T8030] btrfs_scrub_dev+0x1b9/0x9c0 [ 152.905599][ T8030] btrfs_ioctl+0x122c/0x4e50 [ 152.906247][ T8030] __x64_sys_ioctl+0x137/0x190 [ 152.906916][ T8030] do_syscall_64+0x34/0xb0 [ 152.907535][ T8030] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 152.908365][ T8030] [ 152.908688][ T8030] Freed by task 8884: [ 152.909243][ T8030] kasan_save_stack+0x21/0x50 [ 152.909893][ T8030] kasan_set_track+0x20/0x30 [ 152.910541][ T8030] kasan_set_free_info+0x24/0x40 [ 152.911265][ T8030] __kasan_slab_free+0xf7/0x140 [ 152.911964][ T8030] kfree+0x9e/0x3d0 [ 152.912501][ T8030] alloc_workqueue+0x7d7/0x8e0 [ 152.913182][ T8030] __btrfs_alloc_workqueue+0x11e/0x2a0 [ 152.913949][ T8030] btrfs_alloc_workqueue+0x6d/0x1d0 [ 152.914703][ T8030] scrub_workers_get+0x1e8/0x490 [ 152.915402][ T8030] btrfs_scrub_dev+0x1b9/0x9c0 [ 152.916077][ T8030] btrfs_ioctl+0x122c/0x4e50 [ 152.916729][ T8030] __x64_sys_ioctl+0x137/0x190 [ 152.917414][ T8030] do_syscall_64+0x34/0xb0 [ 152.918034][ T8030] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 152.918872][ T8030] [ 152.919203][ T8030] The buggy address belongs to the object at ffff88810d31bc00 [ 152.919203][ T8030] which belongs to the cache kmalloc-512 of size 512 [ 152.921155][ T8030] The buggy address is located 256 bytes inside of [ 152.921155][ T8030] 512-byte region [ffff88810d31bc00, ffff88810d31be00) [ 152.922993][ T8030] The buggy address belongs to the page: [ 152.923800][ T8030] page:ffffea000434c600 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x10d318 [ 152.925249][ T8030] head:ffffea000434c600 order:2 compound_mapcount:0 compound_pincount:0 [ 152.926399][ T8030] flags: 0x57ff00000010200(slab|head|node=1|zone=2|lastcpupid=0x7ff) [ 152.927515][ T8030] raw: 057ff00000010200 dead000000000100 dead000000000122 ffff888009c42c80 [ 152.928716][ T8030] raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000 [ 152.929890][ T8030] page dumped because: kasan: bad access detected [ 152.930759][ T8030] [ 152.931076][ T8030] Memory state around the buggy address: [ 152.931851][ T8030] ffff88810d31bc00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 152.932967][ T8030] ffff88810d31bc80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 152.934068][ T8030] >ffff88810d31bd00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 152.935189][ T8030] ^ [ 152.935763][ T8030] ffff88810d31bd80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 152.936847][ T8030] ffff88810d31be00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 152.937940][ T8030] ================================================================== If apply_wqattrs_prepare() fails in alloc_workqueue(), it will call put_pwq() which invoke a work queue to call pwq_unbound_release_workfn() and use the 'wq'. The 'wq' allocated in alloc_workqueue() will be freed in error path when apply_wqattrs_prepare() fails. So it will lead a UAF. CPU0 CPU1 alloc_workqueue() alloc_and_link_pwqs() apply_wqattrs_prepare() fails apply_wqattrs_cleanup() schedule_work(&pwq->unbound_release_work) kfree(wq) worker_thread() pwq_unbound_release_workfn() <- trigger uaf here If apply_wqattrs_prepare() fails, the new pwq are not linked, it doesn't hold any reference to the 'wq', 'wq' is invalid to access in the worker, so add check pwq if linked to fix this. Fixes: 2d5f0764b526 ("workqueue: split apply_workqueue_attrs() into 3 stages") Cc: stable@vger.kernel.org # v4.2+ Reported-by: Hulk Robot <hulkci@huawei.com> Suggested-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Tested-by: Pavel Skripkin <paskripkin@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-16wq: handle VM suspension in stall detectionSergey Senozhatsky1-2/+10
[ Upstream commit 940d71c6462e8151c78f28e4919aa8882ff2054e ] If VCPU is suspended (VM suspend) in wq_watchdog_timer_fn() then once this VCPU resumes it will see the new jiffies value, while it may take a while before IRQ detects PVCLOCK_GUEST_STOPPED on this VCPU and updates all the watchdogs via pvclock_touch_watchdogs(). There is a small chance of misreported WQ stalls in the meantime, because new jiffies is time_after() old 'ts + thresh'. wq_watchdog_timer_fn() { for_each_pool(pool, pi) { if (time_after(jiffies, ts + thresh)) { pr_emerg("BUG: workqueue lockup - pool"); } } } Save jiffies at the beginning of this function and use that value for stall detection. If VM gets suspended then we continue using "old" jiffies value and old WQ touch timestamps. If IRQ at some point restarts the stall detection cycle (pvclock_touch_watchdogs()) then old jiffies will always be before new 'ts + thresh'. Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-04-16workqueue: Move the position of debug_work_activate() in __queue_work()Zqiang1-1/+1
[ Upstream commit 0687c66b5f666b5ad433f4e94251590d9bc9d10e ] The debug_work_activate() is called on the premise that the work can be inserted, because if wq be in WQ_DRAINING status, insert work may be failed. Fixes: e41e704bc4f4 ("workqueue: improve destroy_workqueue() debuggability") Signed-off-by: Zqiang <qiang.zhang@windriver.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-01-12workqueue: Kick a worker based on the actual activation of delayed worksYunfeng Ye1-3/+10
[ Upstream commit 01341fbd0d8d4e717fc1231cdffe00343088ce0b ] In realtime scenario, We do not want to have interference on the isolated cpu cores. but when invoking alloc_workqueue() for percpu wq on the housekeeping cpu, it kick a kworker on the isolated cpu. alloc_workqueue pwq_adjust_max_active wake_up_worker The comment in pwq_adjust_max_active() said: "Need to kick a worker after thawed or an unbound wq's max_active is bumped" So it is unnecessary to kick a kworker for percpu's wq when invoking alloc_workqueue(). this patch only kick a worker based on the actual activation of delayed works. Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-03-20workqueue: don't use wq_select_unbound_cpu() for bound worksHillf Danton1-6/+8
commit aa202f1f56960c60e7befaa0f49c72b8fa11b0a8 upstream. wq_select_unbound_cpu() is designed for unbound workqueues only, but it's wrongly called when using a bound workqueue too. Fixing this ensures work queued to a bound workqueue with cpu=WORK_CPU_UNBOUND always runs on the local CPU. Before, that would happen only if wq_unbound_cpumask happened to include it (likely almost always the case), or was empty, or we got lucky with forced round-robin placement. So restricting /sys/devices/virtual/workqueue/cpumask to a small subset of a machine's CPUs would cause some bound work items to run unexpectedly there. Fixes: ef557180447f ("workqueue: schedule WORK_CPU_UNBOUND work on wq_unbound_cpumask CPUs") Cc: stable@vger.kernel.org # v4.5+ Signed-off-by: Hillf Danton <hdanton@sina.com> [dj: massage changelog] Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: linux-kernel@vger.kernel.org Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-17workqueue: Fix missing kfree(rescuer) in destroy_workqueue()Tejun Heo1-0/+1
commit 8efe1223d73c218ce7e8b2e0e9aadb974b582d7f upstream. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Qian Cai <cai@lca.pw> Fixes: def98c84b6cd ("workqueue: Fix spurious sanity check failures in destroy_workqueue()") Cc: Nobuhiro Iwamatsu <nobuhiro1.iwamatsu@toshiba.co.jp> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-17workqueue: Fix pwq ref leak in rescuer_thread()Tejun Heo1-3/+10
commit e66b39af00f426b3356b96433d620cb3367ba1ff upstream. 008847f66c3 ("workqueue: allow rescuer thread to do more work.") made the rescuer worker requeue the pwq immediately if there may be more work items which need rescuing instead of waiting for the next mayday timer expiration. Unfortunately, it doesn't check whether the pwq is already on the mayday list and unconditionally gets the ref and moves it onto the list. This doesn't corrupt the list but creates an additional reference to the pwq. It got queued twice but will only be removed once. This leak later can trigger pwq refcnt warning on workqueue destruction and prevent freeing of the workqueue. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: "Williams, Gerald S" <gerald.s.williams@intel.com> Cc: NeilBrown <neilb@suse.de> Cc: stable@vger.kernel.org # v3.19+ Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-17workqueue: Fix spurious sanity check failures in destroy_workqueue()Tejun Heo1-5/+19
commit def98c84b6cdf2eeea19ec5736e90e316df5206b upstream. Before actually destrying a workqueue, destroy_workqueue() checks whether it's actually idle. If it isn't, it prints out a bunch of warning messages and leaves the workqueue dangling. It unfortunately has a couple issues. * Mayday list queueing increments pwq's refcnts which gets detected as busy and fails the sanity checks. However, because mayday list queueing is asynchronous, this condition can happen without any actual work items left in the workqueue. * Sanity check failure leaves the sysfs interface behind too which can lead to init failure of newer instances of the workqueue. This patch fixes the above two by * If a workqueue has a rescuer, disable and kill the rescuer before sanity checks. Disabling and killing is guaranteed to flush the existing mayday list. * Remove sysfs interface before sanity checks. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Marcin Pawlowski <mpawlowski@fb.com> Reported-by: "Williams, Gerald S" <gerald.s.williams@intel.com> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-05watchdog: Mark watchdog touch functions as notraceVincent Whitchurch1-1/+1
commit cb9d7fd51d9fbb329d182423bd7b92d0f8cb0e01 upstream. Some architectures need to use stop_machine() to patch functions for ftrace, and the assumption is that the stopped CPUs do not make function calls to traceable functions when they are in the stopped state. Commit ce4f06dcbb5d ("stop_machine: Touch_nmi_watchdog() after MULTI_STOP_PREPARE") added calls to the watchdog touch functions from the stopped CPUs and those functions lack notrace annotations. This leads to crashes when enabling/disabling ftrace on ARM kernels built with the Thumb-2 instruction set. Fix it by adding the necessary notrace annotations. Fixes: ce4f06dcbb5d ("stop_machine: Touch_nmi_watchdog() after MULTI_STOP_PREPARE") Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: oleg@redhat.com Cc: tj@kernel.org Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20180821152507.18313-1-vincent.whitchurch@axis.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30workqueue: use put_device() instead of kfree()Arvind Yadav1-1/+1
[ Upstream commit 537f4146c53c95aac977852b371bafb9c6755ee1 ] Never directly free @dev after calling device_register(), even if it returned an error! Always use put_device() to give up the reference initialized in this function instead. Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-15workqueue: Allow retrieval of current task's work structLukas Wunner1-0/+16
commit 27d4ee03078aba88c5e07dcc4917e8d01d046f38 upstream. Introduce a helper to retrieve the current task's work struct if it is a workqueue worker. This allows us to fix a long-standing deadlock in several DRM drivers wherein the ->runtime_suspend callback waits for a specific worker to finish and that worker in turn calls a function which waits for runtime suspend to finish. That function is invoked from multiple call sites and waiting for runtime suspend to finish is the correct thing to do except if it's executing in the context of the worker. Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Dave Airlie <airlied@redhat.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Alex Deucher <alexander.deucher@amd.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lyude Paul <lyude@redhat.com> Signed-off-by: Lukas Wunner <lukas@wunner.de> Link: https://patchwork.freedesktop.org/patch/msgid/2d8f603074131eb87e588d2b803a71765bd3a2fd.1518338788.git.lukas@wunner.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-23workqueue: avoid hard lockups in show_workqueue_state()Sergey Senozhatsky1-0/+13
commit 62635ea8c18f0f62df4cc58379e4f1d33afd5801 upstream. show_workqueue_state() can print out a lot of messages while being in atomic context, e.g. sysrq-t -> show_workqueue_state(). If the console device is slow it may end up triggering NMI hard lockup watchdog. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-10workqueue: replace pool->manager_arb mutex with a flagTejun Heo1-22/+15
Josef reported a HARDIRQ-safe -> HARDIRQ-unsafe lock order detected by lockdep: [ 1270.472259] WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected [ 1270.472783] 4.14.0-rc1-xfstests-12888-g76833e8 #110 Not tainted [ 1270.473240] ----------------------------------------------------- [ 1270.473710] kworker/u5:2/5157 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire: [ 1270.474239] (&(&lock->wait_lock)->rlock){+.+.}, at: [<ffffffff8da253d2>] __mutex_unlock_slowpath+0xa2/0x280 [ 1270.474994] [ 1270.474994] and this task is already holding: [ 1270.475440] (&pool->lock/1){-.-.}, at: [<ffffffff8d2992f6>] worker_thread+0x366/0x3c0 [ 1270.476046] which would create a new lock dependency: [ 1270.476436] (&pool->lock/1){-.-.} -> (&(&lock->wait_lock)->rlock){+.+.} [ 1270.476949] [ 1270.476949] but this new dependency connects a HARDIRQ-irq-safe lock: [ 1270.477553] (&pool->lock/1){-.-.} ... [ 1270.488900] to a HARDIRQ-irq-unsafe lock: [ 1270.489327] (&(&lock->wait_lock)->rlock){+.+.} ... [ 1270.494735] Possible interrupt unsafe locking scenario: [ 1270.494735] [ 1270.495250] CPU0 CPU1 [ 1270.495600] ---- ---- [ 1270.495947] lock(&(&lock->wait_lock)->rlock); [ 1270.496295] local_irq_disable(); [ 1270.496753] lock(&pool->lock/1); [ 1270.497205] lock(&(&lock->wait_lock)->rlock); [ 1270.497744] <Interrupt> [ 1270.497948] lock(&pool->lock/1); , which will cause a irq inversion deadlock if the above lock scenario happens. The root cause of this safe -> unsafe lock order is the mutex_unlock(pool->manager_arb) in manage_workers() with pool->lock held. Unlocking mutex while holding an irq spinlock was never safe and this problem has been around forever but it never got noticed because the only time the mutex is usually trylocked while holding irqlock making actual failures very unlikely and lockdep annotation missed the condition until the recent b9c16a0e1f73 ("locking/mutex: Fix lockdep_assert_held() fail"). Using mutex for pool->manager_arb has always been a bit of stretch. It primarily is an mechanism to arbitrate managership between workers which can easily be done with a pool flag. The only reason it became a mutex is that pool destruction path wants to exclude parallel managing operations. This patch replaces the mutex with a new pool flag POOL_MANAGER_ACTIVE and make the destruction path wait for the current manager on a wait queue. v2: Drop unnecessary flag clearing before pool destruction as suggested by Boqun. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: stable@vger.kernel.org
2017-09-06Merge branch 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds1-3/+3
Pull workqueue updates from Tejun Heo: "Nothing major. I introduced a flag collsion bug during v4.13 cycle which is fixed in this pull request. Fortunately, the flag is for debugging / verification and the bug isn't critical" * 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: Fix flag collision workqueue: Use TASK_IDLE workqueue: fix path to documentation workqueue: doc change for ST behavior on NUMA systems
2017-09-05Merge branch 'for-4.13-fixes' into for-4.14Tejun Heo1-4/+26
2017-08-29locking/lockdep: Untangle xhlock history save/restore from task independencePeter Zijlstra1-5/+4
Where XHLOCK_{SOFT,HARD} are save/restore points in the xhlocks[] to ensure the temporal IRQ events don't interact with task state, the XHLOCK_PROC is a fundament different beast that just happens to share the interface. The purpose of XHLOCK_PROC is to annotate independent execution inside one task. For example workqueues, each work should appear to run in its own 'pristine' 'task'. Remove XHLOCK_PROC in favour of its own interface to avoid confusion. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Byungchul Park <byungchul.park@lge.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: boqun.feng@gmail.com Cc: david@fromorbit.com Cc: johannes@sipsolutions.net Cc: kernel-team@lge.com Cc: oleg@redhat.com Cc: tj@kernel.org Link: http://lkml.kernel.org/r/20170829085939.ggmb6xiohw67micb@hirez.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25locking/lockdep: Fix workqueue crossrelease annotationPeter Zijlstra1-1/+22
The new completion/crossrelease annotations interact unfavourable with the extant flush_work()/flush_workqueue() annotations. The problem is that when a single work class does: wait_for_completion(&C) and complete(&C) in different executions, we'll build dependencies like: lock_map_acquire(W) complete_acquire(C) and lock_map_acquire(W) complete_release(C) which results in the dependency chain: W->C->W, which lockdep thinks spells deadlock, even though there is no deadlock potential since works are ran concurrently. One possibility would be to change the work 'lock' to recursive-read, but that would mean hitting a lockdep limitation on recursive locks. Also, unconditinoally switching to recursive-read here would fail to detect the actual deadlock on single-threaded workqueues, which do have a problem with this. For now, forcefully disregard these locks for crossrelease. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: boqun.feng@gmail.com Cc: byungchul.park@lge.com Cc: david@fromorbit.com Cc: johannes@sipsolutions.net Cc: oleg@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25workqueue/lockdep: 'Fix' flush_work() annotationPeter Zijlstra1-9/+11
The flush_work() annotation as introduced by commit: e159489baa71 ("workqueue: relax lockdep annotation on flush_work()") hits on the lockdep problem with recursive read locks. The situation as described is: Work W1: Work W2: Task: ARR(Q) ARR(Q) flush_workqueue(Q) A(W1) A(W2) A(Q) flush_work(W2) R(Q) A(W2) R(W2) if (special) A(Q) else ARR(Q) R(Q) where: A - acquire, ARR - acquire-read-recursive, R - release. Where under 'special' conditions we want to trigger a lock recursion deadlock, but otherwise allow the flush_work(). The allowing is done by using recursive read locks (ARR), but lockdep is broken for recursive stuff. However, there appears to be no need to acquire the lock if we're not 'special', so if we remove the 'else' clause things become much simpler and no longer need the recursion thing at all. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: boqun.feng@gmail.com Cc: byungchul.park@lge.com Cc: david@fromorbit.com Cc: johannes@sipsolutions.net Cc: oleg@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-23workqueue: Use TASK_IDLEPeter Zijlstra1-2/+2
Workqueues don't use signals, it (ab)uses TASK_INTERRUPTIBLE to avoid increasing the loadavg numbers. We've 'recently' introduced TASK_IDLE for this case: 80ed87c8a9ca ("sched/wait: Introduce TASK_NOLOAD and TASK_IDLE") use it. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-08-17locking/lockdep: Explicitly initialize wq_barrier::done::mapBoqun Feng1-1/+10
With the new lockdep crossrelease feature, which checks completions usage, a false positive is reported in the workqueue code: > Worker A : acquired of wfc.work -> wait for cpu_hotplug_lock to be released > Task B : acquired of cpu_hotplug_lock -> wait for lock#3 to be released > Task C : acquired of lock#3 -> wait for completion of barr->done > (Task C is in lru_add_drain_all_cpuslocked()) > Worker D : wait for wfc.work to be released -> will complete barr->done Such a dead lock can not happen because Task C's barr->done and Worker D's barr->done can not be the same instance. The reason of this false positive is we initialize all wq_barrier::done at insert_wq_barrier() via init_completion(), which makes them belong to the same lock class, therefore, impossible circles are reported. To fix this, explicitly initialize the lockdep map for wq_barrier::done in insert_wq_barrier(), so that the lock class key of wq_barrier::done is a subkey of the corresponding work_struct, as a result we won't build a dependency between a wq_barrier with a unrelated work, and we can differ wq barriers based on the related works, so the false positive above is avoided. Also define the empty lockdep_init_map_crosslock() for !CROSSRELEASE to make the code simple and away from unnecessary #ifdefs. Reported-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Cc: Byungchul Park <byungchul.park@lge.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170817094622.12915-1-boqun.feng@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10locking/lockdep: Implement the 'crossrelease' featureByungchul Park1-0/+2
Lockdep is a runtime locking correctness validator that detects and reports a deadlock or its possibility by checking dependencies between locks. It's useful since it does not report just an actual deadlock but also the possibility of a deadlock that has not actually happened yet. That enables problems to be fixed before they affect real systems. However, this facility is only applicable to typical locks, such as spinlocks and mutexes, which are normally released within the context in which they were acquired. However, synchronization primitives like page locks or completions, which are allowed to be released in any context, also create dependencies and can cause a deadlock. So lockdep should track these locks to do a better job. The 'crossrelease' implementation makes these primitives also be tracked. Signed-off-by: Byungchul Park <byungchul.park@lge.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: akpm@linux-foundation.org Cc: boqun.feng@gmail.com Cc: kernel-team@lge.com Cc: kirill@shutemov.name Cc: npiggin@gmail.com Cc: walken@google.com Cc: willy@infradead.org Link: http://lkml.kernel.org/r/1502089981-21272-6-git-send-email-byungchul.park@lge.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-07workqueue: fix path to documentationBenjamin Peterson1-1/+1
Signed-off-by: Benjamin Peterson <bp@benjamin.pe> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-28workqueue: Work around edge cases for calc of pool's cpumaskMichael Bringmann1-0/+7
There is an underlying assumption/trade-off in many layers of the Linux system that CPU <-> node mapping is static. This is despite the presence of features like NUMA and 'hotplug' that support the dynamic addition/ removal of fundamental system resources like CPUs and memory. PowerPC systems, however, do provide extensive features for the dynamic change of resources available to a system. Currently, there is little or no synchronization protection around the updating of the CPU <-> node mapping, and the export/update of this information for other layers / modules. In systems which can change this mapping during 'hotplug', like PowerPC, the information is changing underneath all layers that might reference it. This patch attempts to ensure that a valid, usable cpumask attribute is used by the workqueue infrastructure when setting up new resource pools. It prevents a crash that has been observed when an 'empty' cpumask is passed along to the worker/task scheduling code. It is intended as a temporary workaround until a more fundamental review and correction of the issue can be done. [With additions to the patch provided by Tejun Hao <tj@kernel.org>] Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-25workqueue: implicit ordered attribute should be overridableTejun Heo1-4/+9
5c0338c68706 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered") automatically enabled ordered attribute for unbound workqueues w/ max_active == 1. Because ordered workqueues reject max_active and some attribute changes, this implicit ordered mode broke cases where the user creates an unbound workqueue w/ max_active == 1 and later explicitly changes the related attributes. This patch distinguishes explicit and implicit ordered setting and overrides from attribute changes if implict. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 5c0338c68706 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered")
2017-07-19workqueue: restore WQ_UNBOUND/max_active==1 to be orderedTejun Heo1-0/+10
The combination of WQ_UNBOUND and max_active == 1 used to imply ordered execution. After NUMA affinity 4c16bd327c74 ("workqueue: implement NUMA affinity for unbound workqueues"), this is no longer true due to per-node worker pools. While the right way to create an ordered workqueue is alloc_ordered_workqueue(), the documentation has been misleading for a long time and people do use WQ_UNBOUND and max_active == 1 for ordered workqueues which can lead to subtle bugs which are very difficult to trigger. It's unlikely that we'd see noticeable performance impact by enforcing ordering on WQ_UNBOUND / max_active == 1 workqueues. Let's automatically set __WQ_ORDERED for those workqueues. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Christoph Hellwig <hch@infradead.org> Reported-by: Alexei Potashnik <alexei@purestorage.com> Fixes: 4c16bd327c74 ("workqueue: implement NUMA affinity for unbound workqueues") Cc: stable@vger.kernel.org # v3.10+
2017-06-20sched/wait: Rename wait_queue_t => wait_queue_entry_tIngo Molnar1-2/+2
Rename: wait_queue_t => wait_queue_entry_t 'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue", but in reality it's a queue *entry*. The 'real' queue is the wait queue head, which had to carry the name. Start sorting this out by renaming it to 'wait_queue_entry_t'. This also allows the real structure name 'struct __wait_queue' to lose its double underscore and become 'struct wait_queue_entry', which is the more canonical nomenclature for such data types. Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-01Merge branch 'sched-core-for-linus' of ↵Linus Torvalds1-0/+23
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "The main changes in this cycle were: - another round of rq-clock handling debugging, robustization and fixes - PELT accounting improvements - CPU hotplug related ->cpus_allowed affinity handling fixes all around the tree - ... plus misc fixes, cleanups and updates" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (35 commits) sched/x86: Update reschedule warning text crypto: N2 - Replace racy task affinity logic cpufreq/sparc-us2e: Replace racy task affinity logic cpufreq/sparc-us3: Replace racy task affinity logic cpufreq/sh: Replace racy task affinity logic cpufreq/ia64: Replace racy task affinity logic ACPI/processor: Replace racy task affinity logic ACPI/processor: Fix error handling in __acpi_processor_start() sparc/sysfs: Replace racy task affinity logic powerpc/smp: Replace open coded task affinity logic ia64/sn/hwperf: Replace racy task affinity logic ia64/salinfo: Replace racy task affinity logic workqueue: Provide work_on_cpu_safe() ia64/topology: Remove cpus_allowed manipulation sched/fair: Move the PELT constants into a generated header sched/fair: Increase PELT accuracy for small tasks sched/fair: Fix comments sched/Documentation: Add 'sched-pelt' tool sched/fair: Fix corner case in __accumulate_sum() sched/core: Remove 'task' parameter and rename tsk_restore_flags() to current_restore_flags() ...
2017-05-01Merge branch 'for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds1-3/+2
Pull workqueue update from Tejun Heo: "One trivial patch to use setup_deferrable_timer() instead of open-coding the initialization" * 'for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: use setup_deferrable_timer
2017-04-15workqueue: Provide work_on_cpu_safe()Thomas Gleixner1-0/+23
work_on_cpu() is not protected against CPU hotplug. For code which requires to be either executed on an online CPU or to fail if the CPU is not available the callsite would have to protect against CPU hotplug. Provide a function which does get/put_online_cpus() around the call to work_on_cpu() and fails the call with -ENODEV if the target CPU is not online. Preparatory patch to convert several racy task affinity manipulations. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Tejun Heo <tj@kernel.org> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Sebastian Siewior <bigeasy@linutronix.de> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "David S. Miller" <davem@davemloft.net> Cc: Len Brown <lenb@kernel.org> Link: http://lkml.kernel.org/r/20170412201042.262610721@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-06workqueue: use setup_deferrable_timerGeliang Tang1-3/+2
Use setup_deferrable_timer() instead of init_timer_deferrable() to simplify the code. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-03-06workqueue: trigger WARN if queue_delayed_work() is called with NULL @wqTejun Heo1-0/+1
If queue_delayed_work() gets called with NULL @wq, the kernel will oops asynchronuosly on timer expiration which isn't too helpful in tracking down the offender. This actually happened with smc. __queue_delayed_work() already does several input sanity checks synchronously. Add NULL @wq check. Reported-by: Dave Jones <davej@codemonkey.org.uk> Link: http://lkml.kernel.org/r/20170227171439.jshx3qplflyrgcv7@codemonkey.org.uk Signed-off-by: Tejun Heo <tj@kernel.org>
2017-02-10time: Remove CONFIG_TIMER_STATSKees Cook1-2/+0
Currently CONFIG_TIMER_STATS exposes process information across namespaces: kernel/time/timer_list.c print_timer(): SEQ_printf(m, ", %s/%d", tmp, timer->start_pid); /proc/timer_list: #11: <0000000000000000>, hrtimer_wakeup, S:01, do_nanosleep, cron/2570 Given that the tracer can give the same information, this patch entirely removes CONFIG_TIMER_STATS. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: John Stultz <john.stultz@linaro.org> Cc: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: linux-doc@vger.kernel.org Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Xing Gao <xgao01@email.wm.edu> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Jessica Frazelle <me@jessfraz.com> Cc: kernel-hardening@lists.openwall.com Cc: Nicolas Iooss <nicolas.iooss_linux@m4x.org> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Michal Marek <mmarek@suse.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Olof Johansson <olof@lixom.net> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: linux-api@vger.kernel.org Cc: Arjan van de Ven <arjan@linux.intel.com> Link: http://lkml.kernel.org/r/20170208192659.GA32582@beast Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-10-19Merge branch 'for-4.9' into for-4.10Tejun Heo1-18/+85
2016-10-19workqueue: move wq_numa_init() to workqueue_init()Tejun Heo1-2/+23
While splitting up workqueue initialization into two parts, ac8f73400782 ("workqueue: make workqueue available early during boot") put wq_numa_init() into workqueue_init_early(). Unfortunately, on some archs including power and arm64, cpu to node mapping isn't yet established by the time the early init is called leading to incorrect NUMA initialization and subsequently the following oops due to zero cpumask on node-specific unbound pools. Unable to handle kernel paging request for data at address 0x00000038 Faulting instruction address: 0xc0000000000fc0cc Oops: Kernel access of bad area, sig: 11 [#1] SMP NR_CPUS=2048 NUMA PowerNV Modules linked in: CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.8.0-compiler_gcc-6.2.0-next-20161005 #94 task: c0000007f5400000 task.stack: c000001ffc084000 NIP: c0000000000fc0cc LR: c0000000000ed928 CTR: c0000000000fbfd0 REGS: c000001ffc087780 TRAP: 0300 Not tainted (4.8.0-compiler_gcc-6.2.0-next-20161005) MSR: 9000000002009033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE> CR: 48000424 XER: 00000000 CFAR: c0000000000089dc DAR: 0000000000000038 DSISR: 40000000 SOFTE: 0 GPR00: c0000000000ed928 c000001ffc087a00 c000000000e63200 c000000010d6d600 GPR04: c0000007f5409200 0000000000000021 000000000748e08c 000000000000001f GPR08: 0000000000000000 0000000000000021 000000000748f1f8 0000000000000000 GPR12: 0000000028000422 c00000000fb80000 c00000000000e0c8 0000000000000000 GPR16: 0000000000000000 0000000000000000 0000000000000021 0000000000000001 GPR20: ffffffffafb50401 0000000000000000 c000000010d6d600 000000000000ba7e GPR24: 000000000000ba7e c000000000d8bc58 afb504000afb5041 0000000000000001 GPR28: 0000000000000000 0000000000000004 c0000007f5409280 0000000000000000 NIP [c0000000000fc0cc] enqueue_task_fair+0xfc/0x18b0 LR [c0000000000ed928] activate_task+0x78/0xe0 Call Trace: [c000001ffc087a00] [c0000007f5409200] 0xc0000007f5409200 (unreliable) [c000001ffc087b10] [c0000000000ed928] activate_task+0x78/0xe0 [c000001ffc087b50] [c0000000000ede58] ttwu_do_activate+0x68/0xc0 [c000001ffc087b90] [c0000000000ef1b8] try_to_wake_up+0x208/0x4f0 [c000001ffc087c10] [c0000000000d3484] create_worker+0x144/0x250 [c000001ffc087cb0] [c000000000cd72d0] workqueue_init+0x124/0x150 [c000001ffc087d00] [c000000000cc0e74] kernel_init_freeable+0x158/0x360 [c000001ffc087dc0] [c00000000000e0e4] kernel_init+0x24/0x160 [c000001ffc087e30] [c00000000000bfa0] ret_from_kernel_thread+0x5c/0xbc Instruction dump: 62940401 3b800000 3aa00000 7f17c378 3a600001 3b600001 60000000 60000000 60420000 72490021 ebfe0150 2f890001 <ebbf0038> 419e0de0 7fbee840 419e0e58 ---[ end trace 0000000000000000 ]--- Fix it by moving wq_numa_init() to workqueue_init(). As this means that the early intialization may not have full NUMA info for per-cpu pools and ignores NUMA affinity for unbound pools, fix them up from workqueue_init() after wq_numa_init(). Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Michael Ellerman <mpe@ellerman.id.au> Link: http://lkml.kernel.org/r/87twck5wqo.fsf@concordia.ellerman.id.au Fixes: ac8f73400782 ("workqueue: make workqueue available early during boot") Signed-off-by: Tejun Heo <tj@kernel.org>
2016-10-11kthread: rename probe_kthread_data() to kthread_probe_data()Petr Mladek1-1/+1
Patch series "kthread: Kthread worker API improvements" The intention of this patchset is to make it easier to manipulate and maintain kthreads. Especially, I want to replace all the custom main cycles with a generic one. Also I want to make the kthreads sleep in a consistent state in a common place when there is no work. This patch (of 11): A good practice is to prefix the names of functions by the name of the subsystem. This patch fixes the name of probe_kthread_data(). The other wrong functions names are part of the kthread worker API and will be fixed separately. Link: http://lkml.kernel.org/r/1470754545-17632-2-git-send-email-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Borislav Petkov <bp@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-17workqueue: remove keventd_up()Tejun Heo1-1/+1
keventd_up() no longer has in-kernel users. Remove it and make wq_online static. Signed-off-by: Tejun Heo <tj@kernel.org>
2016-09-17workqueue: make workqueue available early during bootTejun Heo1-16/+60
Workqueue is currently initialized in an early init call; however, there are cases where early boot code has to be split and reordered to come after workqueue initialization or the same code path which makes use of workqueues is used both before workqueue initailization and after. The latter cases have to gate workqueue usages with keventd_up() tests, which is nasty and easy to get wrong. Workqueue usages have become widespread and it'd be a lot more convenient if it can be used very early from boot. This patch splits workqueue initialization into two steps. workqueue_init_early() which sets up the basic data structures so that workqueues can be created and work items queued, and workqueue_init() which actually brings up workqueues online and starts executing queued work items. The former step can be done very early during boot once memory allocation, cpumasks and idr are initialized. The latter right after kthreads become available. This allows work item queueing and canceling from very early boot which is what most of these use cases want. * As systemd_wq being initialized doesn't indicate that workqueue is fully online anymore, update keventd_up() to test wq_online instead. The follow-up patches will get rid of all its usages and the function itself. * Flushing doesn't make sense before workqueue is fully initialized. The flush functions trigger WARN and return immediately before fully online. * Work items are never in-flight before fully online. Canceling can always succeed by skipping the flush step. * Some code paths can no longer assume to be called with irq enabled as irq is disabled during early boot. Use irqsave/restore operations instead. v2: Watchdog init, which requires timer to be running, moved from workqueue_init_early() to workqueue_init(). Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/CA+55aFx0vPuMuxn00rBSM192n-Du5uxy+4AvKa0SBSOVJeuCGg@mail.gmail.com
2016-09-16workqueue: dump workqueue state on sanity check failures in destroy_workqueue()Tejun Heo1-0/+2
destroy_workqueue() performs a number of sanity checks to ensure that the workqueue is empty before proceeding with destruction. However, it's not always easy to tell what's going on just from the warning message. Let's dump workqueue state after sanity check failures to help debugging. Signed-off-by: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/CACT4Y+Zs6vkjHo9qHb4TrEiz3S4+quvvVQ9VWvj2Mx6pETGb9Q@mail.gmail.com Cc: Dmitry Vyukov <dvyukov@google.com>
2016-08-29workqueue: add cancel_work()Jens Axboe1