summaryrefslogtreecommitdiff
path: root/include/linux/workqueue.h
AgeCommit message (Collapse)AuthorFilesLines
2024-01-10Merge tag 'header_cleanup-2024-01-10' of https://evilpiepirate.org/git/bcachefsLinus Torvalds1-15/+1
Pull header cleanups from Kent Overstreet: "The goal is to get sched.h down to a type only header, so the main thing happening in this patchset is splitting out various _types.h headers and dependency fixups, as well as moving some things out of sched.h to better locations. This is prep work for the memory allocation profiling patchset which adds new sched.h interdepencencies" * tag 'header_cleanup-2024-01-10' of https://evilpiepirate.org/git/bcachefs: (51 commits) Kill sched.h dependency on rcupdate.h kill unnecessary thread_info.h include Kill unnecessary kernel.h include preempt.h: Kill dependency on list.h rseq: Split out rseq.h from sched.h LoongArch: signal.c: add header file to fix build error restart_block: Trim includes lockdep: move held_lock to lockdep_types.h sem: Split out sem_types.h uidgid: Split out uidgid_types.h seccomp: Split out seccomp_types.h refcount: Split out refcount_types.h uapi/linux/resource.h: fix include x86/signal: kill dependency on time.h syscall_user_dispatch.h: split out *_types.h mm_types_task.h: Trim dependencies Split out irqflags_types.h ipc: Kill bogus dependency on spinlock.h shm: Slim down dependencies workqueue: Split out workqueue_types.h ...
2023-12-20workqueue: Split out workqueue_types.hKent Overstreet1-15/+1
More sched.h dependency culling - this lets us kill a rhashtable-types.h dependency on workqueue.h. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-12workqueue: Add workqueue_unbound_exclude_cpumask() to exclude CPUs from ↵Waiman Long1-1/+1
wq_unbound_cpumask When the "isolcpus" boot command line option is used to add a set of isolated CPUs, those CPUs will be excluded automatically from wq_unbound_cpumask to avoid running work functions from unbound workqueues. Recently cpuset has been extended to allow the creation of partitions of isolated CPUs dynamically. To make it closer to the "isolcpus" in functionality, the CPUs in those isolated cpuset partitions should be excluded from wq_unbound_cpumask as well. This can be done currently by explicitly writing to the workqueue's cpumask sysfs file after creating the isolated partitions. However, this process can be error prone. Ideally, the cpuset code should be allowed to request the workqueue code to exclude those isolated CPUs from wq_unbound_cpumask so that this operation can be done automatically and the isolated CPUs will be returned back to wq_unbound_cpumask after the destructions of the isolated cpuset partitions. This patch adds a new workqueue_unbound_exclude_cpumask() function to enable that. This new function will exclude the specified isolated CPUs from wq_unbound_cpumask. To be able to restore those isolated CPUs back after the destruction of isolated cpuset partitions, a new wq_requested_unbound_cpumask is added to store the user provided unbound cpumask either from the boot command line options or from writing to the cpumask sysfs file. This new cpumask provides the basis for CPU exclusion. To enable users to understand how the wq_unbound_cpumask is being modified internally, this patch also exposes the newly introduced wq_requested_unbound_cpumask as well as a wq_isolated_cpumask to store the cpumask to be excluded from wq_unbound_cpumask as read-only sysfs files. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-10-17workqueue: Provide one lock class key per work_on_cpu() callsiteFrederic Weisbecker1-7/+39
All callers of work_on_cpu() share the same lock class key for all the functions queued. As a result the workqueue related locking scenario for a function A may be spuriously accounted as an inversion against the locking scenario of function B such as in the following model: long A(void *arg) { mutex_lock(&mutex); mutex_unlock(&mutex); } long B(void *arg) { } void launchA(void) { work_on_cpu(0, A, NULL); } void launchB(void) { mutex_lock(&mutex); work_on_cpu(1, B, NULL); mutex_unlock(&mutex); } launchA and launchB running concurrently have no chance to deadlock. However the above can be reported by lockdep as a possible locking inversion because the works containing A() and B() are treated as belonging to the same locking class. The following shows an existing example of such a spurious lockdep splat: ====================================================== WARNING: possible circular locking dependency detected 6.6.0-rc1-00065-g934ebd6e5359 #35409 Not tainted ------------------------------------------------------ kworker/0:1/9 is trying to acquire lock: ffffffff9bc72f30 (cpu_hotplug_lock){++++}-{0:0}, at: _cpu_down+0x57/0x2b0 but task is already holding lock: ffff9e3bc0057e60 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works+0x216/0x500 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 ((work_completion)(&wfc.work)){+.+.}-{0:0}: __flush_work+0x83/0x4e0 work_on_cpu+0x97/0xc0 rcu_nocb_cpu_offload+0x62/0xb0 rcu_nocb_toggle+0xd0/0x1d0 kthread+0xe6/0x120 ret_from_fork+0x2f/0x40 ret_from_fork_asm+0x1b/0x30 -> #1 (rcu_state.barrier_mutex){+.+.}-{3:3}: __mutex_lock+0x81/0xc80 rcu_nocb_cpu_deoffload+0x38/0xb0 rcu_nocb_toggle+0x144/0x1d0 kthread+0xe6/0x120 ret_from_fork+0x2f/0x40 ret_from_fork_asm+0x1b/0x30 -> #0 (cpu_hotplug_lock){++++}-{0:0}: __lock_acquire+0x1538/0x2500 lock_acquire+0xbf/0x2a0 percpu_down_write+0x31/0x200 _cpu_down+0x57/0x2b0 __cpu_down_maps_locked+0x10/0x20 work_for_cpu_fn+0x15/0x20 process_scheduled_works+0x2a7/0x500 worker_thread+0x173/0x330 kthread+0xe6/0x120 ret_from_fork+0x2f/0x40 ret_from_fork_asm+0x1b/0x30 other info that might help us debug this: Chain exists of: cpu_hotplug_lock --> rcu_state.barrier_mutex --> (work_completion)(&wfc.work) Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock((work_completion)(&wfc.work)); lock(rcu_state.barrier_mutex); lock((work_completion)(&wfc.work)); lock(cpu_hotplug_lock); *** DEADLOCK *** 2 locks held by kworker/0:1/9: #0: ffff900481068b38 ((wq_completion)events){+.+.}-{0:0}, at: process_scheduled_works+0x212/0x500 #1: ffff9e3bc0057e60 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works+0x216/0x500 stack backtrace: CPU: 0 PID: 9 Comm: kworker/0:1 Not tainted 6.6.0-rc1-00065-g934ebd6e5359 #35409 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 Workqueue: events work_for_cpu_fn Call Trace: rcu-torture: rcu_torture_read_exit: Start of episode <TASK> dump_stack_lvl+0x4a/0x80 check_noncircular+0x132/0x150 __lock_acquire+0x1538/0x2500 lock_acquire+0xbf/0x2a0 ? _cpu_down+0x57/0x2b0 percpu_down_write+0x31/0x200 ? _cpu_down+0x57/0x2b0 _cpu_down+0x57/0x2b0 __cpu_down_maps_locked+0x10/0x20 work_for_cpu_fn+0x15/0x20 process_scheduled_works+0x2a7/0x500 worker_thread+0x173/0x330 ? __pfx_worker_thread+0x10/0x10 kthread+0xe6/0x120 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x2f/0x40 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1b/0x30 </TASK Fix this with providing one lock class key per work_on_cpu() caller. Reported-and-tested-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07workqueue: Make default affinity_scope dynamically updatableTejun Heo1-2/+1
While workqueue.default_affinity_scope is writable, it only affects workqueues which are created afterwards and isn't very useful. Instead, let's introduce explicit "default" scope and update the effective scope dynamically when workqueue.default_affinity_scope is changed. Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07workqueue: Implement non-strict affinity scope for unbound workqueuesTejun Heo1-0/+11
An unbound workqueue can be served by multiple worker_pools to improve locality. The segmentation is achieved by grouping CPUs into pods. By default, the cache boundaries according to cpus_share_cache() define the CPUs are grouped. Let's a workqueue is allowed to run on all CPUs and the system has two L3 caches. The workqueue would be mapped to two worker_pools each serving one L3 cache domains. While this improves locality, because the pod boundaries are strict, it limits the total bandwidth a given issuer can consume. For example, let's say there is a thread pinned to a CPU issuing enough work items to saturate the whole machine. With the machine segmented into two pods, no matter how many work items it issues, it can only use half of the CPUs on the system. While this limitation has existed for a very long time, it wasn't very pronounced because the affinity grouping used to be always by NUMA nodes. With cache boundaries as the default and support for even finer grained scopes (smt and cpu), it is now an a lot more pressing problem. This patch implements non-strict affinity scope where the pod boundaries aren't enforced strictly. Going back to the previous example, the workqueue would still be mapped to two worker_pools; however, the affinity enforcement would be soft. The workers in both pools would have their cpus_allowed set to the whole machine thus allowing the scheduler to migrate them anywhere on the machine. However, whenever an idle worker is woken up, the workqueue code asks the scheduler to bring back the task within the pod if the worker is outside. ie. work items start executing within its affinity scope but can be migrated outside as the scheduler sees fit. This removes the hard cap on utilization while maintaining the benefits of affinity scopes. After the earlier ->__pod_cpumask changes, the implementation is pretty simple. When non-strict which is the new default: * pool_allowed_cpus() returns @pool->attrs->cpumask instead of ->__pod_cpumask so that the workers are allowed to run on any CPU that the associated workqueues allow. * If the idle worker task's ->wake_cpu is outside the pod, kick_pool() sets the field to a CPU within the pod. This would be the first use of task_struct->wake_cpu outside scheduler proper, so it isn't clear whether this would be acceptable. However, other methods of migrating tasks are significantly more expensive and are likely prohibitively so if we want to do this on every work item. This needs discussion with scheduler folks. There is also a race window where setting ->wake_cpu wouldn't be effective as the target task is still on CPU. However, the window is pretty small and this being a best-effort optimization, it doesn't seem to warrant more complexity at the moment. While the non-strict cache affinity scopes seem to be the best option, the performance picture interacts with the affinity scope and is a bit complicated to fully discuss in this patch, so the behavior is made easily selectable through wqattrs and sysfs and the next patch will add documentation to discuss performance implications. v2: pool->attrs->affn_strict is set to true for per-cpu worker_pools. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org>
2023-08-07workqueue: Add workqueue_attrs->__pod_cpumaskTejun Heo1-0/+16
workqueue_attrs has two uses: * to specify the required unouned workqueue properties by users * to match worker_pool's properties to workqueues by core code For example, if the user wants to restrict a workqueue to run only CPUs 0 and 2, and the two CPUs are on different affinity scopes, the workqueue's attrs->cpumask would contains CPUs 0 and 2, and the workqueue would be associated with two worker_pools, one with attrs->cpumask containing just CPU 0 and the other CPU 2. Workqueue wants to support non-strict affinity scopes where work items are started in their matching affinity scopes but the scheduler is free to migrate them outside the starting scopes, which can enable utilizing the whole machine while maintaining most of the locality benefits from affinity scopes. To enable that, worker_pools need to distinguish the strict affinity that it has to follow (because that's the restriction coming from the user) and the soft affinity that it wants to apply when dispatching work items. Note that two worker_pools with different soft dispatching requirements have to be separate; otherwise, for example, we'd be ping-ponging worker threads across NUMA boundaries constantly. This patch adds workqueue_attrs->__pod_cpumask. The new field is double underscored as it's only used internally to distinguish worker_pools. A worker_pool's ->cpumask is now always the same as the online subset of allowed CPUs of the associated workqueues, and ->__pod_cpumask is the pod's subset of that ->cpumask. Going back to the example above, both worker_pools would have ->cpumask containing both CPUs 0 and 2 but one's ->__pod_cpumask would contain 0 while the other's 2. * pool_allowed_cpus() is added. It returns the worker_pool's strict cpumask that the pool's workers must stay within. This is currently always ->__pod_cpumask as all boundaries are still strict. * As a workqueue_attrs can now track both the associated workqueues' cpumask and its per-pod subset, wq_calc_pod_cpumask() no longer needs an external out-argument. Drop @cpumask and instead store the result in ->__pod_cpumask. * The above also simplifies apply_wqattrs_prepare() as the same workqueue_attrs can be used to create all pods associated with a workqueue. tmp_attrs is dropped. * wq_update_pod() is updated to use wqattrs_equal() to test whether a pwq update is needed instead of only comparing ->cpumask so that ->__pod_cpumask is compared too. It can directly compare ->__pod_cpumaks but the code is easier to understand and more robust this way. The only user-visible behavior change is that two workqueues with different cpumasks no longer can share worker_pools even when their pod subsets coincide. Going back to the example, let's say there's another workqueue with cpumask 0, 2, 3, where 2 and 3 are in the same pod. It would be mapped to two worker_pools - one with CPU 0, the other with 2 and 3. The former has the same cpumask as the first pod of the earlier example and would have shared the same worker_pool but that's no longer the case after this patch. The worker_pools would have the same ->__pod_cpumask but their ->cpumask's wouldn't match. While this is necessary to support non-strict affinity scopes, there can be further optimizations to maintain sharing among strict affinity scopes. However, non-strict affinity scopes are going to be preferable for most use cases and we don't see very diverse mixture of unbound workqueue cpumasks anyway, so the additional overhead doesn't seem to justify the extra complexity. v2: - wq_update_pod() was incorrectly comparing target_attrs->__pod_cpumask to pool->attrs->cpumask instead of its ->__pod_cpumask. Fix it by using wqattrs_equal() for comparison instead. - Per-cpu worker pools weren't initializing ->__pod_cpumask which caused a subtle problem later on. Set it to cpumask_of(cpu) like ->cpumask. Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07workqueue: Add multiple affinity scopes and interface to select themTejun Heo1-1/+4
Add three more affinity scopes - WQ_AFFN_CPU, SMT and CACHE - and make CACHE the default. The code changes to actually add the additional scopes are trivial. Also add module parameter "workqueue.default_affinity_scope" to override the default scope and "affinity_scope" sysfs file to configure it per workqueue. wq_dump.py and documentations are updated accordingly. This enables significant flexibility in configuring how unbound workqueues behave. If affinity scope is set to "cpu", it'll behave close to a per-cpu workqueue. On the other hand, "system" removes all locality boundaries. Many modern machines have multiple L3 caches often while being mostly uniform in terms of memory access. Thus, workqueue's previous behavior of spreading work items in each NUMA node had negative performance implications from unncessarily crossing L3 boundaries between issue and execution. However, picking a finer grained affinity scope also has a downside in that an issuer in one group can't utilize CPUs in other groups. While dependent on the specifics of workload, there's usually a noticeable penalty in crossing L3 boundaries, so let's default to CACHE. This issue will be further addressed and documented with examples in future patches. Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07workqueue: Generalize unbound CPU podsTejun Heo1-4/+27
While renamed to pod, the code still assumes that the pods are defined by NUMA boundaries. Let's generalize it: * workqueue_attrs->affn_scope is added. Each enum represents the type of boundaries that define the pods. There are currently two scopes - WQ_AFFN_NUMA and WQ_AFFN_SYSTEM. The former is the same behavior as before - one pod per NUMA node. The latter defines one global pod across the whole system. * struct wq_pod_type is added which describes how pods are configured for each affnity scope. For each pod, it lists the member CPUs and the preferred NUMA node for memory allocations. The reverse mapping from CPU to pod is also available. * wq_pod_enabled is dropped. Pod is now always enabled. The previously disabled behavior is now implemented through WQ_AFFN_SYSTEM. * get_unbound_pool() wants to determine the NUMA node to allocate memory from for the new pool. The variables are renamed from node to pod but the logic still assumes they're one and the same. Clearly distinguish them - walk the WQ_AFFN_NUMA pods to find the matching pod and then use the pod's NUMA node. * wq_calc_pod_cpumask() was taking @pod but assumed that it was the NUMA node. Take @cpu instead and determine the cpumask to use from the pod_type matching @attrs. * apply_wqattrs_prepare() is update to return ERR_PTR() on error instead of NULL so that it can indicate -EINVAL on invalid affinity scopes. This patch allows CPUs to be grouped into pods however desired per type. While this patch causes some internal behavior changes, nothing material should change for workqueue users. v2: Trigger WARN_ON_ONCE() in wqattrs_pod_type() if affn_scope is WQ_AFFN_NR_TYPES which indicates that the function is called with a worker_pool's attrs instead of a workqueue's. Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07workqueue: Initialize unbound CPU pods later in the bootTejun Heo1-0/+1
During boot, to initialize unbound CPU pods, wq_pod_init() was called from workqueue_init(). This is early enough for NUMA nodes to be set up but before SMP is brought up and CPU topology information is populated. Workqueue is in the process of improving CPU locality for unbound workqueues and will need access to topology information during pod init. This adds a new init function workqueue_init_topology() which is called after CPU topology information is available and replaces wq_pod_init(). As unbound CPU pods are now initialized after workqueues are activated, we need to revisit the workqueues to apply the pod configuration. Workqueues which are created before workqueue_init_topology() are set up so that they always use the default worker pool. After pods are set up in workqueue_init_topology(), wq_update_pod() is called on all existing workqueues to update the pool associations accordingly. Note that wq_update_pod_attrs_buf allocation is moved to workqueue_init_early(). This isn't necessary right now but enables further generalization of pod handling in the future. This patch changes the initialization sequence but the end result should be the same. Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07workqueue: Rename workqueue_attrs->no_numa to ->orderedTejun Heo1-3/+3
With the recent removal of NUMA related module param and sysfs knob, workqueue_attrs->no_numa is now only used to implement ordered workqueues. Let's rename the field so that it's less confusing especially with the planned CPU affinity awareness improvements. Just a rename. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07workqueue: Make unbound workqueues to use per-cpu pool_workqueuesTejun Heo1-6/+2
A pwq (pool_workqueue) represents an association between a workqueue and a worker_pool. When a work item is queued, the workqueue selects the pwq to use, which in turn determines the pool, and queues the work item to the pool through the pwq. pwq is also what implements the maximum concurrency limit - @max_active. As a per-cpu workqueue should be assocaited with a different worker_pool on each CPU, it always had per-cpu pwq's that are accessed through wq->cpu_pwq. However, unbound workqueues were sharing a pwq within each NUMA node by default. The sharing has several downsides: * Because @max_active is per-pwq, the meaning of @max_active changes depending on the machine configuration and whether workqueue NUMA locality support is enabled. * Makes per-cpu and unbound code deviate. * Gets in the way of making workqueue CPU locality awareness more flexible. This patch makes unbound workqueues use per-cpu pwq's the same way per-cpu workqueues do by making the following changes: * wq->numa_pwq_tbl[] is removed and unbound workqueues now use wq->cpu_pwq just like per-cpu workqueues. wq->cpu_pwq is now RCU protected for unbound workqueues. * numa_pwq_tbl_install() is renamed to install_unbound_pwq() and installs the specified pwq to the target CPU's wq->cpu_pwq. * apply_wqattrs_prepare() now always allocates a separate pwq for each CPU unless the workqueue is ordered. If ordered, all CPUs use wq->dfl_pwq. This makes the return value of wq_calc_node_cpumask() unnecessary. It now returns void. * @max_active now means the same thing for both per-cpu and unbound workqueues. WQ_UNBOUND_MAX_ACTIVE now equals WQ_MAX_ACTIVE and documentation is updated accordingly. WQ_UNBOUND_MAX_ACTIVE is no longer used in workqueue implementation and will be removed later. * All unbound pwq operations which used to be per-numa-node are now per-cpu. For most unbound workqueue users, this shouldn't cause noticeable changes. Work item issue and completion will be a small bit faster, flush_workqueue() would become a bit more expensive, and the total concurrency limit would likely become higher. All @max_active==1 use cases are currently being audited for conversion into alloc_ordered_workqueue() and they shouldn't be affected once the audit and conversion is complete. One area where the behavior change may be more noticeable is workqueue_congested() as the reported congestion state is now per CPU instead of NUMA node. There are only two users of this interface - drivers/infiniband/hw/hfi1 and net/smc. Maintainers of both subsystems are cc'd. Inputs on the behavior change would be very much appreciated. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Leon Romanovsky <leon@kernel.org> Cc: Karsten Graul <kgraul@linux.ibm.com> Cc: Wenjia Zhang <wenjia@linux.ibm.com> Cc: Jan Karcher <jaka@linux.ibm.com>
2023-07-10workqueue: Warn attempt to flush system-wide workqueues.Tetsuo Handa1-41/+3
Based on commit c4f135d643823a86 ("workqueue: Wrap flush_workqueue() using a macro"), all in-tree users stopped flushing system-wide workqueues. Therefore, start emitting runtime message so that all out-of-tree users will understand that they need to update their code. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-06-23workqueue: clean up WORK_* constant types, clarify maskingLinus Torvalds1-7/+8
Dave Airlie reports that gcc-13.1.1 has started complaining about some of the workqueue code in 32-bit arm builds: kernel/workqueue.c: In function ‘get_work_pwq’: kernel/workqueue.c:713:24: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast] 713 | return (void *)(data & WORK_STRUCT_WQ_DATA_MASK); | ^ [ ... a couple of other cases ... ] and while it's not immediately clear exactly why gcc started complaining about it now, I suspect it's some C23-induced enum type handlign fixup in gcc-13 is the cause. Whatever the reason for starting to complain, the code and data types are indeed disgusting enough that the complaint is warranted. The wq code ends up creating various "helper constants" (like that WORK_STRUCT_WQ_DATA_MASK) using an enum type, which is all kinds of confused. The mask needs to be 'unsigned long', not some unspecified enum type. To make matters worse, the actual "mask and cast to a pointer" is repeated a couple of times, and the cast isn't even always done to the right pointer, but - as the error case above - to a 'void *' with then the compiler finishing the job. That's now how we roll in the kernel. So create the masks using the proper types rather than some ambiguous enumeration, and use a nice helper that actually does the type conversion in one well-defined place. Incidentally, this magically makes clang generate better code. That, admittedly, is really just a sign of clang having been seriously confused before, and cleaning up the typing unconfuses the compiler too. Reported-by: Dave Airlie <airlied@gmail.com> Link: https://lore.kernel.org/lkml/CAPM=9twNnV4zMCvrPkw3H-ajZOH-01JVh_kDrxdPYQErz8ZTdA@mail.gmail.com/ Cc: Arnd Bergmann <arnd@arndb.de> Cc: Tejun Heo <tj@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-03-23workqueue: Introduce show_freezable_workqueuesJungseung Lee1-0/+1
Currently show_all_workqueue is called if freeze fails at the time of freeze the workqueues, which shows the status of all workqueues and of all worker pools. In this cases we may only need to dump state of only workqueues that are freezable and busy. This patch defines show_freezable_workqueues, which uses show_one_workqueue, a granular function that shows the state of individual workqueues, so that dump only the state of freezable workqueues at that time. tj: Minor message adjustment. Signed-off-by: Jungseung Lee <js07.lee@samsung.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-01-04workqueue: Add a new flag to spot the potential UAF errorRichard Clark1-0/+1
Currently if the user queues a new work item unintentionally into a wq after the destroy_workqueue(wq), the work still can be queued and scheduled without any noticeable kernel message before the end of a RCU grace period. As a debug-aid facility, this commit adds a new flag __WQ_DESTROYING to spot that issue by triggering a kernel WARN message. Signed-off-by: Richard Clark <richard.xnu.clark@gmail.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-07-12Merge tag 'amd-drm-next-5.20-2022-07-05' of ↵Dave Airlie1-0/+1
https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-next-5.20-2022-07-05: amdgpu: - Various spelling and grammer fixes - Various eDP fixes - Various DMCUB fixes - VCN fixes - GMC 11 fixes - RAS fixes - TMZ support for GC 10.3.7 - GPUVM TLB flush fixes - SMU 13.0.x updates - DCN 3.2 Support - DCN 3.2.1 Support - MES updates - GFX11 modifiers support - USB-C fixes - MMHUB 3.0.1 support - SDMA 6.0 doorbell fixes - Initial devcoredump support - Enable high priority gfx queue on asics which support it - Enable GPU reset for SMU 13.0.4 - OLED display fixes - MPO fixes - DC frame size fixes - ASPM support for PCIE 7.4/7.6 - GPU reset support for SMU 13.0.0 - GFX11 updates - VCN JPEG fix - BACO support for SMU 13.0.7 - VCN instance handling fix - GFX8 GPUVM TLB flush fix - GPU reset rework - VCN 4.0.2 support - GTT size fixes - DP link training fixes - LSDMA 6.0.1 support - Various backlight fixes - Color encoding fixes - Backlight config cleanup - VCN 4.x unified queue cleanup amdkfd: - MMU notifier fixes - Updates for GC 10.3.6 and 10.3.7 - P2P DMA support using dma-buf - Add available memory IOCTL - SDMA 6.0.1 fix - MES fixes - HMM profiler support radeon: - License fix - Backlight config cleanup UAPI: - Add available memory IOCTL to amdkfd Proposed userspace: https://www.mail-archive.com/amd-gfx@lists.freedesktop.org/msg75743.html - HMM profiler support for amdkfd Proposed userspace: https://lists.freedesktop.org/archives/amd-gfx/2022-June/080805.html Signed-off-by: Dave Airlie <airlied@redhat.com> From: Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20220705212633.6037-1-alexander.deucher@amd.com
2022-06-11workqueue: Switch to new kerneldoc syntax for named variable macro argumentJonathan Neuschäfer1-1/+1
The syntax without dots is available since commit 43756e347f21 ("scripts/kernel-doc: Add support for named variable macro arguments"). The same HTML output is produced with and without this patch. Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-06-10Revert "workqueue: remove unused cancel_work()"Andrey Grodzovsky1-0/+1
This reverts commit 6417250d3f894e66a68ba1cd93676143f2376a6f. amdpgu need this function in order to prematurly stop pending reset works when another reset work already in progress. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Lai Jiangshan<jiangshanlai@gmail.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-07workqueue: Wrap flush_workqueue() using a macroTetsuo Handa1-8/+56
Since flush operation synchronously waits for completion, flushing system-wide WQs (e.g. system_wq) might introduce possibility of deadlock due to unexpected locking dependency. Tejun Heo commented at [1] that it makes no sense at all to call flush_workqueue() on the shared WQs as the caller has no idea what it's gonna end up waiting for. Although there is flush_scheduled_work() which flushes system_wq WQ with "Think twice before calling this function! It's very easy to get into trouble if you don't take great care." warning message, syzbot found a circular locking dependency caused by flushing system_wq WQ [2]. Therefore, let's change the direction to that developers had better use their local WQs if flush_scheduled_work()/flush_workqueue(system_*_wq) is inevitable. Steps for converting system-wide WQs into local WQs are explained at [3], and a conversion to stop flushing system-wide WQs is in progress. Now we want some mechanism for preventing developers who are not aware of this conversion from again start flushing system-wide WQs. Since I found that WARN_ON() is complete but awkward approach for teaching developers about this problem, let's use __compiletime_warning() for incomplete but handy approach. For completeness, we will also insert WARN_ON() into __flush_workqueue() after all in-tree users stopped calling flush_scheduled_work(). Link: https://lore.kernel.org/all/YgnQGZWT%2Fn3VAITX@slm.duckdns.org/ [1] Link: https://syzkaller.appspot.com/bug?extid=bde0f89deacca7c765b8 [2] Link: https://lkml.kernel.org/r/49925af7-78a8-a3dd-bce6-cfc02e1a9236@I-love.SAKURA.ne.jp [3] Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Tejun Heo <tj@kernel.org>
2021-10-20workqueue: Introduce show_one_worker_pool and show_one_workqueue.Imran Khan1-1/+2
Currently show_workqueue_state shows the state of all workqueues and of all worker pools. In certain cases we may need to dump state of only a specific workqueue or worker pool. For example in destroy_workqueue we only need to show state of the workqueue which is getting destroyed. So rename show_workqueue_state to show_all_workqueues(to signify it dumps state of all busy workqueues) and divide it into more granular functions (show_one_workqueue and show_one_worker_pool), that would show states of individual workqueues and worker pools and can be used in cases such as the one mentioned above. Also, as mentioned earlier, make destroy_workqueue dump data pertaining to only the workqueue that is being destroyed and make user(s) of earlier interface(show_workqueue_state), use new interface (show_all_workqueues). Signed-off-by: Imran Khan <imran.f.khan@oracle.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2021-09-13workqueue: annotate alloc_workqueue() as printfRolf Eike Beer1-3/+2
This also enables checking of allows alloc_ordered_workqueue(). Signed-off-by: Rolf Eike Beer <eb@emlix.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2021-08-17workqueue: Remove unused WORK_NO_COLORLai Jiangshan1-7/+2
WORK_NO_COLOR has no user now, just remove it. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2021-08-17workqueue: Rename "delayed" (delayed by active management) to "inactive"Lai Jiangshan1-2/+2
There are two kinds of "delayed" work items in workqueue subsystem. One is for timer-delayed work items which are visible to workqueue users. The other kind is for work items delayed by active management which can not be directly visible to workqueue users. We mixed the word "delayed" for both kinds and caused somewhat ambiguity. This patch renames the later one (delayed by active management) to "inactive", because it is used for workqueue active management and most of its related symbols are named with "active" or "activate". All "delayed" and "DELAYED" are carefully checked and renamed one by one to avoid accidentally changing the name of the other kind for timer-delayed. No functional change intended. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2021-08-09workqueue: Fix typo in commentsCai Huoqing1-1/+1
Fix typo: *assing ==> assign *alloced ==> allocated *Retun ==> Return *excute ==> execute v1->v2: *reverse 'iff' *update changelog Signed-off-by: Cai Huoqing <caihuoqing@baidu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2021-01-19workqueue: fix annotation for WQ_SYSFSMenglong Dong1-1/+1
'wq_sysfs_register()' in annotation for 'WQ_SYSFS' is unavailable, change it to 'workqueue_sysfs_register()'. Signed-off-by: Menglong Dong <dong.menglong@zte.com.cn> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2020-06-01workqueue: fix a piece of comment about reserved bits for work flagsLai Jiangshan1-1/+1
8a2e8e5dec7e("workqueue: fix cwq->nr_active underflow") allocated one more bit from the work flags, and it updated partial of the comments (128 bytes -> 256 bytes), but it failed to update the info about the number of reserved bits. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2020-04-03Merge branch 'for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds1-2/+2
Pull workqueue updates from Tejun Heo: "Nothing too interesting. Just two trivial patches" * 'for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: Mark up unlocked access to wq->first_flusher workqueue: Make workqueue_init*() return void
2020-03-04workqueue: Make workqueue_init*() return voidYu Chen1-2/+2
The return values of workqueue_init() and workqueue_early_int() are always 0, and there is no usage of their return value. So just make them return void. Signed-off-by: Yu Chen <chen.yu@easystack.cn> Signed-off-by: Tejun Heo <tj@kernel.org>
2020-02-12workqueue: Document (some) memory-ordering properties of {queue,schedule}_work()Andrea Parri1-0/+16
It's desirable to be able to rely on the following property: All stores preceding (in program order) a call to a successful queue_work() will be visible from the CPU which will execute the queued work by the time such work executes, e.g., { x is initially 0 } CPU0 CPU1 WRITE_ONCE(x, 1); [ "work" is being executed ] r0 = queue_work(wq, work); r1 = READ_ONCE(x); Forbids: r0 == true && r1 == 0 The current implementation of queue_work() provides such memory-ordering property: - In __queue_work(), the ->lock spinlock is acquired. - On the other side, in worker_thread(), this same ->lock is held when dequeueing work. So the locking ordering makes things work out. Add this property to the DocBook headers of {queue,schedule}_work(). Suggested-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Andrea Parri <parri.andrea@gmail.com> Acked-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2019-09-13workqueue: unconfine alloc/apply/free_workqueue_attrs()Daniel Jordan1-0/+4
padata will use these these interfaces in a later patch, so unconfine them. Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-crypto@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-27workqueue: Make alloc/apply/free_workqueue_attrs() staticThomas Gleixner1-4/+0
None of those functions have any users outside of workqueue.c. Confine them. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Tejun Heo <tj@kernel.org>
2019-03-06Merge tag 'driver-core-5.1-rc1' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here is the big driver core patchset for 5.1-rc1 More patches than "normal" here this merge window, due to some work in the driver core by Alexander Duyck to rework the async probe functionality to work better for a number of devices, and independant work from Rafael for the device link functionality to make it work "correctly". Also in here is: - lots of BUS_ATTR() removals, the macro is about to go away - firmware test fixups - ihex fixups and simplification - component additions (also includes i915 patches) - lots of minor coding style fixups and cleanups. All of these have been in linux-next for a while with no reported issues" * tag 'driver-core-5.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (65 commits) driver core: platform: remove misleading err_alloc label platform: set of_node in platform_device_register_full() firmware: hardcode the debug message for -ENOENT driver core: Add missing description of new struct device_link field driver core: Fix PM-runtime for links added during consumer probe drivers/component: kerneldoc polish async: Add cmdline option to specify drivers to be async probed driver core: Fix possible supplier PM-usage counter imbalance PM-runtime: Fix __pm_runtime_set_status() race with runtime resume driver: platform: Support parsing GpioInt 0 in platform_get_irq() selftests: firmware: fix verify_reqs() return value Revert "selftests: firmware: remove use of non-standard diff -Z option" Revert "selftests: firmware: add CONFIG_FW_LOADER_USER_HELPER_FALLBACK to config" device: Fix comment for driver_data in struct device kernfs: Allocating memory for kernfs_iattrs with kmem_cache. sysfs: remove unused include of kernfs-internal.h driver core: Postpone DMA tear-down until after devres release driver core: Document limitation related to DL_FLAG_RPM_ACTIVE PM-runtime: Take suppliers into account in __pm_runtime_set_status() device.h: Add __cold to dev_<level> logging functions ...
2019-02-28kernel/workqueue: Use dynamic lockdep keys for workqueuesBart Van Assche1-24/+4
The following commit: 87915adc3f0a ("workqueue: re-add lockdep dependencies for flushing") improved deadlock checking in the workqueue implementation. Unfortunately that patch also introduced a few false positive lockdep complaints. This patch suppresses these false positives by allocating the workqueue mutex lockdep key dynamically. An example of a false positive lockdep complaint suppressed by this patch can be found below. The root cause of the lockdep complaint shown below is that the direct I/O code can call alloc_workqueue() from inside a work item created by another alloc_workqueue() call and that both workqueues share the same lockdep key. This patch avoids that that lockdep complaint is triggered by allocating the work queue lockdep keys dynamically. In other words, this patch guarantees that a unique lockdep key is associated with each work queue mutex. ====================================================== WARNING: possible circular locking dependency detected 4.19.0-dbg+ #1 Not tainted fio/4129 is trying to acquire lock: 00000000a01cfe1a ((wq_completion)"dio/%s"sb->s_id){+.+.}, at: flush_workqueue+0xd0/0x970 but task is already holding lock: 00000000a0acecf9 (&sb->s_type->i_mutex_key#14){+.+.}, at: ext4_file_write_iter+0x154/0x710 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (&sb->s_type->i_mutex_key#14){+.+.}: down_write+0x3d/0x80 __generic_file_fsync+0x77/0xf0 ext4_sync_file+0x3c9/0x780 vfs_fsync_range+0x66/0x100 dio_complete+0x2f5/0x360 dio_aio_complete_work+0x1c/0x20 process_one_work+0x481/0x9f0 worker_thread+0x63/0x5a0 kthread+0x1cf/0x1f0 ret_from_fork+0x24/0x30 -> #1 ((work_completion)(&dio->complete_work)){+.+.}: process_one_work+0x447/0x9f0 worker_thread+0x63/0x5a0 kthread+0x1cf/0x1f0 ret_from_fork+0x24/0x30 -> #0 ((wq_completion)"dio/%s"sb->s_id){+.+.}: lock_acquire+0xc5/0x200 flush_workqueue+0xf3/0x970 drain_workqueue+0xec/0x220 destroy_workqueue+0x23/0x350 sb_init_dio_done_wq+0x6a/0x80 do_blockdev_direct_IO+0x1f33/0x4be0 __blockdev_direct_IO+0x79/0x86 ext4_direct_IO+0x5df/0xbb0 generic_file_direct_write+0x119/0x220 __generic_file_write_iter+0x131/0x2d0 ext4_file_write_iter+0x3fa/0x710 aio_write+0x235/0x330 io_submit_one+0x510/0xeb0 __x64_sys_io_submit+0x122/0x340 do_syscall_64+0x71/0x220 entry_SYSCALL_64_after_hwframe+0x49/0xbe other info that might help us debug this: Chain exists of: (wq_completion)"dio/%s"sb->s_id --> (work_completion)(&dio->complete_work) --> &sb->s_type->i_mutex_key#14 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&sb->s_type->i_mutex_key#14); lock((work_completion)(&dio->complete_work)); lock(&sb->s_type->i_mutex_key#14); lock((wq_completion)"dio/%s"sb->s_id); *** DEADLOCK *** 1 lock held by fio/4129: #0: 00000000a0acecf9 (&sb->s_type->i_mutex_key#14){+.+.}, at: ext4_file_write_iter+0x154/0x710 stack backtrace: CPU: 3 PID: 4129 Comm: fio Not tainted 4.19.0-dbg+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 Call Trace: dump_stack+0x86/0xc5 print_circular_bug.isra.32+0x20a/0x218 __lock_acquire+0x1c68/0x1cf0 lock_acquire+0xc5/0x200 flush_workqueue+0xf3/0x970 drain_workqueue+0xec/0x220 destroy_workqueue+0x23/0x350 sb_init_dio_done_wq+0x6a/0x80 do_blockdev_direct_IO+0x1f33/0x4be0 __blockdev_direct_IO+0x79/0x86 ext4_direct_IO+0x5df/0xbb0 generic_file_direct_write+0x119/0x220 __generic_file_write_iter+0x131/0x2d0 ext4_file_write_iter+0x3fa/0x710 aio_write+0x235/0x330 io_submit_one+0x510/0xeb0 __x64_sys_io_submit+0x122/0x340 do_syscall_64+0x71/0x220 entry_SYSCALL_64_after_hwframe+0x49/0xbe Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Johannes Berg <johannes.berg@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will.deacon@arm.com> Link: https://lkml.kernel.org/r/20190214230058.196511-20-bvanassche@acm.org [ Reworked the changelog a bit. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-01-31workqueue: Provide queue_work_node to queue work near a given NUMA nodeAlexander Duyck1-0/+2
Provide a new function, queue_work_node, which is meant to schedule work on a "random" CPU of the requested NUMA node. The main motivation for this is to help assist asynchronous init to better improve boot times for devices that are local to a specific node. For now we just default to the first CPU that is in the intersection of the cpumask of the node and the online cpumask. The only exception is if the CPU is local to the node we will just use the current CPU. This should work for our purposes as we are currently only using this for unbound work so the CPU will be translated to a node anyway instead of being directly used. As we are only using the first CPU to represent the NUMA node for now I am limiting the scope of