linux.git/kernel/cgroup/cgroup.c, branch v6.12.80

cgroup: fix race between task migration and iteration

2026-03-25T10:08:30+00:00

commit 5ee01f1a7343d6a3547b6802ca2d4cdce0edacb1 upstream.

When a task is migrated out of a css_set, cgroup_migrate_add_task()
first moves it from cset->tasks to cset->mg_tasks via:

    list_move_tail(&task->cg_list, &cset->mg_tasks);

If a css_task_iter currently has it->task_pos pointing to this task,
css_set_move_task() calls css_task_iter_skip() to keep the iterator
valid. However, since the task has already been moved to ->mg_tasks,
the iterator is advanced relative to the mg_tasks list instead of the
original tasks list. As a result, remaining tasks on cset->tasks, as
well as tasks queued on cset->mg_tasks, can be skipped by iteration.

Fix this by calling css_set_skip_task_iters() before unlinking
task->cg_list from cset->tasks. This advances all active iterators to
the next task on cset->tasks, so iteration continues correctly even
when a task is concurrently being migrated.

This race is hard to hit in practice without instrumentation, but it
can be reproduced by artificially slowing down cgroup_procs_show().
For example, on an Android device a temporary
/sys/kernel/cgroup/cgroup_test knob can be added to inject a delay
into cgroup_procs_show(), and then:

  1) Spawn three long-running tasks (PIDs 101, 102, 103).
  2) Create a test cgroup and move the tasks into it.
  3) Enable a large delay via /sys/kernel/cgroup/cgroup_test.
  4) In one shell, read cgroup.procs from the test cgroup.
  5) Within the delay window, in another shell migrate PID 102 by
     writing it to a different cgroup.procs file.

Under this setup, cgroup.procs can intermittently show only PID 101
while skipping PID 103. Once the migration completes, reading the
file again shows all tasks as expected.

Note that this change does not allow removing the existing
css_set_skip_task_iters() call in css_set_move_task(). The new call
in cgroup_migrate_add_task() only handles iterators that are racing
with migration while the task is still on cset->tasks. Iterators may
also start after the task has been moved to cset->mg_tasks. If we
dropped css_set_skip_task_iters() from css_set_move_task(), such
iterators could keep task_pos pointing to a migrating task, causing
css_task_iter_advance() to malfunction on the destination css_set,
up to and including crashes or infinite loops.

The race window between migration and iteration is very small, and
css_task_iter is not on a hot path. In the worst case, when an
iterator is positioned on the first thread of the migrating process,
cgroup_migrate_add_task() may have to skip multiple tasks via
css_set_skip_task_iters(). However, this only happens when migration
and iteration actually race, so the performance impact is negligible
compared to the correctness fix provided here.

Fixes: b636fd38dc40 ("cgroup: Implement css_task_iter_skip()")
Cc: stable@vger.kernel.org # v5.2+
Signed-off-by: Qingye Zhao 
Reviewed-by: Michal Koutný 
Signed-off-by: Tejun Heo 
Signed-off-by: Greg Kroah-Hartman

cgroup: Fix kernfs_node UAF in css_free_rwork_fn

2026-02-06T15:55:49+00:00

This fix patch is not upstream, and is applicable only to kernels 6.10
(where the cgroup_rstat_lock tracepoint was added) through 6.15 after
which commit 5da3bfa029d6 ("cgroup: use separate rstat trees for each
subsystem") reordered cgroup_rstat_flush as part of a new feature
addition and inadvertently fixed this UAF.

css_free_rwork_fn first releases the last reference on the cgroup's
kernfs_node, and then calls cgroup_rstat_exit which attempts to use it
in the cgroup_rstat_lock tracepoint:

kernfs_put(cgrp->kn);
cgroup_rstat_exit
  cgroup_rstat_flush
    __cgroup_rstat_lock
      trace_cgroup_rstat_locked:
        TP_fast_assign(
          __entry->root = cgrp->root->hierarchy_id;
          __entry->id = cgroup_id(cgrp);

Where cgroup_id is:
static inline u64 cgroup_id(const struct cgroup *cgrp)
{
	return cgrp->kn->id;
}

Fix this by reordering the kernfs_put after cgroup_rstat_exit.

[78782.605161][ T9861] BUG: KASAN: slab-use-after-free in trace_event_raw_event_cgroup_rstat+0x110/0x1dc
[78782.605182][ T9861] Read of size 8 at addr ffffff890270e610 by task kworker/6:1/9861
[78782.605199][ T9861] CPU: 6 UID: 0 PID: 9861 Comm: kworker/6:1 Tainted: G        W  OE      6.12.23-android16-5-gabaf21382e8f-4k #1 0308449da8ad70d2d3649ae989c1d02f0fbf562c
[78782.605220][ T9861] Tainted: [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[78782.605226][ T9861] Hardware name: Qualcomm Technologies, Inc. Alor QRD + WCN7750 WLAN + Kundu PD2536F_EX (DT)
[78782.605235][ T9861] Workqueue: cgroup_destroy css_free_rwork_fn
[78782.605251][ T9861] Call trace:
[78782.605254][ T9861]  dump_backtrace+0x120/0x170
[78782.605267][ T9861]  show_stack+0x2c/0x40
[78782.605276][ T9861]  dump_stack_lvl+0x84/0xb4
[78782.605286][ T9861]  print_report+0x144/0x7a4
[78782.605301][ T9861]  kasan_report+0xe0/0x140
[78782.605315][ T9861]  __asan_load8+0x98/0xa0
[78782.605329][ T9861]  trace_event_raw_event_cgroup_rstat+0x110/0x1dc
[78782.605339][ T9861]  __traceiter_cgroup_rstat_locked+0x78/0xc4
[78782.605355][ T9861]  __cgroup_rstat_lock+0xe8/0x1dc
[78782.605368][ T9861]  cgroup_rstat_flush_locked+0x7dc/0xaec
[78782.605383][ T9861]  cgroup_rstat_flush+0x34/0x108
[78782.605396][ T9861]  cgroup_rstat_exit+0x2c/0x120
[78782.605409][ T9861]  css_free_rwork_fn+0x504/0xa18
[78782.605421][ T9861]  process_scheduled_works+0x378/0x8e0
[78782.605435][ T9861]  worker_thread+0x5a8/0x77c
[78782.605446][ T9861]  kthread+0x1c4/0x270
[78782.605455][ T9861]  ret_from_fork+0x10/0x20
[78782.605470][ T9861] Allocated by task 2864 on cpu 7 at 78781.564561s:
[78782.605467][    C5] ENHANCE: rpm_suspend+0x93c/0xafc: 0:0:0:0 ret=0
[78782.605481][ T9861]  kasan_save_track+0x44/0x9c
[78782.605497][ T9861]  kasan_save_alloc_info+0x40/0x54
[78782.605507][ T9861]  __kasan_slab_alloc+0x70/0x8c
[78782.605521][ T9861]  kmem_cache_alloc_noprof+0x1a0/0x428
[78782.605534][ T9861]  __kernfs_new_node+0xd4/0x3e4
[78782.605545][ T9861]  kernfs_new_node+0xbc/0x168
[78782.605554][ T9861]  kernfs_create_dir_ns+0x58/0xe8
[78782.605565][ T9861]  cgroup_mkdir+0x25c/0xc9c
[78782.605576][ T9861]  kernfs_iop_mkdir+0x130/0x214
[78782.605586][ T9861]  vfs_mkdir+0x290/0x388
[78782.605599][ T9861]  do_mkdirat+0xfc/0x27c
[78782.605612][ T9861]  __arm64_sys_mkdirat+0x5c/0x78
[78782.605625][ T9861]  invoke_syscall+0x90/0x1e8
[78782.605634][ T9861]  el0_svc_common+0x134/0x168
[78782.605643][ T9861]  do_el0_svc+0x34/0x44
[78782.605652][ T9861]  el0_svc+0x38/0x84
[78782.605667][ T9861]  el0t_64_sync_handler+0x70/0xbc
[78782.605681][ T9861]  el0t_64_sync+0x19c/0x1a0
[78782.605695][ T9861] Freed by task 69 on cpu 1 at 78782.573275s:
[78782.605705][ T9861]  kasan_save_track+0x44/0x9c
[78782.605719][ T9861]  kasan_save_free_info+0x54/0x70
[78782.605729][ T9861]  __kasan_slab_free+0x68/0x8c
[78782.605743][ T9861]  kmem_cache_free+0x118/0x488
[78782.605755][ T9861]  kernfs_free_rcu+0xa0/0xb8
[78782.605765][ T9861]  rcu_do_batch+0x324/0xaa0
[78782.605775][ T9861]  rcu_nocb_cb_kthread+0x388/0x690
[78782.605785][ T9861]  kthread+0x1c4/0x270
[78782.605794][ T9861]  ret_from_fork+0x10/0x20
[78782.605809][ T9861] Last potentially related work creation:
[78782.605814][ T9861]  kasan_save_stack+0x40/0x70
[78782.605829][ T9861]  __kasan_record_aux_stack+0xb0/0xcc
[78782.605839][ T9861]  kasan_record_aux_stack_noalloc+0x14/0x24
[78782.605849][ T9861]  __call_rcu_common+0x54/0x390
[78782.605863][ T9861]  call_rcu+0x18/0x28
[78782.605875][ T9861]  kernfs_put+0x17c/0x28c
[78782.605884][ T9861]  css_free_rwork_fn+0x4f4/0xa18
[78782.605897][ T9861]  process_scheduled_works+0x378/0x8e0
[78782.605910][ T9861]  worker_thread+0x5a8/0x77c
[78782.605923][ T9861]  kthread+0x1c4/0x270
[78782.605932][ T9861]  ret_from_fork+0x10/0x20
[78782.605947][ T9861] The buggy address belongs to the object at ffffff890270e5b0
[78782.605947][ T9861]  which belongs to the cache kernfs_node_cache of size 144
[78782.605957][ T9861] The buggy address is located 96 bytes inside of
[78782.605957][ T9861]  freed 144-byte region [ffffff890270e5b0, ffffff890270e640)

Fixes: fc29e04ae1ad ("cgroup/rstat: add cgroup_rstat_lock helpers and tracepoints")
Signed-off-by: T.J. Mercier 
Acked-by: Michal Koutný 
Signed-off-by: Greg Kroah-Hartman

bpf: Do not limit bpf_cgroup_from_id to current's namespace

2025-11-13T20:34:06+00:00

[ Upstream commit 2c895133950646f45e5cf3900b168c952c8dbee8 ]

The bpf_cgroup_from_id kfunc relies on cgroup_get_from_id to obtain the
cgroup corresponding to a given cgroup ID. This helper can be called in
a lot of contexts where the current thread can be random. A recent
example was its use in sched_ext's ops.tick(), to obtain the root cgroup
pointer. Since the current task can be whatever random user space task
preempted by the timer tick, this makes the behavior of the helper
unreliable.

Refactor out __cgroup_get_from_id as the non-namespace aware version of
cgroup_get_from_id, and change bpf_cgroup_from_id to make use of it.

There is no compatibility breakage here, since changing the namespace
against which the lookup is being done to the root cgroup namespace only
permits a wider set of lookups to succeed now. The cgroup IDs across
namespaces are globally unique, and thus don't need to be retranslated.

Reported-by: Dan Schatzberg 
Signed-off-by: Kumar Kartikeya Dwivedi 
Acked-by: Tejun Heo 
Link: https://lore.kernel.org/r/20250915032618.1551762-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov 
Signed-off-by: Sasha Levin

cgroup: split cgroup_destroy_wq into 3 workqueues

2025-09-25T09:13:42+00:00

[ Upstream commit 79f919a89c9d06816dbdbbd168fa41d27411a7f9 ]

A hung task can occur during [1] LTP cgroup testing when repeatedly
mounting/unmounting perf_event and net_prio controllers with
systemd.unified_cgroup_hierarchy=1. The hang manifests in
cgroup_lock_and_drain_offline() during root destruction.

Related case:
cgroup_fj_function_perf_event cgroup_fj_function.sh perf_event
cgroup_fj_function_net_prio cgroup_fj_function.sh net_prio

Call Trace:
	cgroup_lock_and_drain_offline+0x14c/0x1e8
	cgroup_destroy_root+0x3c/0x2c0
	css_free_rwork_fn+0x248/0x338
	process_one_work+0x16c/0x3b8
	worker_thread+0x22c/0x3b0
	kthread+0xec/0x100
	ret_from_fork+0x10/0x20

Root Cause:

CPU0                            CPU1
mount perf_event                umount net_prio
cgroup1_get_tree                cgroup_kill_sb
rebind_subsystems               // root destruction enqueues
				// cgroup_destroy_wq
// kill all perf_event css
                                // one perf_event css A is dying
                                // css A offline enqueues cgroup_destroy_wq
                                // root destruction will be executed first
                                css_free_rwork_fn
                                cgroup_destroy_root
                                cgroup_lock_and_drain_offline
                                // some perf descendants are dying
                                // cgroup_destroy_wq max_active = 1
                                // waiting for css A to die

Problem scenario:
1. CPU0 mounts perf_event (rebind_subsystems)
2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work
3. A dying perf_event CSS gets queued for offline after root destruction
4. Root destruction waits for offline completion, but offline work is
   blocked behind root destruction in cgroup_destroy_wq (max_active=1)

Solution:
Split cgroup_destroy_wq into three dedicated workqueues:
cgroup_offline_wq – Handles CSS offline operations
cgroup_release_wq – Manages resource release
cgroup_free_wq – Performs final memory deallocation

This separation eliminates blocking in the CSS free path while waiting for
offline operations to complete.

[1] https://github.com/linux-test-project/ltp/blob/master/runtest/controllers
Fixes: 334c3679ec4b ("cgroup: reimplement rebind_subsystems() using cgroup_apply_control() and friends")
Reported-by: Gao Yingjie 
Signed-off-by: Chen Ridong 
Suggested-by: Teju Heo 
Signed-off-by: Tejun Heo 
Signed-off-by: Sasha Levin

cgroup: Fix compilation issue due to cgroup_mutex not being exported

2025-05-29T09:01:59+00:00

[ Upstream commit 87c259a7a359e73e6c52c68fcbec79988999b4e6 ]

When adding folio_memcg function call in the zram module for
Android16-6.12, the following error occurs during compilation:
ERROR: modpost: "cgroup_mutex" [../soc-repo/zram.ko] undefined!

This error is caused by the indirect call to lockdep_is_held(&cgroup_mutex)
within folio_memcg. The export setting for cgroup_mutex is controlled by
the CONFIG_PROVE_RCU macro. If CONFIG_LOCKDEP is enabled while
CONFIG_PROVE_RCU is not, this compilation error will occur.

To resolve this issue, add a parallel macro CONFIG_LOCKDEP control to
ensure cgroup_mutex is properly exported when needed.

Signed-off-by: gao xu 
Acked-by: Michal Koutný 
Signed-off-by: Tejun Heo 
Signed-off-by: Sasha Levin

cgroup/cpuset-v1: Add missing support for cpuset_v2_mode

2025-05-02T05:59:00+00:00

[ Upstream commit 1bf67c8fdbda21fadd564a12dbe2b13c1ea5eda7 ]

Android has mounted the v1 cpuset controller using filesystem type
"cpuset" (not "cgroup") since 2015 [1], and depends on the resulting
behavior where the controller name is not added as a prefix for cgroupfs
files. [2]

Later, a problem was discovered where cpu hotplug onlining did not
affect the cpuset/cpus files, which Android carried an out-of-tree patch
to address for a while. An attempt was made to upstream this patch, but
the recommendation was to use the "cpuset_v2_mode" mount option
instead. [3]

An effort was made to do so, but this fails with "cgroup: Unknown
parameter 'cpuset_v2_mode'" because commit e1cba4b85daa ("cgroup: Add
mount flag to enable cpuset to use v2 behavior in v1 cgroup") did not
update the special cased cpuset_mount(), and only the cgroup (v1)
filesystem type was updated.

Add parameter parsing to the cpuset filesystem type so that
cpuset_v2_mode works like the cgroup filesystem type:

$ mkdir /dev/cpuset
$ mount -t cpuset -ocpuset_v2_mode none /dev/cpuset
$ mount|grep cpuset
none on /dev/cpuset type cgroup (rw,relatime,cpuset,noprefix,cpuset_v2_mode,release_agent=/sbin/cpuset_release_agent)

[1] https://cs.android.com/android/_/android/platform/system/core/+/b769c8d24fd7be96f8968aa4c80b669525b930d3
[2] https://cs.android.com/android/platform/superproject/main/+/main:system/core/libprocessgroup/setup/cgroup_map_write.cpp;drc=2dac5d89a0f024a2d0cc46a80ba4ee13472f1681;l=192
[3] https://lore.kernel.org/lkml/f795f8be-a184-408a-0b5a-553d26061385@redhat.com/T/

Fixes: e1cba4b85daa ("cgroup: Add mount flag to enable cpuset to use v2 behavior in v1 cgroup")
Signed-off-by: T.J. Mercier 
Acked-by: Waiman Long 
Reviewed-by: Kamalesh Babulal 
Acked-by: Michal Koutný 
Signed-off-by: Tejun Heo 
Signed-off-by: Sasha Levin

cgroup/cpuset: Fix race between newly created partition and dying one

2025-04-20T08:15:04+00:00

[ Upstream commit a22b3d54de94f82ca057cc2ebf9496fa91ebf698 ]

There is a possible race between removing a cgroup diectory that is
a partition root and the creation of a new partition.  The partition
to be removed can be dying but still online, it doesn't not currently
participate in checking for exclusive CPUs conflict, but the exclusive
CPUs are still there in subpartitions_cpus and isolated_cpus. These
two cpumasks are global states that affect the operation of cpuset
partitions. The exclusive CPUs in dying cpusets will only be removed
when cpuset_css_offline() function is called after an RCU delay.

As a result, it is possible that a new partition can be created with
exclusive CPUs that overlap with those of a dying one. When that dying
partition is finally offlined, it removes those overlapping exclusive
CPUs from subpartitions_cpus and maybe isolated_cpus resulting in an
incorrect CPU configuration.

This bug was found when a warning was triggered in
remote_partition_disable() during testing because the subpartitions_cpus
mask was empty.

One possible way to fix this is to iterate the dying cpusets as well and
avoid using the exclusive CPUs in those dying cpusets. However, this
can still cause random partition creation failures or other anomalies
due to racing. A better way to fix this race is to reset the partition
state at the moment when a cpuset is being killed.

Introduce a new css_killed() CSS function pointer and call it, if
defined, before setting CSS_DYING flag in kill_css(). Also update the
css_is_dying() helper to use the CSS_DYING flag introduced by commit
33c35aa48178 ("cgroup: Prevent kill_css() from being called more than
once") for proper synchronization.

Add a new cpuset_css_killed() function to reset the partition state of
a valid partition root if it is being killed.

Fixes: ee8dde0cd2ce ("cpuset: Add new v2 cpuset.sched.partition flag")
Signed-off-by: Waiman Long 
Signed-off-by: Tejun Heo 
Signed-off-by: Sasha Levin

cgroup: fix race between fork and cgroup.kill

2025-02-21T13:01:35+00:00

commit b69bb476dee99d564d65d418e9a20acca6f32c3f upstream.

Tejun reported the following race between fork() and cgroup.kill at [1].

Tejun:
  I was looking at cgroup.kill implementation and wondering whether there
  could be a race window. So, __cgroup_kill() does the following:

   k1. Set CGRP_KILL.
   k2. Iterate tasks and deliver SIGKILL.
   k3. Clear CGRP_KILL.

  The copy_process() does the following:

   c1. Copy a bunch of stuff.
   c2. Grab siglock.
   c3. Check fatal_signal_pending().
   c4. Commit to forking.
   c5. Release siglock.
   c6. Call cgroup_post_fork() which puts the task on the css_set and tests
       CGRP_KILL.

  The intention seems to be that either a forking task gets SIGKILL and
  terminates on c3 or it sees CGRP_KILL on c6 and kills the child. However, I
  don't see what guarantees that k3 can't happen before c6. ie. After a
  forking task passes c5, k2 can take place and then before the forking task
  reaches c6, k3 can happen. Then, nobody would send SIGKILL to the child.
  What am I missing?

This is indeed a race. One way to fix this race is by taking
cgroup_threadgroup_rwsem in write mode in __cgroup_kill() as the fork()
side takes cgroup_threadgroup_rwsem in read mode from cgroup_can_fork()
to cgroup_post_fork(). However that would be heavy handed as this adds
one more potential stall scenario for cgroup.kill which is usually
called under extreme situation like memory pressure.

To fix this race, let's maintain a sequence number per cgroup which gets
incremented on __cgroup_kill() call. On the fork() side, the
cgroup_can_fork() will cache the sequence number locally and recheck it
against the cgroup's sequence number at cgroup_post_fork() site. If the
sequence numbers mismatch, it means __cgroup_kill() can been called and
we should send SIGKILL to the newly created task.

Reported-by: Tejun Heo 
Closes: https://lore.kernel.org/all/Z5QHE2Qn-QZ6M-KW@slm.duckdns.org/ [1]
Fixes: 661ee6280931 ("cgroup: introduce cgroup.kill")
Cc: stable@vger.kernel.org # v5.14+
Signed-off-by: Shakeel Butt 
Reviewed-by: Michal Koutný 
Signed-off-by: Tejun Heo 
Signed-off-by: Greg Kroah-Hartman

cgroup/bpf: only cgroup v2 can be attached by bpf programs

2024-12-05T13:01:29+00:00

[ Upstream commit 2190df6c91373fdec6db9fc07e427084f232f57e ]

Only cgroup v2 can be attached by bpf programs, so this patch introduces
that cgroup_bpf_inherit and cgroup_bpf_offline can only be called in
cgroup v2, and this can fix the memleak mentioned by commit 04f8ef5643bc
("cgroup: Fix memory leak caused by missing cgroup_bpf_offline"), which
has been reverted.

Fixes: 2b0d3d3e4fcf ("percpu_ref: reduce memory footprint of percpu_ref in fast path")
Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself")
Link: https://lore.kernel.org/cgroups/aka2hk5jsel5zomucpwlxsej6iwnfw4qu5jkrmjhyfhesjlfdw@46zxhg5bdnr7/
Signed-off-by: Chen Ridong 
Signed-off-by: Tejun Heo 
Signed-off-by: Sasha Levin

Revert "cgroup: Fix memory leak caused by missing cgroup_bpf_offline"

2024-12-05T13:01:29+00:00

[ Upstream commit feb301c60970bd2a1310a53ce2d6e4375397a51b ]

This reverts commit 04f8ef5643bcd8bcde25dfdebef998aea480b2ba.

Only cgroup v2 can be attached by cgroup by BPF programs. Revert this
commit and cgroup_bpf_inherit and cgroup_bpf_offline won't be called in
cgroup v1. The memory leak issue will be fixed with next patch.

Fixes: 04f8ef5643bc ("cgroup: Fix memory leak caused by missing cgroup_bpf_offline")
Link: https://lore.kernel.org/cgroups/aka2hk5jsel5zomucpwlxsej6iwnfw4qu5jkrmjhyfhesjlfdw@46zxhg5bdnr7/
Signed-off-by: Chen Ridong 
Signed-off-by: Tejun Heo 
Signed-off-by: Sasha Levin