linux.git/kernel/sched, branch v6.12.81

sched/fair: Fix zero_vruntime tracking fix

2026-04-11T12:24:39+00:00

[ Upstream commit 1319ea57529e131822bab56bf417c8edc2db9ae8 ]

John reported that stress-ng-yield could make his machine unhappy and
managed to bisect it to commit b3d99f43c72b ("sched/fair: Fix
zero_vruntime tracking").

The combination of yield and that commit was specific enough to
hypothesize the following scenario:

Suppose we have 2 runnable tasks, both doing yield. Then one will be
eligible and one will not be, because the average position must be in
between these two entities.

Therefore, the runnable task will be eligible, and be promoted a full
slice (all the tasks do is yield after all). This causes it to jump over
the other task and now the other task is eligible and current is no
longer. So we schedule.

Since we are runnable, there is no {de,en}queue. All we have is the
__{en,de}queue_entity() from {put_prev,set_next}_task(). But per the
fingered commit, those two no longer move zero_vruntime.

All that moves zero_vruntime are tick and full {de,en}queue.

This means, that if the two tasks playing leapfrog can reach the
critical speed to reach the overflow point inside one tick's worth of
time, we're up a creek.

Additionally, when multiple cgroups are involved, there is no guarantee
the tick will in fact hit every cgroup in a timely manner. Statistically
speaking it will, but that same statistics does not rule out the
possibility of one cgroup not getting a tick for a significant amount of
time -- however unlikely.

Therefore, just like with the yield() case, force an update at the end
of every slice. This ensures the update is never more than a single
slice behind and the whole thing is within 2 lag bounds as per the
comment on entity_key().

Fixes: b3d99f43c72b ("sched/fair: Fix zero_vruntime tracking")
Reported-by: John Stultz 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Vincent Guittot 
Tested-by: K Prateek Nayak 
Tested-by: John Stultz 
Link: https://patch.msgid.link/20260401132355.081530332@infradead.org
Signed-off-by: Sasha Levin

sched/fair: Use protect_slice() instead of direct comparison

2026-04-11T12:24:39+00:00

[ Upstream commit 9cdb4fe20cd239c848b5c3f5753d83a9443ba329 ]

Replace the test by the relevant protect_slice() function.

Signed-off-by: Vincent Guittot 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Dhaval Giani (AMD) 
Link: https://lkml.kernel.org/r/20250708165630.1948751-2-vincent.guittot@linaro.org
Stable-dep-of: 1319ea57529e ("sched/fair: Fix zero_vruntime tracking fix")
Signed-off-by: Sasha Levin

sched_ext: Use WRITE_ONCE() for the write side of dsq->seq update

2026-04-02T11:09:29+00:00

[ Upstream commit 7a8464555d2e5f038758bb19e72ab4710b79e9cd ]

bpf_iter_scx_dsq_new() reads dsq->seq via READ_ONCE() without holding
any lock, making dsq->seq a lock-free concurrently accessed variable.
However, dispatch_enqueue(), the sole writer of dsq->seq, uses a plain
increment without the matching WRITE_ONCE() on the write side:

    dsq->seq++;
    ^^^^^^^^^^^
    plain write -- KCSAN data race

The KCSAN documentation requires that if one accessor uses READ_ONCE()
or WRITE_ONCE() on a variable to annotate lock-free access, all other
accesses must also use the appropriate accessor. A plain write leaves
the pair incomplete and will trigger KCSAN warnings.

Fix by using WRITE_ONCE() for the write side of the update:

    WRITE_ONCE(dsq->seq, dsq->seq + 1);

This is consistent with bpf_iter_scx_dsq_new() and makes the
concurrent access annotation complete and KCSAN-clean.

Signed-off-by: zhidao su 
Signed-off-by: Tejun Heo 
Signed-off-by: Sasha Levin

sched: idle: Consolidate the handling of two special cases

2026-03-25T10:08:53+00:00

[ Upstream commit f4c31b07b136839e0fb3026f8a5b6543e3b14d2f ]

There are two special cases in the idle loop that are handled
inconsistently even though they are analogous.

The first one is when a cpuidle driver is absent and the default CPU
idle time power management implemented by the architecture code is used.
In that case, the scheduler tick is stopped every time before invoking
default_idle_call().

The second one is when a cpuidle driver is present, but there is only
one idle state in its table.  In that case, the scheduler tick is never
stopped at all.

Since each of these approaches has its drawbacks, reconcile them with
the help of one simple heuristic.  Namely, stop the tick if the CPU has
been woken up by it in the previous iteration of the idle loop, or let
it tick otherwise.

Signed-off-by: Rafael J. Wysocki 
Reviewed-by: Christian Loehle 
Reviewed-by: Frederic Weisbecker 
Reviewed-by: Qais Yousef 
Reviewed-by: Aboorva Devarajan 
Fixes: ed98c3491998 ("sched: idle: Do not stop the tick before cpuidle_idle_call()")
[ rjw: Added Fixes tag, changelog edits ]
Link: https://patch.msgid.link/4741364.LvFx2qVVIh@rafael.j.wysocki
Signed-off-by: Rafael J. Wysocki 
Signed-off-by: Sasha Levin

sched/fair: Fix zero_vruntime tracking

2026-03-25T10:08:46+00:00

commit b3d99f43c72b56cf7a104a364e7fb34b0702828b upstream.

It turns out that zero_vruntime tracking is broken when there is but a single
task running. Current update paths are through __{en,de}queue_entity(), and
when there is but a single task, pick_next_task() will always return that one
task, and put_prev_set_next_task() will end up in neither function.

This can cause entity_key() to grow indefinitely large and cause overflows,
leading to much pain and suffering.

Furtermore, doing update_zero_vruntime() from __{de,en}queue_entity(), which
are called from {set_next,put_prev}_entity() has problems because:

 - set_next_entity() calls __dequeue_entity() before it does cfs_rq->curr = se.
   This means the avg_vruntime() will see the removal but not current, missing
   the entity for accounting.

 - put_prev_entity() calls __enqueue_entity() before it does cfs_rq->curr =
   NULL. This means the avg_vruntime() will see the addition *and* current,
   leading to double accounting.

Both cases are incorrect/inconsistent.

Noting that avg_vruntime is already called on each {en,de}queue, remove the
explicit avg_vruntime() calls (which removes an extra 64bit division for each
{en,de}queue) and have avg_vruntime() update zero_vruntime itself.

Additionally, have the tick call avg_vruntime() -- discarding the result, but
for the side-effect of updating zero_vruntime.

While there, optimize avg_vruntime() by noting that the average of one value is
rather trivial to compute.

Test case:
  # taskset -c -p 1 $$
  # taskset -c 2 bash -c 'while :; do :; done&'
  # cat /sys/kernel/debug/sched/debug | awk '/^cpu#/ {P=0} /^cpu#2,/ {P=1} {if (P) print $0}' | grep -e zero_vruntime -e "^>"

PRE:
    .zero_vruntime                 : 31316.407903
  >R            bash   487     50787.345112   E       50789.145972           2.800000     50780.298364        16     120         0.000000         0.000000         0.000000        /
    .zero_vruntime                 : 382548.253179
  >R            bash   487    427275.204288   E      427276.003584           2.800000    427268.157540        23     120         0.000000         0.000000         0.000000        /

POST:
    .zero_vruntime                 : 17259.709467
  >R            bash   526     17259.709467   E       17262.509467           2.800000     16915.031624         9     120         0.000000         0.000000         0.000000        /
    .zero_vruntime                 : 18702.723356
  >R            bash   526     18702.723356   E       18705.523356           2.800000     18358.045513         9     120         0.000000         0.000000         0.000000        /

Fixes: 79f3f9bedd14 ("sched/eevdf: Fix min_vruntime vs avg_vruntime")
Reported-by: K Prateek Nayak 
Signed-off-by: Peter Zijlstra (Intel) 
Tested-by: K Prateek Nayak 
Tested-by: Shubhang Kaushik 
Link: https://patch.msgid.link/20260219080624.438854780%40infradead.org
Tested-by: Eric Hagberg 
Signed-off-by: Greg Kroah-Hartman

sched_ext: Remove redundant css_put() in scx_cgroup_init()

2026-03-25T10:08:46+00:00

commit 1336b579f6079fb8520be03624fcd9ba443c930b upstream.

The iterator css_for_each_descendant_pre() walks the cgroup hierarchy
under cgroup_lock(). It does not increment the reference counts on
yielded css structs.

According to the cgroup documentation, css_put() should only be used
to release a reference obtained via css_get() or css_tryget_online().
Since the iterator does not use either of these to acquire a reference,
calling css_put() in the error path of scx_cgroup_init() causes a
refcount underflow.

Remove the unbalanced css_put() to prevent a potential Use-After-Free
(UAF) vulnerability.

Fixes: 819513666966 ("sched_ext: Add cgroup support")
Cc: stable@vger.kernel.org # v6.12+
Signed-off-by: Cheng-Yang Chou 
Reviewed-by: Andrea Righi 
Signed-off-by: Tejun Heo 
Signed-off-by: Greg Kroah-Hartman

sched_ext: Fix starvation of scx_enable() under fair-class saturation

2026-03-25T10:08:42+00:00

[ Upstream commit b06ccbabe2506fd70b9167a644978b049150224a ]

During scx_enable(), the READY -> ENABLED task switching loop changes the
calling thread's sched_class from fair to ext. Since fair has higher
priority than ext, saturating fair-class workloads can indefinitely starve
the enable thread, hanging the system. This was introduced when the enable
path switched from preempt_disable() to scx_bypass() which doesn't protect
against fair-class starvation. Note that the original preempt_disable()
protection wasn't complete either - in partial switch modes, the calling
thread could still be starved after preempt_enable() as it may have been
switched to ext class.

Fix it by offloading the enable body to a dedicated system-wide RT
(SCHED_FIFO) kthread which cannot be starved by either fair or ext class
tasks. scx_enable() lazily creates the kthread on first use and passes the
ops pointer through a struct scx_enable_cmd containing the kthread_work,
then synchronously waits for completion.

The workfn runs on a different kthread from sch->helper (which runs
disable_work), so it can safely flush disable_work on the error path
without deadlock.

Fixes: 8c2090c504e9 ("sched_ext: Initialize in bypass mode")
Cc: stable@vger.kernel.org # v6.12+
Signed-off-by: Tejun Heo 
[ adapted per-scheduler scx_sched struct references to globals ]
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman

sched_ext: Disable preemption between scx_claim_exit() and kicking helper work

2026-03-25T10:08:42+00:00

[ Upstream commit 83236b2e43dba00bee5b82eb5758816b1a674f6a ]

scx_claim_exit() atomically sets exit_kind, which prevents scx_error() from
triggering further error handling. After claiming exit, the caller must kick
the helper kthread work which initiates bypass mode and teardown.

If the calling task gets preempted between claiming exit and kicking the
helper work, and the BPF scheduler fails to schedule it back (since error
handling is now disabled), the helper work is never queued, bypass mode
never activates, tasks stop being dispatched, and the system wedges.

Disable preemption across scx_claim_exit() and the subsequent work kicking
in all callers - scx_disable() and scx_vexit(). Add
lockdep_assert_preemption_disabled() to scx_claim_exit() to enforce the
requirement.

Fixes: f0e1a0643a59 ("sched_ext: Implement BPF extensible scheduler class")
Cc: stable@vger.kernel.org # v6.12+
Reviewed-by: Andrea Righi 
Signed-off-by: Tejun Heo 
[ adapted per-scheduler struct (sch->exit_kind, scx_disable, scx_vexit) to global variables (scx_exit_kind, scx_ops_disable, scx_ops_exit_kind) ]
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman

sched: idle: Make skipping governor callbacks more consistent

2026-03-25T10:08:29+00:00

[ Upstream commit d557640e4ce589a24dca5ca7ce3b9680f471325f ]

If the cpuidle governor .select() callback is skipped because there
is only one idle state in the cpuidle driver, the .reflect() callback
should be skipped as well, at least for consistency (if not for
correctness), so do it.

Fixes: e5c9ffc6ae1b ("cpuidle: Skip governor when only one idle state is available")
Signed-off-by: Rafael J. Wysocki 
Reviewed-by: Christian Loehle 
Reviewed-by: Aboorva Devarajan 
Reviewed-by: Frederic Weisbecker 
Link: https://patch.msgid.link/12857700.O9o76ZdvQC@rafael.j.wysocki
Signed-off-by: Sasha Levin

sched/debug: Fix updating of ppos on server write ops

2026-03-04T12:20:58+00:00

[ Upstream commit 6080fb211672aec6ce8f2f5a2e0b4eae736f2027 ]

Updating "ppos" on error conditions does not make much sense. The pattern
is to return the error code directly without modifying the position, or
modify the position on success and return the number of bytes written.

Since on success, the return value of apply is 0, there is no point in
modifying ppos either. Fix it by removing all this and just returning
error code or number of bytes written on success.

Signed-off-by: Joel Fernandes 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Juri Lelli 
Reviewed-by: Andrea Righi 
Acked-by: Tejun Heo 
Tested-by: Christian Loehle 
Link: https://patch.msgid.link/20260126100050.3854740-3-arighi@nvidia.com
Signed-off-by: Sasha Levin