<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux.git/kernel/sched/rt.c, branch v6.12.80</title>
<subtitle>Clone of https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git</subtitle>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/'/>
<entry>
<title>sched/rt: Skip currently executing CPU in rto_next_cpu()</title>
<updated>2026-03-04T12:19:40+00:00</updated>
<author>
<name>Chen Jinghuang</name>
<email>chenjinghuang2@huawei.com</email>
</author>
<published>2026-01-22T01:25:33+00:00</published>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/commit/?id=a6a73403733e86748421f2eeaf028c85683ef896'/>
<id>a6a73403733e86748421f2eeaf028c85683ef896</id>
<content type='text'>
[ Upstream commit 94894c9c477e53bcea052e075c53f89df3d2a33e ]

CPU0 becomes overloaded when hosting a CPU-bound RT task, a non-CPU-bound
RT task, and a CFS task stuck in kernel space. When other CPUs switch from
RT to non-RT tasks, RT load balancing (LB) is triggered; with
HAVE_RT_PUSH_IPI enabled, they send IPIs to CPU0 to drive the execution
of rto_push_irq_work_func. During push_rt_task on CPU0,
if next_task-&gt;prio &lt; rq-&gt;donor-&gt;prio, resched_curr() sets NEED_RESCHED
and after the push operation completes, CPU0 calls rto_next_cpu().
Since only CPU0 is overloaded in this scenario, rto_next_cpu() should
ideally return -1 (no further IPI needed).

However, multiple CPUs invoking tell_cpu_to_push() during LB increments
rd-&gt;rto_loop_next. Even when rd-&gt;rto_cpu is set to -1, the mismatch between
rd-&gt;rto_loop and rd-&gt;rto_loop_next forces rto_next_cpu() to restart its
search from -1. With CPU0 remaining overloaded (satisfying rt_nr_migratory
&amp;&amp; rt_nr_total &gt; 1), it gets reselected, causing CPU0 to queue irq_work to
itself and send self-IPIs repeatedly. As long as CPU0 stays overloaded and
other CPUs run pull_rt_tasks(), it falls into an infinite self-IPI loop,
which triggers a CPU hardlockup due to continuous self-interrupts.

The trigging scenario is as follows:

         cpu0                      cpu1                    cpu2
                                pull_rt_task
                              tell_cpu_to_push
                 &lt;------------irq_work_queue_on
rto_push_irq_work_func
       push_rt_task
    resched_curr(rq)                                   pull_rt_task
    rto_next_cpu                                     tell_cpu_to_push
                      &lt;-------------------------- atomic_inc(rto_loop_next)
rd-&gt;rto_loop != next
     rto_next_cpu
   irq_work_queue_on
rto_push_irq_work_func

Fix redundant self-IPI by filtering the initiating CPU in rto_next_cpu().
This solution has been verified to effectively eliminate spurious self-IPIs
and prevent CPU hardlockup scenarios.

Fixes: 4bdced5c9a29 ("sched/rt: Simplify the IPI based RT balancing logic")
Suggested-by: Steven Rostedt (Google) &lt;rostedt@goodmis.org&gt;
Suggested-by: K Prateek Nayak &lt;kprateek.nayak@amd.com&gt;
Signed-off-by: Chen Jinghuang &lt;chenjinghuang2@huawei.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Reviewed-by: Steven Rostedt (Google) &lt;rostedt@goodmis.org&gt;
Reviewed-by: Valentin Schneider &lt;vschneid@redhat.com&gt;
Link: https://patch.msgid.link/20260122012533.673768-1-chenjinghuang2@huawei.com
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit 94894c9c477e53bcea052e075c53f89df3d2a33e ]

CPU0 becomes overloaded when hosting a CPU-bound RT task, a non-CPU-bound
RT task, and a CFS task stuck in kernel space. When other CPUs switch from
RT to non-RT tasks, RT load balancing (LB) is triggered; with
HAVE_RT_PUSH_IPI enabled, they send IPIs to CPU0 to drive the execution
of rto_push_irq_work_func. During push_rt_task on CPU0,
if next_task-&gt;prio &lt; rq-&gt;donor-&gt;prio, resched_curr() sets NEED_RESCHED
and after the push operation completes, CPU0 calls rto_next_cpu().
Since only CPU0 is overloaded in this scenario, rto_next_cpu() should
ideally return -1 (no further IPI needed).

However, multiple CPUs invoking tell_cpu_to_push() during LB increments
rd-&gt;rto_loop_next. Even when rd-&gt;rto_cpu is set to -1, the mismatch between
rd-&gt;rto_loop and rd-&gt;rto_loop_next forces rto_next_cpu() to restart its
search from -1. With CPU0 remaining overloaded (satisfying rt_nr_migratory
&amp;&amp; rt_nr_total &gt; 1), it gets reselected, causing CPU0 to queue irq_work to
itself and send self-IPIs repeatedly. As long as CPU0 stays overloaded and
other CPUs run pull_rt_tasks(), it falls into an infinite self-IPI loop,
which triggers a CPU hardlockup due to continuous self-interrupts.

The trigging scenario is as follows:

         cpu0                      cpu1                    cpu2
                                pull_rt_task
                              tell_cpu_to_push
                 &lt;------------irq_work_queue_on
rto_push_irq_work_func
       push_rt_task
    resched_curr(rq)                                   pull_rt_task
    rto_next_cpu                                     tell_cpu_to_push
                      &lt;-------------------------- atomic_inc(rto_loop_next)
rd-&gt;rto_loop != next
     rto_next_cpu
   irq_work_queue_on
rto_push_irq_work_func

Fix redundant self-IPI by filtering the initiating CPU in rto_next_cpu().
This solution has been verified to effectively eliminate spurious self-IPIs
and prevent CPU hardlockup scenarios.

Fixes: 4bdced5c9a29 ("sched/rt: Simplify the IPI based RT balancing logic")
Suggested-by: Steven Rostedt (Google) &lt;rostedt@goodmis.org&gt;
Suggested-by: K Prateek Nayak &lt;kprateek.nayak@amd.com&gt;
Signed-off-by: Chen Jinghuang &lt;chenjinghuang2@huawei.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Reviewed-by: Steven Rostedt (Google) &lt;rostedt@goodmis.org&gt;
Reviewed-by: Valentin Schneider &lt;vschneid@redhat.com&gt;
Link: https://patch.msgid.link/20260122012533.673768-1-chenjinghuang2@huawei.com
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>sched/rt: Fix race in push_rt_task</title>
<updated>2026-01-08T09:14:27+00:00</updated>
<author>
<name>Harshit Agarwal</name>
<email>harshit@nutanix.com</email>
</author>
<published>2025-02-25T18:05:53+00:00</published>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/commit/?id=debfbc047196df1f6bfd52f2d028c21dce67f0de'/>
<id>debfbc047196df1f6bfd52f2d028c21dce67f0de</id>
<content type='text'>
commit 690e47d1403e90b7f2366f03b52ed3304194c793 upstream.

Overview
========
When a CPU chooses to call push_rt_task and picks a task to push to
another CPU's runqueue then it will call find_lock_lowest_rq method
which would take a double lock on both CPUs' runqueues. If one of the
locks aren't readily available, it may lead to dropping the current
runqueue lock and reacquiring both the locks at once. During this window
it is possible that the task is already migrated and is running on some
other CPU. These cases are already handled. However, if the task is
migrated and has already been executed and another CPU is now trying to
wake it up (ttwu) such that it is queued again on the runqeue
(on_rq is 1) and also if the task was run by the same CPU, then the
current checks will pass even though the task was migrated out and is no
longer in the pushable tasks list.

Crashes
=======
This bug resulted in quite a few flavors of crashes triggering kernel
panics with various crash signatures such as assert failures, page
faults, null pointer dereferences, and queue corruption errors all
coming from scheduler itself.

Some of the crashes:
-&gt; kernel BUG at kernel/sched/rt.c:1616! BUG_ON(idx &gt;= MAX_RT_PRIO)
   Call Trace:
   ? __die_body+0x1a/0x60
   ? die+0x2a/0x50
   ? do_trap+0x85/0x100
   ? pick_next_task_rt+0x6e/0x1d0
   ? do_error_trap+0x64/0xa0
   ? pick_next_task_rt+0x6e/0x1d0
   ? exc_invalid_op+0x4c/0x60
   ? pick_next_task_rt+0x6e/0x1d0
   ? asm_exc_invalid_op+0x12/0x20
   ? pick_next_task_rt+0x6e/0x1d0
   __schedule+0x5cb/0x790
   ? update_ts_time_stats+0x55/0x70
   schedule_idle+0x1e/0x40
   do_idle+0x15e/0x200
   cpu_startup_entry+0x19/0x20
   start_secondary+0x117/0x160
   secondary_startup_64_no_verify+0xb0/0xbb

-&gt; BUG: kernel NULL pointer dereference, address: 00000000000000c0
   Call Trace:
   ? __die_body+0x1a/0x60
   ? no_context+0x183/0x350
   ? __warn+0x8a/0xe0
   ? exc_page_fault+0x3d6/0x520
   ? asm_exc_page_fault+0x1e/0x30
   ? pick_next_task_rt+0xb5/0x1d0
   ? pick_next_task_rt+0x8c/0x1d0
   __schedule+0x583/0x7e0
   ? update_ts_time_stats+0x55/0x70
   schedule_idle+0x1e/0x40
   do_idle+0x15e/0x200
   cpu_startup_entry+0x19/0x20
   start_secondary+0x117/0x160
   secondary_startup_64_no_verify+0xb0/0xbb

-&gt; BUG: unable to handle page fault for address: ffff9464daea5900
   kernel BUG at kernel/sched/rt.c:1861! BUG_ON(rq-&gt;cpu != task_cpu(p))

-&gt; kernel BUG at kernel/sched/rt.c:1055! BUG_ON(!rq-&gt;nr_running)
   Call Trace:
   ? __die_body+0x1a/0x60
   ? die+0x2a/0x50
   ? do_trap+0x85/0x100
   ? dequeue_top_rt_rq+0xa2/0xb0
   ? do_error_trap+0x64/0xa0
   ? dequeue_top_rt_rq+0xa2/0xb0
   ? exc_invalid_op+0x4c/0x60
   ? dequeue_top_rt_rq+0xa2/0xb0
   ? asm_exc_invalid_op+0x12/0x20
   ? dequeue_top_rt_rq+0xa2/0xb0
   dequeue_rt_entity+0x1f/0x70
   dequeue_task_rt+0x2d/0x70
   __schedule+0x1a8/0x7e0
   ? blk_finish_plug+0x25/0x40
   schedule+0x3c/0xb0
   futex_wait_queue_me+0xb6/0x120
   futex_wait+0xd9/0x240
   do_futex+0x344/0xa90
   ? get_mm_exe_file+0x30/0x60
   ? audit_exe_compare+0x58/0x70
   ? audit_filter_rules.constprop.26+0x65e/0x1220
   __x64_sys_futex+0x148/0x1f0
   do_syscall_64+0x30/0x80
   entry_SYSCALL_64_after_hwframe+0x62/0xc7

-&gt; BUG: unable to handle page fault for address: ffff8cf3608bc2c0
   Call Trace:
   ? __die_body+0x1a/0x60
   ? no_context+0x183/0x350
   ? spurious_kernel_fault+0x171/0x1c0
   ? exc_page_fault+0x3b6/0x520
   ? plist_check_list+0x15/0x40
   ? plist_check_list+0x2e/0x40
   ? asm_exc_page_fault+0x1e/0x30
   ? _cond_resched+0x15/0x30
   ? futex_wait_queue_me+0xc8/0x120
   ? futex_wait+0xd9/0x240
   ? try_to_wake_up+0x1b8/0x490
   ? futex_wake+0x78/0x160
   ? do_futex+0xcd/0xa90
   ? plist_check_list+0x15/0x40
   ? plist_check_list+0x2e/0x40
   ? plist_del+0x6a/0xd0
   ? plist_check_list+0x15/0x40
   ? plist_check_list+0x2e/0x40
   ? dequeue_pushable_task+0x20/0x70
   ? __schedule+0x382/0x7e0
   ? asm_sysvec_reschedule_ipi+0xa/0x20
   ? schedule+0x3c/0xb0
   ? exit_to_user_mode_prepare+0x9e/0x150
   ? irqentry_exit_to_user_mode+0x5/0x30
   ? asm_sysvec_reschedule_ipi+0x12/0x20

Above are some of the common examples of the crashes that were observed
due to this issue.

Details
=======
Let's look at the following scenario to understand this race.

1) CPU A enters push_rt_task
  a) CPU A has chosen next_task = task p.
  b) CPU A calls find_lock_lowest_rq(Task p, CPU Z’s rq).
  c) CPU A identifies CPU X as a destination CPU (X &lt; Z).
  d) CPU A enters double_lock_balance(CPU Z’s rq, CPU X’s rq).
  e) Since X is lower than Z, CPU A unlocks CPU Z’s rq. Someone else has
     locked CPU X’s rq, and thus, CPU A must wait.

2) At CPU Z
  a) Previous task has completed execution and thus, CPU Z enters
     schedule, locks its own rq after CPU A releases it.
  b) CPU Z dequeues previous task and begins executing task p.
  c) CPU Z unlocks its rq.
  d) Task p yields the CPU (ex. by doing IO or waiting to acquire a
     lock) which triggers the schedule function on CPU Z.
  e) CPU Z enters schedule again, locks its own rq, and dequeues task p.
  f) As part of dequeue, it sets p.on_rq = 0 and unlocks its rq.

3) At CPU B
  a) CPU B enters try_to_wake_up with input task p.
  b) Since CPU Z dequeued task p, p.on_rq = 0, and CPU B updates
     B.state = WAKING.
  c) CPU B via select_task_rq determines CPU Y as the target CPU.

4) The race
  a) CPU A acquires CPU X’s lock and relocks CPU Z.
  b) CPU A reads task p.cpu = Z and incorrectly concludes task p is
     still on CPU Z.
  c) CPU A failed to notice task p had been dequeued from CPU Z while
     CPU A was waiting for locks in double_lock_balance. If CPU A knew
     that task p had been dequeued, it would return NULL forcing
     push_rt_task to give up the task p's migration.
  d) CPU B updates task p.cpu = Y and calls ttwu_queue.
  e) CPU B locks Ys rq. CPU B enqueues task p onto Y and sets task
     p.on_rq = 1.
  f) CPU B unlocks CPU Y, triggering memory synchronization.
  g) CPU A reads task p.on_rq = 1, cementing its assumption that task p
     has not migrated.
  h) CPU A decides to migrate p to CPU X.

This leads to A dequeuing p from Y's queue and various crashes down the
line.

Solution
========
The solution here is fairly simple. After obtaining the lock (at 4a),
the check is enhanced to make sure that the task is still at the head of
the pushable tasks list. If not, then it is anyway not suitable for
being pushed out.

Testing
=======
The fix is tested on a cluster of 3 nodes, where the panics due to this
are hit every couple of days. A fix similar to this was deployed on such
cluster and was stable for more than 30 days.

Co-developed-by: Jon Kohler &lt;jon@nutanix.com&gt;
Signed-off-by: Jon Kohler &lt;jon@nutanix.com&gt;
Co-developed-by: Gauri Patwardhan &lt;gauri.patwardhan@nutanix.com&gt;
Signed-off-by: Gauri Patwardhan &lt;gauri.patwardhan@nutanix.com&gt;
Co-developed-by: Rahul Chunduru &lt;rahul.chunduru@nutanix.com&gt;
Signed-off-by: Rahul Chunduru &lt;rahul.chunduru@nutanix.com&gt;
Signed-off-by: Harshit Agarwal &lt;harshit@nutanix.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Reviewed-by: "Steven Rostedt (Google)" &lt;rostedt@goodmis.org&gt;
Reviewed-by: Phil Auld &lt;pauld@redhat.com&gt;
Tested-by: Will Ton &lt;william.ton@nutanix.com&gt;
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250225180553.167995-1-harshit@nutanix.com
Signed-off-by: Rajani Kantha &lt;681739313@139.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 690e47d1403e90b7f2366f03b52ed3304194c793 upstream.

Overview
========
When a CPU chooses to call push_rt_task and picks a task to push to
another CPU's runqueue then it will call find_lock_lowest_rq method
which would take a double lock on both CPUs' runqueues. If one of the
locks aren't readily available, it may lead to dropping the current
runqueue lock and reacquiring both the locks at once. During this window
it is possible that the task is already migrated and is running on some
other CPU. These cases are already handled. However, if the task is
migrated and has already been executed and another CPU is now trying to
wake it up (ttwu) such that it is queued again on the runqeue
(on_rq is 1) and also if the task was run by the same CPU, then the
current checks will pass even though the task was migrated out and is no
longer in the pushable tasks list.

Crashes
=======
This bug resulted in quite a few flavors of crashes triggering kernel
panics with various crash signatures such as assert failures, page
faults, null pointer dereferences, and queue corruption errors all
coming from scheduler itself.

Some of the crashes:
-&gt; kernel BUG at kernel/sched/rt.c:1616! BUG_ON(idx &gt;= MAX_RT_PRIO)
   Call Trace:
   ? __die_body+0x1a/0x60
   ? die+0x2a/0x50
   ? do_trap+0x85/0x100
   ? pick_next_task_rt+0x6e/0x1d0
   ? do_error_trap+0x64/0xa0
   ? pick_next_task_rt+0x6e/0x1d0
   ? exc_invalid_op+0x4c/0x60
   ? pick_next_task_rt+0x6e/0x1d0
   ? asm_exc_invalid_op+0x12/0x20
   ? pick_next_task_rt+0x6e/0x1d0
   __schedule+0x5cb/0x790
   ? update_ts_time_stats+0x55/0x70
   schedule_idle+0x1e/0x40
   do_idle+0x15e/0x200
   cpu_startup_entry+0x19/0x20
   start_secondary+0x117/0x160
   secondary_startup_64_no_verify+0xb0/0xbb

-&gt; BUG: kernel NULL pointer dereference, address: 00000000000000c0
   Call Trace:
   ? __die_body+0x1a/0x60
   ? no_context+0x183/0x350
   ? __warn+0x8a/0xe0
   ? exc_page_fault+0x3d6/0x520
   ? asm_exc_page_fault+0x1e/0x30
   ? pick_next_task_rt+0xb5/0x1d0
   ? pick_next_task_rt+0x8c/0x1d0
   __schedule+0x583/0x7e0
   ? update_ts_time_stats+0x55/0x70
   schedule_idle+0x1e/0x40
   do_idle+0x15e/0x200
   cpu_startup_entry+0x19/0x20
   start_secondary+0x117/0x160
   secondary_startup_64_no_verify+0xb0/0xbb

-&gt; BUG: unable to handle page fault for address: ffff9464daea5900
   kernel BUG at kernel/sched/rt.c:1861! BUG_ON(rq-&gt;cpu != task_cpu(p))

-&gt; kernel BUG at kernel/sched/rt.c:1055! BUG_ON(!rq-&gt;nr_running)
   Call Trace:
   ? __die_body+0x1a/0x60
   ? die+0x2a/0x50
   ? do_trap+0x85/0x100
   ? dequeue_top_rt_rq+0xa2/0xb0
   ? do_error_trap+0x64/0xa0
   ? dequeue_top_rt_rq+0xa2/0xb0
   ? exc_invalid_op+0x4c/0x60
   ? dequeue_top_rt_rq+0xa2/0xb0
   ? asm_exc_invalid_op+0x12/0x20
   ? dequeue_top_rt_rq+0xa2/0xb0
   dequeue_rt_entity+0x1f/0x70
   dequeue_task_rt+0x2d/0x70
   __schedule+0x1a8/0x7e0
   ? blk_finish_plug+0x25/0x40
   schedule+0x3c/0xb0
   futex_wait_queue_me+0xb6/0x120
   futex_wait+0xd9/0x240
   do_futex+0x344/0xa90
   ? get_mm_exe_file+0x30/0x60
   ? audit_exe_compare+0x58/0x70
   ? audit_filter_rules.constprop.26+0x65e/0x1220
   __x64_sys_futex+0x148/0x1f0
   do_syscall_64+0x30/0x80
   entry_SYSCALL_64_after_hwframe+0x62/0xc7

-&gt; BUG: unable to handle page fault for address: ffff8cf3608bc2c0
   Call Trace:
   ? __die_body+0x1a/0x60
   ? no_context+0x183/0x350
   ? spurious_kernel_fault+0x171/0x1c0
   ? exc_page_fault+0x3b6/0x520
   ? plist_check_list+0x15/0x40
   ? plist_check_list+0x2e/0x40
   ? asm_exc_page_fault+0x1e/0x30
   ? _cond_resched+0x15/0x30
   ? futex_wait_queue_me+0xc8/0x120
   ? futex_wait+0xd9/0x240
   ? try_to_wake_up+0x1b8/0x490
   ? futex_wake+0x78/0x160
   ? do_futex+0xcd/0xa90
   ? plist_check_list+0x15/0x40
   ? plist_check_list+0x2e/0x40
   ? plist_del+0x6a/0xd0
   ? plist_check_list+0x15/0x40
   ? plist_check_list+0x2e/0x40
   ? dequeue_pushable_task+0x20/0x70
   ? __schedule+0x382/0x7e0
   ? asm_sysvec_reschedule_ipi+0xa/0x20
   ? schedule+0x3c/0xb0
   ? exit_to_user_mode_prepare+0x9e/0x150
   ? irqentry_exit_to_user_mode+0x5/0x30
   ? asm_sysvec_reschedule_ipi+0x12/0x20

Above are some of the common examples of the crashes that were observed
due to this issue.

Details
=======
Let's look at the following scenario to understand this race.

1) CPU A enters push_rt_task
  a) CPU A has chosen next_task = task p.
  b) CPU A calls find_lock_lowest_rq(Task p, CPU Z’s rq).
  c) CPU A identifies CPU X as a destination CPU (X &lt; Z).
  d) CPU A enters double_lock_balance(CPU Z’s rq, CPU X’s rq).
  e) Since X is lower than Z, CPU A unlocks CPU Z’s rq. Someone else has
     locked CPU X’s rq, and thus, CPU A must wait.

2) At CPU Z
  a) Previous task has completed execution and thus, CPU Z enters
     schedule, locks its own rq after CPU A releases it.
  b) CPU Z dequeues previous task and begins executing task p.
  c) CPU Z unlocks its rq.
  d) Task p yields the CPU (ex. by doing IO or waiting to acquire a
     lock) which triggers the schedule function on CPU Z.
  e) CPU Z enters schedule again, locks its own rq, and dequeues task p.
  f) As part of dequeue, it sets p.on_rq = 0 and unlocks its rq.

3) At CPU B
  a) CPU B enters try_to_wake_up with input task p.
  b) Since CPU Z dequeued task p, p.on_rq = 0, and CPU B updates
     B.state = WAKING.
  c) CPU B via select_task_rq determines CPU Y as the target CPU.

4) The race
  a) CPU A acquires CPU X’s lock and relocks CPU Z.
  b) CPU A reads task p.cpu = Z and incorrectly concludes task p is
     still on CPU Z.
  c) CPU A failed to notice task p had been dequeued from CPU Z while
     CPU A was waiting for locks in double_lock_balance. If CPU A knew
     that task p had been dequeued, it would return NULL forcing
     push_rt_task to give up the task p's migration.
  d) CPU B updates task p.cpu = Y and calls ttwu_queue.
  e) CPU B locks Ys rq. CPU B enqueues task p onto Y and sets task
     p.on_rq = 1.
  f) CPU B unlocks CPU Y, triggering memory synchronization.
  g) CPU A reads task p.on_rq = 1, cementing its assumption that task p
     has not migrated.
  h) CPU A decides to migrate p to CPU X.

This leads to A dequeuing p from Y's queue and various crashes down the
line.

Solution
========
The solution here is fairly simple. After obtaining the lock (at 4a),
the check is enhanced to make sure that the task is still at the head of
the pushable tasks list. If not, then it is anyway not suitable for
being pushed out.

Testing
=======
The fix is tested on a cluster of 3 nodes, where the panics due to this
are hit every couple of days. A fix similar to this was deployed on such
cluster and was stable for more than 30 days.

Co-developed-by: Jon Kohler &lt;jon@nutanix.com&gt;
Signed-off-by: Jon Kohler &lt;jon@nutanix.com&gt;
Co-developed-by: Gauri Patwardhan &lt;gauri.patwardhan@nutanix.com&gt;
Signed-off-by: Gauri Patwardhan &lt;gauri.patwardhan@nutanix.com&gt;
Co-developed-by: Rahul Chunduru &lt;rahul.chunduru@nutanix.com&gt;
Signed-off-by: Rahul Chunduru &lt;rahul.chunduru@nutanix.com&gt;
Signed-off-by: Harshit Agarwal &lt;harshit@nutanix.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Reviewed-by: "Steven Rostedt (Google)" &lt;rostedt@goodmis.org&gt;
Reviewed-by: Phil Auld &lt;pauld@redhat.com&gt;
Tested-by: Will Ton &lt;william.ton@nutanix.com&gt;
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250225180553.167995-1-harshit@nutanix.com
Signed-off-by: Rajani Kantha &lt;681739313@139.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>sched/deadline: Fix accounting after global limits change</title>
<updated>2025-08-20T16:30:31+00:00</updated>
<author>
<name>Juri Lelli</name>
<email>juri.lelli@redhat.com</email>
</author>
<published>2025-06-27T11:51:16+00:00</published>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/commit/?id=1b9f54ac0bbc85ba5bc80d65a97ecdea8e273df4'/>
<id>1b9f54ac0bbc85ba5bc80d65a97ecdea8e273df4</id>
<content type='text'>
[ Upstream commit 440989c10f4e32620e9e2717ca52c3ed7ae11048 ]

A global limits change (sched_rt_handler() logic) currently leaves stale
and/or incorrect values in variables related to accounting (e.g.
extra_bw).

Properly clean up per runqueue variables before implementing the change
and rebuild scheduling domains (so that accounting is also properly
restored) after such a change is complete.

Reported-by: Marcel Ziswiler &lt;marcel.ziswiler@codethink.co.uk&gt;
Signed-off-by: Juri Lelli &lt;juri.lelli@redhat.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Tested-by: Marcel Ziswiler &lt;marcel.ziswiler@codethink.co.uk&gt; # nuc &amp; rock5b
Link: https://lore.kernel.org/r/20250627115118.438797-4-juri.lelli@redhat.com
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit 440989c10f4e32620e9e2717ca52c3ed7ae11048 ]

A global limits change (sched_rt_handler() logic) currently leaves stale
and/or incorrect values in variables related to accounting (e.g.
extra_bw).

Properly clean up per runqueue variables before implementing the change
and rebuild scheduling domains (so that accounting is also properly
restored) after such a change is complete.

Reported-by: Marcel Ziswiler &lt;marcel.ziswiler@codethink.co.uk&gt;
Signed-off-by: Juri Lelli &lt;juri.lelli@redhat.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Tested-by: Marcel Ziswiler &lt;marcel.ziswiler@codethink.co.uk&gt; # nuc &amp; rock5b
Link: https://lore.kernel.org/r/20250627115118.438797-4-juri.lelli@redhat.com
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>sched: Add put_prev_task(.next)</title>
<updated>2024-09-03T13:26:32+00:00</updated>
<author>
<name>Peter Zijlstra</name>
<email>peterz@infradead.org</email>
</author>
<published>2024-08-13T22:25:56+00:00</published>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/commit/?id=b2d70222dbf2a2ff7a972a685d249a5d75afa87f'/>
<id>b2d70222dbf2a2ff7a972a685d249a5d75afa87f</id>
<content type='text'>
In order to tell the previous sched_class what the next task is, add
put_prev_task(.next).

Notable SCX will use this to:

 1) determine the next task will leave the SCX sched class and push
    the current task to another CPU if possible.
 2) statistics on how often and which other classes preempt it

Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Link: https://lore.kernel.org/r/20240813224016.367421076@infradead.org
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
In order to tell the previous sched_class what the next task is, add
put_prev_task(.next).

Notable SCX will use this to:

 1) determine the next task will leave the SCX sched class and push
    the current task to another CPU if possible.
 2) statistics on how often and which other classes preempt it

Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Link: https://lore.kernel.org/r/20240813224016.367421076@infradead.org
</pre>
</div>
</content>
</entry>
<entry>
<title>sched: Rework pick_next_task()</title>
<updated>2024-09-03T13:26:31+00:00</updated>
<author>
<name>Peter Zijlstra</name>
<email>peterz@infradead.org</email>
</author>
<published>2024-08-13T22:25:53+00:00</published>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/commit/?id=fd03c5b8585562d60f8b597b4332d28f48abfe7d'/>
<id>fd03c5b8585562d60f8b597b4332d28f48abfe7d</id>
<content type='text'>
The current rule is that:

  pick_next_task() := pick_task() + set_next_task(.first = true)

And many classes implement it directly as such. Change things around
to make pick_next_task() optional while also changing the definition to:

  pick_next_task(prev) := pick_task() + put_prev_task() + set_next_task(.first = true)

The reason is that sched_ext would like to have a 'final' call that
knows the next task. By placing put_prev_task() right next to
set_next_task() (as it already is for sched_core) this becomes
trivial.

As a bonus, this is a nice cleanup on its own.

Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Link: https://lore.kernel.org/r/20240813224016.051225657@infradead.org
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The current rule is that:

  pick_next_task() := pick_task() + set_next_task(.first = true)

And many classes implement it directly as such. Change things around
to make pick_next_task() optional while also changing the definition to:

  pick_next_task(prev) := pick_task() + put_prev_task() + set_next_task(.first = true)

The reason is that sched_ext would like to have a 'final' call that
knows the next task. By placing put_prev_task() right next to
set_next_task() (as it already is for sched_core) this becomes
trivial.

As a bonus, this is a nice cleanup on its own.

Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Link: https://lore.kernel.org/r/20240813224016.051225657@infradead.org
</pre>
</div>
</content>
</entry>
<entry>
<title>sched: Allow sched_class::dequeue_task() to fail</title>
<updated>2024-08-17T09:06:41+00:00</updated>
<author>
<name>Peter Zijlstra</name>
<email>peterz@infradead.org</email>
</author>
<published>2024-04-03T07:50:20+00:00</published>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/commit/?id=863ccdbb918a77e3f011571f943020bf7f0b114b'/>
<id>863ccdbb918a77e3f011571f943020bf7f0b114b</id>
<content type='text'>
Change the function signature of sched_class::dequeue_task() to return
a boolean, allowing future patches to 'fail' dequeue.

Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Reviewed-by: Valentin Schneider &lt;vschneid@redhat.com&gt;
Tested-by: Valentin Schneider &lt;vschneid@redhat.com&gt;
Link: https://lkml.kernel.org/r/20240727105028.864630153@infradead.org
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Change the function signature of sched_class::dequeue_task() to return
a boolean, allowing future patches to 'fail' dequeue.

Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Reviewed-by: Valentin Schneider &lt;vschneid@redhat.com&gt;
Tested-by: Valentin Schneider &lt;vschneid@redhat.com&gt;
Link: https://lkml.kernel.org/r/20240727105028.864630153@infradead.org
</pre>
</div>
</content>
</entry>
<entry>
<title>sched/rt: Remove default bandwidth control</title>
<updated>2024-07-29T10:22:37+00:00</updated>
<author>
<name>Peter Zijlstra</name>
<email>peterz@infradead.org</email>
</author>
<published>2024-05-27T12:06:55+00:00</published>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/commit/?id=5f6bd380c7bdbe10f7b4e8ddcceed60ce0714c6d'/>
<id>5f6bd380c7bdbe10f7b4e8ddcceed60ce0714c6d</id>
<content type='text'>
Now that fair_server exists, we no longer need RT bandwidth control
unless RT_GROUP_SCHED.

Enable fair_server with parameters equivalent to RT throttling.

Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Signed-off-by: "Peter Zijlstra (Intel)" &lt;peterz@infradead.org&gt;
Signed-off-by: Daniel Bristot de Oliveira &lt;bristot@kernel.org&gt;
Signed-off-by: "Vineeth Pillai (Google)" &lt;vineeth@bitbyteword.org&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Tested-by: Juri Lelli &lt;juri.lelli@redhat.com&gt;
Link: https://lore.kernel.org/r/14d562db55df5c3c780d91940743acb166895ef7.1716811044.git.bristot@kernel.org
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Now that fair_server exists, we no longer need RT bandwidth control
unless RT_GROUP_SCHED.

Enable fair_server with parameters equivalent to RT throttling.

Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Signed-off-by: "Peter Zijlstra (Intel)" &lt;peterz@infradead.org&gt;
Signed-off-by: Daniel Bristot de Oliveira &lt;bristot@kernel.org&gt;
Signed-off-by: "Vineeth Pillai (Google)" &lt;vineeth@bitbyteword.org&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Tested-by: Juri Lelli &lt;juri.lelli@redhat.com&gt;
Link: https://lore.kernel.org/r/14d562db55df5c3c780d91940743acb166895ef7.1716811044.git.bristot@kernel.org
</pre>
</div>
</content>
</entry>
<entry>
<title>sysctl: treewide: constify the ctl_table argument of proc_handlers</title>
<updated>2024-07-24T18:59:29+00:00</updated>
<author>
<name>Joel Granados</name>
<email>j.granados@samsung.com</email>
</author>
<published>2024-07-24T18:59:29+00:00</published>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/commit/?id=78eb4ea25cd5fdbdae7eb9fdf87b99195ff67508'/>
<id>78eb4ea25cd5fdbdae7eb9fdf87b99195ff67508</id>
<content type='text'>
const qualify the struct ctl_table argument in the proc_handler function
signatures. This is a prerequisite to moving the static ctl_table
structs into .rodata data which will ensure that proc_handler function
pointers cannot be modified.

This patch has been generated by the following coccinelle script:

```
  virtual patch

  @r1@
  identifier ctl, write, buffer, lenp, ppos;
  identifier func !~ "appldata_(timer|interval)_handler|sched_(rt|rr)_handler|rds_tcp_skbuf_handler|proc_sctp_do_(hmac_alg|rto_min|rto_max|udp_port|alpha_beta|auth|probe_interval)";
  @@

  int func(
  - struct ctl_table *ctl
  + const struct ctl_table *ctl
    ,int write, void *buffer, size_t *lenp, loff_t *ppos);

  @r2@
  identifier func, ctl, write, buffer, lenp, ppos;
  @@

  int func(
  - struct ctl_table *ctl
  + const struct ctl_table *ctl
    ,int write, void *buffer, size_t *lenp, loff_t *ppos)
  { ... }

  @r3@
  identifier func;
  @@

  int func(
  - struct ctl_table *
  + const struct ctl_table *
    ,int , void *, size_t *, loff_t *);

  @r4@
  identifier func, ctl;
  @@

  int func(
  - struct ctl_table *ctl
  + const struct ctl_table *ctl
    ,int , void *, size_t *, loff_t *);

  @r5@
  identifier func, write, buffer, lenp, ppos;
  @@

  int func(
  - struct ctl_table *
  + const struct ctl_table *
    ,int write, void *buffer, size_t *lenp, loff_t *ppos);

```

* Code formatting was adjusted in xfs_sysctl.c to comply with code
  conventions. The xfs_stats_clear_proc_handler,
  xfs_panic_mask_proc_handler and xfs_deprecated_dointvec_minmax where
  adjusted.

* The ctl_table argument in proc_watchdog_common was const qualified.
  This is called from a proc_handler itself and is calling back into
  another proc_handler, making it necessary to change it as part of the
  proc_handler migration.

Co-developed-by: Thomas Weißschuh &lt;linux@weissschuh.net&gt;
Signed-off-by: Thomas Weißschuh &lt;linux@weissschuh.net&gt;
Co-developed-by: Joel Granados &lt;j.granados@samsung.com&gt;
Signed-off-by: Joel Granados &lt;j.granados@samsung.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
const qualify the struct ctl_table argument in the proc_handler function
signatures. This is a prerequisite to moving the static ctl_table
structs into .rodata data which will ensure that proc_handler function
pointers cannot be modified.

This patch has been generated by the following coccinelle script:

```
  virtual patch

  @r1@
  identifier ctl, write, buffer, lenp, ppos;
  identifier func !~ "appldata_(timer|interval)_handler|sched_(rt|rr)_handler|rds_tcp_skbuf_handler|proc_sctp_do_(hmac_alg|rto_min|rto_max|udp_port|alpha_beta|auth|probe_interval)";
  @@

  int func(
  - struct ctl_table *ctl
  + const struct ctl_table *ctl
    ,int write, void *buffer, size_t *lenp, loff_t *ppos);

  @r2@
  identifier func, ctl, write, buffer, lenp, ppos;
  @@

  int func(
  - struct ctl_table *ctl
  + const struct ctl_table *ctl
    ,int write, void *buffer, size_t *lenp, loff_t *ppos)
  { ... }

  @r3@
  identifier func;
  @@

  int func(
  - struct ctl_table *
  + const struct ctl_table *
    ,int , void *, size_t *, loff_t *);

  @r4@
  identifier func, ctl;
  @@

  int func(
  - struct ctl_table *ctl
  + const struct ctl_table *ctl
    ,int , void *, size_t *, loff_t *);

  @r5@
  identifier func, write, buffer, lenp, ppos;
  @@

  int func(
  - struct ctl_table *
  + const struct ctl_table *
    ,int write, void *buffer, size_t *lenp, loff_t *ppos);

```

* Code formatting was adjusted in xfs_sysctl.c to comply with code
  conventions. The xfs_stats_clear_proc_handler,
  xfs_panic_mask_proc_handler and xfs_deprecated_dointvec_minmax where
  adjusted.

* The ctl_table argument in proc_watchdog_common was const qualified.
  This is called from a proc_handler itself and is calling back into
  another proc_handler, making it necessary to change it as part of the
  proc_handler migration.

Co-developed-by: Thomas Weißschuh &lt;linux@weissschuh.net&gt;
Signed-off-by: Thomas Weißschuh &lt;linux@weissschuh.net&gt;
Co-developed-by: Joel Granados &lt;j.granados@samsung.com&gt;
Signed-off-by: Joel Granados &lt;j.granados@samsung.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>sched: Fix spelling in comments</title>
<updated>2024-05-27T15:00:21+00:00</updated>
<author>
<name>Ingo Molnar</name>
<email>mingo@kernel.org</email>
</author>
<published>2024-05-27T14:54:52+00:00</published>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/commit/?id=402de7fc880fef055bc984957454b532987e9ad0'/>
<id>402de7fc880fef055bc984957454b532987e9ad0</id>
<content type='text'>
Do a spell-checking pass.

Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Do a spell-checking pass.

Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>scheduler: Remove the now superfluous sentinel elements from ctl_table array</title>
<updated>2024-04-24T07:43:54+00:00</updated>
<author>
<name>Joel Granados</name>
<email>j.granados@samsung.com</email>
</author>
<published>2023-06-27T13:30:21+00:00</published>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/commit/?id=f532376e881fb370bc2901a9e1cf0b7e461638ad'/>
<id>f532376e881fb370bc2901a9e1cf0b7e461638ad</id>
<content type='text'>
This commit comes at the tail end of a greater effort to remove the
empty elements at the end of the ctl_table arrays (sentinels) which
will reduce the overall build time size of the kernel and run time
memory bloat by ~64 bytes per sentinel (further information Link :
https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/)

rm sentinel element from ctl_table arrays

Acked-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Tested-by: Valentin Schneider &lt;vschneid@redhat.com&gt;
Reviewed-by: Valentin Schneider &lt;vschneid@redhat.com&gt;
Signed-off-by: Joel Granados &lt;j.granados@samsung.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This commit comes at the tail end of a greater effort to remove the
empty elements at the end of the ctl_table arrays (sentinels) which
will reduce the overall build time size of the kernel and run time
memory bloat by ~64 bytes per sentinel (further information Link :
https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/)

rm sentinel element from ctl_table arrays

Acked-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Tested-by: Valentin Schneider &lt;vschneid@redhat.com&gt;
Reviewed-by: Valentin Schneider &lt;vschneid@redhat.com&gt;
Signed-off-by: Joel Granados &lt;j.granados@samsung.com&gt;
</pre>
</div>
</content>
</entry>
</feed>
