<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux.git/kernel/locking/rtmutex.c, branch v6.12.80</title>
<subtitle>Clone of https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git</subtitle>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/'/>
<entry>
<title>Merge tag 'sched-core-2024-09-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip</title>
<updated>2024-09-19T13:55:58+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2024-09-19T13:55:58+00:00</published>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/commit/?id=2004cef11ea072838f99bd95cefa5c8e45df0847'/>
<id>2004cef11ea072838f99bd95cefa5c8e45df0847</id>
<content type='text'>
Pull scheduler updates from Ingo Molnar:

 - Implement the SCHED_DEADLINE server infrastructure - Daniel Bristot
   de Oliveira's last major contribution to the kernel:

     "SCHED_DEADLINE servers can help fixing starvation issues of low
      priority tasks (e.g., SCHED_OTHER) when higher priority tasks
      monopolize CPU cycles. Today we have RT Throttling; DEADLINE
      servers should be able to replace and improve that."

   (Daniel Bristot de Oliveira, Peter Zijlstra, Joel Fernandes, Youssef
   Esmat, Huang Shijie)

 - Preparatory changes for sched_ext integration:
     - Use set_next_task(.first) where required
     - Fix up set_next_task() implementations
     - Clean up DL server vs. core sched
     - Split up put_prev_task_balance()
     - Rework pick_next_task()
     - Combine the last put_prev_task() and the first set_next_task()
     - Rework dl_server
     - Add put_prev_task(.next)

   (Peter Zijlstra, with a fix by Tejun Heo)

 - Complete the EEVDF transition and refine EEVDF scheduling:
     - Implement delayed dequeue
     - Allow shorter slices to wakeup-preempt
     - Use sched_attr::sched_runtime to set request/slice suggestion
     - Document the new feature flags
     - Remove unused and duplicate-functionality fields
     - Simplify &amp; unify pick_next_task_fair()
     - Misc debuggability enhancements

   (Peter Zijlstra, with fixes/cleanups by Dietmar Eggemann, Valentin
   Schneider and Chuyi Zhou)

 - Initialize the vruntime of a new task when it is first enqueued,
   resulting in significant decrease in latency of newly woken tasks
   (Zhang Qiao)

 - Introduce SM_IDLE and an idle re-entry fast-path in __schedule()
   (K Prateek Nayak, Peter Zijlstra)

 - Clean up and clarify the usage of Clean up usage of rt_task()
   (Qais Yousef)

 - Preempt SCHED_IDLE entities in strict cgroup hierarchies
   (Tianchen Ding)

 - Clarify the documentation of time units for deadline scheduler
   parameters (Christian Loehle)

 - Remove the HZ_BW chicken-bit feature flag introduced a year ago,
   the original change seems to be working fine (Phil Auld)

 - Misc fixes and cleanups (Chen Yu, Dan Carpenter, Huang Shijie,
   Peilin He, Qais Yousefm and Vincent Guittot)

* tag 'sched-core-2024-09-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (64 commits)
  sched/cpufreq: Use NSEC_PER_MSEC for deadline task
  cpufreq/cppc: Use NSEC_PER_MSEC for deadline task
  sched/deadline: Clarify nanoseconds in uapi
  sched/deadline: Convert schedtool example to chrt
  sched/debug: Fix the runnable tasks output
  sched: Fix sched_delayed vs sched_core
  kernel/sched: Fix util_est accounting for DELAY_DEQUEUE
  kthread: Fix task state in kthread worker if being frozen
  sched/pelt: Use rq_clock_task() for hw_pressure
  sched/fair: Move effective_cpu_util() and effective_cpu_util() in fair.c
  sched/core: Introduce SM_IDLE and an idle re-entry fast-path in __schedule()
  sched: Add put_prev_task(.next)
  sched: Rework dl_server
  sched: Combine the last put_prev_task() and the first set_next_task()
  sched: Rework pick_next_task()
  sched: Split up put_prev_task_balance()
  sched: Clean up DL server vs core sched
  sched: Fixup set_next_task() implementations
  sched: Use set_next_task(.first) where required
  sched/fair: Properly deactivate sched_delayed task upon class change
  ...
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Pull scheduler updates from Ingo Molnar:

 - Implement the SCHED_DEADLINE server infrastructure - Daniel Bristot
   de Oliveira's last major contribution to the kernel:

     "SCHED_DEADLINE servers can help fixing starvation issues of low
      priority tasks (e.g., SCHED_OTHER) when higher priority tasks
      monopolize CPU cycles. Today we have RT Throttling; DEADLINE
      servers should be able to replace and improve that."

   (Daniel Bristot de Oliveira, Peter Zijlstra, Joel Fernandes, Youssef
   Esmat, Huang Shijie)

 - Preparatory changes for sched_ext integration:
     - Use set_next_task(.first) where required
     - Fix up set_next_task() implementations
     - Clean up DL server vs. core sched
     - Split up put_prev_task_balance()
     - Rework pick_next_task()
     - Combine the last put_prev_task() and the first set_next_task()
     - Rework dl_server
     - Add put_prev_task(.next)

   (Peter Zijlstra, with a fix by Tejun Heo)

 - Complete the EEVDF transition and refine EEVDF scheduling:
     - Implement delayed dequeue
     - Allow shorter slices to wakeup-preempt
     - Use sched_attr::sched_runtime to set request/slice suggestion
     - Document the new feature flags
     - Remove unused and duplicate-functionality fields
     - Simplify &amp; unify pick_next_task_fair()
     - Misc debuggability enhancements

   (Peter Zijlstra, with fixes/cleanups by Dietmar Eggemann, Valentin
   Schneider and Chuyi Zhou)

 - Initialize the vruntime of a new task when it is first enqueued,
   resulting in significant decrease in latency of newly woken tasks
   (Zhang Qiao)

 - Introduce SM_IDLE and an idle re-entry fast-path in __schedule()
   (K Prateek Nayak, Peter Zijlstra)

 - Clean up and clarify the usage of Clean up usage of rt_task()
   (Qais Yousef)

 - Preempt SCHED_IDLE entities in strict cgroup hierarchies
   (Tianchen Ding)

 - Clarify the documentation of time units for deadline scheduler
   parameters (Christian Loehle)

 - Remove the HZ_BW chicken-bit feature flag introduced a year ago,
   the original change seems to be working fine (Phil Auld)

 - Misc fixes and cleanups (Chen Yu, Dan Carpenter, Huang Shijie,
   Peilin He, Qais Yousefm and Vincent Guittot)

* tag 'sched-core-2024-09-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (64 commits)
  sched/cpufreq: Use NSEC_PER_MSEC for deadline task
  cpufreq/cppc: Use NSEC_PER_MSEC for deadline task
  sched/deadline: Clarify nanoseconds in uapi
  sched/deadline: Convert schedtool example to chrt
  sched/debug: Fix the runnable tasks output
  sched: Fix sched_delayed vs sched_core
  kernel/sched: Fix util_est accounting for DELAY_DEQUEUE
  kthread: Fix task state in kthread worker if being frozen
  sched/pelt: Use rq_clock_task() for hw_pressure
  sched/fair: Move effective_cpu_util() and effective_cpu_util() in fair.c
  sched/core: Introduce SM_IDLE and an idle re-entry fast-path in __schedule()
  sched: Add put_prev_task(.next)
  sched: Rework dl_server
  sched: Combine the last put_prev_task() and the first set_next_task()
  sched: Rework pick_next_task()
  sched: Split up put_prev_task_balance()
  sched: Clean up DL server vs core sched
  sched: Fixup set_next_task() implementations
  sched: Use set_next_task(.first) where required
  sched/fair: Properly deactivate sched_delayed task upon class change
  ...
</pre>
</div>
</content>
</entry>
<entry>
<title>rtmutex: Drop rt_mutex::wait_lock before scheduling</title>
<updated>2024-08-15T13:38:53+00:00</updated>
<author>
<name>Roland Xu</name>
<email>mu001999@outlook.com</email>
</author>
<published>2024-08-15T02:58:13+00:00</published>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/commit/?id=d33d26036a0274b472299d7dcdaa5fb34329f91b'/>
<id>d33d26036a0274b472299d7dcdaa5fb34329f91b</id>
<content type='text'>
rt_mutex_handle_deadlock() is called with rt_mutex::wait_lock held.  In the
good case it returns with the lock held and in the deadlock case it emits a
warning and goes into an endless scheduling loop with the lock held, which
triggers the 'scheduling in atomic' warning.

Unlock rt_mutex::wait_lock in the dead lock case before issuing the warning
and dropping into the schedule for ever loop.

[ tglx: Moved unlock before the WARN(), removed the pointless comment,
  	massaged changelog, added Fixes tag ]

Fixes: 3d5c9340d194 ("rtmutex: Handle deadlock detection smarter")
Signed-off-by: Roland Xu &lt;mu001999@outlook.com&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/ME0P300MB063599BEF0743B8FA339C2CECC802@ME0P300MB0635.AUSP300.PROD.OUTLOOK.COM
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
rt_mutex_handle_deadlock() is called with rt_mutex::wait_lock held.  In the
good case it returns with the lock held and in the deadlock case it emits a
warning and goes into an endless scheduling loop with the lock held, which
triggers the 'scheduling in atomic' warning.

Unlock rt_mutex::wait_lock in the dead lock case before issuing the warning
and dropping into the schedule for ever loop.

[ tglx: Moved unlock before the WARN(), removed the pointless comment,
  	massaged changelog, added Fixes tag ]

Fixes: 3d5c9340d194 ("rtmutex: Handle deadlock detection smarter")
Signed-off-by: Roland Xu &lt;mu001999@outlook.com&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/ME0P300MB063599BEF0743B8FA339C2CECC802@ME0P300MB0635.AUSP300.PROD.OUTLOOK.COM
</pre>
</div>
</content>
</entry>
<entry>
<title>sched/rt: Rename realtime_{prio, task}() to rt_or_dl_{prio, task}()</title>
<updated>2024-08-07T16:32:38+00:00</updated>
<author>
<name>Qais Yousef</name>
<email>qyousef@layalina.io</email>
</author>
<published>2024-06-10T19:20:18+00:00</published>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/commit/?id=ae04f69de0bef93c7086cf2983dbc8e8fd624ebe'/>
<id>ae04f69de0bef93c7086cf2983dbc8e8fd624ebe</id>
<content type='text'>
Some find the name realtime overloaded. Use rt_or_dl() as an
alternative, hopefully better, name.

Suggested-by: Daniel Bristot de Oliveira &lt;bristot@redhat.com&gt;
Signed-off-by: Qais Yousef &lt;qyousef@layalina.io&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Link: https://lore.kernel.org/r/20240610192018.1567075-4-qyousef@layalina.io
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Some find the name realtime overloaded. Use rt_or_dl() as an
alternative, hopefully better, name.

Suggested-by: Daniel Bristot de Oliveira &lt;bristot@redhat.com&gt;
Signed-off-by: Qais Yousef &lt;qyousef@layalina.io&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Link: https://lore.kernel.org/r/20240610192018.1567075-4-qyousef@layalina.io
</pre>
</div>
</content>
</entry>
<entry>
<title>sched/rt: Clean up usage of rt_task()</title>
<updated>2024-08-07T16:32:37+00:00</updated>
<author>
<name>Qais Yousef</name>
<email>qyousef@layalina.io</email>
</author>
<published>2024-06-10T19:20:16+00:00</published>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/commit/?id=130fd056dd82b02db9a661c013071af35309be1a'/>
<id>130fd056dd82b02db9a661c013071af35309be1a</id>
<content type='text'>
rt_task() checks if a task has RT priority. But depends on your
dictionary, this could mean it belongs to RT class, or is a 'realtime'
task, which includes RT and DL classes.

Since this has caused some confusion already on discussion [1], it
seemed a clean up is due.

I define the usage of rt_task() to be tasks that belong to RT class.
Make sure that it returns true only for RT class and audit the users and
replace the ones required the old behavior with the new realtime_task()
which returns true for RT and DL classes. Introduce similar
realtime_prio() to create similar distinction to rt_prio() and update
the users that required the old behavior to use the new function.

Move MAX_DL_PRIO to prio.h so it can be used in the new definitions.

Document the functions to make it more obvious what is the difference
between them. PI-boosted tasks is a factor that must be taken into
account when choosing which function to use.

Rename task_is_realtime() to realtime_task_policy() as the old name is
confusing against the new realtime_task().

No functional changes were intended.

[1] https://lore.kernel.org/lkml/20240506100509.GL40213@noisy.programming.kicks-ass.net/

Signed-off-by: Qais Yousef &lt;qyousef@layalina.io&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Reviewed-by: Phil Auld &lt;pauld@redhat.com&gt;
Reviewed-by: "Steven Rostedt (Google)" &lt;rostedt@goodmis.org&gt;
Reviewed-by: Sebastian Andrzej Siewior &lt;bigeasy@linutronix.de&gt;
Link: https://lore.kernel.org/r/20240610192018.1567075-2-qyousef@layalina.io
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
rt_task() checks if a task has RT priority. But depends on your
dictionary, this could mean it belongs to RT class, or is a 'realtime'
task, which includes RT and DL classes.

Since this has caused some confusion already on discussion [1], it
seemed a clean up is due.

I define the usage of rt_task() to be tasks that belong to RT class.
Make sure that it returns true only for RT class and audit the users and
replace the ones required the old behavior with the new realtime_task()
which returns true for RT and DL classes. Introduce similar
realtime_prio() to create similar distinction to rt_prio() and update
the users that required the old behavior to use the new function.

Move MAX_DL_PRIO to prio.h so it can be used in the new definitions.

Document the functions to make it more obvious what is the difference
between them. PI-boosted tasks is a factor that must be taken into
account when choosing which function to use.

Rename task_is_realtime() to realtime_task_policy() as the old name is
confusing against the new realtime_task().

No functional changes were intended.

[1] https://lore.kernel.org/lkml/20240506100509.GL40213@noisy.programming.kicks-ass.net/

Signed-off-by: Qais Yousef &lt;qyousef@layalina.io&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Reviewed-by: Phil Auld &lt;pauld@redhat.com&gt;
Reviewed-by: "Steven Rostedt (Google)" &lt;rostedt@goodmis.org&gt;
Reviewed-by: Sebastian Andrzej Siewior &lt;bigeasy@linutronix.de&gt;
Link: https://lore.kernel.org/r/20240610192018.1567075-2-qyousef@layalina.io
</pre>
</div>
</content>
</entry>
<entry>
<title>locking/rtmutex: Use try_cmpxchg_relaxed() in mark_rt_mutex_waiters()</title>
<updated>2024-03-01T12:02:05+00:00</updated>
<author>
<name>Uros Bizjak</name>
<email>ubizjak@gmail.com</email>
</author>
<published>2024-01-24T10:49:53+00:00</published>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/commit/?id=ce3576ebd62d99f79c1dc98824e2ef6d6ab68434'/>
<id>ce3576ebd62d99f79c1dc98824e2ef6d6ab68434</id>
<content type='text'>
Use try_cmpxchg() instead of cmpxchg(*ptr, old, new) == old.

The x86 CMPXCHG instruction returns success in the ZF flag, so this change
saves a compare after CMPXCHG (and related move instruction in front of CMPXCHG).

Also, try_cmpxchg() implicitly assigns old *ptr value to "old" when CMPXCHG
fails. There is no need to re-read the value in the loop.

Note that the value from *ptr should be read using READ_ONCE() to prevent
the compiler from merging, refetching or reordering the read.

No functional change intended.

Signed-off-by: Uros Bizjak &lt;ubizjak@gmail.com&gt;
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Waiman Long &lt;longman@redhat.com&gt;
Cc: Will Deacon &lt;will.deacon@arm.com&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Paul E. McKenney &lt;paulmck@kernel.org&gt;
Link: https://lore.kernel.org/r/20240124104953.612063-1-ubizjak@gmail.com
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Use try_cmpxchg() instead of cmpxchg(*ptr, old, new) == old.

The x86 CMPXCHG instruction returns success in the ZF flag, so this change
saves a compare after CMPXCHG (and related move instruction in front of CMPXCHG).

Also, try_cmpxchg() implicitly assigns old *ptr value to "old" when CMPXCHG
fails. There is no need to re-read the value in the loop.

Note that the value from *ptr should be read using READ_ONCE() to prevent
the compiler from merging, refetching or reordering the read.

No functional change intended.

Signed-off-by: Uros Bizjak &lt;ubizjak@gmail.com&gt;
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Waiman Long &lt;longman@redhat.com&gt;
Cc: Will Deacon &lt;will.deacon@arm.com&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Paul E. McKenney &lt;paulmck@kernel.org&gt;
Link: https://lore.kernel.org/r/20240124104953.612063-1-ubizjak@gmail.com
</pre>
</div>
</content>
</entry>
<entry>
<title>locking/rtmutex: Add a lockdep assert to catch potential nested blocking</title>
<updated>2023-09-20T07:31:14+00:00</updated>
<author>
<name>Thomas Gleixner</name>
<email>tglx@linutronix.de</email>
</author>
<published>2023-09-08T16:22:53+00:00</published>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/commit/?id=45f67f30a22f264bc7a0a61255c2ee1a838e9403'/>
<id>45f67f30a22f264bc7a0a61255c2ee1a838e9403</id>
<content type='text'>
There used to be a BUG_ON(current-&gt;pi_blocked_on) in the lock acquisition
functions, but that vanished in one of the rtmutex overhauls.

Bring it back in form of a lockdep assert to catch code paths which take
rtmutex based locks with current::pi_blocked_on != NULL.

Reported-by: Crystal Wood &lt;swood@redhat.com&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: "Peter Zijlstra (Intel)" &lt;peterz@infradead.org&gt;
Signed-off-by: Sebastian Andrzej Siewior &lt;bigeasy@linutronix.de&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Link: https://lkml.kernel.org/r/20230908162254.999499-7-bigeasy@linutronix.de
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
There used to be a BUG_ON(current-&gt;pi_blocked_on) in the lock acquisition
functions, but that vanished in one of the rtmutex overhauls.

Bring it back in form of a lockdep assert to catch code paths which take
rtmutex based locks with current::pi_blocked_on != NULL.

Reported-by: Crystal Wood &lt;swood@redhat.com&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: "Peter Zijlstra (Intel)" &lt;peterz@infradead.org&gt;
Signed-off-by: Sebastian Andrzej Siewior &lt;bigeasy@linutronix.de&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Link: https://lkml.kernel.org/r/20230908162254.999499-7-bigeasy@linutronix.de
</pre>
</div>
</content>
</entry>
<entry>
<title>locking/rtmutex: Use rt_mutex specific scheduler helpers</title>
<updated>2023-09-20T07:31:13+00:00</updated>
<author>
<name>Sebastian Andrzej Siewior</name>
<email>bigeasy@linutronix.de</email>
</author>
<published>2023-09-08T16:22:52+00:00</published>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/commit/?id=d14f9e930b9073de264c106bf04968286ef9b3a4'/>
<id>d14f9e930b9073de264c106bf04968286ef9b3a4</id>
<content type='text'>
Have rt_mutex use the rt_mutex specific scheduler helpers to avoid
recursion vs rtlock on the PI state.

[[ peterz: adapted to new names ]]

Reported-by: Crystal Wood &lt;swood@redhat.com&gt;
Signed-off-by: Sebastian Andrzej Siewior &lt;bigeasy@linutronix.de&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Link: https://lkml.kernel.org/r/20230908162254.999499-6-bigeasy@linutronix.de
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Have rt_mutex use the rt_mutex specific scheduler helpers to avoid
recursion vs rtlock on the PI state.

[[ peterz: adapted to new names ]]

Reported-by: Crystal Wood &lt;swood@redhat.com&gt;
Signed-off-by: Sebastian Andrzej Siewior &lt;bigeasy@linutronix.de&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Link: https://lkml.kernel.org/r/20230908162254.999499-6-bigeasy@linutronix.de
</pre>
</div>
</content>
</entry>
<entry>
<title>locking/rtmutex: Avoid unconditional slowpath for DEBUG_RT_MUTEXES</title>
<updated>2023-09-20T07:31:11+00:00</updated>
<author>
<name>Sebastian Andrzej Siewior</name>
<email>bigeasy@linutronix.de</email>
</author>
<published>2023-09-08T16:22:49+00:00</published>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/commit/?id=af9f006393b53409be0ca83ae234bef840cdef4a'/>
<id>af9f006393b53409be0ca83ae234bef840cdef4a</id>
<content type='text'>
With DEBUG_RT_MUTEXES enabled the fast-path rt_mutex_cmpxchg_acquire()
always fails and all lock operations take the slow path.

Provide a new helper inline rt_mutex_try_acquire() which maps to
rt_mutex_cmpxchg_acquire() in the non-debug case. For the debug case
it invokes rt_mutex_slowtrylock() which can acquire a non-contended
rtmutex under full debug coverage.

Signed-off-by: Sebastian Andrzej Siewior &lt;bigeasy@linutronix.de&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Sebastian Andrzej Siewior &lt;bigeasy@linutronix.de&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Link: https://lkml.kernel.org/r/20230908162254.999499-3-bigeasy@linutronix.de
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
With DEBUG_RT_MUTEXES enabled the fast-path rt_mutex_cmpxchg_acquire()
always fails and all lock operations take the slow path.

Provide a new helper inline rt_mutex_try_acquire() which maps to
rt_mutex_cmpxchg_acquire() in the non-debug case. For the debug case
it invokes rt_mutex_slowtrylock() which can acquire a non-contended
rtmutex under full debug coverage.

Signed-off-by: Sebastian Andrzej Siewior &lt;bigeasy@linutronix.de&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Sebastian Andrzej Siewior &lt;bigeasy@linutronix.de&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Link: https://lkml.kernel.org/r/20230908162254.999499-3-bigeasy@linutronix.de
</pre>
</div>
</content>
</entry>
<entry>
<title>locking/rtmutex: Fix task-&gt;pi_waiters integrity</title>
<updated>2023-07-17T11:59:10+00:00</updated>
<author>
<name>Peter Zijlstra</name>
<email>peterz@infradead.org</email>
</author>
<published>2023-07-07T14:19:09+00:00</published>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/commit/?id=f7853c34241807bb97673a5e97719123be39a09e'/>
<id>f7853c34241807bb97673a5e97719123be39a09e</id>
<content type='text'>
Henry reported that rt_mutex_adjust_prio_check() has an ordering
problem and puts the lie to the comment in [7]. Sharing the sort key
between lock-&gt;waiters and owner-&gt;pi_waiters *does* create problems,
since unlike what the comment claims, holding [L] is insufficient.

Notably, consider:

	A
      /   \
     M1   M2
     |     |
     B     C

That is, task A owns both M1 and M2, B and C block on them. In this
case a concurrent chain walk (B &amp; C) will modify their resp. sort keys
in [7] while holding M1-&gt;wait_lock and M2-&gt;wait_lock. So holding [L]
is meaningless, they're different Ls.

This then gives rise to a race condition between [7] and [11], where
the requeue of pi_waiters will observe an inconsistent tree order.

	B				C

  (holds M1-&gt;wait_lock,		(holds M2-&gt;wait_lock,
   holds B-&gt;pi_lock)		 holds A-&gt;pi_lock)

  [7]
  waiter_update_prio();
  ...
  [8]
  raw_spin_unlock(B-&gt;pi_lock);
  ...
  [10]
  raw_spin_lock(A-&gt;pi_lock);

				[11]
				rt_mutex_enqueue_pi();
				// observes inconsistent A-&gt;pi_waiters
				// tree order

Fixing this means either extending the range of the owner lock from
[10-13] to [6-13], with the immediate problem that this means [6-8]
hold both blocked and owner locks, or duplicating the sort key.

Since the locking in chain walk is horrible enough without having to
consider pi_lock nesting rules, duplicate the sort key instead.

By giving each tree their own sort key, the above race becomes
harmless, if C sees B at the old location, then B will correct things
(if they need correcting) when it walks up the chain and reaches A.

Fixes: fb00aca47440 ("rtmutex: Turn the plist into an rb-tree")
Reported-by: Henry Wu &lt;triangletrap12@gmail.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Acked-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Tested-by: Henry Wu &lt;triangletrap12@gmail.com&gt;
Link: https://lkml.kernel.org/r/20230707161052.GF2883469%40hirez.programming.kicks-ass.net
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Henry reported that rt_mutex_adjust_prio_check() has an ordering
problem and puts the lie to the comment in [7]. Sharing the sort key
between lock-&gt;waiters and owner-&gt;pi_waiters *does* create problems,
since unlike what the comment claims, holding [L] is insufficient.

Notably, consider:

	A
      /   \
     M1   M2
     |     |
     B     C

That is, task A owns both M1 and M2, B and C block on them. In this
case a concurrent chain walk (B &amp; C) will modify their resp. sort keys
in [7] while holding M1-&gt;wait_lock and M2-&gt;wait_lock. So holding [L]
is meaningless, they're different Ls.

This then gives rise to a race condition between [7] and [11], where
the requeue of pi_waiters will observe an inconsistent tree order.

	B				C

  (holds M1-&gt;wait_lock,		(holds M2-&gt;wait_lock,
   holds B-&gt;pi_lock)		 holds A-&gt;pi_lock)

  [7]
  waiter_update_prio();
  ...
  [8]
  raw_spin_unlock(B-&gt;pi_lock);
  ...
  [10]
  raw_spin_lock(A-&gt;pi_lock);

				[11]
				rt_mutex_enqueue_pi();
				// observes inconsistent A-&gt;pi_waiters
				// tree order

Fixing this means either extending the range of the owner lock from
[10-13] to [6-13], with the immediate problem that this means [6-8]
hold both blocked and owner locks, or duplicating the sort key.

Since the locking in chain walk is horrible enough without having to
consider pi_lock nesting rules, duplicate the sort key instead.

By giving each tree their own sort key, the above race becomes
harmless, if C sees B at the old location, then B will correct things
(if they need correcting) when it walks up the chain and reaches A.

Fixes: fb00aca47440 ("rtmutex: Turn the plist into an rb-tree")
Reported-by: Henry Wu &lt;triangletrap12@gmail.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Acked-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Tested-by: Henry Wu &lt;triangletrap12@gmail.com&gt;
Link: https://lkml.kernel.org/r/20230707161052.GF2883469%40hirez.programming.kicks-ass.net
</pre>
</div>
</content>
</entry>
<entry>
<title>rtmutex: Ensure that the top waiter is always woken up</title>
<updated>2023-02-06T13:49:13+00:00</updated>
<author>
<name>Wander Lairson Costa</name>
<email>wander@redhat.com</email>
</author>
<published>2023-02-02T12:30:20+00:00</published>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/commit/?id=db370a8b9f67ae5f17e3d5482493294467784504'/>
<id>db370a8b9f67ae5f17e3d5482493294467784504</id>
<content type='text'>
Let L1 and L2 be two spinlocks.

Let T1 be a task holding L1 and blocked on L2. T1, currently, is the top
waiter of L2.

Let T2 be the task holding L2.

Let T3 be a task trying to acquire L1.

The following events will lead to a state in which the wait queue of L2
isn't empty, but no task actually holds the lock.

T1                T2                                  T3
==                ==                                  ==

                                                      spin_lock(L1)
                                                      | raw_spin_lock(L1-&gt;wait_lock)
                                                      | rtlock_slowlock_locked(L1)
                                                      | | task_blocks_on_rt_mutex(L1, T3)
                                                      | | | orig_waiter-&gt;lock = L1
                                                      | | | orig_waiter-&gt;task = T3
                                                      | | | raw_spin_unlock(L1-&gt;wait_lock)
                                                      | | | rt_mutex_adjust_prio_chain(T1, L1, L2, orig_waiter, T3)
                  spin_unlock(L2)                     | | | |
                  | rt_mutex_slowunlock(L2)           | | | |
                  | | raw_spin_lock(L2-&gt;wait_lock)    | | | |
                  | | wakeup(T1)                      | | | |
                  | | raw_spin_unlock(L2-&gt;wait_lock)  | | | |
                                                      | | | | waiter = T1-&gt;pi_blocked_on
                                                      | | | | waiter == rt_mutex_top_waiter(L2)
                                                      | | | | waiter-&gt;task == T1
                                                      | | | | raw_spin_lock(L2-&gt;wait_lock)
                                                      | | | | dequeue(L2, waiter)
                                                      | | | | update_prio(waiter, T1)
                                                      | | | | enqueue(L2, waiter)
                                                      | | | | waiter != rt_mutex_top_waiter(L2)
                                                      | | | | L2-&gt;owner == NULL
                                                      | | | | wakeup(T1)
                                                      | | | | raw_spin_unlock(L2-&gt;wait_lock)
T1 wakes up
T1 != top_waiter(L2)
schedule_rtlock()

If the deadline of T1 is updated before the call to update_prio(), and the
new deadline is greater than the deadline of the second top waiter, then
after the requeue, T1 is no longer the top waiter, and the wrong task is
woken up which will then go back to sleep because it is not the top waiter.

This can be reproduced in PREEMPT_RT with stress-ng:

while true; do
    stress-ng --sched deadline --sched-period 1000000000 \
    	    --sched-runtime 800000000 --sched-deadline \
    	    1000000000 --mmapfork 23 -t 20
done

A similar issue was pointed out by Thomas versus the cases where the top
waiter drops out early due to a signal or timeout, which is a general issue
for all regular rtmutex use cases, e.g. futex.

The problematic code is in rt_mutex_adjust_prio_chain():

    	// Save the top waiter before dequeue/enqueue
	prerequeue_top_waiter = rt_mutex_top_waiter(lock);

	rt_mutex_dequeue(lock, waiter);
	waiter_update_prio(waiter, task);
	rt_mutex_enqueue(lock, waiter);

	// Lock has no owner?
	if (!rt_mutex_owner(lock)) {
	   	// Top waiter changed		      			   
  ----&gt;		if (prerequeue_top_waiter != rt_mutex_top_waiter(lock))
  ----&gt;			wake_up_state(waiter-&gt;task, waiter-&gt;wake_state);

This only takes the case into account where @waiter is the new top waiter
due to the requeue operation.

But it fails to handle the case where @waiter is not longer the top
waiter due to the requeue operation.

Ensure that the new top waiter is woken up so in all cases so it can take
over the ownerless lock.

[ tglx: Amend changelog, add Fixes tag ]

Fixes: c014ef69b3ac ("locking/rtmutex: Add wake_state to rt_mutex_waiter")
Signed-off-by: Wander Lairson Costa &lt;wander@redhat.com&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230117172649.52465-1-wander@redhat.com
Link: https://lore.kernel.org/r/20230202123020.14844-1-wander@redhat.com
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Let L1 and L2 be two spinlocks.

Let T1 be a task holding L1 and blocked on L2. T1, currently, is the top
waiter of L2.

Let T2 be the task holding L2.

Let T3 be a task trying to acquire L1.

The following events will lead to a state in which the wait queue of L2
isn't empty, but no task actually holds the lock.

T1                T2                                  T3
==                ==                                  ==

                                                      spin_lock(L1)
                                                      | raw_spin_lock(L1-&gt;wait_lock)
                                                      | rtlock_slowlock_locked(L1)
                                                      | | task_blocks_on_rt_mutex(L1, T3)
                                                      | | | orig_waiter-&gt;lock = L1
                                                      | | | orig_waiter-&gt;task = T3
                                                      | | | raw_spin_unlock(L1-&gt;wait_lock)
                                                      | | | rt_mutex_adjust_prio_chain(T1, L1, L2, orig_waiter, T3)
                  spin_unlock(L2)                     | | | |
                  | rt_mutex_slowunlock(L2)           | | | |
                  | | raw_spin_lock(L2-&gt;wait_lock)    | | | |
                  | | wakeup(T1)                      | | | |
                  | | raw_spin_unlock(L2-&gt;wait_lock)  | | | |
                                                      | | | | waiter = T1-&gt;pi_blocked_on
                                                      | | | | waiter == rt_mutex_top_waiter(L2)
                                                      | | | | waiter-&gt;task == T1
                                                      | | | | raw_spin_lock(L2-&gt;wait_lock)
                                                      | | | | dequeue(L2, waiter)
                                                      | | | | update_prio(waiter, T1)
                                                      | | | | enqueue(L2, waiter)
                                                      | | | | waiter != rt_mutex_top_waiter(L2)
                                                      | | | | L2-&gt;owner == NULL
                                                      | | | | wakeup(T1)
                                                      | | | | raw_spin_unlock(L2-&gt;wait_lock)
T1 wakes up
T1 != top_waiter(L2)
schedule_rtlock()

If the deadline of T1 is updated before the call to update_prio(), and the
new deadline is greater than the deadline of the second top waiter, then
after the requeue, T1 is no longer the top waiter, and the wrong task is
woken up which will then go back to sleep because it is not the top waiter.

This can be reproduced in PREEMPT_RT with stress-ng:

while true; do
    stress-ng --sched deadline --sched-period 1000000000 \
    	    --sched-runtime 800000000 --sched-deadline \
    	    1000000000 --mmapfork 23 -t 20
done

A similar issue was pointed out by Thomas versus the cases where the top
waiter drops out early due to a signal or timeout, which is a general issue
for all regular rtmutex use cases, e.g. futex.

The problematic code is in rt_mutex_adjust_prio_chain():

    	// Save the top waiter before dequeue/enqueue
	prerequeue_top_waiter = rt_mutex_top_waiter(lock);

	rt_mutex_dequeue(lock, waiter);
	waiter_update_prio(waiter, task);
	rt_mutex_enqueue(lock, waiter);

	// Lock has no owner?
	if (!rt_mutex_owner(lock)) {
	   	// Top waiter changed		      			   
  ----&gt;		if (prerequeue_top_waiter != rt_mutex_top_waiter(lock))
  ----&gt;			wake_up_state(waiter-&gt;task, waiter-&gt;wake_state);

This only takes the case into account where @waiter is the new top waiter
due to the requeue operation.

But it fails to handle the case where @waiter is not longer the top
waiter due to the requeue operation.

Ensure that the new top waiter is woken up so in all cases so it can take
over the ownerless lock.

[ tglx: Amend changelog, add Fixes tag ]

Fixes: c014ef69b3ac ("locking/rtmutex: Add wake_state to rt_mutex_waiter")
Signed-off-by: Wander Lairson Costa &lt;wander@redhat.com&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230117172649.52465-1-wander@redhat.com
Link: https://lore.kernel.org/r/20230202123020.14844-1-wander@redhat.com
</pre>
</div>
</content>
</entry>
</feed>
