<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux.git/arch/powerpc/kernel/watchdog.c, branch v5.4.211</title>
<subtitle>Clone of https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git</subtitle>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/'/>
<entry>
<title>powerpc/watchdog: Fix missed watchdog reset due to memory ordering race</title>
<updated>2022-01-27T08:19:47+00:00</updated>
<author>
<name>Nicholas Piggin</name>
<email>npiggin@gmail.com</email>
</author>
<published>2021-11-10T02:50:53+00:00</published>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/commit/?id=aea9d368480fb4c3cc88a6b074b69dee845a88f0'/>
<id>aea9d368480fb4c3cc88a6b074b69dee845a88f0</id>
<content type='text'>
[ Upstream commit 5dad4ba68a2483fc80d70b9dc90bbe16e1f27263 ]

It is possible for all CPUs to miss the pending cpumask becoming clear,
and then nobody resetting it, which will cause the lockup detector to
stop working. It will eventually expire, but watchdog_smp_panic will
avoid doing anything if the pending mask is clear and it will never be
reset.

Order the cpumask clear vs the subsequent test to close this race.

Add an extra check for an empty pending mask when the watchdog fires and
finds its bit still clear, to try to catch any other possible races or
bugs here and keep the watchdog working. The extra test in
arch_touch_nmi_watchdog is required to prevent the new warning from
firing off.

Signed-off-by: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Reviewed-by: Laurent Dufour &lt;ldufour@linux.ibm.com&gt;
Debugged-by: Laurent Dufour &lt;ldufour@linux.ibm.com&gt;
Signed-off-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
Link: https://lore.kernel.org/r/20211110025056.2084347-2-npiggin@gmail.com
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit 5dad4ba68a2483fc80d70b9dc90bbe16e1f27263 ]

It is possible for all CPUs to miss the pending cpumask becoming clear,
and then nobody resetting it, which will cause the lockup detector to
stop working. It will eventually expire, but watchdog_smp_panic will
avoid doing anything if the pending mask is clear and it will never be
reset.

Order the cpumask clear vs the subsequent test to close this race.

Add an extra check for an empty pending mask when the watchdog fires and
finds its bit still clear, to try to catch any other possible races or
bugs here and keep the watchdog working. The extra test in
arch_touch_nmi_watchdog is required to prevent the new warning from
firing off.

Signed-off-by: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Reviewed-by: Laurent Dufour &lt;ldufour@linux.ibm.com&gt;
Debugged-by: Laurent Dufour &lt;ldufour@linux.ibm.com&gt;
Signed-off-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
Link: https://lore.kernel.org/r/20211110025056.2084347-2-npiggin@gmail.com
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>powerpc/watchdog: Use hrtimers for per-CPU heartbeat</title>
<updated>2019-04-30T01:31:02+00:00</updated>
<author>
<name>Nicholas Piggin</name>
<email>npiggin@gmail.com</email>
</author>
<published>2019-04-09T04:40:05+00:00</published>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/commit/?id=7ae3f6e130e8dc6188b59e3b4ebc2f16e9c8d053'/>
<id>7ae3f6e130e8dc6188b59e3b4ebc2f16e9c8d053</id>
<content type='text'>
Using a jiffies timer creates a dependency on the tick_do_timer_cpu
incrementing jiffies. If that CPU has locked up and jiffies is not
incrementing, the watchdog heartbeat timer for all CPUs stops and
creates false positives and confusing warnings on local CPUs, and
also causes the SMP detector to stop, so the root cause is never
detected.

Fix this by using hrtimer based timers for the watchdog heartbeat,
like the generic kernel hardlockup detector.

Cc: Gautham R. Shenoy &lt;ego@linux.vnet.ibm.com&gt;
Reported-by: Ravikumar Bangoria &lt;ravi.bangoria@in.ibm.com&gt;
Signed-off-by: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Tested-by: Ravi Bangoria &lt;ravi.bangoria@linux.ibm.com&gt;
Reported-by: Ravi Bangoria &lt;ravi.bangoria@linux.ibm.com&gt;
Reviewed-by: Gautham R. Shenoy &lt;ego@linux.vnet.ibm.com&gt;
Signed-off-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Using a jiffies timer creates a dependency on the tick_do_timer_cpu
incrementing jiffies. If that CPU has locked up and jiffies is not
incrementing, the watchdog heartbeat timer for all CPUs stops and
creates false positives and confusing warnings on local CPUs, and
also causes the SMP detector to stop, so the root cause is never
detected.

Fix this by using hrtimer based timers for the watchdog heartbeat,
like the generic kernel hardlockup detector.

Cc: Gautham R. Shenoy &lt;ego@linux.vnet.ibm.com&gt;
Reported-by: Ravikumar Bangoria &lt;ravi.bangoria@in.ibm.com&gt;
Signed-off-by: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Tested-by: Ravi Bangoria &lt;ravi.bangoria@linux.ibm.com&gt;
Reported-by: Ravi Bangoria &lt;ravi.bangoria@linux.ibm.com&gt;
Reviewed-by: Gautham R. Shenoy &lt;ego@linux.vnet.ibm.com&gt;
Signed-off-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>powerpc: NMI IPI make NMI IPIs fully sychronous</title>
<updated>2018-07-24T12:03:14+00:00</updated>
<author>
<name>Nicholas Piggin</name>
<email>npiggin@gmail.com</email>
</author>
<published>2018-04-25T05:17:59+00:00</published>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/commit/?id=5b73151fff63fb019db8171cb81c6c978533844b'/>
<id>5b73151fff63fb019db8171cb81c6c978533844b</id>
<content type='text'>
There is an asynchronous aspect to smp_send_nmi_ipi. The caller waits
for all CPUs to call in to the handler, but it does not wait for
completion of the handler. This is a needless complication, so remove
it and always wait synchronously.

The synchronous wait allows the caller to easily time out and clear
the wait for completion (zero nmi_ipi_busy_count) in the case of badly
behaved handlers. This would have prevented the recent smp_send_stop
NMI IPI bug from causing the system to hang.

Signed-off-by: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Signed-off-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
There is an asynchronous aspect to smp_send_nmi_ipi. The caller waits
for all CPUs to call in to the handler, but it does not wait for
completion of the handler. This is a needless complication, so remove
it and always wait synchronously.

The synchronous wait allows the caller to easily time out and clear
the wait for completion (zero nmi_ipi_busy_count) in the case of badly
behaved handlers. This would have prevented the recent smp_send_stop
NMI IPI bug from causing the system to hang.

Signed-off-by: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Signed-off-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>powerpc/watchdog: provide more data in watchdog messages</title>
<updated>2018-05-10T13:25:11+00:00</updated>
<author>
<name>Nicholas Piggin</name>
<email>npiggin@gmail.com</email>
</author>
<published>2018-05-05T07:26:00+00:00</published>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/commit/?id=4e49226ea8e1810d2c14d9e92a290bad239b512b'/>
<id>4e49226ea8e1810d2c14d9e92a290bad239b512b</id>
<content type='text'>
Provide timebase and timebase of last heartbeat in watchdog lockup
messages. Also provide a stack trace of when a CPU becomes un-stuck,
which can be useful -- it could be where irqs are re-enabled, so it
may be the end of the critical section which is responsible for the
latency which is useful information.

Signed-off-by: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Signed-off-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Provide timebase and timebase of last heartbeat in watchdog lockup
messages. Also provide a stack trace of when a CPU becomes un-stuck,
which can be useful -- it could be where irqs are re-enabled, so it
may be the end of the critical section which is responsible for the
latency which is useful information.

Signed-off-by: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Signed-off-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>powerpc/watchdog: don't update the watchdog timestamp if a lockup is detected</title>
<updated>2018-05-10T13:25:11+00:00</updated>
<author>
<name>Nicholas Piggin</name>
<email>npiggin@gmail.com</email>
</author>
<published>2018-05-05T07:25:59+00:00</published>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/commit/?id=5a951c4e7e8df5d6df52bace1b4ff327885584d6'/>
<id>5a951c4e7e8df5d6df52bace1b4ff327885584d6</id>
<content type='text'>
The watchdog heartbeat timestamp is updated when the local heartbeat
timer fires (or touch_nmi_watchdog() is called).

This is an interesting data point, so don't overwrite it when the
soft-NMI interrupt detects a hard lockup. That code came from a pre-
merge version to prevent hard lockup messages flood, but that's taken
care of with the stuck CPU logic now, so there is no reason to
update the heartbeat timestamp here.

Signed-off-by: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Signed-off-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The watchdog heartbeat timestamp is updated when the local heartbeat
timer fires (or touch_nmi_watchdog() is called).

This is an interesting data point, so don't overwrite it when the
soft-NMI interrupt detects a hard lockup. That code came from a pre-
merge version to prevent hard lockup messages flood, but that's taken
care of with the stuck CPU logic now, so there is no reason to
update the heartbeat timestamp here.

Signed-off-by: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Signed-off-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>powerpc/watchdog: fix typo 'can by' to 'can be'</title>
<updated>2018-05-10T13:25:05+00:00</updated>
<author>
<name>Wolfram Sang</name>
<email>wsa+renesas@sang-engineering.com</email>
</author>
<published>2018-05-06T11:23:46+00:00</published>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/commit/?id=7c18659dd498b25c6651ba83d4267ba7f9458c9c'/>
<id>7c18659dd498b25c6651ba83d4267ba7f9458c9c</id>
<content type='text'>
Signed-off-by: Wolfram Sang &lt;wsa+renesas@sang-engineering.com&gt;
Signed-off-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Signed-off-by: Wolfram Sang &lt;wsa+renesas@sang-engineering.com&gt;
Signed-off-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>powerpc/watchdog: Print the NIP in soft_nmi_interrupt()</title>
<updated>2018-01-28T06:08:29+00:00</updated>
<author>
<name>Michael Ellerman</name>
<email>mpe@ellerman.id.au</email>
</author>
<published>2017-10-12T04:44:34+00:00</published>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/commit/?id=0bc00914010e3826a3d168d1c879f79800a11bae'/>
<id>0bc00914010e3826a3d168d1c879f79800a11bae</id>
<content type='text'>
When a CPU detects its locked up via soft_nmi_interrupt() we have
pt_regs, so print the regs-&gt;nip, which points to where we took the
soft-NMI.

Signed-off-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
When a CPU detects its locked up via soft_nmi_interrupt() we have
pt_regs, so print the regs-&gt;nip, which points to where we took the
soft-NMI.

Signed-off-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>powerpc/watchdog: regs can't be null in soft_nmi_interrupt()</title>
<updated>2018-01-28T06:08:28+00:00</updated>
<author>
<name>Michael Ellerman</name>
<email>mpe@ellerman.id.au</email>
</author>
<published>2017-10-12T04:44:33+00:00</published>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/commit/?id=3ba45b7e46e2ae7c059f3335120aed5e2c749bd9'/>
<id>3ba45b7e46e2ae7c059f3335120aed5e2c749bd9</id>
<content type='text'>
soft_nmi_interrupt() is called directly from the asm exception
handling code, which passes regs as a pointer to the stack. So regs
can't be NULL, it may be full of junk, but that's a separate problem.

Signed-off-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
soft_nmi_interrupt() is called directly from the asm exception
handling code, which passes regs as a pointer to the stack. So regs
can't be NULL, it may be full of junk, but that's a separate problem.

Signed-off-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>powerpc/watchdog: Tweak watchdog printks</title>
<updated>2018-01-28T06:08:26+00:00</updated>
<author>
<name>Michael Ellerman</name>
<email>mpe@ellerman.id.au</email>
</author>
<published>2017-10-12T04:44:32+00:00</published>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/commit/?id=d8fa82e0e2b1ad9c66aa76810615ad389b414d44'/>
<id>d8fa82e0e2b1ad9c66aa76810615ad389b414d44</id>
<content type='text'>
Use pr_fmt() in the watchdog code, so we don't have to say "Watchdog"
so many times.

Rather than "CPU:%d" just spell it "CPU %d", "Hard" doesn't need a
capital in the middle of a sentence, and "LOCKUP other CPUS" should be
"LOCKUP on other CPUS".

Also make it clear when a CPU self detects a lockup by spelling it
out.

Signed-off-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Use pr_fmt() in the watchdog code, so we don't have to say "Watchdog"
so many times.

Rather than "CPU:%d" just spell it "CPU %d", "Hard" doesn't need a
capital in the middle of a sentence, and "LOCKUP other CPUS" should be
"LOCKUP on other CPUS".

Also make it clear when a CPU self detects a lockup by spelling it
out.

Signed-off-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>powerpc/watchdog: improve watchdog comments</title>
<updated>2018-01-21T04:06:26+00:00</updated>
<author>
<name>Nicholas Piggin</name>
<email>npiggin@gmail.com</email>
</author>
<published>2017-11-01T00:27:33+00:00</published>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/commit/?id=723b113319baa8d8b5f619a94348fa5b3a78bc94'/>
<id>723b113319baa8d8b5f619a94348fa5b3a78bc94</id>
<content type='text'>
The overview comments in the powerpc watchdog are out of date after
several iterations and changes of the code. Bring them up to date.

Signed-off-by: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Signed-off-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The overview comments in the powerpc watchdog are out of date after
several iterations and changes of the code. Bring them up to date.

Signed-off-by: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Signed-off-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
</pre>
</div>
</content>
</entry>
</feed>
