summaryrefslogtreecommitdiff
path: root/arch/powerpc/kernel
AgeCommit message (Collapse)AuthorFilesLines
2020-11-22powerpc/8xx: Always fault when _PAGE_ACCESSED is not setChristophe Leroy1-6/+2
commit 29daf869cbab69088fe1755d9dd224e99ba78b56 upstream. The kernel expects pte_young() to work regardless of CONFIG_SWAP. Make sure a minor fault is taken to set _PAGE_ACCESSED when it is not already set, regardless of the selection of CONFIG_SWAP. This adds at least 3 instructions to the TLB miss exception handlers fast path. Following patch will reduce this overhead. Also update the rotation instruction to the correct number of bits to reflect all changes done to _PAGE_ACCESSED over time. Fixes: d069cb4373fe ("powerpc/8xx: Don't touch ACCESSED when no SWAP.") Fixes: 5f356497c384 ("powerpc/8xx: remove unused _PAGE_WRITETHRU") Fixes: e0a8e0d90a9f ("powerpc/8xx: Handle PAGE_USER via APG bits") Fixes: 5b2753fc3e8a ("powerpc/8xx: Implementation of PAGE_EXEC") Fixes: a891c43b97d3 ("powerpc/8xx: Prepare handlers for _PAGE_HUGE for 512k pages.") Cc: stable@vger.kernel.org Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/af834e8a0f1fa97bfae65664950f0984a70c4750.1602492856.git.christophe.leroy@csgroup.eu Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-22powerpc/64s: flush L1D after user accessesNicholas Piggin3-57/+93
commit 9a32a7e78bd0cd9a9b6332cbdc345ee5ffd0c5de upstream. IBM Power9 processors can speculatively operate on data in the L1 cache before it has been completely validated, via a way-prediction mechanism. It is not possible for an attacker to determine the contents of impermissible memory using this method, since these systems implement a combination of hardware and software security measures to prevent scenarios where protected data could be leaked. However these measures don't address the scenario where an attacker induces the operating system to speculatively execute instructions using data that the attacker controls. This can be used for example to speculatively bypass "kernel user access prevention" techniques, as discovered by Anthony Steinhauser of Google's Safeside Project. This is not an attack by itself, but there is a possibility it could be used in conjunction with side-channels or other weaknesses in the privileged code to construct an attack. This issue can be mitigated by flushing the L1 cache between privilege boundaries of concern. This patch flushes the L1 cache after user accesses. This is part of the fix for CVE-2020-4788. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Daniel Axtens <dja@axtens.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-22powerpc/64s: flush L1D on kernel entryNicholas Piggin3-6/+108
commit f79643787e0a0762d2409b7b8334e83f22d85695 upstream. IBM Power9 processors can speculatively operate on data in the L1 cache before it has been completely validated, via a way-prediction mechanism. It is not possible for an attacker to determine the contents of impermissible memory using this method, since these systems implement a combination of hardware and software security measures to prevent scenarios where protected data could be leaked. However these measures don't address the scenario where an attacker induces the operating system to speculatively execute instructions using data that the attacker controls. This can be used for example to speculatively bypass "kernel user access prevention" techniques, as discovered by Anthony Steinhauser of Google's Safeside Project. This is not an attack by itself, but there is a possibility it could be used in conjunction with side-channels or other weaknesses in the privileged code to construct an attack. This issue can be mitigated by flushing the L1 cache between privilege boundaries of concern. This patch flushes the L1 cache on kernel entry. This is part of the fix for CVE-2020-4788. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Daniel Axtens <dja@axtens.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-22powerpc/64s: move some exception handlers out of lineDaniel Axtens1-2/+8
(backport only) We're about to grow the exception handlers, which will make a bunch of them no longer fit within the space available. We move them out of line. This is a fiddly and error-prone business, so in the interests of reviewability I haven't merged this in with the addition of the entry flush. Signed-off-by: Daniel Axtens <dja@axtens.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-05powerpc: Fix undetected data corruption with P9N DD2.1 VSX CI load emulationMichael Neuling1-1/+1
commit 1da4a0272c5469169f78cd76cf175ff984f52f06 upstream. __get_user_atomic_128_aligned() stores to kaddr using stvx which is a VMX store instruction, hence kaddr must be 16 byte aligned otherwise the store won't occur as expected. Unfortunately when we call __get_user_atomic_128_aligned() in p9_hmi_special_emu(), the buffer we pass as kaddr (ie. vbuf) isn't guaranteed to be 16B aligned. This means that the write to vbuf in __get_user_atomic_128_aligned() has the bottom bits of the address truncated. This results in other local variables being overwritten. Also vbuf will not contain the correct data which results in the userspace emulation being wrong and hence undetected user data corruption. In the past we've been mostly lucky as vbuf has ended up aligned but this is fragile and isn't always true. CONFIG_STACKPROTECTOR in particular can change the stack arrangement enough that our luck runs out. This issue only occurs on POWER9 Nimbus <= DD2.1 bare metal. The fix is to align vbuf to a 16 byte boundary. Fixes: 5080332c2c89 ("powerpc/64s: Add workaround for P9 vector CI load issue") Cc: stable@vger.kernel.org # v4.15+ Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20201013043741.743413-1-mikey@neuling.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-05powerpc: Warn about use of smt_snooze_delayJoel Stanley1-25/+17
commit a02f6d42357acf6e5de6ffc728e6e77faf3ad217 upstream. It's not done anything for a long time. Save the percpu variable, and emit a warning to remind users to not expect it to do anything. This uses pr_warn_once instead of pr_warn_ratelimit as testing 'ppc64_cpu --smt=off' on a 24 core / 4 SMT system showed the warning to be noisy, as the online/offline loop is slow. Fixes: 3fa8cad82b94 ("powerpc/pseries/cpuidle: smt-snooze-delay cleanup.") Cc: stable@vger.kernel.org # v3.14 Signed-off-by: Joel Stanley <joel@jms.id.au> Acked-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200902000012.3440389-1-joel@jms.id.au Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-05powerpc/rtas: Restrict RTAS requests from userspaceAndrew Donnellan1-0/+153
commit bd59380c5ba4147dcbaad3e582b55ccfd120b764 upstream. A number of userspace utilities depend on making calls to RTAS to retrieve information and update various things. The existing API through which we expose RTAS to userspace exposes more RTAS functionality than we actually need, through the sys_rtas syscall, which allows root (or anyone with CAP_SYS_ADMIN) to make any RTAS call they want with arbitrary arguments. Many RTAS calls take the address of a buffer as an argument, and it's up to the caller to specify the physical address of the buffer as an argument. We allocate a buffer (the "RMO buffer") in the Real Memory Area that RTAS can access, and then expose the physical address and size of this buffer in /proc/powerpc/rtas/rmo_buffer. Userspace is expected to read this address, poke at the buffer using /dev/mem, and pass an address in the RMO buffer to the RTAS call. However, there's nothing stopping the caller from specifying whatever address they want in the RTAS call, and it's easy to construct a series of RTAS calls that can overwrite arbitrary bytes (even without /dev/mem access). Additionally, there are some RTAS calls that do potentially dangerous things and for which there are no legitimate userspace use cases. In the past, this would not have been a particularly big deal as it was assumed that root could modify all system state freely, but with Secure Boot and lockdown we need to care about this. We can't fundamentally change the ABI at this point, however we can address this by implementing a filter that checks RTAS calls against a list of permitted calls and forces the caller to use addresses within the RMO buffer. The list is based off the list of calls that are used by the librtas userspace library, and has been tested with a number of existing userspace RTAS utilities. For compatibility with any applications we are not aware of that require other calls, the filter can be turned off at build time. Cc: stable@vger.kernel.org Reported-by: Daniel Axtens <dja@axtens.net> Signed-off-by: Andrew Donnellan <ajd@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200820044512.7543-1-ajd@linux.ibm.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-30powerpc/tau: Disable TAU between measurementsFinn Thain1-41/+24
[ Upstream commit e63d6fb5637e92725cf143559672a34b706bca4f ] Enabling CONFIG_TAU_INT causes random crashes: Unrecoverable exception 1700 at c0009414 (msr=1000) Oops: Unrecoverable exception, sig: 6 [#1] BE PAGE_SIZE=4K MMU=Hash SMP NR_CPUS=2 PowerMac Modules linked in: CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.7.0-pmac-00043-gd5f545e1a8593 #5 NIP: c0009414 LR: c0009414 CTR: c00116fc REGS: c0799eb8 TRAP: 1700 Not tainted (5.7.0-pmac-00043-gd5f545e1a8593) MSR: 00001000 <ME> CR: 22000228 XER: 00000100 GPR00: 00000000 c0799f70 c076e300 00800000 0291c0ac 00e00000 c076e300 00049032 GPR08: 00000001 c00116fc 00000000 dfbd3200 ffffffff 007f80a8 00000000 00000000 GPR16: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 c075ce04 GPR24: c075ce04 dfff8880 c07b0000 c075ce04 00080000 00000001 c079ef98 c079ef5c NIP [c0009414] arch_cpu_idle+0x24/0x6c LR [c0009414] arch_cpu_idle+0x24/0x6c Call Trace: [c0799f70] [00000001] 0x1 (unreliable) [c0799f80] [c0060990] do_idle+0xd8/0x17c [c0799fa0] [c0060ba4] cpu_startup_entry+0x20/0x28 [c0799fb0] [c072d220] start_kernel+0x434/0x44c [c0799ff0] [00003860] 0x3860 Instruction dump: XXXXXXXX XXXXXXXX XXXXXXXX 3d20c07b XXXXXXXX XXXXXXXX XXXXXXXX 7c0802a6 XXXXXXXX XXXXXXXX XXXXXXXX 4e800421 XXXXXXXX XXXXXXXX XXXXXXXX 7d2000a6 ---[ end trace 3a0c9b5cb216db6b ]--- Resolve this problem by disabling each THRMn comparator when handling the associated THRMn interrupt and by disabling the TAU entirely when updating THRMn thresholds. Fixes: 1da177e4c3f41 ("Linux-2.6.12-rc2") Signed-off-by: Finn Thain <fthain@telegraphics.com.au> Tested-by: Stan Johnson <userm57@yahoo.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/5a0ba3dc5612c7aac596727331284a3676c08472.1599260540.git.fthain@telegraphics.com.au Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-30powerpc/tau: Check processor type before enabling TAU interruptFinn Thain1-19/+14
[ Upstream commit 5e3119e15fed5b9a9a7e528665ff098a4a8dbdbc ] According to Freescale's documentation, MPC74XX processors have an erratum that prevents the TAU interrupt from working, so don't try to use it when running on those processors. Fixes: 1da177e4c3f41 ("Linux-2.6.12-rc2") Signed-off-by: Finn Thain <fthain@telegraphics.com.au> Tested-by: Stan Johnson <userm57@yahoo.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/c281611544768e758bd58fe812cf702a5bd2d042.1599260540.git.fthain@telegraphics.com.au Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-29powerpc/tau: Remove duplicated set_thresholds() callFinn Thain1-5/+0
[ Upstream commit 420ab2bc7544d978a5d0762ee736412fe9c796ab ] The commentary at the call site seems to disagree with the code. The conditional prevents calling set_thresholds() via the exception handler, which appears to crash. Perhaps that's because it immediately triggers another TAU exception. Anyway, calling set_thresholds() from TAUupdate() is redundant because tau_timeout() does so. Fixes: 1da177e4c3f41 ("Linux-2.6.12-rc2") Signed-off-by: Finn Thain <fthain@telegraphics.com.au> Tested-by: Stan Johnson <userm57@yahoo.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/d7c7ee33232cf72a6a6bbb6ef05838b2e2b113c0.1599260540.git.fthain@telegraphics.com.au Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-29powerpc/tau: Convert from timer to workqueueFinn Thain1-20/+18
[ Upstream commit b1c6a0a10bfaf36ec82fde6f621da72407fa60a1 ] Since commit 19dbdcb8039cf ("smp: Warn on function calls from softirq context") the Thermal Assist Unit driver causes a warning like the following when CONFIG_SMP is enabled. ------------[ cut here ]------------ WARNING: CPU: 0 PID: 0 at kernel/smp.c:428 smp_call_function_many_cond+0xf4/0x38c Modules linked in: CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.7.0-pmac #3 NIP: c00b37a8 LR: c00b3abc CTR: c001218c REGS: c0799c60 TRAP: 0700 Not tainted (5.7.0-pmac) MSR: 00029032 <EE,ME,IR,DR,RI> CR: 42000224 XER: 00000000 GPR00: c00b3abc c0799d18 c076e300 c079ef5c c0011fec 00000000 00000000 00000000 GPR08: 00000100 00000100 00008000 ffffffff 42000224 00000000 c079d040 c079d044 GPR16: 00000001 00000000 00000004 c0799da0 c079f054 c07a0000 c07a0000 00000000 GPR24: c0011fec 00000000 c079ef5c c079ef5c 00000000 00000000 00000000 00000000 NIP [c00b37a8] smp_call_function_many_cond+0xf4/0x38c LR [c00b3abc] on_each_cpu+0x38/0x68 Call Trace: [c0799d18] [ffffffff] 0xffffffff (unreliable) [c0799d68] [c00b3abc] on_each_cpu+0x38/0x68 [c0799d88] [c0096704] call_timer_fn.isra.26+0x20/0x7c [c0799d98] [c0096b40] run_timer_softirq+0x1d4/0x3fc [c0799df8] [c05b4368] __do_softirq+0x118/0x240 [c0799e58] [c0039c44] irq_exit+0xc4/0xcc [c0799e68] [c000ade8] timer_interrupt+0x1b0/0x230 [c0799ea8] [c0013520] ret_from_except+0x0/0x14 --- interrupt: 901 at arch_cpu_idle+0x24/0x6c LR = arch_cpu_idle+0x24/0x6c [c0799f70] [00000001] 0x1 (unreliable) [c0799f80] [c0060990] do_idle+0xd8/0x17c [c0799fa0] [c0060ba8] cpu_startup_entry+0x24/0x28 [c0799fb0] [c072d220] start_kernel+0x434/0x44c [c0799ff0] [00003860] 0x3860 Instruction dump: 8129f204 2f890000 40beff98 3d20c07a 8929eec4 2f890000 40beff88 0fe00000 81220000 552805de 550802ef 4182ff84 <0fe00000> 3860ffff 7f65db78 7f44d378 ---[ end trace 34a886e47819c2eb ]--- Don't call on_each_cpu() from a timer callback, call it from a worker thread instead. Fixes: 1da177e4c3f41 ("Linux-2.6.12-rc2") Tested-by: Stan Johnson <userm57@yahoo.com> Signed-off-by: Finn Thain <fthain@telegraphics.com.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/bb61650bea4f4c91fb8e24b9a6f130a1438651a7.1599260540.git.fthain@telegraphics.com.au Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-29powerpc/tau: Use appropriate temperature sample intervalFinn Thain1-8/+4
[ Upstream commit 66943005cc41f48e4d05614e8f76c0ca1812f0fd ] According to the MPC750 Users Manual, the SITV value in Thermal Management Register 3 is 13 bits long. The present code calculates the SITV value as 60 * 500 cycles. This would overflow to give 10 us on a 500 MHz CPU rather than the intended 60 us. (But according to the Microprocessor Datasheet, there is also a factor of 266 that has to be applied to this value on certain parts i.e. speed sort above 266 MHz.) Always use the maximum cycle count, as recommended by the Datasheet. Fixes: 1da177e4c3f41 ("Linux-2.6.12-rc2") Signed-off-by: Finn Thain <fthain@telegraphics.com.au> Tested-by: Stan Johnson <userm57@yahoo.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/896f542e5f0f1d6cf8218524c2b67d79f3d69b3c.1599260540.git.fthain@telegraphics.com.au Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01powerpc/traps: Make unrecoverable NMIs die instead of panicNicholas Piggin1-3/+3
[ Upstream commit 265d6e588d87194c2fe2d6c240247f0264e0c19b ] System Reset and Machine Check interrupts that are not recoverable due to being nested or interrupting when RI=0 currently panic. This is not necessary, and can often just kill the current context and recover. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr> Link: https://lore.kernel.org/r/20200508043408.886394-16-npiggin@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01powerpc/eeh: Only dump stack once if an MMIO loop is detectedOliver O'Halloran1-1/+1
[ Upstream commit 4e0942c0302b5ad76b228b1a7b8c09f658a1d58a ] Many drivers don't check for errors when they get a 0xFFs response from an MMIO load. As a result after an EEH event occurs a driver can get stuck in a polling loop unless it some kind of internal timeout logic. Currently EEH tries to detect and report stuck drivers by dumping a stack trace after eeh_dev_check_failure() is called EEH_MAX_FAILS times on an already frozen PE. The value of EEH_MAX_FAILS was chosen so that a dump would occur every few seconds if the driver was spinning in a loop. This results in a lot of spurious stack traces in the kernel log. Fix this by limiting it to printing one stack trace for each PE freeze. If the driver is truely stuck the kernel's hung task detector is better suited to reporting the probelm anyway. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Reviewed-by: Sam Bobroff <sbobroff@linux.ibm.com> Tested-by: Sam Bobroff <sbobroff@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191016012536.22588-1-oohall@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-09-23powerpc/dma: Fix dma_map_ops::get_required_maskAlexey Kardashevskiy1-1/+2
commit 437ef802e0adc9f162a95213a3488e8646e5fc03 upstream. There are 2 problems with it: 1. "<" vs expected "<<" 2. the shift number is an IOMMU page number mask, not an address mask as the IOMMU page shift is missing. This did not hit us before f1565c24b596 ("powerpc: use the generic dma_ops_bypass mode") because we had additional code to handle bypass mask so this chunk (almost?) never executed.However there were reports that aacraid does not work with "iommu=nobypass". After f1565c24b596, aacraid (and probably others which call dma_get_required_mask() before setting the mask) was unable to enable 64bit DMA and fall back to using IOMMU which was known not to work, one of the problems is double free of an IOMMU page. This fixes DMA for aacraid, both with and without "iommu=nobypass" in the kernel command line. Verified with "stress-ng -d 4". Fixes: 6a5c7be5e484 ("powerpc: Override dma_get_required_mask by platform hook and ops") Cc: stable@vger.kernel.org # v3.2+ Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200908015106.79661-1-aik@ozlabs.ru Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-09-03powerpc/64s: Don't init FSCR_DSCR in __init_FSCR()Michael Ellerman1-1/+1
commit 0828137e8f16721842468e33df0460044a0c588b upstream. __init_FSCR() was added originally in commit 2468dcf641e4 ("powerpc: Add support for context switching the TAR register") (Feb 2013), and only set FSCR_TAR. At that point FSCR (Facility Status and Control Register) was not context switched, so the setting was permanent after boot. Later we added initialisation of FSCR_DSCR to __init_FSCR(), in commit 54c9b2253d34 ("powerpc: Set DSCR bit in FSCR setup") (Mar 2013), again that was permanent after boot. Then commit 2517617e0de6 ("powerpc: Fix context switch DSCR on POWER8") (Aug 2013) added a limited context switch of FSCR, just the FSCR_DSCR bit was context switched based on thread.dscr_inherit. That commit said "This clears the H/FSCR DSCR bit initially", but it didn't, it left the initialisation of FSCR_DSCR in __init_FSCR(). However the initial context switch from init_task to pid 1 would clear FSCR_DSCR because thread.dscr_inherit was 0. That commit also introduced the requirement that FSCR_DSCR be clear for user processes, so that we can take the facility unavailable interrupt in order to manage dscr_inherit. Then in commit 152d523e6307 ("powerpc: Create context switch helpers save_sprs() and restore_sprs()") (Dec 2015) FSCR was added to thread_struct. However it still wasn't fully context switched, we just took the existing value and set FSCR_DSCR if the new thread had dscr_inherit set. FSCR was still initialised at boot to FSCR_DSCR | FSCR_TAR, but that value was not propagated into the thread_struct, so the initial context switch set FSCR_DSCR back to 0. Finally commit b57bd2de8c6c ("powerpc: Improve FSCR init and context switching") (Jun 2016) added a full context switch of the FSCR, and added an initialisation of init_task.thread.fscr to FSCR_TAR | FSCR_EBB, but omitted FSCR_DSCR. The end result is that swapper runs with FSCR_DSCR set because of the initialisation in __init_FSCR(), but no other processes do, they use the value from init_task.thread.fscr. Having FSCR_DSCR set for swapper allows it to access SPR 3 from userspace, but swapper never runs userspace, so it has no useful effect. It's also confusing to have the value initialised in two places to two different values. So remove FSCR_DSCR from __init_FSCR(), this at least gets us to the point where there's a single value of FSCR, even if it's still set in two places. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Tested-by: Alistair Popple <alistair@popple.id.au> Link: https://lore.kernel.org/r/20200527145843.2761782-1-mpe@ellerman.id.au Cc: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-08-19powerpc/vdso: Fix vdso cpu truncationMilton Miller1-1/+1
[ Upstream commit a9f675f950a07d5c1dbcbb97aabac56f5ed085e3 ] The code in vdso_cpu_init that exposes the cpu and numa node to userspace via SPRG_VDSO incorrctly masks the cpu to 12 bits. This means that any kernel running on a box with more than 4096 threads (NR_CPUS advertises a limit of of 8192 cpus) would expose userspace to two cpu contexts running at the same time with the same cpu number. Note: I'm not aware of any distro shipping a kernel with support for more than 4096 threads today, nor of any system image that currently exceeds 4096 threads. Found via code browsing. Fixes: 18ad51dd342a7eb09dbcd059d0b451b616d4dafc ("powerpc: Add VDSO version of getcpu") Signed-off-by: Milton Miller <miltonm@us.ibm.com> Signed-off-by: Anton Blanchard <anton@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200715233704.1352257-1-anton@ozlabs.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-25powerpc/64: Don't initialise init_task->thread.regsMichael Ellerman1-8/+1
[ Upstream commit 7ffa8b7dc11752827329e4e84a574ea6aaf24716 ] Aneesh increased the size of struct pt_regs by 16 bytes and started seeing this WARN_ON: smp: Bringing up secondary CPUs ... ------------[ cut here ]------------ WARNING: CPU: 0 PID: 0 at arch/powerpc/kernel/process.c:455 giveup_all+0xb4/0x110 Modules linked in: CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.7.0-rc2-gcc-8.2.0-1.g8f6a41f-default+ #318 NIP: c00000000001a2b4 LR: c00000000001a29c CTR: c0000000031d0000 REGS: c0000000026d3980 TRAP: 0700 Not tainted (5.7.0-rc2-gcc-8.2.0-1.g8f6a41f-default+) MSR: 800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 48048224 XER: 00000000 CFAR: c000000000019cc8 IRQMASK: 1 GPR00: c00000000001a264 c0000000026d3c20 c0000000026d7200 800000000280b033 GPR04: 0000000000000001 0000000000000000 0000000000000077 30206d7372203164 GPR08: 0000000000002000 0000000002002000 800000000280b033 3230303030303030 GPR12: 0000000000008800 c0000000031d0000 0000000000800050 0000000002000066 GPR16: 000000000309a1a0 000000000309a4b0 000000000309a2d8 000000000309a890 GPR20: 00000000030d0098 c00000000264da40 00000000fd620000 c0000000ff798080 GPR24: c00000000264edf0 c0000001007469f0 00000000fd620000 c0000000020e5e90 GPR28: c00000000264edf0 c00000000264d200 000000001db60000 c00000000264d200 NIP [c00000000001a2b4] giveup_all+0xb4/0x110 LR [c00000000001a29c] giveup_all+0x9c/0x110 Call Trace: [c0000000026d3c20] [c00000000001a264] giveup_all+0x64/0x110 (unreliable) [c0000000026d3c90] [c00000000001ae34] __switch_to+0x104/0x480 [c0000000026d3cf0] [c000000000e0b8a0] __schedule+0x320/0x970 [c0000000026d3dd0] [c000000000e0c518] schedule_idle+0x38/0x70 [c0000000026d3df0] [c00000000019c7c8] do_idle+0x248/0x3f0 [c0000000026d3e70] [c00000000019cbb8] cpu_startup_entry+0x38/0x40 [c0000000026d3ea0] [c000000000011bb0] rest_init+0xe0/0xf8 [c0000000026d3ed0] [c000000002004820] start_kernel+0x990/0x9e0 [c0000000026d3f90] [c00000000000c49c] start_here_common+0x1c/0x400 Which was unexpected. The warning is checking the thread.regs->msr value of the task we are switching from: usermsr = tsk->thread.regs->msr; ... WARN_ON((usermsr & MSR_VSX) && !((usermsr & MSR_FP) && (usermsr & MSR_VEC))); ie. if MSR_VSX is set then both of MSR_FP and MSR_VEC are also set. Dumping tsk->thread.regs->msr we see that it's: 0x1db60000 Which is not a normal looking MSR, in fact the only valid bit is MSR_VSX, all the other bits are reserved in the current definition of the MSR. We can see from the oops that it was swapper/0 that we were switching from when we hit the warning, ie. init_task. So its thread.regs points to the base (high addresses) in init_stack. Dumping the content of init_task->thread.regs, with the members of pt_regs annotated (the 16 bytes larger version), we see: 0000000000000000 c000000002780080 gpr[0] gpr[1] 0000000000000000 c000000002666008 gpr[2] gpr[3] c0000000026d3ed0 0000000000000078 gpr[4] gpr[5] c000000000011b68 c000000002780080 gpr[6] gpr[7] 0000000000000000 0000000000000000 gpr[8] gpr[9] c0000000026d3f90 0000800000002200 gpr[10] gpr[11] c000000002004820 c0000000026d7200 gpr[12] gpr[13] 000000001db60000 c0000000010aabe8 gpr[14] gpr[15] c0000000010aabe8 c0000000010aabe8 gpr[16] gpr[17] c00000000294d598 0000000000000000 gpr[18] gpr[19] 0000000000000000 0000000000001ff8 gpr[20] gpr[21] 0000000000000000 c00000000206d608 gpr[22] gpr[23] c00000000278e0cc 0000000000000000 gpr[24] gpr[25] 000000002fff0000 c000000000000000 gpr[26] gpr[27] 0000000002000000 0000000000000028 gpr[28] gpr[29] 000000001db60000 0000000004750000 gpr[30] gpr[31] 0000000002000000 000000001db60000 nip msr 0000000000000000 0000000000000000 orig_r3 ctr c00000000000c49c 0000000000000000 link xer 0000000000000000 0000000000000000 ccr softe 0000000000000000 0000000000000000 trap dar 0000000000000000 0000000000000000 dsisr result 0000000000000000 0000000000000000 ppr kuap 0000000000000000 0000000000000000 pad[2] pad[3] This looks suspiciously like stack frames, not a pt_regs. If we look closely we can see return addresses from the stack trace above, c000000002004820 (start_kernel) and c00000000000c49c (start_here_common). init_task->thread.regs is setup at build time in processor.h: #define INIT_THREAD { \ .ksp = INIT_SP, \ .regs = (struct pt_regs *)INIT_SP - 1, /* XXX bogus, I think */ \ The early boot code where we setup the initial stack is: LOAD_REG_ADDR(r3,init_thread_union) /* set up a stack pointer */ LOAD_REG_IMMEDIATE(r1,THREAD_SIZE) add r1,r3,r1 li r0,0 stdu r0,-STACK_FRAME_OVERHEAD(r1) Which creates a stack frame of size 112 bytes (STACK_FRAME_OVERHEAD). Which is far too small to contain a pt_regs. So the result is init_task->thread.regs is pointing at some stack frames on the init stack, not at a pt_regs. We have gotten away with this for so long because with pt_regs at its current size the MSR happens to point into the first frame, at a location that is not written to by the early asm. With the 16 byte expansion the MSR falls into the second frame, which is used by the compiler, and collides with a saved register that tends to be non-zero. As far as I can see this has been wrong since the original merge of 64-bit ppc support, back in 2002. Conceptually swapper should have no regs, it never entered from userspace, and in fact that's what we do on 32-bit. It's also presumably what the "bogus" comment is referring to. So I think the right fix is to just not-initialise regs at all. I'm slightly worried this will break some code that isn't prepared for a NULL regs, but we'll have to see. Remove the comment in head_64.S which refers to us setting up the regs (even though we never did), and is otherwise not really accurate any more. Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200428123130.73078-1-mpe@ellerman.id.au Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-25powerpc/crashkernel: Take "mem=" option into accountPingfan Liu1-3/+5
[ Upstream commit be5470e0c285a68dc3afdea965032f5ddc8269d7 ] 'mem=" option is an easy way to put high pressure on memory during some test. Hence after applying the memory limit, instead of total mem, the actual usable memory should be considered when reserving mem for crashkernel. Otherwise the boot up may experience OOM issue. E.g. it would reserve 4G prior to the change and 512M afterward, if passing crashkernel="2G-4G:384M,4G-16G:512M,16G-64G:1G,64G-128G:2G,128G-:4G", and mem=5G on a 256G machine. This issue is powerpc specific because it puts higher priority on fadump and kdump reservation than on "mem=". Referring the following code: if (fadump_reserve_mem() == 0) reserve_crashkernel(); ... /* Ensure that total memory size is page-aligned. */ limit = ALIGN(memory_limit ?: memblock_phys_mem_size(), PAGE_SIZE); memblock_enforce_memory_limit(limit); While on other arches, the effect of "mem=" takes a higher priority and pass through memblock_phys_mem_size() before calling reserve_crashkernel(). Signed-off-by: Pingfan Liu <kernelfans@gmail.com> Reviewed-by: Hari Bathini <hbathini@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/1585749644-4148-1-git-send-email-kernelfans@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-22powerpc/64s: Save FSCR to init_task.thread.fscr after feature initMichael Ellerman1-0/+19
commit 912c0a7f2b5daa3cbb2bc10f303981e493de73bd upstream. At boot the FSCR is initialised via one of two paths. On most systems it's set to a hard coded value in __init_FSCR(). On newer skiboot systems we use the device tree CPU features binding, where firmware can tell Linux what bits to set in FSCR (and HFSCR). In both cases the value that's configured at boot is not propagated into the init_task.thread.fscr value prior to the initial fork of init (pid 1), which means the value is not used by any processes other than swapper (the idle task). For the __init_FSCR() case this is OK, because the value in init_task.thread.fscr is initialised to something sensible. However it does mean that the value set in __init_FSCR() is not used other than for swapper, which is odd and confusing. The bigger problem is for the device tree CPU features case it prevents firmware from setting (or clearing) FSCR bits for use by user space. This means all existing kernels can not have features enabled/disabled by firmware if those features require setting/clearing FSCR bits. We can handle both cases by saving the FSCR value into init_task.thread.fscr after we have initialised it at boot. This fixes the bug for device tree CPU features, and will allow us to simplify the initialisation for the __init_FSCR() case in a future patch. Fixes: 5a61ef74f269 ("powerpc/64s: Support new device tree binding for discovering CPU features") Cc: stable@vger.kernel.org # v4.12+ Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200527145843.2761782-3-mpe@ellerman.id.au Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-22powerpc/64s: Don't let DT CPU features set FSCR_DSCRMichael Ellerman1-0/+8
commit 993e3d96fd08c3ebf7566e43be9b8cd622063e6d upstream. The device tree CPU features binding includes FSCR bit numbers which Linux is instructed to set by firmware. Whether that's a good idea or not, in the case of the DSCR the Linux implementation has a hard requirement that the FSCR_DSCR bit not be set by default. We use it to track when a process reads/writes to DSCR, so it must be clear to begin with. So if firmware tells us to set FSCR_DSCR we must ignore it. Currently this does not cause a bug in our DSCR handling because the value of FSCR that the device tree CPU features code establishes is only used by swapper. All other tasks use the value hard coded in init_task.thread.fscr. However we'd like to fix that in a future commit, at which point this will become necessary. Fixes: 5a61ef74f269 ("powerpc/64s: Support new device tree binding for discovering CPU features") Cc: stable@vger.kernel.org # v4.12+ Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200527145843.2761782-2-mpe@ellerman.id.au Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-10powerpc/pci/of: Parse unassigned resourcesAlexey Kardashevskiy1-2/+10
commit dead1c845dbe97e0061dae2017eaf3bd8f8f06ee upstream. The pseries platform uses the PCI_PROBE_DEVTREE method of PCI probing which reads "assigned-addresses" of every PCI device and initializes the device resources. However if the property is missing or zero sized, then there is no fallback of any kind and the PCI resources remain undiscovered, i.e. pdev->resource[] array remains empty. This adds a fallback which parses the "reg" property in pretty much same way except it marks resources as "unset" which later make Linux assign those resources proper addresses. This has an effect when: 1. a hypervisor failed to assign any resource for a device; 2. /chosen/linux,pci-probe-only=0 is in the DT so the system may try assigning a resource. Neither is likely to happen under PowerVM. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Cc: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-29powerpc/setup_64: Set cache-line-size based on cache-block-sizeChris Packham1-0/+2
commit 94c0b013c98583614e1ad911e8795ca36da34a85 upstream. If {i,d}-cache-block-size is set and {i,d}-cache-line-size is not, use the block-size value for both. Per the devicetree spec cache-line-size is only needed if it differs from the block size. Originally the code would fallback from block size to line size. An error message was printed if both properties were missing. Later the code was refactored to use clearer names and logic but it inadvertently made line size a required property, meaning on systems without a line size property we fall back to the default from the cputable. On powernv (OPAL) platforms, since the introduction of device tree CPU features (5a61ef74f269 ("powerpc/64s: Support new device tree binding for discovering CPU features")), that has led to the wrong value being used, as the fallback value is incorrect for Power8/Power9 CPUs. The incorrect values flow through to the VDSO and also to the sysconf values, SC_LEVEL1_ICACHE_LINESIZE etc. Fixes: bd067f83b084 ("powerpc/64: Fix naming of cache block vs. cache line") Cc: stable@vger.kernel.org # v4.11+ Signed-off-by: Chris Packham <chris.packham@alliedtelesis.co.nz> Reported-by: Qian Cai <cai@lca.pw> [mpe: Add even more detail to change log] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200416221908.7886-1-chris.packham@alliedtelesis.co.nz Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-29Revert "powerpc/64: irq_work avoid interrupt when called with hardware irqs ↵Nicholas Piggin1-31/+13
enabled" [ Upstream commit abc3fce76adbdfa8f87272c784b388cd20b46049 ] This reverts commit ebb37cf3ffd39fdb6ec5b07111f8bb2f11d92c5f. That commit does not play well with soft-masked irq state manipulations in idle, interrupt replay, and possibly others due to tracing code sometimes using irq_work_queue (e.g., in trace_hardirqs_on()). That can cause PACA_IRQ_DEC to become set when it is not expected, and be ignored or cleared or cause warnings. The net result seems to be missing an irq_work until the next timer interrupt in the worst case which is usually not going to be noticed, however it could be a long time if the tick is disabled, which is against the spirit of irq_work and might cause real problems. The idea is still solid, but it would need more work. It's not really clear if it would be worth added complexity, so revert this for now (not a straight revert, but replace with a comment explaining why we might see interrupts happening, and gives git blame something to find). Fixes: ebb37cf3ffd3 ("powerpc/64: irq_work avoid interrupt when called with hardware irqs enabled") Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200402120401.1115883-1-npiggin@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-04-17powerpc: Make setjmp/longjmp signature standardClement Courbet1-3/+0
commit c17eb4dca5a353a9dbbb8ad6934fe57af7165e91 upstream. Declaring setjmp()/longjmp() as taking longs makes the signature non-standard, and makes clang complain. In the past, this has been worked around by adding -ffreestanding to the compile flags. The implementation looks like it only ever propagates the value (in longjmp) or sets it to 1 (in setjmp), and we only call longjmp with integer parameters. This allows removing -ffreestanding from the compilation flags. Fixes: c9029ef9c957 ("powerpc: Avoid clang warnings around setjmp and longjmp") Cc: stable@vger.kernel.org # v4.14+ Signed-off-by: Clement Courbet <courbet@google.com> Reviewed-by: Nathan Chancellor <natechancellor@gmail.com> Tested-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200330080400.124803-1-courbet@google.com Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-17powerpc/kprobes: Ignore traps that happened in real modeChristophe Leroy1-0/+3
commit 21f8b2fa3ca5b01f7a2b51b89ce97a3705a15aa0 upstream. When a program check exception happens while MMU translation is disabled, following Oops happens in kprobe_handler() in the following code: } else if (*addr != BREAKPOINT_INSTRUCTION) { BUG: Unable to handle kernel data access on read at 0x0000e268 Faulting instruction address: 0xc000ec34 Oops: Kernel access of bad area, sig: 11 [#1] BE PAGE_SIZE=16K PREEMPT CMPC885 Modules linked in: CPU: 0 PID: 429 Comm: cat Not tainted 5.6.0-rc1-s3k-dev-00824-g84195dc6c58a #3267 NIP: c000ec34 LR: c000ecd8 CTR: c019cab8 REGS: ca4d3b58 TRAP: 0300 Not tainted (5.6.0-rc1-s3k-dev-00824-g84195dc6c58a) MSR: 00001032 <ME,IR,DR,RI> CR: 2a4d3c52 XER: 00000000 DAR: 0000e268 DSISR: c0000000 GPR00: c000b09c ca4d3c10 c66d0620 00000000 ca4d3c60 00000000 00009032 00000000 GPR08: 00020000 00000000 c087de44 c000afe0 c66d0ad0 100d3dd6 fffffff3 00000000 GPR16: 00000000 00000041 00000000 ca4d3d70 00000000 00000000 0000416d 00000000 GPR24: 00000004 c53b6128 00000000 0000e268 00000000 c07c0000 c07bb6fc ca4d3c60 NIP [c000ec34] kprobe_handler+0x128/0x290 LR [c000ecd8] kprobe_handler+0x1cc/0x290 Call Trace: [ca4d3c30] [c000b09c] program_check_exception+0xbc/0x6fc [ca4d3c50] [c000e43c] ret_from_except_full+0x0/0x4 --- interrupt: 700 at 0xe268 Instruction dump: 913e0008 81220000 38600001 3929ffff 91220000 80010024 bb410008 7c0803a6 38210020 4e800020 38600000 4e800020 <813b0000> 6d2a7fe0 2f8a0008 419e0154 ---[ end trace 5b9152d4cdadd06d ]--- kprobe is not prepared to handle events in real mode and functions running in real mode should have been blacklisted, so kprobe_handler() can safely bail out telling 'this trap is not mine' for any trap that happened while in real-mode. If the trap happened with MSR_IR or MSR_DR cleared, return 0 immediately. Reported-by: Larry Finger <Larry.Finger@lwfinger.net> Fixes: 6cc89bad60a6 ("powerpc/kprobes: Invoke handlers directly") Cc: stable@vger.kernel.org # v4.10+ Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Reviewed-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/424331e2006e7291a1bfe40e7f3fa58825f565e1.1582054578.git.christophe.leroy@c-s.fr Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-17powerpc/64/tm: Don't let userspace set regs->trap via sigreturnMichael Ellerman1-1/+3
commit c7def7fbdeaa25feaa19caf4a27c5d10bd8789e4 upstream. In restore_tm_sigcontexts() we take the trap value directly from the user sigcontext with no checking: err |= __get_user(regs->trap, &sc->gp_regs[PT_TRAP]); This means we can be in the kernel with an arbitrary regs->trap value. Although that's not immediately problematic, there is a risk we could trigger one of the uses of CHECK_FULL_REGS(): #define CHECK_FULL_REGS(regs) BUG_ON(regs->trap & 1) It can also cause us to unnecessarily save non-volatile GPRs again in save_nvgprs(), which shouldn't be problematic but is still wrong. It's also possible it could trick the syscall restart machinery, which relies on regs->trap not being == 0xc00 (see 9a81c16b5275 ("powerpc: fix double syscall restarts")), though I haven't been able to make that happen. Finally it doesn't match the behaviour of the non-TM case, in restore_sigcontext() which zeroes regs->trap. So change restore_tm_sigcontexts() to zero regs->trap. This was discovered while testing Nick's upcoming rewrite of the syscall entry path. In that series the call to save_nvgprs() prior to signal handling (do_notify_resume()) is removed, which leaves the low-bit of regs->trap uncleared which can then trigger the FULL_REGS() WARNs in setup_tm_sigcontexts(). Fixes: 2b0a576d15e0 ("powerpc: Add new transactional memory state to the signal context") Cc: stable@vger.kernel.org # v3.9+ Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200401023836.3286664-1-mpe@ellerman.id.au Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-17powerpc/powernv/idle: Restore AMR/UAMOR/AMOR after idleMichael Ellerman1-4/+23
commit 53a712bae5dd919521a58d7bad773b949358add0 upstream. In order to implement KUAP (Kernel Userspace Access Protection) on Power9 we will be using the AMR, and therefore indirectly the UAMOR/AMOR. So save/restore these regs in the idle code. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> [ajd: Backport to 4.19 tree, CVE-2020-11669] Signed-off-by: Andrew Donnellan <ajd@linux.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-25powerpc: Include .BTF sectionNaveen N. Rao1-0/+6
[ Upstream commit cb0cc635c7a9fa8a3a0f75d4d896721819c63add ] Selecting CONFIG_DEBUG_INFO_BTF results in the below warning from ld: ld: warning: orphan section `.BTF' from `.btf.vmlinux.bin.o' being placed in section `.BTF' Include .BTF section in vmlinux explicitly to fix the same. Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200220113132.857132-1-naveen.n.rao@linux.vnet.ibm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-03-11powerpc: fix hardware PMU exception bug on PowerVM compatibility mode systemsDesnes A. Nunes do Rosario1-1/+3
commit fc37a1632d40c80c067eb1bc235139f5867a2667 upstream. PowerVM systems running compatibility mode on a few Power8 revisions are still vulnerable to the hardware defect that loses PMU exceptions arriving prior to a context switch. The software fix for this issue is enabled through the CPU_FTR_PMAO_BUG cpu_feature bit, nevertheless this bit also needs to be set for PowerVM compatibility mode systems. Fixes: 68f2f0d431d9ea4 ("powerpc: Add a cpu feature CPU_FTR_PMAO_BUG") Signed-off-by: Desnes A. Nunes do Rosario <desnesn@linux.ibm.com> Reviewed-by: Leonardo Bras <leonardo@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200227134715.9715-1-desnesn@linux.ibm.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-28powerpc/tm: Fix clearing MSR[TS] in current when reclaiming on signal deliveryGustavo Luiz Duarte3-28/+39
commit 2464cc4c345699adea52c7aef75707207cb8a2f6 upstream. After a treclaim, we expect to be in non-transactional state. If we don't clear the current thread's MSR[TS] before we get preempted, then tm_recheckpoint_new_task() will recheckpoint and we get rescheduled in suspended transaction state. When handling a signal caught in transactional state, handle_rt_signal64() calls get_tm_stackpointer() that treclaims the transaction using tm_reclaim_current() but without clearing the thread's MSR[TS]. This can cause the TM Bad Thing exception below if later we pagefault and get preempted trying to access the user's sigframe, using __put_user(). Afterwards, when we are rescheduled back into do_page_fault() (but now in suspended state since the thread's MSR[TS] was not cleared), upon executing 'rfid' after completion of the page fault handling, the exception is raised because a transition from suspended to non-transactional state is invalid. Unexpected TM Bad Thing exception at c00000000000de44 (msr 0x8000000302a03031) tm_scratch=800000010280b033 Oops: Unrecoverable exception, sig: 6 [#1] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries CPU: 25 PID: 15547 Comm: a.out Not tainted 5.4.0-rc2 #32 NIP: c00000000000de44 LR: c000000000034728 CTR: 0000000000000000 REGS: c00000003fe7bd70 TRAP: 0700 Not tainted (5.4.0-rc2) MSR: 8000000302a03031 <SF,VEC,VSX,FP,ME,IR,DR,LE,TM[SE]> CR: 44000884 XER: 00000000 CFAR: c00000000000dda4 IRQMASK: 0 PACATMSCRATCH: 800000010280b033 GPR00: c000000000034728 c000000f65a17c80 c000000001662800 00007fffacf3fd78 GPR04: 0000000000001000 0000000000001000 0000000000000000 c000000f611f8af0 GPR08: 0000000000000000 0000000078006001 0000000000000000 000c000000000000 GPR12: c000000f611f84b0 c00000003ffcb200 0000000000000000 0000000000000000 GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR20: 0000000000000000 0000000000000000 0000000000000000 c000000f611f8140 GPR24: 0000000000000000 00007fffacf3fd68 c000000f65a17d90 c000000f611f7800 GPR28: c000000f65a17e90 c000000f65a17e90 c000000001685e18 00007fffacf3f000 NIP [c00000000000de44] fast_exception_return+0xf4/0x1b0 LR [c000000000034728] handle_rt_signal64+0x78/0xc50 Call Trace: [c000000f65a17c80] [c000000000034710] handle_rt_signal64+0x60/0xc50 (unreliable) [c000000f65a17d30] [c000000000023640] do_notify_resume+0x330/0x460 [c000000f65a17e20] [c00000000000dcc4] ret_from_except_lite+0x70/0x74 Instruction dump: 7c4ff120 e8410170 7c5a03a6 38400000 f8410060 e8010070 e8410080 e8610088 60000000 60000000 e8810090 e8210078 <4c000024> 48000000 e8610178 88ed0989 ---[ end trace 93094aa44b442f87 ]--- The simplified sequence of events that triggers the above exception is: ... # userspace in NON-TRANSACTIONAL state tbegin # userspace in TRANSACTIONAL state signal delivery # kernelspace in SUSPENDED state handle_rt_signal64() get_tm_stackpointer() treclaim # kernelspace in NON-TRANSACTIONAL state __put_user() page fault happens. We will never get back here because of the TM Ba