Age | Commit message (Collapse) | Author | Files | Lines |
|
[ Upstream commit 43c6494fa1499912c8177e71450c0279041152a6 ]
Back in 2006 Ben added some workarounds for a misbehaviour in the
Spider IO bridge used on early Cell machines, see commit
014da7ff47b5 ("[POWERPC] Cell "Spider" MMIO workarounds"). Later these
were made to be generic, ie. not tied specifically to Spider.
The code stashes a token in the high bits (59-48) of virtual addresses
used for IO (eg. returned from ioremap()). This works fine when using
the Hash MMU, but when we're using the Radix MMU the bits used for the
token overlap with some of the bits of the virtual address.
This is because the maximum virtual address is larger with Radix, up
to c00fffffffffffff, and in fact we use that high part of the address
range for ioremap(), see RADIX_KERN_IO_START.
As it happens the bits that are used overlap with the bits that
differentiate an IO address vs a linear map address. If the resulting
address lies outside the linear mapping we will crash (see below), if
not we just corrupt memory.
virtio-pci 0000:00:00.0: Using 64-bit direct DMA at offset 800000000000000
Unable to handle kernel paging request for data at address 0xc000000080000014
...
CFAR: c000000000626b98 DAR: c000000080000014 DSISR: 42000000 IRQMASK: 0
GPR00: c0000000006c54fc c00000003e523378 c0000000016de600 0000000000000000
GPR04: c00c000080000014 0000000000000007 0fffffff000affff 0000000000000030
^^^^
...
NIP [c000000000626c5c] .iowrite8+0xec/0x100
LR [c0000000006c992c] .vp_reset+0x2c/0x90
Call Trace:
.pci_bus_read_config_dword+0xc4/0x120 (unreliable)
.register_virtio_device+0x13c/0x1c0
.virtio_pci_probe+0x148/0x1f0
.local_pci_probe+0x68/0x140
.pci_device_probe+0x164/0x220
.really_probe+0x274/0x3b0
.driver_probe_device+0x80/0x170
.__driver_attach+0x14c/0x150
.bus_for_each_dev+0xb8/0x130
.driver_attach+0x34/0x50
.bus_add_driver+0x178/0x2f0
.driver_register+0x90/0x1a0
.__pci_register_driver+0x6c/0x90
.virtio_pci_driver_init+0x2c/0x40
.do_one_initcall+0x64/0x280
.kernel_init_freeable+0x36c/0x474
.kernel_init+0x24/0x160
.ret_from_kernel_thread+0x58/0x7c
This hasn't been a problem because CONFIG_PPC_IO_WORKAROUNDS which
enables this code is usually not enabled. It is only enabled when it's
selected by PPC_CELL_NATIVE which is only selected by
PPC_IBM_CELL_BLADE and that in turn depends on BIG_ENDIAN. So in order
to hit the bug you need to build a big endian kernel, with IBM Cell
Blade support enabled, as well as Radix MMU support, and then boot
that on Power9 using Radix MMU.
Still we can fix the bug, so let's do that. We simply use fewer bits
for the token, taking the union of the restrictions on the address
from both Hash and Radix, we end up with 8 bits we can use for the
token. The only user of the token is iowa_mem_find_bus() which only
supports 8 token values, so 8 bits is plenty for that.
Fixes: 566ca99af026 ("powerpc/mm/radix: Add dummy radix_enabled()")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 0f99153def98134403c9149128e59d3e1786cf04 upstream.
mpic_get_primary_version() is not defined when not using MPIC.
The compile error log like:
arch/powerpc/sysdev/built-in.o: In function `fsl_of_msi_probe':
fsl_msi.c:(.text+0x150c): undefined reference to `fsl_mpic_primary_get_version'
Signed-off-by: Jia Hongtao <hongtao.jia@freescale.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
Reported-by: Radu Rendec <radu.rendec@gmail.com>
Fixes: 807d38b73b6 ("powerpc/mpic: Add get_version API both for internal and external use")
Cc: stable@vger.kernel.org
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 1bd6a1c4b80a28d975287630644e6b47d0f977a5 upstream.
Crash memory ranges is an array of memory ranges of the crashing kernel
to be exported as a dump via /proc/vmcore file. The size of the array
is set based on INIT_MEMBLOCK_REGIONS, which works alright in most cases
where memblock memory regions count is less than INIT_MEMBLOCK_REGIONS
value. But this count can grow beyond INIT_MEMBLOCK_REGIONS value since
commit 142b45a72e22 ("memblock: Add array resizing support").
On large memory systems with a few DLPAR operations, the memblock memory
regions count could be larger than INIT_MEMBLOCK_REGIONS value. On such
systems, registering fadump results in crash or other system failures
like below:
task: c00007f39a290010 ti: c00000000b738000 task.ti: c00000000b738000
NIP: c000000000047df4 LR: c0000000000f9e58 CTR: c00000000010f180
REGS: c00000000b73b570 TRAP: 0300 Tainted: G L X (4.4.140+)
MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 22004484 XER: 20000000
CFAR: c000000000008500 DAR: 000007a450000000 DSISR: 40000000 SOFTE: 0
...
NIP [c000000000047df4] smp_send_reschedule+0x24/0x80
LR [c0000000000f9e58] resched_curr+0x138/0x160
Call Trace:
resched_curr+0x138/0x160 (unreliable)
check_preempt_curr+0xc8/0xf0
ttwu_do_wakeup+0x38/0x150
try_to_wake_up+0x224/0x4d0
__wake_up_common+0x94/0x100
ep_poll_callback+0xac/0x1c0
__wake_up_common+0x94/0x100
__wake_up_sync_key+0x70/0xa0
sock_def_readable+0x58/0xa0
unix_stream_sendmsg+0x2dc/0x4c0
sock_sendmsg+0x68/0xa0
___sys_sendmsg+0x2cc/0x2e0
__sys_sendmsg+0x5c/0xc0
SyS_socketcall+0x36c/0x3f0
system_call+0x3c/0x100
as array index overflow is not checked for while setting up crash memory
ranges causing memory corruption. To resolve this issue, dynamically
allocate memory for crash memory ranges and resize it incrementally,
in units of pagesize, on hitting array size limit.
Fixes: 2df173d9e85d ("fadump: Initialize elfcore header and add PT_LOAD program headers.")
Cc: stable@vger.kernel.org # v3.4+
Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
[mpe: Just use PAGE_SIZE directly, fixup variable placement]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit a048a07d7f4535baa4cbad6bc024f175317ab938 upstream.
On some CPUs we can prevent a vulnerability related to store-to-load
forwarding by preventing store forwarding between privilege domains,
by inserting a barrier in kernel entry and exit paths.
This is known to be the case on at least Power7, Power8 and Power9
powerpc CPUs.
Barriers must be inserted generally before the first load after moving
to a higher privilege, and after the last store before moving to a
lower privilege, HV and PR privilege transitions must be protected.
Barriers are added as patch sections, with all kernel/hypervisor entry
points patched, and the exit points to lower privilge levels patched
similarly to the RFI flush patching.
Firmware advertisement is not implemented yet, so CPU flush types
are hard coded.
Thanks to Michal Suchánek for bug fixes and review.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michal Suchánek <msuchanek@suse.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit e7347a86830f38dc3e40c8f7e28c04412b12a2e7 upstream.
This moves the definition of the default security feature flags
(i.e., enabled by default) closer to the security feature flags.
This can be used to restore current flags to the default flags.
Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 8ad33041563a10b34988800c682ada14b2612533 upstream.
This landed in setup_64.c for no good reason other than we had nowhere
else to put it. Now that we have a security-related file, that is a
better place for it so move it.
[mpe: Add extern for rfi_flush to fix bisection break]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 9a868f634349e62922c226834aa23e3d1329ae7f upstream.
This commit adds security feature flags to reflect the settings we
receive from firmware regarding Spectre/Meltdown mitigations.
The feature names reflect the names we are given by firmware on bare
metal machines. See the hostboot source for details.
Arguably these could be firmware features, but that then requires them
to be read early in boot so they're available prior to asm feature
patching, but we don't actually want to use them for patching. We may
also want to dynamically update them in future, which would be
incompatible with the way firmware features work (at the moment at
least). So for now just make them separate flags.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit c4bc36628d7f8b664657d8bd6ad1c44c177880b7 upstream.
Add some additional values which have been defined for the
H_GET_CPU_CHARACTERISTICS hypercall.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit abf110f3e1cea40f5ea15e85f5d67c39c14568a7 upstream.
For PowerVM migration we want to be able to call setup_rfi_flush()
again after we've migrated the partition.
To support that we need to check that we're not trying to allocate the
fallback flush area after memblock has gone away (i.e., boot-time only).
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit f5246862f82f1e16bbf84cda4cddf287672b30fe ]
In commit 4f8b50bbbe63 ("irq_work, ppc: Fix up arch hooks") a new
function arch_irq_work_raise() was added without a prototype in header
irq_work.h.
Fix the following warning (treated as error in W=1):
arch/powerpc/kernel/time.c:523:6: error: no previous prototype for ‘arch_irq_work_raise’
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 30d6e0a4190d37740e9447e4e4815f06992dd8c3 upstream.
There is code duplicated over all architecture's headers for
futex_atomic_op_inuser. Namely op decoding, access_ok check for uaddr,
and comparison of the result.
Remove this duplication and leave up to the arches only the needed
assembly which is now in arch_futex_atomic_op_inuser.
This effectively distributes the Will Deacon's arm64 fix for undefined
behaviour reported by UBSAN to all architectures. The fix was done in
commit 5f16a046f8e1 (arm64: futex: Fix undefined behaviour with
FUTEX_OP_OPARG_SHIFT usage). Look there for an example dump.
And as suggested by Thomas, check for negative oparg too, because it was
also reported to cause undefined behaviour report.
Note that s390 removed access_ok check in d12a29703 ("s390/uaccess:
remove pointless access_ok() checks") as access_ok there returns true.
We introduce it back to the helper for the sake of simplicity (it gets
optimized away anyway).
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Russell King <rmk+kernel@armlinux.org.uk>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> [s390]
Acked-by: Chris Metcalf <cmetcalf@mellanox.com> [for tile]
Reviewed-by: Darren Hart (VMware) <dvhart@infradead.org>
Reviewed-by: Will Deacon <will.deacon@arm.com> [core/arm64]
Cc: linux-mips@linux-mips.org
Cc: Rich Felker <dalias@libc.org>
Cc: linux-ia64@vger.kernel.org
Cc: linux-sh@vger.kernel.org
Cc: peterz@infradead.org
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: sparclinux@vger.kernel.org
Cc: Jonas Bonn <jonas@southpole.se>
Cc: linux-s390@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: linux-hexagon@vger.kernel.org
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: linux-snps-arc@lists.infradead.org
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: linux-xtensa@linux-xtensa.org
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: openrisc@lists.librecores.org
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Stafford Horne <shorne@gmail.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: Richard Henderson <rth@twiddle.net>
Cc: Chris Zankel <chris@zankel.net>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-parisc@vger.kernel.org
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: linux-alpha@vger.kernel.org
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: "David S. Miller" <davem@davemloft.net>
Link: http://lkml.kernel.org/r/20170824073105.3901-1-jslaby@suse.cz
Cc: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 34dd25de9fe3f60bfdb31b473bf04b28262d0896 upstream.
This is the start of an effort to tidy up and standardise all the
delays. Existing loops have a range of delay/sleep periods from 1ms
to 20ms, and some have no delay. They all loop forever except rtc,
which times out after 10 retries, and that uses 10ms delays. So use
10ms as our standard delay. The OPAL maintainer agrees 10ms is a
reasonable starting point.
The idea is to use the same recipe everywhere, once this is proven to
work then it will be documented as an OPAL API standard. Then both
firmware and OS can agree, and if a particular call needs something
else, then that can be documented with reasoning.
This is not the end-all of this effort, it's just a relatively easy
change that fixes some existing high latency delays. There should be
provision for standardising timeouts and/or interruptible loops where
possible, so non-fatal firmware errors don't cause hangs.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nathan Chancellor <natechancellor@gmail.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 0bfdf598900fd62869659f360d3387ed80eb71cf upstream.
asm/barrier.h is not always included after asm/synch.h, which meant
it was missing __SUBARCH_HAS_LWSYNC, so in some files smp_wmb() would
be eieio when it should be lwsync. kernel/time/hrtimer.c is one case.
__SUBARCH_HAS_LWSYNC is only used in one place, so just fold it in
to where it's used. Previously with my small simulator config, 377
instances of eieio in the tree. After this patch there are 55.
Fixes: 46d075be585e ("powerpc: Optimise smp_wmb")
Cc: stable@vger.kernel.org # v2.6.29+
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit e41e53cd4fe331d0d1f06f8e4ed7e2cc63ee2c34 ]
virt_addr_valid() is supposed to tell you if it's OK to call virt_to_page() on
an address. What this means in practice is that it should only return true for
addresses in the linear mapping which are backed by a valid PFN.
We are failing to properly check that the address is in the linear mapping,
because virt_to_pfn() will return a valid looking PFN for more or less any
address. That bug is actually caused by __pa(), used in virt_to_pfn().
eg: __pa(0xc000000000010000) = 0x10000 # Good
__pa(0xd000000000010000) = 0x10000 # Bad!
__pa(0x0000000000010000) = 0x10000 # Bad!
This started happening after commit bdbc29c19b26 ("powerpc: Work around gcc
miscompilation of __pa() on 64-bit") (Aug 2013), where we changed the definition
of __pa() to work around a GCC bug. Prior to that we subtracted PAGE_OFFSET from
the value passed to __pa(), meaning __pa() of a 0xd or 0x0 address would give
you something bogus back.
Until we can verify if that GCC bug is no longer an issue, or come up with
another solution, this commit does the minimal fix to make virt_addr_valid()
work, by explicitly checking that the address is in the linear mapping region.
Fixes: bdbc29c19b26 ("powerpc: Work around gcc miscompilation of __pa() on 64-bit")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
Tested-by: Breno Leitao <breno.leitao@gmail.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 43e24e82f35291d4c1ca78877ce1b20d3aeb78f1 ]
On powerpc we can build the kernel with two different ABIs for mcount(), which
is used by ftrace. Kernels built with one ABI do not know how to load modules
built with the other ABI. The new style ABI is called "mprofile-kernel", for
want of a better name.
Currently if we build a module using the old style ABI, and the kernel with
mprofile-kernel, when we load the module we'll oops something like:
# insmod autofs4-no-mprofile-kernel.ko
ftrace-powerpc: Unexpected instruction f8810028 around bl _mcount
------------[ cut here ]------------
WARNING: CPU: 6 PID: 3759 at ../kernel/trace/ftrace.c:2024 ftrace_bug+0x2b8/0x3c0
CPU: 6 PID: 3759 Comm: insmod Not tainted 4.11.0-rc3-gcc-5.4.1-00017-g5a61ef74f269 #11
...
NIP [c0000000001eaa48] ftrace_bug+0x2b8/0x3c0
LR [c0000000001eaff8] ftrace_process_locs+0x4a8/0x590
Call Trace:
alloc_pages_current+0xc4/0x1d0 (unreliable)
ftrace_process_locs+0x4a8/0x590
load_module+0x1c8c/0x28f0
SyS_finit_module+0x110/0x140
system_call+0x38/0xfc
...
ftrace failed to modify
[<d000000002a31024>] 0xd000000002a31024
actual: 35:65:00:48
We can avoid this by including in the vermagic whether the kernel/module was
built with mprofile-kernel. Which results in:
# insmod autofs4-pg.ko
autofs4: version magic
'4.11.0-rc3-gcc-5.4.1-00017-g5a61ef74f269 SMP mod_unload modversions '
should be
'4.11.0-rc3-gcc-5.4.1-00017-g5a61ef74f269-dirty SMP mod_unload modversions mprofile-kernel'
insmod: ERROR: could not insert module autofs4-pg.ko: Invalid module format
Fixes: 8c50b72a3b4f ("powerpc/ftrace: Add Kconfig & Make glue for mprofile-kernel")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Acked-by: Jessica Yu <jeyu@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit ca80d5d0a8175c9be04cfbce24180b8f5e0a744b ]
Power9 DD1 does not implement SAO. Although it's not widely used, its presence
or absence is visible to user space via arch_validate_prot() so it's moderately
important that we get the value right.
Fixes: 7dccfbc325bb ("powerpc/book3s: Add a cpu table entry for different POWER9 revs")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit b9eab08d012fa093947b230f9a87257c27fb829b ]
When attempting to load a livepatch module, I got the following error:
module_64: patch_module: Expect noop after relocate, got 3c820000
The error was triggered by the following code in
unregister_netdevice_queue():
14c: 00 00 00 48 b 14c <unregister_netdevice_queue+0x14c>
14c: R_PPC64_REL24 net_set_todo
150: 00 00 82 3c addis r4,r2,0
GCC didn't insert a nop after the branch to net_set_todo() because it's
a sibling call, so it never returns. The nop isn't needed after the
branch in that case.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Reviewed-and-tested-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit bdcb1aefc5b3f7d0f1dc8b02673602bca2ff7a4b upstream.
The fallback RFI flush is used when firmware does not provide a way
to flush the cache. It's a "displacement flush" that evicts useful
data by displacing it with an uninteresting buffer.
The flush has to take care to work with implementation specific cache
replacment policies, so the recipe has been in flux. The initial
slow but conservative approach is to touch all lines of a congruence
class, with dependencies between each load. It has since been
determined that a linear pattern of loads without dependencies is
sufficient, and is significantly faster.
Measuring the speed of a null syscall with RFI fallback flush enabled
gives the relative improvement:
P8 - 1.83x
P9 - 1.75x
The flush also becomes simpler and more adaptable to different cache
geometries.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Backport to 4.9]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 222f20f140623ef6033491d0103ee0875fe87d35 upstream.
This commit does simple conversions of rfi/rfid to the new macros that
include the expected destination context. By simple we mean cases
where there is a single well known destination context, and it's
simply a matter of substituting the instruction for the appropriate
macro.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Backport to 4.9, use RFI_TO_KERNEL in idle_book3s.S]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 1b689a95ce7427075f9ac9fb4aea1af530742b7f upstream.
Commit 6e032b350cd1 ("powerpc/powernv: Check device-tree for RFI flush
settings") uses u64 in asm/hvcall.h without including linux/types.h
This breaks hvcall.h users that do not include the header themselves.
Fixes: 6e032b350cd1 ("powerpc/powernv: Check device-tree for RFI flush settings")
Signed-off-by: Michal Suchanek <msuchanek@suse.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit aa8a5e0062ac940f7659394f4817c948dc8c0667 upstream.
On some CPUs we can prevent the Meltdown vulnerability by flushing the
L1-D cache on exit from kernel to user mode, and from hypervisor to
guest.
This is known to be the case on at least Power7, Power8 and Power9. At
this time we do not know the status of the vulnerability on other CPUs
such as the 970 (Apple G5), pasemi CPUs (AmigaOne X1000) or Freescale
CPUs. As more information comes to light we can enable this, or other
mechanisms on those CPUs.
The vulnerability occurs when the load of an architecturally
inaccessible memory region (eg. userspace load of kernel memory) is
speculatively executed to the point where its result can influence the
address of a subsequent speculatively executed load.
In order for that to happen, the first load must hit in the L1,
because before the load is sent to the L2 the permission check is
performed. Therefore if no kernel addresses hit in the L1 the
vulnerability can not occur. We can ensure that is the case by
flushing the L1 whenever we return to userspace. Similarly for
hypervisor vs guest.
In order to flush the L1-D cache on exit, we add a section of nops at
each (h)rfi location that returns to a lower privileged context, and
patch that with some sequence. Newer firmwares are able to advertise
to us that there is a special nop instruction that flushes the L1-D.
If we do not see that advertised, we fall back to doing a displacement
flush in software.
For guest kernels we support migration between some CPU versions, and
different CPUs may use different flush instructions. So that we are
prepared to migrate to a machine with a different flush instruction
activated, we may have to patch more than one flush instruction at
boot if the hypervisor tells us to.
In the end this patch is mostly the work of Nicholas Piggin and
Michael Ellerman. However a cast of thousands contributed to analysis
of the issue, earlier versions of the patch, back ports testing etc.
Many thanks to all of them.
Tested-by: Jon Masters <jcm@redhat.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
[Balbir - back ported to stable with changes]
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 50e51c13b3822d14ff6df4279423e4b7b2269bc3 upstream.
The rfid/hrfid ((Hypervisor) Return From Interrupt) instruction is
used for switching from the kernel to userspace, and from the
hypervisor to the guest kernel. However it can and is also used for
other transitions, eg. from real mode kernel code to virtual mode
kernel code, and it's not always clear from the code what the
destination context is.
To make it clearer when reading the code, add macros which encode the
expected destination context.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 191eccb1580939fb0d47deb405b82a85b0379070 upstream.
A new hypervisor call has been defined to communicate various
characteristics of the CPU to guests. Add definitions for the hcall
number, flags and a wrapper function.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
[Balbir fixed conflicts in backport]
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit b492f7e4e07a28e706db26cf4943bb0911435426 upstream.
These functions compute an IP checksum by computing a 64-bit sum and
folding it to 32 bits (the "nofold" in their names refers to folding
down to 16 bits). However, doing (u32) (s + (s >> 32)) is not
sufficient to fold a 64-bit sum to 32 bits correctly. The addition
can produce a carry out from bit 31, which needs to be added in to
the sum to produce the correct result.
To fix this, we copy the from64to32() function from lib/checksum.c
and use that.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 6ad966d7303b70165228dba1ee8da1a05c10eefe ]
Paul's patch to fix checksum folding, commit b492f7e4e07a ("powerpc/64:
Fix checksum folding in csum_tcpudp_nofold and ip_fast_csum_nofold")
missed a case in csum_add(). Fix it.
Signed-off-by: Shile Zhang <shile.zhang@nokia.com>
Acked-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 32b53c012e0bfe20b2745962a89db0dc72ef3270 ]
Memory hotplug is leading to hash page table calls, even on radix:
arch_add_memory
create_section_mapping
htab_bolt_mapping
BUG_ON(!ppc_md.hpte_insert);
To fix, refactor {create,remove}_section_mapping() into hash__ and
radix__ variants. Leave the radix versions stubbed for now.
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 1a92a80ad386a1a6e3b36d576d52a1a456394b70 upstream.
There is no guarantee that the various isync's involved with
the context switch will order the update of the CPU mask with
the first TLB entry for the new context being loaded by the HW.
Be safe here and add a memory barrier to order any subsequent
load/store which may bring entries into the TLB.
The corresponding barrier on the other side already exists as
pte updates use pte_xchg() which uses __cmpxchg_u64 which has
a sync after the atomic operation.
Cc: stable@vger.kernel.org
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Add comments in the code]
[mpe: Backport to 4.12, minor context change]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
This reverts commit b4624ff952ec7d268a9651cd9184a1995befc271 which is
commit ba4a648f12f4cd0a8003dd229b6ca8a53348ee4b upstream.
Michal Hocko writes:
JFYI. We have encountered a regression after applying this patch on a
large ppc machine. While the patch is the right thing to do it doesn't
work well with the current vmalloc area size on ppc and large machines
where NUMA nodes are very far from each other. Just for the reference
the boot fails on such a machine with bunch of warning preceeding it.
See http://lkml.kernel.org/r/20170724134240.GL25221@dhcp22.suse.cz
It seems the right thing to do is to enlarge the vmalloc space on ppc
but this is not the case in the upstream kernel yet AFAIK. It is also
questionable whether that is a stable material but I will decision on
you here.
We have reverted this patch from our 4.4 based kernel.
Newer kernels do not have enlarged vmalloc space yet AFAIK so they won't
work properly eiter. This bug is quite rare though because you need a
specific HW configuration to trigger the issue - namely NUMA nodes have
to be far away from each other in the physical memory space.
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 2400fd822f467cb4c886c879d8ad99feac9cf319 upstream.
The workaround for the CELL timebase bug does not correctly mark cr0 as
being clobbered. This means GCC doesn't know that the asm block changes cr0 and
might leave the result of an unrelated comparison in cr0 across the block, which
we then trash, leading to basically random behaviour.
Fixes: 859deea949c3 ("[POWERPC] Cell timebase bug workaround")
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
[mpe: Tweak change log and flag for stable]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 01e6a61aceb82e13bec29502a8eb70d9574f97ad upstream.
Although it's not documented anywhere, there is an expectation that
atomic64_inc_not_zero() returns a result which fits in an int. This is
the behaviour implemented on all arches except powerpc.
This has caused at least one bug in practice, in the percpu-refcount
code, where the long result from our atomic64_inc_not_zero() was
truncated to an int leading to lost references and stuck systems. That
was worked around in that code in commit 966d2b04e070 ("percpu-refcount:
fix reference leak during percpu-atomic transition").
To the best of my grepping abilities there are no other callers
in-tree which truncate the value, but we should fix it anyway. Because
the breakage is subtle and potentially very harmful I'm also tagging
it for stable.
Code generation is largely unaffected because in most cases the
callers are just using the result for a test anyway. In particular the
case of fget() that was mentioned in commit a6cf7ed5119f
("powerpc/atomic: Implement atomic*_inc_not_zero") generates exactly
the same code.
Fixes: a6cf7ed5119f ("powerpc/atomic: Implement atomic*_inc_not_zero")
Noticed-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 47ebb09d54856500c5a5e14824781902b3bb738e upstream.
Now that explicitly executed loaders are loaded in the mmap region, we
have more freedom to decide where we position PIE binaries in the
address space to avoid possible collisions with mmap or stack regions.
For 64-bit, align to 4GB to allow runtimes to use the entire 32-bit
address space for 32-bit pointers. On 32-bit use 4MB, which is the
traditional x86 minimum load location, likely to avoid historically
requiring a 4MB page table entry when only a portion of the first 4MB
would be used (since the NULL address is avoided).
Link: http://lkml.kernel.org/r/1498154792-49952-4-git-send-email-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Tested-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: Pratyush Anand <panand@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit ba4a648f12f4cd0a8003dd229b6ca8a53348ee4b upstream.
In commit 8c272261194d ("powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID"), we
switched to the generic implementation of cpu_to_node(), which uses a percpu
variable to hold the NUMA node for each CPU.
Unfortunately we neglected to notice that we use cpu_to_node() in the allocation
of our percpu areas, leading to a chicken and egg problem. In practice what
happens is when we are setting up the percpu areas, cpu_to_node() reports that
all CPUs are on node 0, so we allocate all percpu areas on node 0.
This is visible in the dmesg output, as all pcpu allocs being in group 0:
pcpu-alloc: [0] 00 01 02 03 [0] 04 05 06 07
pcpu-alloc: [0] 08 09 10 11 [0] 12 13 14 15
pcpu-alloc: [0] 16 17 18 19 [0] 20 21 22 23
pcpu-alloc: [0] 24 25 26 27 [0] 28 29 30 31
pcpu-alloc: [0] 32 33 34 35 [0] 36 37 38 39
pcpu-alloc: [0] 40 41 42 43 [0] 44 45 46 47
To fix it we need an early_cpu_to_node() which can run prior to percpu being
setup. We already have the numa_cpu_lookup_table we can use, so just plumb it
in. With the patch dmesg output shows two groups, 0 and 1:
pcpu-alloc: [0] 00 01 02 03 [0] 04 05 06 07
pcpu-alloc: [0] 08 09 10 11 [0] 12 13 14 15
pcpu-alloc: [0] 16 17 18 19 [0] 20 21 22 23
pcpu-alloc: [1] 24 25 26 27 [1] 28 29 30 31
pcpu-alloc: [1] 32 33 34 35 [1] 36 37 38 39
pcpu-alloc: [1] 40 41 42 43 [1] 44 45 46 47
We can also check the data_offset in the paca of various CPUs, with the fix we
see:
CPU 0: data_offset = 0x0ffe8b0000
CPU 24: data_offset = 0x1ffe5b0000
And we can see from dmesg that CPU 24 has an allocation on node 1:
node 0: [mem 0x0000000000000000-0x0000000fffffffff]
node 1: [mem 0x0000001000000000-0x0000001fffffffff]
Fixes: 8c272261194d ("powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 9765ad134a00a01cbcc69c78ff6defbfad209bc5 upstream.
powerpc expects IRQs to already be (soft) disabled when switch_mm() is
called, as made clear in the commit message of 9c1e105238c4 ("powerpc: Allow
perf_counters to access user memory at interrupt time").
Aside from any race conditions that might exist between switch_mm() and an IRQ,
there is also an unconditional hard_irq_disable() in switch_slb(). If that isn't
followed at some point by an IRQ enable then interrupts will remain disabled
until we return to userspace.
It is true that when switch_mm() is called from the scheduler IRQs are off, but
not when it's called by use_mm(). Looking closer we see that last year in commit
f98db6013c55 ("sched/core: Add switch_mm_irqs_off() and use it in the scheduler")
this was made more explicit by the addition of switch_mm_irqs_off() which is now
called by the scheduler, vs switch_mm() which is used by use_mm().
Arguably it is a bug in use_mm() to call switch_mm() in a different context than
it expects, but fixing that will take time.
This was discovered recently when vhost started throwing warnings such as:
BUG: sleeping function called from invalid context at kernel/mutex.c:578
in_atomic(): 0, irqs_disabled(): 1, pid: 10768, name: vhost-10760
no locks held by vhost-10760/10768.
irq event stamp: 10
hardirqs last enabled at (9): _raw_spin_unlock_irq+0x40/0x80
hardirqs last disabled at (10): switch_slb+0x2e4/0x490
softirqs last enabled at (0): copy_process+0x5e8/0x1260
softirqs last disabled at (0): (null)
Call Trace:
show_stack+0x88/0x390 (unreliable)
dump_stack+0x30/0x44
__might_sleep+0x1c4/0x2d0
mutex_lock_nested+0x74/0x5c0
cgroup_attach_task_all+0x5c/0x180
vhost_attach_cgroups_work+0x58/0x80 [vhost]
vhost_worker+0x24c/0x3d0 [vhost]
kthread+0xec/0x100
ret_from_kernel_thread+0x5c/0xd4
Prior to commit 04b96e5528ca ("vhost: lockless enqueuing") (Aug 2016) the
vhost_worker() would do a spin_unlock_irq() not long after calling use_mm(),
which had the effect of reenabling IRQs. Since that commit removed the locking
in vhost_worker() the body of the vhost_worker() loop now runs with interrupts
off causing the warnings.
This patch addresses the problem by making the powerpc code mirror the x86 code,
ie. we disable interrupts in switch_mm(), and optimise the scheduler case by
defining switch_mm_irqs_off().
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[mpe: Flesh out/rewrite change log, add stable]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 4ab2537c4204b976e4ca350bbdc193b4649cad28 upstream.
In commit a4b349540a26af ("powerpc/mm: Cleanup LPCR defines") we updated
LPCR_VRMASD wrongly as below.
-#define LPCR_VRMASD (0x1ful << (63-16))
+#define LPCR_VRMASD_SH 47
+#define LPCR_VRMASD (ASM_CONST(1) << LPCR_VRMASD_SH)
We initialize the VRMA bits in LPCR to 0x00 in kvm. Hence using a
different mask value as above while updating lpcr should not have any
impact.
This patch updates it to the correct value.
Fixes: a4b349540a26 ("powerpc/mm: Cleanup LPCR defines")
Reported-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Jia He <hejianet@gmail.com>
Acked-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit d7baee6901b34c4895eb78efdbf13a49079d7404 ]
This changes mm_iommu_xxx helpers to take mm_struct as a parameter
instead of getting it from @current which in some situations may
not have a valid reference to mm.
This changes helpers to receive @mm and moves all references to @current
to the caller, including checks for !current and !current->mm;
checks in mm_iommu_preregistered() are removed as there is no caller
yet.
This moves the mm_iommu_adjust_locked_vm() call to the caller as
it receives mm_iommu_table_group_mem_t but it needs mm.
This should cause no behavioral change.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 88f54a3581eb9deaa3bd1aade40aef266d782385 ]
We are going to get rid of @current references in mmu_context_boos3s64.c
and cache mm_struct in the VFIO container. Since mm_context_t does not
have reference counting, we will be using mm_struct which does have
the reference counter.
This changes mm_iommu_init/mm_iommu_cleanup to receive mm_struct rather
than mm_context_t (which is embedded into mm).
This should not cause any behavioral change.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit a5ecdad4847897007399d7a14c9109b65ce4c9b7 upstream.
Without this we will always find the feature disabled.
Fixes: 984d7a1ec6 ("powerpc/mm: Fixup kernel read only mapping")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 9b256714979fad61ae11d90b53cf67dd5e6484eb upstream.
The IPIs come in as HVI not EE, so we need to test the appropriate
SRR1 bits. The encoding is such that it won't have false positives
on P7 and P8 so we can just test it like that. We also need to handle
the icp-opal variant of the flush.
Fixes: d74361881f0d ("powerpc/xics: Add ICP OPAL backend")
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael |