Age | Commit message (Collapse) | Author | Files | Lines |
|
[ Upstream commit b8b2f37cf632434456182e9002d63cbc4cccc50c ]
When adding a PTE a ptesync is needed to order the update of the PTE
with subsequent accesses otherwise a spurious fault may be raised.
radix__set_pte_at() does not do this for performance gains. For
non-kernel memory this is not an issue as any faults of this kind are
corrected by the page fault handler. For kernel memory these faults
are not handled. The current solution is that there is a ptesync in
flush_cache_vmap() which should be called when mapping from the
vmalloc region.
However, map_kernel_page() does not call flush_cache_vmap(). This is
troublesome in particular for code patching with Strict RWX on radix.
In do_patch_instruction() the page frame that contains the instruction
to be patched is mapped and then immediately patched. With no ordering
or synchronization between setting up the PTE and writing to the page
it is possible for faults.
As the code patching is done using __put_user_asm_goto() the resulting
fault is obscured - but using a normal store instead it can be seen:
BUG: Unable to handle kernel data access on write at 0xc008000008f24a3c
Faulting instruction address: 0xc00000000008bd74
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA PowerNV
Modules linked in: nop_module(PO+) [last unloaded: nop_module]
CPU: 4 PID: 757 Comm: sh Tainted: P O 5.10.0-rc5-01361-ge3c1b78c8440-dirty #43
NIP: c00000000008bd74 LR: c00000000008bd50 CTR: c000000000025810
REGS: c000000016f634a0 TRAP: 0300 Tainted: P O (5.10.0-rc5-01361-ge3c1b78c8440-dirty)
MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 44002884 XER: 00000000
CFAR: c00000000007c68c DAR: c008000008f24a3c DSISR: 42000000 IRQMASK: 1
This results in the kind of issue reported here:
https://lore.kernel.org/linuxppc-dev/15AC5B0E-A221-4B8C-9039-FA96B8EF7C88@lca.pw/
Chris Riedl suggested a reliable way to reproduce the issue:
$ mount -t debugfs none /sys/kernel/debug
$ (while true; do echo function > /sys/kernel/debug/tracing/current_tracer ; echo nop > /sys/kernel/debug/tracing/current_tracer ; done) &
Turning ftrace on and off does a large amount of code patching which
in usually less then 5min will crash giving a trace like:
ftrace-powerpc: (____ptrval____): replaced (4b473b11) != old (60000000)
------------[ ftrace bug ]------------
ftrace failed to modify
[<c000000000bf8e5c>] napi_busy_loop+0xc/0x390
actual: 11:3b:47:4b
Setting ftrace call site to call ftrace function
ftrace record flags: 80000001
(1)
expected tramp: c00000000006c96c
------------[ cut here ]------------
WARNING: CPU: 4 PID: 809 at kernel/trace/ftrace.c:2065 ftrace_bug+0x28c/0x2e8
Modules linked in: nop_module(PO-) [last unloaded: nop_module]
CPU: 4 PID: 809 Comm: sh Tainted: P O 5.10.0-rc5-01360-gf878ccaf250a #1
NIP: c00000000024f334 LR: c00000000024f330 CTR: c0000000001a5af0
REGS: c000000004c8b760 TRAP: 0700 Tainted: P O (5.10.0-rc5-01360-gf878ccaf250a)
MSR: 900000000282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 28008848 XER: 20040000
CFAR: c0000000001a9c98 IRQMASK: 0
GPR00: c00000000024f330 c000000004c8b9f0 c000000002770600 0000000000000022
GPR04: 00000000ffff7fff c000000004c8b6d0 0000000000000027 c0000007fe9bcdd8
GPR08: 0000000000000023 ffffffffffffffd8 0000000000000027 c000000002613118
GPR12: 0000000000008000 c0000007fffdca00 0000000000000000 0000000000000000
GPR16: 0000000023ec37c5 0000000000000000 0000000000000000 0000000000000008
GPR20: c000000004c8bc90 c0000000027a2d20 c000000004c8bcd0 c000000002612fe8
GPR24: 0000000000000038 0000000000000030 0000000000000028 0000000000000020
GPR28: c000000000ff1b68 c000000000bf8e5c c00000000312f700 c000000000fbb9b0
NIP ftrace_bug+0x28c/0x2e8
LR ftrace_bug+0x288/0x2e8
Call Trace:
ftrace_bug+0x288/0x2e8 (unreliable)
ftrace_modify_all_code+0x168/0x210
arch_ftrace_update_code+0x18/0x30
ftrace_run_update_code+0x44/0xc0
ftrace_startup+0xf8/0x1c0
register_ftrace_function+0x4c/0xc0
function_trace_init+0x80/0xb0
tracing_set_tracer+0x2a4/0x4f0
tracing_set_trace_write+0xd4/0x130
vfs_write+0xf0/0x330
ksys_write+0x84/0x140
system_call_exception+0x14c/0x230
system_call_common+0xf0/0x27c
To fix this when updating kernel memory PTEs using ptesync.
Fixes: f1cb8f9beba8 ("powerpc/64s/radix: avoid ptesync after set_pte and ptep_set_access_flags")
Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Tidy up change log slightly]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210208032957.1232102-1-jniethe5@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
- A series from Nick adding ARCH_WANT_IRQS_OFF_ACTIVATE_MM & selecting
it for powerpc, as well as a related fix for sparc.
- Remove support for PowerPC 601.
- Some fixes for watchpoints & addition of a new ptrace flag for
detecting ISA v3.1 (Power10) watchpoint features.
- A fix for kernels using 4K pages and the hash MMU on bare metal
Power9 systems with > 16TB of RAM, or RAM on the 2nd node.
- A basic idle driver for shallow stop states on Power10.
- Tweaks to our sched domains code to better inform the scheduler about
the hardware topology on Power9/10, where two SMT4 cores can be
presented by firmware as an SMT8 core.
- A series doing further reworks & cleanups of our EEH code.
- Addition of a filter for RTAS (firmware) calls done via sys_rtas(),
to prevent root from overwriting kernel memory.
- Other smaller features, fixes & cleanups.
Thanks to: Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V,
Athira Rajeev, Biwen Li, Cameron Berkenpas, Cédric Le Goater, Christophe
Leroy, Christoph Hellwig, Colin Ian King, Daniel Axtens, David Dai, Finn
Thain, Frederic Barrat, Gautham R. Shenoy, Greg Kurz, Gustavo Romero,
Ira Weiny, Jason Yan, Joel Stanley, Jordan Niethe, Kajol Jain, Konrad
Rzeszutek Wilk, Laurent Dufour, Leonardo Bras, Liu Shixin, Luca
Ceresoli, Madhavan Srinivasan, Mahesh Salgaonkar, Nathan Lynch, Nicholas
Mc Guire, Nicholas Piggin, Nick Desaulniers, Oliver O'Halloran, Pedro
Miraglia Franco de Carvalho, Pratik Rajesh Sampat, Qian Cai, Qinglang
Miao, Ravi Bangoria, Russell Currey, Satheesh Rajendran, Scott Cheloha,
Segher Boessenkool, Srikar Dronamraju, Stan Johnson, Stephen Kitt,
Stephen Rothwell, Thiago Jung Bauermann, Tyrel Datwyler, Vaibhav Jain,
Vaidyanathan Srinivasan, Vasant Hegde, Wang Wensheng, Wolfram Sang, Yang
Yingliang, zhengbin.
* tag 'powerpc-5.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (228 commits)
Revert "powerpc/pci: unmap legacy INTx interrupts when a PHB is removed"
selftests/powerpc: Fix eeh-basic.sh exit codes
cpufreq: powernv: Fix frame-size-overflow in powernv_cpufreq_reboot_notifier
powerpc/time: Make get_tb() common to PPC32 and PPC64
powerpc/time: Make get_tbl() common to PPC32 and PPC64
powerpc/time: Remove get_tbu()
powerpc/time: Avoid using get_tbl() and get_tbu() internally
powerpc/time: Make mftb() common to PPC32 and PPC64
powerpc/time: Rename mftbl() to mftb()
powerpc/32s: Remove #ifdef CONFIG_PPC_BOOK3S_32 in head_book3s_32.S
powerpc/32s: Rename head_32.S to head_book3s_32.S
powerpc/32s: Setup the early hash table at all time.
powerpc/time: Remove ifdef in get_dec() and set_dec()
powerpc: Remove get_tb_or_rtc()
powerpc: Remove __USE_RTC()
powerpc: Tidy up a bit after removal of PowerPC 601.
powerpc: Remove support for PowerPC 601
powerpc: Remove PowerPC 601
powerpc: Drop SYNC_601() ISYNC_601() and SYNC()
powerpc: Remove CONFIG_PPC601_SYNC_FIX
...
|
|
There are several occurrences of the following pattern:
for_each_memblock(memory, reg) {
start = __pfn_to_phys(memblock_region_memory_base_pfn(reg);
end = __pfn_to_phys(memblock_region_memory_end_pfn(reg));
/* do something with start and end */
}
Using for_each_mem_range() iterator is more appropriate in such cases and
allows simpler and cleaner code.
[akpm@linux-foundation.org: fix arch/arm/mm/pmsa-v7.c build]
[rppt@linux.ibm.com: mips: fix cavium-octeon build caused by memblock refactoring]
Link: http://lkml.kernel.org/r/20200827124549.GD167163@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Axtens <dja@axtens.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Emil Renner Berthing <kernel@esmil.dk>
Cc: Hari Bathini <hbathini@linux.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: https://lkml.kernel.org/r/20200818151634.14343-13-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Make it consistent with other usages.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201007114836.282468-5-aneesh.kumar@linux.ibm.com
|
|
Similar to commit 89c140bbaeee ("pseries: Fix 64 bit logical memory block panic")
make sure different variables tracking lmb_size are updated to be 64 bit.
Fixes: af9d00e93a4f ("powerpc/mm/radix: Create separate mappings for hot-plugged memory")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201007114836.282468-4-aneesh.kumar@linux.ibm.com
|
|
This ensures we don't do a partial mapping of memory. With nvdimm, when
creating namespaces with size not aligned to 16MB, the kernel ends up partially
mapping the pages. This can result in kernel adding multiple hash page table
entries for the same range. A new namespace will result in
create_section_mapping() with start and end overlapping an already existing
bolted hash page table entry.
commit: 6acd7d5ef264 ("libnvdimm/namespace: Enforce memremap_compat_align()")
made sure that we always create namespaces aligned to 16MB. But we can do
better by avoiding mapping pages that are not aligned. This helps to catch
access to these partially mapped pages early.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200907072539.67310-1-aneesh.kumar@linux.ibm.com
|
|
If the hypervisor doesn't support hugepages, the kernel ends up allocating a large
number of page table pages. The early page table allocation was wrongly
setting the max memblock limit to ppc64_rma_size with radix translation
which resulted in boot failure as shown below.
Kernel panic - not syncing:
early_alloc_pgtable: Failed to allocate 16777216 bytes align=0x1000000 nid=-1 from=0x0000000000000000 max_addr=0xffffffffffffffff
CPU: 0 PID: 0 Comm: swapper Not tainted 5.8.0-24.9-default+ #2
Call Trace:
[c0000000016f3d00] [c0000000007c6470] dump_stack+0xc4/0x114 (unreliable)
[c0000000016f3d40] [c00000000014c78c] panic+0x164/0x418
[c0000000016f3dd0] [c000000000098890] early_alloc_pgtable+0xe0/0xec
[c0000000016f3e60] [c0000000010a5440] radix__early_init_mmu+0x360/0x4b4
[c0000000016f3ef0] [c000000001099bac] early_init_mmu+0x1c/0x3c
[c0000000016f3f10] [c00000000109a320] early_setup+0x134/0x170
This was because the kernel was checking for the radix feature before we enable the
feature via mmu_features. This resulted in the kernel using hash restrictions on
radix.
Rework the early init code such that the kernel boot with memblock restrictions
as imposed by hash. At that point, the kernel still hasn't finalized the
translation the kernel will end up using.
We have three different ways of detecting radix.
1. dt_cpu_ftrs_scan -> used only in case of PowerNV
2. ibm,pa-features -> Used when we don't use cpu_dt_ftr_scan
3. CAS -> Where we negotiate with hypervisor about the supported translation.
We look at 1 or 2 early in the boot and after that, we look at the CAS vector to
finalize the translation the kernel will use. We also support a kernel command
line option (disable_radix) to switch to hash.
Update the memblock limit after mmu_early_init_devtree() if the kernel is going
to use radix translation. This forces some of the memblock allocations we do before
mmu_early_init_devtree() to be within the RMA limit.
Fixes: 2bfd65e45e87 ("powerpc/mm/radix: Add radix callbacks for early init routines")
Reported-by: Shirisha Ganta <shiganta@in.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200828100852.426575-1-aneesh.kumar@linux.ibm.com
|
|
Certain warnings are emitted for powerpc code when building with a gcc-10
toolset:
WARNING: modpost: vmlinux.o(.text.unlikely+0x377c): Section mismatch in
reference from the function remove_pmd_table() to the function
.meminit.text:split_kernel_mapping()
The function remove_pmd_table() references
the function __meminit split_kernel_mapping().
This is often because remove_pmd_table lacks a __meminit
annotation or the annotation of split_kernel_mapping is wrong.
Add the appropriate __init and __meminit annotations to make modpost not
complain. In all the cases there are just a single callsite from another
__init or __meminit function:
__meminit remove_pagetable() -> remove_pud_table() -> remove_pmd_table()
__init prom_init() -> setup_secure_guest()
__init xive_spapr_init() -> xive_spapr_disabled()
Signed-off-by: Vladis Dronov <vdronov@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200729133741.62789-1-vdronov@redhat.com
|
|
UAMOR values are not application-specific. The kernel initializes
its value based on different reserved keys. Remove the thread-specific
UAMOR value and don't switch the UAMOR on context switch.
Move UAMOR initialization to key initialization code and remove
thread_struct.uamor because it is not used anymore.
Before commit: 4a4a5e5d2aad ("powerpc/pkeys: key allocation/deallocation must not change pkey registers")
we used to update uamor based on key allocation and free.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200709032946.881753-20-aneesh.kumar@linux.ibm.com
|
|
This will be used to enable/disable Kernel Userspace Execution
Prevention (KUEP).
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200709032946.881753-15-aneesh.kumar@linux.ibm.com
|
|
To enable memory unplug without splitting kernel page table
mapping, we force the max mapping size to the LMB size. LMB
size is the unit in which hypervisor will do memory add/remove
operation.
Pseries systems supports max LMB size of 256MB. Hence on pseries,
we now end up mapping memory with 2M page size instead of 1G. To improve
that we want hypervisor to hint the kernel about the hotplug
memory range. That was added that as part of
commit b6eca183e23e ("powerpc/kernel: Enables memory
hot-remove after reboot on pseries guests")
But PowerVM doesn't provide that hint yet. Once we get PowerVM
updated, we can then force the 2M mapping only to hot-pluggable
memory region using memblock_is_hotpluggable(). Till then
let's depend on LMB size for finding the mapping page size
for linear range.
With this change KVM guest will also be doing linear mapping with
2M page size.
The actual TLB benefit of mapping guest page table entries with
hugepage size can only be materialized if the partition scoped
entries are also using the same or higher page size. A guest using
1G hugetlbfs backing guest memory can have a performance impact with
the above change.
Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
[mpe: Fold in fix from Aneesh spotted by lkp@intel.com]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200709131925.922266-5-aneesh.kumar@linux.ibm.com
|
|
We split the page table mapping on memory unplug if the
linear range was mapped with huge page mapping (for ex: 1G)
The page table splitting code has a few issues:
1. Recursive locking
--------------------
Memory unplug path takes cpu_hotplug_lock and calls stop_machine()
for splitting the mappings. However stop_machine() takes
cpu_hotplug_lock again causing deadlock.
2. BUG: sleeping function called from in_atomic() context
---------------------------------------------------------
Memory unplug path (remove_pagetable) takes init_mm.page_table_lock
spinlock and later calls stop_machine() which does wait_for_completion()
3. Bad unlock unbalance
-----------------------
Memory unplug path takes init_mm.page_table_lock spinlock and calls
stop_machine(). The stop_machine thread function runs in a different
thread context (migration thread) which tries to release and reaquire
ptl. Releasing ptl from a different thread than which acquired it
causes bad unlock unbalance.
These problems can be avoided if we avoid mapping hot-plugged memory
with 1G mapping, thereby removing the need for splitting them during
unplug. The kernel always make sure the minimum unplug request is
SUBSECTION_SIZE for device memory and SECTION_SIZE for regular memory.
In preparation for such a change remove page table splitting support.
This essentially is a revert of
commit 4dd5f8a99e791 ("powerpc/mm/radix: Split linear mapping on hot-unplug")
Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200709131925.922266-4-aneesh.kumar@linux.ibm.com
|
|
remove_pagetable() isn't freeing PUD table. This causes memory
leak during memory unplug. Fix this.
Fixes: 4b5d62ca17a1 ("powerpc/mm: add radix__remove_section_mapping()")
Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200709131925.922266-3-aneesh.kumar@linux.ibm.com
|
|
We can hit the following BUG_ON during memory unplug:
kernel BUG at arch/powerpc/mm/book3s64/pgtable.c:342!
Oops: Exception in kernel mode, sig: 5 [#1]
LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
NIP [c000000000093308] pmd_fragment_free+0x48/0xc0
LR [c00000000147bfec] remove_pagetable+0x578/0x60c
Call Trace:
0xc000008050000000 (unreliable)
remove_pagetable+0x384/0x60c
radix__remove_section_mapping+0x18/0x2c
remove_section_mapping+0x1c/0x3c
arch_remove_memory+0x11c/0x180
try_remove_memory+0x120/0x1b0
__remove_memory+0x20/0x40
dlpar_remove_lmb+0xc0/0x114
dlpar_memory+0x8b0/0xb20
handle_dlpar_errorlog+0xc0/0x190
pseries_hp_work_fn+0x2c/0x60
process_one_work+0x30c/0x810
worker_thread+0x98/0x540
kthread+0x1c4/0x1d0
ret_from_kernel_thread+0x5c/0x74
This occurs when unplug is attempted for such memory which has
been mapped using memblock pages as part of early kernel page
table setup. We wouldn't have initialized the PMD or PTE fragment
count for those PMD or PTE pages.
This can be fixed by allocating memory in PAGE_SIZE granularity
during early page table allocation. This makes sure a specific
page is not shared for another memblock allocation and we can
free them correctly on removing page-table pages.
Since we now do PAGE_SIZE allocations for both PUD table and
PMD table (Note that PTE table allocation is already of PAGE_SIZE),
we end up allocating more memory for the same amount of system RAM.
Here is a comparision of how much more we need for a 64T and 2G
system after this patch:
1. 64T system
-------------
64T RAM would need 64G for vmemmap with struct page size being 64B.
128 PUD tables for 64T memory (1G mappings)
1 PUD table and 64 PMD tables for 64G vmemmap (2M mappings)
With default PUD[PMD]_TABLE_SIZE(4K), (128+1+64)*4K=772K
With PAGE_SIZE(64K) table allocations, (128+1+64)*64K=12352K
2. 2G system
------------
2G RAM would need 2M for vmemmap with struct page size being 64B.
1 PUD table for 2G memory (1G mapping)
1 PUD table and 1 PMD table for 2M vmemmap (2M mappings)
With default PUD[PMD]_TABLE_SIZE(4K), (1+1+1)*4K=12K
With new PAGE_SIZE(64K) table allocations, (1+1+1)*64K=192K
Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200709131925.922266-2-aneesh.kumar@linux.ibm.com
|
|
Patch series "mm: consolidate definitions of page table accessors", v2.
The low level page table accessors (pXY_index(), pXY_offset()) are
duplicated across all architectures and sometimes more than once. For
instance, we have 31 definition of pgd_offset() for 25 supported
architectures.
Most of these definitions are actually identical and typically it boils
down to, e.g.
static inline unsigned long pmd_index(unsigned long address)
{
return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
}
static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
{
return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
}
These definitions can be shared among 90% of the arches provided
XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined.
For architectures that really need a custom version there is always
possibility to override the generic version with the usual ifdefs magic.
These patches introduce include/linux/pgtable.h that replaces
include/asm-generic/pgtable.h and add the definitions of the page table
accessors to the new header.
This patch (of 12):
The linux/mm.h header includes <asm/pgtable.h> to allow inlining of the
functions involving page table manipulations, e.g. pte_alloc() and
pmd_alloc(). So, there is no point to explicitly include <asm/pgtable.h>
in the files that include <linux/mm.h>.
The include statements in such cases are remove with a simple loop:
for f in $(git grep -l "include <linux/mm.h>") ; do
sed -i -e '/include <asm\/pgtable.h>/ d' $f
done
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200514170327.31389-1-rppt@kernel.org
Link: http://lkml.kernel.org/r/20200514170327.31389-2-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
- Support for userspace to send requests directly to the on-chip GZIP
accelerator on Power9.
- Rework of our lockless page table walking (__find_linux_pte()) to
make it safe against parallel page table manipulations without
relying on an IPI for serialisation.
- A series of fixes & enhancements to make our machine check handling
more robust.
- Lots of plumbing to add support for "prefixed" (64-bit) instructions
on Power10.
- Support for using huge pages for the linear mapping on 8xx (32-bit).
- Remove obsolete Xilinx PPC405/PPC440 support, and an associated sound
driver.
- Removal of some obsolete 40x platforms and associated cruft.
- Initial support for booting on Power10.
- Lots of other small features, cleanups & fixes.
Thanks to: Alexey Kardashevskiy, Alistair Popple, Andrew Donnellan,
Andrey Abramov, Aneesh Kumar K.V, Balamuruhan S, Bharata B Rao, Bulent
Abali, Cédric Le Goater, Chen Zhou, Christian Zigotzky, Christophe
JAILLET, Christophe Leroy, Dmitry Torokhov, Emmanuel Nicolet, Erhard F.,
Gautham R. Shenoy, Geoff Levand, George Spelvin, Greg Kurz, Gustavo A.
R. Silva, Gustavo Walbon, Haren Myneni, Hari Bathini, Joel Stanley,
Jordan Niethe, Kajol Jain, Kees Cook, Leonardo Bras, Madhavan
Srinivasan., Mahesh Salgaonkar, Markus Elfring, Michael Neuling, Michal
Simek, Nathan Chancellor, Nathan Lynch, Naveen N. Rao, Nicholas Piggin,
Oliver O'Halloran, Paul Mackerras, Pingfan Liu, Qian Cai, Ram Pai,
Raphael Moreira Zinsly, Ravi Bangoria, Sam Bobroff, Sandipan Das, Segher
Boessenkool, Stephen Rothwell, Sukadev Bhattiprolu, Tyrel Datwyler,
Wolfram Sang, Xiongfeng Wang.
* tag 'powerpc-5.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (299 commits)
powerpc/pseries: Make vio and ibmebus initcalls pseries specific
cxl: Remove dead Kconfig options
powerpc: Add POWER10 architected mode
powerpc/dt_cpu_ftrs: Add MMA feature
powerpc/dt_cpu_ftrs: Enable Prefixed Instructions
powerpc/dt_cpu_ftrs: Advertise support for ISA v3.1 if selected
powerpc: Add support for ISA v3.1
powerpc: Add new HWCAP bits
powerpc/64s: Don't set FSCR bits in INIT_THREAD
powerpc/64s: Save FSCR to init_task.thread.fscr after feature init
powerpc/64s: Don't let DT CPU features set FSCR_DSCR
powerpc/64s: Don't init FSCR_DSCR in __init_FSCR()
powerpc/32s: Fix another build failure with CONFIG_PPC_KUAP_DEBUG
powerpc/module_64: Use special stub for _mcount() with -mprofile-kernel
powerpc/module_64: Simplify check for -mprofile-kernel ftrace relocations
powerpc/module_64: Consolidate ftrace code
powerpc/32: Disable KASAN with pages bigger than 16k
powerpc/uaccess: Don't set KUEP by default on book3s/32
powerpc/uaccess: Don't set KUAP by default on book3s/32
powerpc/8xx: Reduce time spent in allow_user_access() and friends
...
|
|
Implement primitives necessary for the 4th level folding, add walks of p4d
level where appropriate and replace 5level-fixup.h with pgtable-nop4d.h.
[rppt@linux.ibm.com: powerpc/xmon: drop unused pgdir varialble in show_pte() function]
Link: http://lkml.kernel.org/r/20200519181454.GI1059226@linux.ibm.com
[rppt@linux.ibm.com; build fix]
Link: http://lkml.kernel.org/r/20200423141845.GI13521@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Christophe Leroy <christophe.leroy@c-s.fr> # 8xx and 83xx
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: James Morse <james.morse@arm.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Julien Thierry <julien.thierry.kdev@gmail.com>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200414153455.21744-9-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
_ALIGN_UP() is specific to powerpc
ALIGN() is generic and does the same
Replace _ALIGN_UP() by ALIGN()
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Joel Stanley <joel@jms.id.au>
Link: https://lore.kernel.org/r/8a6d7e45f7904c73a0af539642d3962e2a3c7268.1587407777.git.christophe.leroy@c-s.fr
|
|
Now that all the lockless page table walk is careful w.r.t the PTE
address returned, we can now revert
commit: 13bd817bb884 ("powerpc/thp: Serialize pmd clear against a linux page table walk.")
We also drop the equivalent IPI from other pte updates routines. We still keep
IPI in hash pmdp collapse and that is to take care of parallel hash page table
insert. The radix pmdp collapse flush can possibly be removed once I am sure
generic code doesn't have the any expectations around parallel gup walk.
This speeds up Qemu guest RAM del/unplug time as below
128 core, 496GB guest:
Without patch:
munmap start: timer = 13162 ms, PID=7684
munmap finish: timer = 95312 ms, PID=7684 - delta = 82150 ms
With patch:
munmap start: timer = 196449 ms, PID=6681
munmap finish: timer = 196488 ms, PID=6681 - delta = 39ms
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200505071729.54912-21-aneesh.kumar@linux.ibm.com
|
|
In prepartion to support a pgprot_t argument for arch_add_memory().
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Eric Badger <ebadger@gigaio.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Link: http://lkml.kernel.org/r/20200306170846.9333-6-logang@deltatee.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200302010410.2957362-1-npiggin@gmail.com
|
|
Commit a25bd72badfa ("powerpc/mm/radix: Workaround prefetch issue with
KVM") introduced a number of workarounds as coming out of a guest with
the mmu enabled would make the cpu would start running in hypervisor
state with the PID value from the guest. The cpu will then start
prefetching for the hypervisor with that PID value.
In Power9 DD2.2 the cpu behaviour was modified to fix this. When
accessing Quadrant 0 in hypervisor mode with LPID != 0 prefetching will
not be performed. This means that we can get rid of the workarounds for
Power9 DD2.2 and later revisions. Add a new cpu feature
CPU_FTR_P9_RADIX_PREFETCH_BUG to indicate if the workarounds are needed.
Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
Acked-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20191206031722.25781-1-jniethe5@gmail.com
|
|
Patch series "hugetlbfs: convert macros to static inline, fix sparse
warning".
The definition for huge_pte_offset() in <linux/hugetlb.h> causes a
sparse warning in the !CONFIG_HUGETLB_PAGE. Fix this as well as
converting all macros in this block of definitions to static inlines for
better type checking.
When making the above changes, build errors were found in powerpc due to
duplicate definitions. A separate powerpc specific patch is included as
a requisite to remove the definitions and get them from
<linux/hugetlb.h>.
This patch (of 2):
This removes the power specific stubs created by commit aad71e3928be
("powerpc/mm: Fix build break with RADIX=y & HUGETLBFS=n") used when
!CONFIG_HUGETLB_PAGE. Instead, it addresses the build break by getting
the definitions from <linux/hugetlb.h>. This allows the macros in
<linux/hugetlb.h> to be replaced with static inlines.
Link: http://lkml.kernel.org/r/20191112194558.139389-2-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Ben Dooks <ben.dooks@codethink.co.uk>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
More libnvdimm updates from Dan Williams:
- Complete the reworks to interoperate with powerpc dynamic huge page
sizes
- Fix a crash due to missed accounting for the powerpc 'struct
page'-memmap mapping granularity
- Fix badblock initialization for volatile (DRAM emulated) pmem ranges
- Stop triggering request_key() notifications to userspace when
NVDIMM-security is disabled / not present
- Miscellaneous small fixups
* tag 'libnvdimm-fixes-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
libnvdimm/region: Enable MAP_SYNC for volatile regions
libnvdimm: prevent nvdimm from requesting key when security is disabled
libnvdimm/region: Initialize bad block for volatile namespaces
libnvdimm/nfit_test: Fix acpi_handle redefinition
libnvdimm/altmap: Track namespace boundaries in altmap
libnvdimm: Fix endian conversion issues
libnvdimm/dax: Pick the right alignment default when creating dax devices
powerpc/book3s64: Export has_transparent_hugepage() related functions.
|
|
In later patch, we want to use hash_transparent_hugepage() in a kernel module.
Export two related functions.
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Link: https://lore.kernel.org/r/20190924042440.27946-1-aneesh.kumar@linux.ibm.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
Booting a POWER9 PowerNV system generates a few messages below with
"____ptrval____" due to the pointers printed without a specifier
extension (i.e unadorned %p) are hashed to prevent leaking information
about the kernel memory layout.
radix-mmu: Initializing Radix MMU
radix-mmu: Partition table (____ptrval____)
radix-mmu: Mapped 0x0000000000000000-0x0000000040000000 with 1.00 GiB
pages (exec)
radix-mmu: Mapped 0x0000000040000000-0x0000002000000000 with 1.00 GiB
pages
radix-mmu: Mapped 0x0000200000000000-0x0000202000000000 with 1.00 GiB
pages
radix-mmu: Process table (____ptrval____) and radix root for kernel:
(____ptrval____)
Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1566570120-16529-1-git-send-email-cai@lca.pw
|
|
The various translation structure invalidations performed in early boot
when the MMU is off are not required, because everything is invalidated
immediately before a CPU first enables its MMU (see early_init_mmu
and early_init_mmu_secondary).
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190902152931.17840-6-npiggin@gmail.com
|
|
Radix guests are responsible for managing their own translation caches,
so make them match bare metal radix and hash, and make each CPU flush
all its translations right before enabling its MMU.
Radix guests may not flush partition scope translations, so in
tlbiel_all, make these flushes conditional on CPU_FTR_HVMODE. Process
scope translations are the only type visible to the guest.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190902152931.17840-5-npiggin@gmail.com
|
|
No functional change.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190902152931.17840-4-npiggin@gmail.com
|
|
This callback is only required because the partition table init comes
before process table allocation on powernv (aka bare metal aka native).
Change the order to allocate the process table first, and remove the
callback.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190902152931.17840-2-npiggin@gmail.com
|
|
Merge our ppc-kvm topic branch to bring in the Ultravisor support
patches.
|
|
In ultravisor enabled systems, PTCR becomes ultravisor privileged only
for writing and an attempt to write to it will cause a Hypervisor
Emulation Assitance interrupt.
This patch uses the set_ptcr_when_no_uv() function to restrict PTCR
writing to only when ultravisor is disabled.
Signed-off-by: Claudio Carvalho <cclaudio@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190822034838.27876-6-cclaudio@linux.ibm.com
|
|
book3s64's ioremap_range() is almost same as fallback ioremap_range(),
except that it calls radix__ioremap_range() when radix is enabled.
radix__ioremap_range() is also very similar to the other ones, expect
that it calls ioremap_page_range when slab is available.
PPC32 __ioremap_caller() have a loop doing the same thing as
ioremap_range() so use it on PPC32 as well.
Lets keep only one version of ioremap_range() which calls
ioremap_page_range() on all platforms when slab is available.
At the same time, drop the nid parameter which is not used.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/4b1dca7096b01823b101be7338983578641547f1.1566309263.git.christophe.leroy@c-s.fr
|
|
create_physical_mapping expects physical addresses, but splitting
these mapping on hot unplug is supplying virtual (effective)
addresses.
Fixes: 4dd5f8a99e791 ("powerpc/mm/radix: Split linear mapping on hot-unplug")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190724084638.24982-2-npiggin@gmail.com
|
|
create_physical_mapping expects physical addresses, but creating and
splitting these mappings after boot is supplying virtual (effective)
addresses. This can be irritated by booting with mem= to limit memory
then probing an unused physical memory range:
echo <addr> > /sys/devices/system/memory/probe
This mostly works by accident, firstly because __va(__va(x)) == __va(x)
so the virtual address does not get corrupted. Secondly because pfn_pte
masks out the upper bits of the pfn beyond the physical address limit,
so a pfn constructed with a 0xc000000000000000 virtual linear address
will be masked back to the correct physical address in the pte.
Fixes: 6cc27341b21a8 ("powerpc/mm: add radix__create_section_mapping()")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190724084638.24982-1-npiggin@gmail.com
|
|
Finish up what commit c2febafc6773 ("mm: convert generic code to 5-level
paging") started while levelling up P4D huge mapping support at par with
PUD and PMD. A new arch call back arch_ioremap_p4d_supported() is added
which just maintains status quo (P4D huge map not supported) on x86,
arm64 and powerpc.
When HAVE_ARCH_HUGE_VMAP is enabled its just a simple check from the
arch about the support, hence runtime effects are minimal.
Link: http://lkml.kernel.org/r/1561699231-20991-1-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel |