Age | Commit message (Collapse) | Author | Files | Lines |
|
PPC64 IOMMU API defines iommu_table_group_ops which handles DMA
windows for PEs, their ownership transfer, create/set/unset the TCE
tables for the Dynamic DMA wundows(DDW). VFIOS uses these APIs for
support on POWER.
The commit 9d67c9433509 ("powerpc/iommu: Add "borrowing"
iommu_table_group_ops") implemented partial support for this API with
"borrow" mechanism wherein the DMA windows if created already by the
host driver, they would be available for VFIO to use. Also, it didn't
have the support to control/modify the window size or the IO page
size.
The current patch implements all the necessary iommu_table_group_ops
APIs there by avoiding the "borrrowing". So, just the way it is on the
PowerNV platform, with this patch the iommu table group ownership is
transferred to the VFIO PPC subdriver, the iommu table, DMA windows
creation/deletion all driven through the APIs.
The pSeries uses the query-pe-dma-window, create-pe-dma-window and
reset-pe-dma-window RTAS calls for DMA window creation, deletion and
reset to defaul. The RTAs calls do show some minor differences to the
way things are to be handled on the pSeries which are listed below.
* On pSeries, the default DMA window size is "fixed" cannot be custom
sized as requested by the user. For non-SRIOV VFs, It is fixed at 2GB
and for SRIOV VFs, its variable sized based on the capacity assigned
to it during the VF assignment to the LPAR. So, for the default DMA
window alone the size if requested less than tce32_size, the smaller
size is enforced using the iommu table->it_size.
* The DMA start address for 32-bit window is 0, and for the 64-bit
window in case of PowerNV is hardcoded to TVE select (bit 59) at 512PiB
offset. This address is returned at the time of create_table() API call
(even before the window is created), the subsequent set_window() call
actually opens the DMA window. On pSeries, the DMA start address for
32-bit window is known from the 'ibm,dma-window' DT property. However,
the 64-bit window start address is not known until the create-pe-dma
RTAS call is made. So, the create_table() which returns the DMA window
start address actually opens the DMA window and returns the DMA start
address as returned by the Hypervisor for the create-pe-dma RTAS call.
* The reset-pe-dma RTAS call resets the DMA windows and restores the
default DMA window, however it does not clear the TCE table entries
if there are any. In case of ownership transfer from platform domain
which used direct mapping, the patch chooses remove-pe-dma instead of
reset-pe for the 64-bit window intentionally so that the
clear_dma_window() is called.
Other than the DMA window management changes mentioned above, the
patch also brings back the userspace view for the single level TCE
as it existed before commit 090bad39b237a ("powerpc/powernv: Add
indirect levels to it_userspace") along with the relavent
refactoring.
Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/171923275958.1397.907964437142542242.stgit@linux.ibm.com
|
|
Move function dev_has_iommu_table() to powerpc/kernel/iommu.c
as it is going to be used by machine specific iommu code as
well in subsequent patches.
Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/171923274748.1397.6274953248403106679.stgit@linux.ibm.com
|
|
The PowerNV specific table_group_ops are defined in powernv/pci-ioda.c.
The pSeries specific table_group_ops are sitting in the generic powerpc
file. Move it to where it actually belong(pseries/iommu.c).
The functions are currently defined even for CONFIG_PPC_POWERNV
which are unused on PowerNV.
Only code movement, no functional changes intended.
Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/171923269701.1397.15758640002786937132.stgit@linux.ibm.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull mm updates from Andrew Morton:
"The usual shower of singleton fixes and minor series all over MM,
documented (hopefully adequately) in the respective changelogs.
Notable series include:
- Lucas Stach has provided some page-mapping cleanup/consolidation/
maintainability work in the series "mm/treewide: Remove pXd_huge()
API".
- In the series "Allow migrate on protnone reference with
MPOL_PREFERRED_MANY policy", Donet Tom has optimized mempolicy's
MPOL_PREFERRED_MANY mode, yielding almost doubled performance in
one test.
- In their series "Memory allocation profiling" Kent Overstreet and
Suren Baghdasaryan have contributed a means of determining (via
/proc/allocinfo) whereabouts in the kernel memory is being
allocated: number of calls and amount of memory.
- Matthew Wilcox has provided the series "Various significant MM
patches" which does a number of rather unrelated things, but in
largely similar code sites.
- In his series "mm: page_alloc: freelist migratetype hygiene"
Johannes Weiner has fixed the page allocator's handling of
migratetype requests, with resulting improvements in compaction
efficiency.
- In the series "make the hugetlb migration strategy consistent"
Baolin Wang has fixed a hugetlb migration issue, which should
improve hugetlb allocation reliability.
- Liu Shixin has hit an I/O meltdown caused by readahead in a
memory-tight memcg. Addressed in the series "Fix I/O high when
memory almost met memcg limit".
- In the series "mm/filemap: optimize folio adding and splitting"
Kairui Song has optimized pagecache insertion, yielding ~10%
performance improvement in one test.
- Baoquan He has cleaned up and consolidated the early zone
initialization code in the series "mm/mm_init.c: refactor
free_area_init_core()".
- Baoquan has also redone some MM initializatio code in the series
"mm/init: minor clean up and improvement".
- MM helper cleanups from Christoph Hellwig in his series "remove
follow_pfn".
- More cleanups from Matthew Wilcox in the series "Various
page->flags cleanups".
- Vlastimil Babka has contributed maintainability improvements in the
series "memcg_kmem hooks refactoring".
- More folio conversions and cleanups in Matthew Wilcox's series:
"Convert huge_zero_page to huge_zero_folio"
"khugepaged folio conversions"
"Remove page_idle and page_young wrappers"
"Use folio APIs in procfs"
"Clean up __folio_put()"
"Some cleanups for memory-failure"
"Remove page_mapping()"
"More folio compat code removal"
- David Hildenbrand chipped in with "fs/proc/task_mmu: convert
hugetlb functions to work on folis".
- Code consolidation and cleanup work related to GUP's handling of
hugetlbs in Peter Xu's series "mm/gup: Unify hugetlb, part 2".
- Rick Edgecombe has developed some fixes to stack guard gaps in the
series "Cover a guard gap corner case".
- Jinjiang Tu has fixed KSM's behaviour after a fork+exec in the
series "mm/ksm: fix ksm exec support for prctl".
- Baolin Wang has implemented NUMA balancing for multi-size THPs.
This is a simple first-cut implementation for now. The series is
"support multi-size THP numa balancing".
- Cleanups to vma handling helper functions from Matthew Wilcox in
the series "Unify vma_address and vma_pgoff_address".
- Some selftests maintenance work from Dev Jain in the series
"selftests/mm: mremap_test: Optimizations and style fixes".
- Improvements to the swapping of multi-size THPs from Ryan Roberts
in the series "Swap-out mTHP without splitting".
- Kefeng Wang has significantly optimized the handling of arm64's
permission page faults in the series
"arch/mm/fault: accelerate pagefault when badaccess"
"mm: remove arch's private VM_FAULT_BADMAP/BADACCESS"
- GUP cleanups from David Hildenbrand in "mm/gup: consistently call
it GUP-fast".
- hugetlb fault code cleanups from Vishal Moola in "Hugetlb fault
path to use struct vm_fault".
- selftests build fixes from John Hubbard in the series "Fix
selftests/mm build without requiring "make headers"".
- Memory tiering fixes/improvements from Ho-Ren (Jack) Chuang in the
series "Improved Memory Tier Creation for CPUless NUMA Nodes".
Fixes the initialization code so that migration between different
memory types works as intended.
- David Hildenbrand has improved follow_pte() and fixed an errant
driver in the series "mm: follow_pte() improvements and acrn
follow_pte() fixes".
- David also did some cleanup work on large folio mapcounts in his
series "mm: mapcount for large folios + page_mapcount() cleanups".
- Folio conversions in KSM in Alex Shi's series "transfer page to
folio in KSM".
- Barry Song has added some sysfs stats for monitoring multi-size
THP's in the series "mm: add per-order mTHP alloc and swpout
counters".
- Some zswap cleanups from Yosry Ahmed in the series "zswap
same-filled and limit checking cleanups".
- Matthew Wilcox has been looking at buffer_head code and found the
documentation to be lacking. The series is "Improve buffer head
documentation".
- Multi-size THPs get more work, this time from Lance Yang. His
series "mm/madvise: enhance lazyfreeing with mTHP in madvise_free"
optimizes the freeing of these things.
- Kemeng Shi has added more userspace-visible writeback
instrumentation in the series "Improve visibility of writeback".
- Kemeng Shi then sent some maintenance work on top in the series
"Fix and cleanups to page-writeback".
- Matthew Wilcox reduces mmap_lock traffic in the anon vma code in
the series "Improve anon_vma scalability for anon VMAs". Intel's
test bot reported an improbable 3x improvement in one test.
- SeongJae Park adds some DAMON feature work in the series
"mm/damon: add a DAMOS filter type for page granularity access recheck"
"selftests/damon: add DAMOS quota goal test"
- Also some maintenance work in the series
"mm/damon/paddr: simplify page level access re-check for pageout"
"mm/damon: misc fixes and improvements"
- David Hildenbrand has disabled some known-to-fail selftests ni the
series "selftests: mm: cow: flag vmsplice() hugetlb tests as
XFAIL".
- memcg metadata storage optimizations from Shakeel Butt in "memcg:
reduce memory consumption by memcg stats".
- DAX fixes and maintenance work from Vishal Verma in the series
"dax/bus.c: Fixups for dax-bus locking""
* tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (426 commits)
memcg, oom: cleanup unused memcg_oom_gfp_mask and memcg_oom_order
selftests/mm: hugetlb_madv_vs_map: avoid test skipping by querying hugepage size at runtime
mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_wp
mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_fault
selftests: cgroup: add tests to verify the zswap writeback path
mm: memcg: make alloc_mem_cgroup_per_node_info() return bool
mm/damon/core: fix return value from damos_wmark_metric_value
mm: do not update memcg stats for NR_{FILE/SHMEM}_PMDMAPPED
selftests: cgroup: remove redundant enabling of memory controller
Docs/mm/damon/maintainer-profile: allow posting patches based on damon/next tree
Docs/mm/damon/maintainer-profile: change the maintainer's timezone from PST to PT
Docs/mm/damon/design: use a list for supported filters
Docs/admin-guide/mm/damon/usage: fix wrong schemes effective quota update command
Docs/admin-guide/mm/damon/usage: fix wrong example of DAMOS filter matching sysfs file
selftests/damon: classify tests for functionalities and regressions
selftests/damon/_damon_sysfs: use 'is' instead of '==' for 'None'
selftests/damon/_damon_sysfs: find sysfs mount point from /proc/mounts
selftests/damon/_damon_sysfs: check errors from nr_schemes file reads
mm/damon/core: initialize ->esz_bp from damos_quota_init_priv()
selftests/damon: add a test for DAMOS quota goal
...
|
|
Patch series "Memory allocation profiling", v6.
Overview:
Low overhead [1] per-callsite memory allocation profiling. Not just for
debug kernels, overhead low enough to be deployed in production.
Example output:
root@moria-kvm:~# sort -rn /proc/allocinfo
127664128 31168 mm/page_ext.c:270 func:alloc_page_ext
56373248 4737 mm/slub.c:2259 func:alloc_slab_page
14880768 3633 mm/readahead.c:247 func:page_cache_ra_unbounded
14417920 3520 mm/mm_init.c:2530 func:alloc_large_system_hash
13377536 234 block/blk-mq.c:3421 func:blk_mq_alloc_rqs
11718656 2861 mm/filemap.c:1919 func:__filemap_get_folio
9192960 2800 kernel/fork.c:307 func:alloc_thread_stack_node
4206592 4 net/netfilter/nf_conntrack_core.c:2567 func:nf_ct_alloc_hashtable
4136960 1010 drivers/staging/ctagmod/ctagmod.c:20 [ctagmod] func:ctagmod_start
3940352 962 mm/memory.c:4214 func:alloc_anon_folio
2894464 22613 fs/kernfs/dir.c:615 func:__kernfs_new_node
...
Usage:
kconfig options:
- CONFIG_MEM_ALLOC_PROFILING
- CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT
- CONFIG_MEM_ALLOC_PROFILING_DEBUG
adds warnings for allocations that weren't accounted because of a
missing annotation
sysctl:
/proc/sys/vm/mem_profiling
Runtime info:
/proc/allocinfo
Notes:
[1]: Overhead
To measure the overhead we are comparing the following configurations:
(1) Baseline with CONFIG_MEMCG_KMEM=n
(2) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n)
(3) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y)
(4) Enabled at runtime (CONFIG_MEM_ALLOC_PROFILING=y &&
CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n && /proc/sys/vm/mem_profiling=1)
(5) Baseline with CONFIG_MEMCG_KMEM=y && allocating with __GFP_ACCOUNT
(6) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n) && CONFIG_MEMCG_KMEM=y
(7) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y) && CONFIG_MEMCG_KMEM=y
Performance overhead:
To evaluate performance we implemented an in-kernel test executing
multiple get_free_page/free_page and kmalloc/kfree calls with allocation
sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU
affinity set to a specific CPU to minimize the noise. Below are results
from running the test on Ubuntu 22.04.2 LTS with 6.8.0-rc1 kernel on
56 core Intel Xeon:
kmalloc pgalloc
(1 baseline) 6.764s 16.902s
(2 default disabled) 6.793s (+0.43%) 17.007s (+0.62%)
(3 default enabled) 7.197s (+6.40%) 23.666s (+40.02%)
(4 runtime enabled) 7.405s (+9.48%) 23.901s (+41.41%)
(5 memcg) 13.388s (+97.94%) 48.460s (+186.71%)
(6 def disabled+memcg) 13.332s (+97.10%) 48.105s (+184.61%)
(7 def enabled+memcg) 13.446s (+98.78%) 54.963s (+225.18%)
Memory overhead:
Kernel size:
text data bss dec diff
(1) 26515311 18890222 17018880 62424413
(2) 26524728 19423818 16740352 62688898 264485
(3) 26524724 19423818 16740352 62688894 264481
(4) 26524728 19423818 16740352 62688898 264485
(5) 26541782 18964374 16957440 62463596 39183
Memory consumption on a 56 core Intel CPU with 125GB of memory:
Code tags: 192 kB
PageExts: 262144 kB (256MB)
SlabExts: 9876 kB (9.6MB)
PcpuExts: 512 kB (0.5MB)
Total overhead is 0.2% of total memory.
Benchmarks:
Hackbench tests run 100 times:
hackbench -s 512 -l 200 -g 15 -f 25 -P
baseline disabled profiling enabled profiling
avg 0.3543 0.3559 (+0.0016) 0.3566 (+0.0023)
stdev 0.0137 0.0188 0.0077
hackbench -l 10000
baseline disabled profiling enabled profiling
avg 6.4218 6.4306 (+0.0088) 6.5077 (+0.0859)
stdev 0.0933 0.0286 0.0489
stress-ng tests:
stress-ng --class memory --seq 4 -t 60
stress-ng --class cpu --seq 4 -t 60
Results posted at: https://evilpiepirate.org/~kent/memalloc_prof_v4_stress-ng/
[2] https://lore.kernel.org/all/20240306182440.2003814-1-surenb@google.com/
This patch (of 37):
The next patch drops vmalloc.h from a system header in order to fix a
circular dependency; this adds it to all the files that were pulling it in
implicitly.
[kent.overstreet@linux.dev: fix arch/alpha/lib/memcpy.c]
Link: https://lkml.kernel.org/r/20240327002152.3339937-1-kent.overstreet@linux.dev
[surenb@google.com: fix arch/x86/mm/numa_32.c]
Link: https://lkml.kernel.org/r/20240402180933.1663992-1-surenb@google.com
[kent.overstreet@linux.dev: a few places were depending on sizes.h]
Link: https://lkml.kernel.org/r/20240404034744.1664840-1-kent.overstreet@linux.dev
[arnd@arndb.de: fix mm/kasan/hw_tags.c]
Link: https://lkml.kernel.org/r/20240404124435.3121534-1-arnd@kernel.org
[surenb@google.com: fix arc build]
Link: https://lkml.kernel.org/r/20240405225115.431056-1-surenb@google.com
Link: https://lkml.kernel.org/r/20240321163705.3067592-1-surenb@google.com
Link: https://lkml.kernel.org/r/20240321163705.3067592-2-surenb@google.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Tested-by: Kees Cook <keescook@chromium.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alex Gaynor <alex.gaynor@gmail.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andreas Hindborg <a.hindborg@samsung.com>
Cc: Benno Lossin <benno.lossin@proton.me>
Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Gary Guo <gary@garyguo.net>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wedson Almeida Filho <wedsonaf@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The patch makes the iommu_group_get() call only when using it
thereby avoiding the unnecessary get & put for domain already
being set case.
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/170800513841.2411.13524607664262048895.stgit@linux.ibm.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
- Fix a crash when hot adding a PCI device to an LPAR since
recent changes
- Fix nested KVM level-2 guest reboot failure due to empty
'arch_compat'
Thanks to Amit Machhiwal, Aneesh Kumar K.V (IBM), Brian King, Gaurav
Batra, and Vaibhav Jain.
* tag 'powerpc-6.8-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
KVM: PPC: Book3S HV: Fix L2 guest reboot failure due to empty 'arch_compat'
powerpc/pseries/iommu: DLPAR add doesn't completely initialize pci_controller
|
|
When a PCI device is dynamically added, the kernel oopses with a NULL
pointer dereference:
BUG: Kernel NULL pointer dereference on read at 0x00000030
Faulting instruction address: 0xc0000000006bbe5c
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: rpadlpar_io rpaphp rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs xsk_diag bonding nft_compat nf_tables nfnetlink rfkill binfmt_misc dm_multipath rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_umad ib_iser libiscsi scsi_transport_iscsi ib_ipoib rdma_cm iw_cm ib_cm mlx5_ib ib_uverbs ib_core pseries_rng drm drm_panel_orientation_quirks xfs libcrc32c mlx5_core mlxfw sd_mod t10_pi sg tls ibmvscsi ibmveth scsi_transport_srp vmx_crypto pseries_wdt psample dm_mirror dm_region_hash dm_log dm_mod fuse
CPU: 17 PID: 2685 Comm: drmgr Not tainted 6.7.0-203405+ #66
Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf000006 of:IBM,FW1060.00 (NH1060_008) hv:phyp pSeries
NIP: c0000000006bbe5c LR: c000000000a13e68 CTR: c0000000000579f8
REGS: c00000009924f240 TRAP: 0300 Not tainted (6.7.0-203405+)
MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 24002220 XER: 20040006
CFAR: c000000000a13e64 DAR: 0000000000000030 DSISR: 40000000 IRQMASK: 0
...
NIP sysfs_add_link_to_group+0x34/0x94
LR iommu_device_link+0x5c/0x118
Call Trace:
iommu_init_device+0x26c/0x318 (unreliable)
iommu_device_link+0x5c/0x118
iommu_init_device+0xa8/0x318
iommu_probe_device+0xc0/0x134
iommu_bus_notifier+0x44/0x104
notifier_call_chain+0xb8/0x19c
blocking_notifier_call_chain+0x64/0x98
bus_notify+0x50/0x7c
device_add+0x640/0x918
pci_device_add+0x23c/0x298
of_create_pci_dev+0x400/0x884
of_scan_pci_dev+0x124/0x1b0
__of_scan_bus+0x78/0x18c
pcibios_scan_phb+0x2a4/0x3b0
init_phb_dynamic+0xb8/0x110
dlpar_add_slot+0x170/0x3b8 [rpadlpar_io]
add_slot_store.part.0+0xb4/0x130 [rpadlpar_io]
kobj_attr_store+0x2c/0x48
sysfs_kf_write+0x64/0x78
kernfs_fop_write_iter+0x1b0/0x290
vfs_write+0x350/0x4a0
ksys_write+0x84/0x140
system_call_exception+0x124/0x330
system_call_vectored_common+0x15c/0x2ec
Commit a940904443e4 ("powerpc/iommu: Add iommu_ops to report capabilities
and allow blocking domains") broke DLPAR add of PCI devices.
The above added iommu_device structure to pci_controller. During
system boot, PCI devices are discovered and this newly added iommu_device
structure is initialized by a call to iommu_device_register().
During DLPAR add of a PCI device, a new pci_controller structure is
allocated but there are no calls made to iommu_device_register()
interface.
Fix is to register the iommu device during DLPAR add as well.
Fixes: a940904443e4 ("powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains")
Signed-off-by: Gaurav Batra <gbatra@linux.ibm.com>
Reviewed-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20240215221833.4817-1-gbatra@linux.ibm.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
"This is a bit of a big batch for rc4, but just due to holiday hangover
and because I didn't send any fixes last week due to a late revert
request. I think next week should be back to normal.
- Fix ftrace bug on boot caused by exit text sections with
'-fpatchable-function-entry'
- Fix accuracy of stolen time on pseries since the switch to
VIRT_CPU_ACCOUNTING_GEN
- Fix a crash in the IOMMU code when doing DLPAR remove
- Set pt_regs->link on scv entry to fix BPF stack unwinding
- Add missing PPC_FEATURE_BOOKE on 64-bit e5500/e6500, which broke
gdb
- Fix boot on some 6xx platforms with STRICT_KERNEL_RWX enabled
- Fix build failures with KASAN enabled and 32KB stack size
- Some other minor fixes
Thanks to Arnd Bergmann, Benjamin Gray, Christophe Leroy, David
Engraf, Gaurav Batra, Jason Gunthorpe, Jiangfeng Xiao, Matthias
Schiffer, Nathan Lynch, Naveen N Rao, Nicholas Piggin, Nysal Jan K.A,
R Nageswara Sastry, Shivaprasad G Bhat, Shrikanth Hegde, Spoorthy,
Srikar Dronamraju, and Venkat Rao Bagalkote"
* tag 'powerpc-6.8-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/iommu: Fix the missing iommu_group_put() during platform domain attach
powerpc/pseries: fix accuracy of stolen time
powerpc/ftrace: Ignore ftrace locations in exit text sections
powerpc/cputable: Add missing PPC_FEATURE_BOOKE on PPC64 Book-E
powerpc/kasan: Limit KASAN thread size increase to 32KB
Revert "powerpc/pseries/iommu: Fix iommu initialisation during DLPAR add"
powerpc: 85xx: mark local functions static
powerpc: udbg_memcons: mark functions static
powerpc/kasan: Fix addr error caused by page alignment
powerpc/6xx: set High BAT Enable flag on G2_LE cores
selftests/powerpc/papr_vpd: Check devfd before get_system_loc_code()
powerpc/64: Set task pt_regs->link to the LR value on scv entry
powerpc/pseries/iommu: Fix iommu initialisation during DLPAR add
powerpc/pseries/papr-sysparm: use u8 arrays for payloads
|
|
The function spapr_tce_platform_iommu_attach_dev() is missing to call
iommu_group_put() when the domain is already set. This refcount leak
shows up with BUG_ON() during DLPAR remove operation as:
KernelBug: Kernel bug in state 'None': kernel BUG at arch/powerpc/platforms/pseries/iommu.c:100!
Oops: Exception in kernel mode, sig: 5 [#1]
LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=8192 NUMA pSeries
<snip>
Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf000006 of:IBM,FW1060.00 (NH1060_016) hv:phyp pSeries
NIP: c0000000000ff4d4 LR: c0000000000ff4cc CTR: 0000000000000000
REGS: c0000013aed5f840 TRAP: 0700 Tainted: G I (6.8.0-rc3-autotest-g99bd3cb0d12e)
MSR: 8000000000029033 <SF,EE,ME,IR,DR,RI,LE> CR: 44002402 XER: 20040000
CFAR: c000000000a0d170 IRQMASK: 0
...
NIP iommu_reconfig_notifier+0x94/0x200
LR iommu_reconfig_notifier+0x8c/0x200
Call Trace:
iommu_reconfig_notifier+0x8c/0x200 (unreliable)
notifier_call_chain+0xb8/0x19c
blocking_notifier_call_chain+0x64/0x98
of_reconfig_notify+0x44/0xdc
of_detach_node+0x78/0xb0
ofdt_write.part.0+0x86c/0xbb8
proc_reg_write+0xf4/0x150
vfs_write+0xf8/0x488
ksys_write+0x84/0x140
system_call_exception+0x138/0x330
system_call_vectored_common+0x15c/0x2ec
The patch adds the missing iommu_group_put() call.
Fixes: a8ca9fc9134c ("powerpc/iommu: Do not do platform domain attach atctions after probe")
Reported-by: Venkat Rao Bagalkote <venkat88@linux.vnet.ibm.com>
Closes: https://lore.kernel.org/all/274e0d2b-b5cc-475e-94e6-8427e88e271d@linux.vnet.ibm.com/
Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.vnet.ibm.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/170784021983.6249.10039296655906636112.stgit@linux.ibm.com
|
|
This reverts commit ed8b94f6e0acd652ce69bd69d678a0c769172df8.
Gaurav reported that there are still problems with the patch and it
should be reverted pending a fuller fix.
Link: https://lore.kernel.org/all/4f6fc1ac-7a76-4447-9d0e-f55c0be373f8@linux.ibm.com/
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
When a PCI device is dynamically added, the kernel oopses with a NULL
pointer dereference:
BUG: Kernel NULL pointer dereference on read at 0x00000030
Faulting instruction address: 0xc0000000006bbe5c
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: rpadlpar_io rpaphp rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs xsk_diag bonding nft_compat nf_tables nfnetlink rfkill binfmt_misc dm_multipath rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_umad ib_iser libiscsi scsi_transport_iscsi ib_ipoib rdma_cm iw_cm ib_cm mlx5_ib ib_uverbs ib_core pseries_rng drm drm_panel_orientation_quirks xfs libcrc32c mlx5_core mlxfw sd_mod t10_pi sg tls ibmvscsi ibmveth scsi_transport_srp vmx_crypto pseries_wdt psample dm_mirror dm_region_hash dm_log dm_mod fuse
CPU: 17 PID: 2685 Comm: drmgr Not tainted 6.7.0-203405+ #66
Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf000006 of:IBM,FW1060.00 (NH1060_008) hv:phyp pSeries
NIP: c0000000006bbe5c LR: c000000000a13e68 CTR: c0000000000579f8
REGS: c00000009924f240 TRAP: 0300 Not tainted (6.7.0-203405+)
MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 24002220 XER: 20040006
CFAR: c000000000a13e64 DAR: 0000000000000030 DSISR: 40000000 IRQMASK: 0
...
NIP sysfs_add_link_to_group+0x34/0x94
LR iommu_device_link+0x5c/0x118
Call Trace:
iommu_init_device+0x26c/0x318 (unreliable)
iommu_device_link+0x5c/0x118
iommu_init_device+0xa8/0x318
iommu_probe_device+0xc0/0x134
iommu_bus_notifier+0x44/0x104
notifier_call_chain+0xb8/0x19c
blocking_notifier_call_chain+0x64/0x98
bus_notify+0x50/0x7c
device_add+0x640/0x918
pci_device_add+0x23c/0x298
of_create_pci_dev+0x400/0x884
of_scan_pci_dev+0x124/0x1b0
__of_scan_bus+0x78/0x18c
pcibios_scan_phb+0x2a4/0x3b0
init_phb_dynamic+0xb8/0x110
dlpar_add_slot+0x170/0x3b8 [rpadlpar_io]
add_slot_store.part.0+0xb4/0x130 [rpadlpar_io]
kobj_attr_store+0x2c/0x48
sysfs_kf_write+0x64/0x78
kernfs_fop_write_iter+0x1b0/0x290
vfs_write+0x350/0x4a0
ksys_write+0x84/0x140
system_call_exception+0x124/0x330
system_call_vectored_common+0x15c/0x2ec
Commit a940904443e4 ("powerpc/iommu: Add iommu_ops to report capabilities
and allow blocking domains") broke DLPAR add of PCI devices.
The above added iommu_device structure to pci_controller. During
system boot, PCI devices are discovered and this newly added iommu_device
structure is initialized by a call to iommu_device_register().
During DLPAR add of a PCI device, a new pci_controller structure is
allocated but there are no calls made to iommu_device_register()
interface.
Fix is to register the iommu device during DLPAR add as well.
Fixes: a940904443e4 ("powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains")
Signed-off-by: Gaurav Batra <gbatra@linux.ibm.com>
[mpe: Trim oops and tweak some change log wording]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20240122222407.39603-1-gbatra@linux.ibm.com
|
|
The commit 2ad56efa80db ("powerpc/iommu: Setup a default domain and
remove set_platform_dma_ops") refactored the code removing the
set_platform_dma_ops(). It missed out the table group
release_ownership() call which would have got called otherwise
during the guest shutdown via vfio_group_detach_container(). On
PPC64, this particular call actually sets up the 32-bit TCE table,
and enables the 64-bit DMA bypass etc. Now after guest shutdown,
the subsequent host driver (e.g megaraid-sas) probe post unbind
from vfio-pci fails like,
megaraid_sas 0031:01:00.0: Warning: IOMMU dma not supported: mask 0x7fffffffffffffff, table unavailable
megaraid_sas 0031:01:00.0: Warning: IOMMU dma not supported: mask 0xffffffff, table unavailable
megaraid_sas 0031:01:00.0: Failed to set DMA mask
megaraid_sas 0031:01:00.0: Failed from megasas_init_fw 6539
The patch brings back the call to table_group release_ownership()
call when switching back to PLATFORM domain from BLOCKED, while
also separates the domain_ops for both.
Fixes: 2ad56efa80db ("powerpc/iommu: Setup a default domain and remove set_platform_dma_ops")
Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/170628173462.3742.18330000394415935845.stgit@ltcd48-lp2.aus.stglab.ibm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull iommu updates from Joerg Roedel:
"Core changes:
- Make default-domains mandatory for all IOMMU drivers
- Remove group refcounting
- Add generic_single_device_group() helper and consolidate drivers
- Cleanup map/unmap ops
- Scaling improvements for the IOVA rcache depot
- Convert dart & iommufd to the new domain_alloc_paging()
ARM-SMMU:
- Device-tree binding update:
- Add qcom,sm7150-smmu-v2 for Adreno on SM7150 SoC
- SMMUv2:
- Support for Qualcomm SDM670 (MDSS) and SM7150 SoCs
- SMMUv3:
- Large refactoring of the context descriptor code to move the CD
table into the master, paving the way for '->set_dev_pasid()'
support on non-SVA domains
- Minor cleanups to the SVA code
Intel VT-d:
- Enable debugfs to dump domain attached to a pasid
- Remove an unnecessary inline function
AMD IOMMU:
- Initial patches for SVA support (not complete yet)
S390 IOMMU:
- DMA-API conversion and optimized IOTLB flushing
And some smaller fixes and improvements"
* tag 'iommu-updates-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (102 commits)
iommu/dart: Remove the force_bypass variable
iommu/dart: Call apple_dart_finalize_domain() as part of alloc_paging()
iommu/dart: Convert to domain_alloc_paging()
iommu/dart: Move the blocked domain support to a global static
iommu/dart: Use static global identity domains
iommufd: Convert to alloc_domain_paging()
iommu/vt-d: Use ops->blocked_domain
iommu/vt-d: Update the definition of the blocking domain
iommu: Move IOMMU_DOMAIN_BLOCKED global statics to ops->blocked_domain
Revert "iommu/vt-d: Remove unused function"
iommu/amd: Remove DMA_FQ type from domain allocation path
iommu: change iommu_map_sgtable to return signed values
iommu/virtio: Add __counted_by for struct viommu_request and use struct_size()
iommu/vt-d: debugfs: Support dumping a specified page table
iommu/vt-d: debugfs: Create/remove debugfs file per {device, pasid}
iommu/vt-d: debugfs: Dump entry pointing to huge page
iommu/vt-d: Remove unused function
iommu/arm-smmu-v3-sva: Remove bond refcount
iommu/arm-smmu-v3-sva: Remove unused iommu_sva handle
iommu/arm-smmu-v3: Rename cdcfg to cd_table
...
|
|
Following the pattern of identity domains, just assign the BLOCKED domain
global statics to a value in ops. Update the core code to use the global
static directly.
Update powerpc to use the new scheme and remove its empty domain_alloc
callback.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Sven Peter <sven@svenpeter.dev>
Link: https://lore.kernel.org/r/1-v2-bff223cf6409+282-dart_paging_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
Sparse reports several function implementations annotated with extern.
This is clearly incorrect, likely just copied from an actual extern
declaration in another file.
Fix the sparse warnings by removing extern.
Signed-off-by: Benjamin Gray <bgray@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20231011053711.93427-6-bgray@linux.ibm.com
|
|
POWER throws a splat at boot, it looks like the DMA ops were probably
changed while a driver was attached. Something is still weird about how
power sequences its bootup. Previously this was hidden since the core
iommu code did nothing during probe, now it calls
spapr_tce_platform_iommu_attach_dev().
Make spapr_tce_platform_iommu_attach_dev() do nothing on the probe time
call like it did before.
WARNING: CPU: 0 PID: 8 at arch/powerpc/kernel/iommu.c:407 __iommu_free+0x1e4/0x1f0
Modules linked in: sd_mod t10_pi crc64_rocksoft crc64 sg ibmvfc mlx5_core(+) scsi_transport_fc ibmveth mlxfw psample dm_multipath dm_mirror dm_region_hash dm_log dm_mod fuse
CPU: 0 PID: 8 Comm: kworker/0:0 Not tainted 6.6.0-rc3-next-20230929-auto #1
Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf000006 of:IBM,FW1030.30 (NH1030_062) hv:phyp pSeries
Workqueue: events work_for_cpu_fn
NIP: c00000000005f6d4 LR: c00000000005f6d0 CTR: 00000000005ca81c
REGS: c000000003a27890 TRAP: 0700 Not tainted (6.6.0-rc3-next-20230929-auto)
MSR: 800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 48000824 XER: 00000008
CFAR: c00000000020f738 IRQMASK: 0
GPR00: c00000000005f6d0 c000000003a27b30 c000000001481800 000000000000017
GPR04: 00000000ffff7fff c000000003a27950 c000000003a27948 0000000000000027
GPR08: c000000c18c07c10 0000000000000001 0000000000000027 c000000002ac8a08
GPR12: 0000000000000000 c000000002ff0000 c00000000019cc88 c000000003042300
GPR16: 0000000000000000 0000000000000000 0000000000000000 c000000003071ab0
GPR20: c00000000349f80d c000000003215440 c000000003215480 61c8864680b583eb
GPR24: 0000000000000000 000000007fffffff 0800000020000000 0000000000000010
GPR28: 0000000000020000 0000800000020000 c00000000c5dc800 c00000000c5dc880
NIP [c00000000005f6d4] __iommu_free+0x1e4/0x1f0
LR [c00000000005f6d0] __iommu_free+0x1e0/0x1f0
Call Trace:
[c000000003a27b30] [c00000000005f6d0] __iommu_free+0x1e0/0x1f0 (unreliable)
[c000000003a27bc0] [c00000000005f848] iommu_free+0x28/0x70
[c000000003a27bf0] [c000000000061518] iommu_free_coherent+0x68/0xa0
[c000000003a27c20] [c00000000005e8d4] dma_iommu_free_coherent+0x24/0x40
[c000000003a27c40] [c00000000024698c] dma_free_attrs+0x10c/0x140
[c000000003a27c90] [c008000000dcb8d4] mlx5_cmd_cleanup+0x5c/0x90 [mlx5_core]
[c000000003a27cc0] [c008000000dc45a0] mlx5_mdev_uninit+0xc8/0x100 [mlx5_core]
[c000000003a27d00] [c008000000dc4ac4] probe_one+0x3ec/0x530 [mlx5_core]
[c000000003a27d90] [c0000000008c5edc] local_pci_probe+0x6c/0x110
[c000000003a27e10] [c000000000189c98] work_for_cpu_fn+0x38/0x60
[c000000003a27e40] [c00000000018d1d0] process_scheduled_works+0x230/0x4f0
[c000000003a27f10] [c00000000018ff14] worker_thread+0x1e4/0x500
[c000000003a27f90] [c00000000019cdb8] kthread+0x138/0x140
[c000000003a27fe0] [c00000000000df98] start_kernel_thread+0x14/0x18
Code: 481b004d 60000000 e89e0028 3c62ffe0 3863dd20 481b0039 60000000 e89e0038 3c62ffe0 3863dd38 481b0025 60000000 <0fe00000> 4bffff20 60000000 3c4c0142
---[ end trace 0000000000000000 ]---
iommu_free: invalid entry
entry = 0x8000000203d0
dma_addr = 0x8000000203d0000
Table = 0xc00000000c5dc800
bus# = 0x1
size = 0x20000
startOff = 0x800000000000
index = 0x70200016
Fixes: 2ad56efa80db ("powerpc/iommu: Setup a default domain and remove set_platform_dma_ops")
Reported-by: Tasmiya Nalatwad <tasmiya@linux.vnet.ibm.com>
Link: https://lore.kernel.org/r/d06cee81-c47f-9d62-dfc6-4c77b60058db@linux.vnet.ibm.com
Tested-by: Tasmiya Nalatwad <tasmiya@linux.vnet.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/0-v1-2b52423411b9+164fc-iommu_ppc_defdomain_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
POWER is using the set_platform_dma_ops() callback to hook up its private
dma_ops, but this is buired under some indirection and is weirdly
happening for a BLOCKED domain as well.
For better documentation create a PLATFORM domain to manage the dma_ops,
since that is what it is for, and make the BLOCKED domain an alias for
it. BLOCKED is required for VFIO.
Also removes the leaky allocation of the BLOCKED domain by using a global
static.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/3-v8-81230027b2fa+9d-iommu_all_defdom_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
fail_iommu_setup() registers the fail_iommu_bus_notifier struct to both
PCI and VIO buses. struct notifier_block is a linked list node, so this
causes any notifiers later registered to either bus type to also be
registered to the other since they share the same node.
This causes issues in (at least) the vgaarb code, which registers a
notifier for PCI buses. pci_notify() ends up being called on a vio
device, converted with to_pci_dev() even though it's not a PCI device,
and finally makes a bad access in vga_arbiter_add_pci_device() as
discovered with KASAN:
BUG: KASAN: slab-out-of-bounds in vga_arbiter_add_pci_device+0x60/0xe00
Read of size 4 at addr c000000264c26fdc by task swapper/0/1
Call Trace:
dump_stack_lvl+0x1bc/0x2b8 (unreliable)
print_report+0x3f4/0xc60
kasan_report+0x244/0x698
__asan_load4+0xe8/0x250
vga_arbiter_add_pci_device+0x60/0xe00
pci_notify+0x88/0x444
notifier_call_chain+0x104/0x320
blocking_notifier_call_chain+0xa0/0x140
device_add+0xac8/0x1d30
device_register+0x58/0x80
vio_register_device_node+0x9ac/0xce0
vio_bus_scan_register_devices+0xc4/0x13c
__machine_initcall_pseries_vio_device_init+0x94/0xf0
do_one_initcall+0x12c/0xaa8
kernel_init_freeable+0xa48/0xba8
kernel_init+0x64/0x400
ret_from_kernel_thread+0x5c/0x64
Fix this by creating separate notifier_block structs for each bus type.
Fixes: d6b9a81b2a45 ("powerpc: IOMMU fault injection")
Reported-by: Nageswara R Sastry <rnsastry@linux.ibm.com>
Signed-off-by: Russell Currey <ruscur@russell.cc>
Tested-by: Nageswara R Sastry <rnsastry@linux.ibm.com>
Reviewed-by: Andrew Donnellan <ajd@linux.ibm.com>
[mpe: Add #ifdef to fix CONFIG_IBMVIO=n build]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20230322035322.328709-1-ruscur@russell.cc
|
|
and PowerNV
A build failure with CONFIG_HAVE_PCI=y set without PSERIES or POWERNV
set was caught by the random configuration checker. Guard the sPAPR
specific IOMMU functions on CONFIG_PPC_PSERIES || CONFIG_PPC_POWERNV.
Signed-off-by: Timothy Pearson <tpearson@raptorengineering.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/2015925968.3546872.1685990936823.JavaMail.zimbra@raptorengineeringinc.com
|
|
When DMA window is backed by 2MB TCEs, the DMA address for the mapped
page should be the offset of the page relative to the 2MB TCE. The code
was incorrectly setting the DMA address to the beginning of the TCE
range.
Mellanox driver is reporting timeout trying to ENABLE_HCA for an SR-IOV
ethernet port, when DMA window is backed by 2MB TCEs.
Fixes: 387273118714 ("powerps/pseries/dma: Add support for 2M IOMMU page size")
Cc: stable@vger.kernel.org # v5.16+
Signed-off-by: Gaurav Batra <gbatra@linux.vnet.ibm.com>
Reviewed-by: Greg Joyce <gjoyce@linux.vnet.ibm.com>
Reviewed-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20230504175913.83844-1-gbatra@linux.vnet.ibm.com
|
|
Now that power calls iommu_device_register() and populates its groups
using iommu_ops->device_group it should not be calling
iommu_group_remove_device().
The core code owns the groups and all the other related iommu data, it
will clean it up automatically.
Remove the bus notifiers and explicit calls to
iommu_group_remove_device().
Fixes: a940904443e4 ("powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/0-v1-1421774b874b+167-ppc_device_group_jgg@nvidia.com
|
|
Up until now PPC64 managed to avoid using iommu_ops. The VFIO driver
uses a SPAPR TCE sub-driver and all iommu_ops uses were kept in the
Type1 VFIO driver. Recent development added 2 uses of iommu_ops to the
generic VFIO which broke POWER:
- a coherency capability check;
- blocking IOMMU domain - iommu_group_dma_owner_claimed()/...
This adds a simple iommu_ops which reports support for cache coherency
and provides a basic support for blocking domains. No other domain types
are implemented so the default domain is NULL.
Since now iommu_ops controls the group ownership, this takes it out of
VFIO.
This adds an IOMMU device into a pci_controller (=PHB) and registers it
in the IOMMU subsystem, iommu_ops is registered at this point. This
setup is done in postcore_initcall_sync.
This replaces iommu_group_add_device() with iommu_probe_device() as the
former misses necessary steps in connecting PCI devices to IOMMU
devices. This adds a comment about why explicit iommu_probe_device() is
still needed.
The previous discussion is here:
https://lore.kernel.org/r/20220707135552.3688927-1-aik@ozlabs.ru/
https://lore.kernel.org/r/20220701061751.1955857-1-aik@ozlabs.ru/
Fixes: e8ae0e140c05 ("vfio: Require that devices support DMA cache coherence")
Fixes: 70693f470848 ("vfio: Set DMA ownership for VFIO devices")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Timothy Pearson <tpearson@raptorengineering.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
[mpe: Fix CONFIG_IOMMU_API=n build]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/2000135730.16998523.1678123860135.JavaMail.zimbra@raptorengineeringinc.com
|
|
PPC64 IOMMU API defines iommu_table_group_ops which handles DMA windows
for PEs: control the ownership, create/set/unset a table the hardware
for dynamic DMA windows (DDW). VFIO uses the API to implement support on
POWER.
So far only PowerNV IODA2 (POWER8 and newer machines) implemented this
and other cases (POWER7 or nested KVM) did not and instead reused
existing iommu_table structs. This means 1) no DDW 2) ownership transfer
is done directly in the VFIO SPAPR TCE driver.
Soon POWER is going to get its own iommu_ops and ownership control is
going to move there. This implements spapr_tce_table_group_ops which
borrows iommu_table tables. The upside is that VFIO needs to know less
about POWER.
The new ops returns the existing table from create_table() and only
checks if the same window is already set. This is only going to work if
the default DMA window starts table_group.tce32_start and as big as
pe->table_group.tce32_size (not the case for IODA2+ PowerNV).
This changes iommu_table_group_ops::take_ownership() to return an error
if borrowing a table failed.
This should not cause any visible change in behavior for PowerNV.
pSeries was not that well tested/supported anyway.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Timothy Pearson <tpearson@raptorengineering.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
[mpe: Fix CONFIG_IOMMU_API=n build (skiroot_defconfig), & formatting]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/525438831.16998517.1678123820075.JavaMail.zimbra@raptorengineeringinc.com
|
|
When calling debugfs_lookup() the result must have dput() called on it,
otherwise the memory will leak over time. To make things simpler, just
call debugfs_lookup_and_remove() instead which handles all of the logic
at once.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20230202141919.2298821-1-gregkh@linuxfoundation.org
|
|
The existing iommu_table_in_use() helper checks if the kernel is using
any of TCEs. There are some reserved TCEs:
1) the very first one if DMA window starts from 0 to avoid having a zero
but still valid DMA handle;
2) it_reserved_start..it_reserved_end to exclude MMIO32 window in case
the default window spans across that - this is the default for the first
DMA window on PowerNV.
When 1) is the case and 2) is not the helper does not skip 1) and returns
wrong status.
This only seems occurring when passing through a PCI device to a nested
guest (not something we support really well) so it has not been seen
before.
This fixes the bug by adding a special case for no MMIO32 reservation.
Fixes: 3c33066a2190 ("powerpc/kernel/iommu: Add new iommu_table_in_use() helper")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220714081119.3714605-1-aik@ozlabs.ru
|
| |