summaryrefslogtreecommitdiff
path: root/arch
AgeCommit message (Collapse)AuthorFilesLines
2019-05-31x86/CPU/hygon: Fix phys_proc_id calculation logic for multi-die processorsPu Wen1-0/+5
[ Upstream commit e0ceeae708cebf22c990c3d703a4ca187dc837f5 ] The Hygon family 18h multi-die processor platform supports 1, 2 or 4-Dies per socket. The topology looks like this: System View (with 1-Die 2-Socket): |------------| ------ ----- SOCKET0 | D0 | | D1 | SOCKET1 ------ ----- System View (with 2-Die 2-socket): -------------------- | -------------|------ | | | | ------------ ------------ SOCKET0 | D1 -- D0 | | D3 -- D2 | SOCKET1 ------------ ------------ System View (with 4-Die 2-Socket) : -------------------- | -------------|------ | | | | ------------ ------------ | D1 -- D0 | | D7 -- D6 | | | \/ | | | | \/ | | SOCKET0 | | /\ | | | | /\ | | SOCKET1 | D2 -- D3 | | D4 -- D5 | ------------ ------------ | | | | ------|------------| | -------------------- Currently phys_proc_id = initial_apicid >> bits calculates the physical processor ID from the initial_apicid by shifting *bits*. However, this does not work for 1-Die and 2-Die 2-socket systems. According to document [1] section 2.1.11.1, the bits is the value of CPUID_Fn80000008_ECX[12:15]. The possible values are 4, 5 or 6 which mean: 4 - 1 die 5 - 2 dies 6 - 3/4 dies. Hygon programs the initial ApicId the same way as AMD. The ApicId is read from CPUID_Fn00000001_EBX (see section 2.1.11.1 of referrence [1]) and the definition is as below (see section 2.1.10.2.1.3 of [1]): ------------------------------------------------- Bit | 6 | 5 4 | 3 | 2 1 0 | |-----------|---------|--------|----------------| IDs | Socket ID | Node ID | CCX ID | Core/Thread ID | ------------------------------------------------- So for 3/4-Die configurations, the bits variable is 6, which is the same as the ApicID definition field. For 1-Die and 2-Die configurations, bits is 4 or 5, which will cause the right shifted result to not be exactly the value of socket ID. However, the socket ID should be obtained from ApicId[6]. To fix the problem and match the ApicID field definition, set the shift bits to 6 for all Hygon family 18h multi-die CPUs. Because AMD doesn't have 2-Socket systems with 1-Die/2-Die processors (see reference [2]), this doesn't need to be changed on the AMD side but only for Hygon. References: [1] https://www.amd.com/system/files/TechDocs/54945_PPR_Family_17h_Models_00h-0Fh.pdf [2] https://www.amd.com/en/products/specifications/processors [bp: heavily massage commit message. ] Signed-off-by: Pu Wen <puwen@hygon.cn> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas Lendacky <Thomas.Lendacky@amd.com> Cc: Yazen Ghannam <yazen.ghannam@amd.com> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/1553355740-19999-1-git-send-email-puwen@hygon.cn Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-31x86/platform/uv: Fix missing checks of kcalloc() return valuesKangjie Lu1-1/+6
[ Upstream commit 766460852cfaeca4042e5f3aeb9616b3689147bc ] Handle potential errors returned from kcalloc(). [ bp: rewrite commit message. ] Signed-off-by: Kangjie Lu <kjlu@umn.edu> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Andrew Banman <abanman@hpe.com> Cc: Andy Shevchenko <andy@infradead.org> Cc: Colin Ian King <colin.king@canonical.com> Cc: Darren Hart <dvhart@infradead.org> Cc: "Gustavo A. R. Silva" <gustavo@embeddedor.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mike Travis <mike.travis@hpe.com> Cc: Nicolai Stange <nstange@suse.de> Cc: pakki001@umn.edu Cc: platform-driver-x86@vger.kernel.org Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Varsha Rao <rvarsha016@gmail.com> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20190325202924.4624-1-kjlu@umn.edu Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-31x86/mce: Handle varying MCA bank countsYazen Ghannam2-22/+14
[ Upstream commit 006c077041dc73b9490fffc4c6af5befe0687110 ] Linux reads MCG_CAP[Count] to find the number of MCA banks visible to a CPU. Currently, this number is the same for all CPUs and a warning is shown if there is a difference. The number of banks is overwritten with the MCG_CAP[Count] value of each following CPU that boots. According to the Intel SDM and AMD APM, the MCG_CAP[Count] value gives the number of banks that are available to a "processor implementation". The AMD BKDGs/PPRs further clarify that this value is per core. This value has historically been the same for every core in the system, but that is not an architectural requirement. Future AMD systems may have different MCG_CAP[Count] values per core, so the assumption that all CPUs will have the same MCG_CAP[Count] value will no longer be valid. Also, the first CPU to boot will allocate the struct mce_banks[] array using the number of banks based on its MCG_CAP[Count] value. The machine check handler and other functions use the global number of banks to iterate and index into the mce_banks[] array. So it's possible to use an out-of-bounds index on an asymmetric system where a following CPU sees a MCG_CAP[Count] value greater than its predecessors. Thus, allocate the mce_banks[] array to the maximum number of banks. This will avoid the potential out-of-bounds index since the value of mca_cfg.banks is capped to MAX_NR_BANKS. Set the value of mca_cfg.banks equal to the max of the previous value and the value for the current CPU. This way mca_cfg.banks will always represent the max number of banks detected on any CPU in the system. This will ensure that all CPUs will access all the banks that are visible to them. A CPU that can access fewer than the max number of banks will find the registers of the extra banks to be read-as-zero. Furthermore, print the resulting number of MCA banks in use. Do this in mcheck_late_init() so that the final value is printed after all CPUs have been initialized. Finally, get bank count from target CPU when doing injection with mce-inject module. [ bp: Remove out-of-bounds example, passify and cleanup commit message. ] Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: Pu Wen <puwen@hygon.cn> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20180727214009.78289-1-Yazen.Ghannam@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-31x86/mce: Fix machine_check_poll() tests for error typesTony Luck1-7/+37
[ Upstream commit f19501aa07f18268ab14f458b51c1c6b7f72a134 ] There has been a lurking "TBD" in the machine check poll routine ever since it was first split out from the machine check handler. The potential issue is that the poll routine may have just begun a read from the STATUS register in a machine check bank when the hardware logs an error in that bank and signals a machine check. That race used to be pretty small back when machine checks were broadcast, but the addition of local machine check means that the poll code could continue running and clear the error from the bank before the local machine check handler on another CPU gets around to reading it. Fix the code to be sure to only process errors that need to be processed in the poll code, leaving other logged errors alone for the machine check handler to find and process. [ bp: Massage a bit and flip the "== 0" check to the usual !(..) test. ] Fixes: b79109c3bbcf ("x86, mce: separate correct machine check poller and fatal exception handler") Fixes: ed7290d0ee8f ("x86, mce: implement new status bits") Reported-by: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Ashok Raj <ashok.raj@intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: x86-ml <x86@kernel.org> Cc: Yazen Ghannam <Yazen.Ghannam@amd.com> Link: https://lkml.kernel.org/r/20190312170938.GA23035@agluck-desk Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-31sh: sh7786: Add explicit I/O cast to sh7786_mm_sel()Geert Uytterhoeven1-1/+1
[ Upstream commit 8440bb9b944c02222c7a840d406141ed42e945cd ] When compile-testing on arm: arch/sh/include/cpu-sh4/cpu/sh7786.h: In function ‘sh7786_mm_sel’: arch/sh/include/cpu-sh4/cpu/sh7786.h:135:21: warning: passing argument 1 of ‘__raw_readl’ makes pointer from integer without a cast [-Wint-conversion] return __raw_readl(0xFC400020) & 0x7; ^~~~~~~~~~ In file included from include/linux/io.h:25:0, from arch/sh/include/cpu-sh4/cpu/sh7786.h:14, from drivers/pinctrl/sh-pfc/pfc-sh7786.c:15: arch/arm/include/asm/io.h:113:21: note: expected ‘const volatile void *’ but argument is of type ‘unsigned int’ #define __raw_readl __raw_readl ^ arch/arm/include/asm/io.h:114:19: note: in expansion of macro ‘__raw_readl’ static inline u32 __raw_readl(const volatile void __iomem *addr) ^~~~~~~~~~~ __raw_readl() on SuperH is a macro that casts the passed I/O address to the correct type, while the implementations on most other architectures expect to be passed the correct pointer type. Add an explicit cast to fix this. Note that this also gets rid of a sparse warning on SuperH: arch/sh/include/cpu-sh4/cpu/sh7786.h:135:16: warning: incorrect type in argument 1 (different base types) arch/sh/include/cpu-sh4/cpu/sh7786.h:135:16: expected void const volatile [noderef] <asn:2>*<noident> arch/sh/include/cpu-sh4/cpu/sh7786.h:135:16: got unsigned int Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: Simon Horman <horms+renesas@verge.net.au> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-31x86/uaccess: Fix up the fixupPeter Zijlstra1-1/+2
[ Upstream commit b69656fa7ea2f75e47d7bd5b9430359fa46488af ] New tooling got confused about this: arch/x86/lib/memcpy_64.o: warning: objtool: .fixup+0x7: return with UACCESS enabled While the code isn't wrong, it is tedious (if at all possible) to figure out what function a particular chunk of .fixup belongs to. This then confuses the objtool uaccess validation. Instead of returning directly from the .fixup, jump back into the right function. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-31x86/ia32: Fix ia32_restore_sigcontext() AC leakPeter Zijlstra1-12/+17
[ Upstream commit 67a0514afdbb8b2fc70b771b8c77661a9cb9d3a9 ] Objtool spotted that we call native_load_gs_index() with AC set. Re-arrange the code to avoid that. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-31x86/uaccess, signal: Fix AC=1 bloatPeter Zijlstra1-12/+17
[ Upstream commit 88e4718275c1bddca6f61f300688b4553dc8584b ] Occasionally GCC is less agressive with inlining and the following is observed: arch/x86/kernel/signal.o: warning: objtool: restore_sigcontext()+0x3cc: call to force_valid_ss.isra.5() with UACCESS enabled arch/x86/kernel/signal.o: warning: objtool: do_signal()+0x384: call to frame_uc_flags.isra.0() with UACCESS enabled Cure this by moving this code out of the AC=1 region, since it really isn't needed for the user access. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-31arm64: cpu_ops: fix a leaked reference by adding missing of_node_putWen Yang1-0/+1
[ Upstream commit 92606ec9285fb84cd9b5943df23f07d741384bfc ] The call to of_get_next_child returns a node pointer with refcount incremented thus it must be explicitly decremented after the last usage. Detected by coccinelle with the following warnings: ./arch/arm64/kernel/cpu_ops.c:102:1-7: ERROR: missing of_node_put; acquired a node pointer with refcount incremented on line 69, but without a corresponding object release within this function. Signed-off-by: Wen Yang <wen.yang99@zte.com.cn> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-31x86/build: Keep local relocations with ld.lldKees Cook1-1/+1
[ Upstream commit 7c21383f3429dd70da39c0c7f1efa12377a47ab6 ] The LLVM linker (ld.lld) defaults to removing local relocations, which causes KASLR boot failures. ld.bfd and ld.gold already handle this correctly. This adds the explicit instruction "--discard-none" during the link phase. There is no change in output for ld.bfd and ld.gold, but ld.lld now produces an image with all the needed relocations. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: clang-built-linux@googlegroups.com Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20190404214027.GA7324@beast Link: https://github.com/ClangBuiltLinux/linux/issues/404 Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-31s390/mm: silence compiler warning when compiling without CONFIG_PGSTEThomas Huth1-0/+2
[ Upstream commit 81a8f2beb32a5951ecf04385301f50879abc092b ] If CONFIG_PGSTE is not set (e.g. when compiling without KVM), GCC complains: CC arch/s390/mm/pgtable.o arch/s390/mm/pgtable.c:413:15: warning: ‘pmd_alloc_map’ defined but not used [-Wunused-function] static pmd_t *pmd_alloc_map(struct mm_struct *mm, unsigned long addr) ^~~~~~~~~~~~~ Wrap the function with "#ifdef CONFIG_PGSTE" to silence the warning. Signed-off-by: Thomas Huth <thuth@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-31x86/microcode: Fix the ancient deprecated microcode loading methodBorislav Petkov1-1/+2
[ Upstream commit 24613a04ad1c0588c10f4b5403ca60a73d164051 ] Commit 2613f36ed965 ("x86/microcode: Attempt late loading only when new microcode is present") added the new define UCODE_NEW to denote that an update should happen only when newer microcode (than installed on the system) has been found. But it missed adjusting that for the old /dev/cpu/microcode loading interface. Fix it. Fixes: 2613f36ed965 ("x86/microcode: Attempt late loading only when new microcode is present") Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Jann Horn <jannh@google.com> Link: https://lkml.kernel.org/r/20190405133010.24249-3-bp@alien8.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-31perf/x86/intel/cstate: Add Icelake supportKan Liang1-0/+2
[ Upstream commit f08c47d1f86c6dc666c7e659d94bf6d4492aa9d7 ] Icelake uses the same C-state residency events as Sandy Bridge. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: acme@kernel.org Cc: jolsa@kernel.org Link: https://lkml.kernel.org/r/20190402194509.2832-10-kan.liang@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-31perf/x86/intel/rapl: Add Icelake supportKan Liang1-0/+2
[ Upstream commit b3377c3acb9e54cf86efcfe25f2e792bca599ed4 ] Icelake support the same RAPL counters as Skylake. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: acme@kernel.org Cc: jolsa@kernel.org Link: https://lkml.kernel.org/r/20190402194509.2832-11-kan.liang@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-31perf/x86/msr: Add Icelake supportKan Liang1-0/+1
[ Upstream commit cf50d79a8cfe5adae37fec026220b009559bbeed ] Icelake is the same as the existing Skylake parts. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: acme@kernel.org Cc: jolsa@kernel.org Link: https://lkml.kernel.org/r/20190402194509.2832-12-kan.liang@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-31arm64: vdso: Fix clock_getres() for CLOCK_REALTIMEVincenzo Frascino4-5/+8
[ Upstream commit 81fb8736dd81da3fe94f28968dac60f392ec6746 ] clock_getres() in the vDSO library has to preserve the same behaviour of posix_get_hrtimer_res(). In particular, posix_get_hrtimer_res() does: sec = 0; ns = hrtimer_resolution; where 'hrtimer_resolution' depends on whether or not high resolution timers are enabled, which is a runtime decision. The vDSO incorrectly returns the constant CLOCK_REALTIME_RES. Fix this by exposing 'hrtimer_resolution' in the vDSO datapage and returning that instead. Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com> [will: Use WRITE_ONCE(), move adr off COARSE path, renumber labels, use 'w' reg] Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-31x86/irq/64: Limit IST stack overflow check to #DB stackThomas Gleixner1-5/+14
[ Upstream commit 7dbcf2b0b770eeb803a416ee8dcbef78e6389d40 ] Commit 37fe6a42b343 ("x86: Check stack overflow in detail") added a broad check for the full exception stack area, i.e. it considers the full exception stack area as valid. That's wrong in two aspects: 1) It does not check the individual areas one by one 2) #DF, NMI and #MCE are not enabling interrupts which means that a regular device interrupt cannot happen in their context. In fact if a device interrupt hits one of those IST stacks that's a bug because some code path enabled interrupts while handling the exception. Limit the check to the #DB stack and consider all other IST stacks as 'overflow' or invalid. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> Cc: Nicolai Stange <nstange@suse.de> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20190414160143.682135110@linutronix.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-31powerpc/64: Fix booting large kernels with STRICT_KERNEL_RWXRussell Currey1-1/+3
[ Upstream commit 56c46bba9bbfe229b4472a5be313c44c5b714a39 ] With STRICT_KERNEL_RWX enabled anything marked __init is placed at a 16M boundary. This is necessary so that it can be repurposed later with different permissions. However, in kernels with text larger than 16M, this pushes early_setup past 32M, incapable of being reached by the branch instruction. Fix this by setting the CTR and branching there instead. Fixes: 1e0fc9d1eb2b ("powerpc/Kconfig: Enable STRICT_KERNEL_RWX for some configs") Signed-off-by: Russell Currey <ruscur@russell.cc> [mpe: Fix it to work on BE by using DOTSYM()] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-31powerpc/numa: improve control of topology updatesNathan Lynch1-6/+12
[ Upstream commit 2d4d9b308f8f8dec68f6dbbff18c68ec7c6bd26f ] When booted with "topology_updates=no", or when "off" is written to /proc/powerpc/topology_updates, NUMA reassignments are inhibited for PRRN and VPHN events. However, migration and suspend unconditionally re-enable reassignments via start_topology_update(). This is incoherent. Check the topology_updates_enabled flag in start/stop_topology_update() so that callers of those APIs need not be aware of whether reassignments are enabled. This allows the administrative decision on reassignments to remain in force across migrations and suspensions. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-31x86/mm: Remove in_nmi() warning from 64-bit implementation of vmalloc_fault()Jiri Kosina1-2/+0
[ Upstream commit a65c88e16f32aa9ef2e8caa68ea5c29bd5eb0ff0 ] In-NMI warnings have been added to vmalloc_fault() via: ebc8827f75 ("x86: Barf when vmalloc and kmemcheck faults happen in NMI") back in the time when our NMI entry code could not cope with nested NMIs. These days, it's perfectly fine to take a fault in NMI context and we don't have to care about the fact that IRET from the fault handler might cause NMI nesting. This warning has already been removed from 32-bit implementation of vmalloc_fault() in: 6863ea0cda8 ("x86/mm: Remove in_nmi() warning from vmalloc_fault()") but the 64-bit version was omitted. Remove the bogus warning also from 64-bit implementation of vmalloc_fault(). Reported-by: Nicolai Stange <nstange@suse.de> Signed-off-by: Jiri Kosina <jkosina@suse.cz> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 6863ea0cda8 ("x86/mm: Remove in_nmi() warning from vmalloc_fault()") Link: http://lkml.kernel.org/r/nycvar.YFH.7.76.1904240902280.9803@cbobk.fhfr.pm Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-31x86/uaccess: Dont leak the AC flag into __put_user() argument evaluationPeter Zijlstra1-3/+4
[ Upstream commit 6ae865615fc43d014da2fd1f1bba7e81ee622d1b ] The __put_user() macro evaluates it's @ptr argument inside the __uaccess_begin() / __uaccess_end() region. While this would normally not be expected to be an issue, an UBSAN bug (it ignored -fwrapv, fixed in GCC 8+) would transform the @ptr evaluation for: drivers/gpu/drm/i915/i915_gem_execbuffer.c: if (unlikely(__put_user(offset, &urelocs[r-stack].presumed_offset))) { into a signed-overflow-UB check and trigger the objtool AC validation. Finish this commit: 2a418cf3f5f1 ("x86/uaccess: Don't leak the AC flag into __put_user() value evaluation") and explicitly evaluate all 3 arguments early. Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: luto@kernel.org Fixes: 2a418cf3f5f1 ("x86/uaccess: Don't leak the AC flag into __put_user() value evaluation") Link: http://lkml.kernel.org/r/20190424072208.695962771@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-31x86/build: Move _etext to actual end of .textKees Cook1-3/+3
[ Upstream commit 392bef709659abea614abfe53cf228e7a59876a4 ] When building x86 with Clang LTO and CFI, CFI jump regions are automatically added to the end of the .text section late in linking. As a result, the _etext position was being labelled before the appended jump regions, causing confusion about where the boundaries of the executable region actually are in the running kernel, and broke at least the fault injection code. This moves the _etext mark to outside (and immediately after) the .text area, as it already the case on other architectures (e.g. arm64, arm). Reported-and-tested-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20190423183827.GA4012@beast Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-31arm64: futex: Fix FUTEX_WAKE_OP atomic ops with non-zero result valueWill Deacon1-1/+1
[ Upstream commit 84ff7a09c371bc7417eabfda19bf7f113ec917b6 ] Rather embarrassingly, our futex() FUTEX_WAKE_OP implementation doesn't explicitly set the return value on the non-faulting path and instead leaves it holding the result of the underlying atomic operation. This means that any FUTEX_WAKE_OP atomic operation which computes a non-zero value will be reported as having failed. Regrettably, I wrote the buggy code back in 2011 and it was upstreamed as part of the initial arm64 support in 2012. The reasons we appear to get away with this are: 1. FUTEX_WAKE_OP is rarely used and therefore doesn't appear to get exercised by futex() test applications 2. If the result of the atomic operation is zero, the system call behaves correctly 3. Prior to version 2.25, the only operation used by GLIBC set the futex to zero, and therefore worked as expected. From 2.25 onwards, FUTEX_WAKE_OP is not used by GLIBC at all. Fix the implementation by ensuring that the return value is either 0 to indicate that the atomic operation completed successfully, or -EFAULT if we encountered a fault when accessing the user mapping. Cc: <stable@kernel.org> Fixes: 6170a97460db ("arm64: Atomic operations") Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-31s390/kexec_file: Fix detection of text segment in ELF loaderPhilipp Rudo1-1/+6
[ Upstream commit 729829d775c9a5217abc784b2f16087d79c4eec8 ] To register data for the next kernel (command line, oldmem_base, etc.) the current kernel needs to find the ELF segment that contains head.S. This is currently done by checking ifor 'phdr->p_paddr == 0'. This works fine for the current kernel build but in theory the first few pages could be skipped. Make the detection more robust by checking if the entry point lies within the segment. Signed-off-by: Philipp Rudo <prudo@linux.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-31x86/modules: Avoid breaking W^X while loading modulesNadav Amit2-8/+22
[ Upstream commit f2c65fb3221adc6b73b0549fc7ba892022db9797 ] When modules and BPF filters are loaded, there is a time window in which some memory is both writable and executable. An attacker that has already found another vulnerability (e.g., a dangling pointer) might be able to exploit this behavior to overwrite kernel code. Prevent having writable executable PTEs in this stage. In addition, avoiding having W+X mappings can also slightly simplify the patching of modules code on initialization (e.g., by alternatives and static-key), as would be done in the next patch. This was actually the main motivation for this patch. To avoid having W+X mappings, set them initially as RW (NX) and after they are set as RO set them as X as well. Setting them as executable is done as a separate step to avoid one core in which the old PTE is cached (hence writable), and another which sees the updated PTE (executable), which would break the W^X protection. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Suggested-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Nadav Amit <namit@vmware.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <akpm@linux-foundation.org> Cc: <ard.biesheuvel@linaro.org> Cc: <deneen.t.dock@intel.com> Cc: <kernel-hardening@lists.openwall.com> Cc: <kristen@linux.intel.com> Cc: <linux_dti@icloud.com> Cc: <will.deacon@arm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jessica Yu <jeyu@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Rik van Riel <riel@surriel.com> Link: https://lkml.kernel.org/r/20190426001143.4983-12-namit@vmware.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-31powerpc/watchdog: Use hrtimers for per-CPU heartbeatNicholas Piggin1-41/+40
[ Upstream commit 7ae3f6e130e8dc6188b59e3b4ebc2f16e9c8d053 ] Using a jiffies timer creates a dependency on the tick_do_timer_cpu incrementing jiffies. If that CPU has locked up and jiffies is not incrementing, the watchdog heartbeat timer for all CPUs stops and creates false positives and confusing warnings on local CPUs, and also causes the SMP detector to stop, so the root cause is never detected. Fix this by using hrtimer based timers for the watchdog heartbeat, like the generic kernel hardlockup detector. Cc: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Reported-by: Ravikumar Bangoria <ravi.bangoria@in.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Tested-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com> Reported-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com> Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-31x86/ftrace: Set trampoline pages as executableNadav Amit1-0/+8
[ Upstream commit 3c0dab44e22782359a0a706cbce72de99a22aa75 ] Since alloc_module() will not set the pages as executable soon, set ftrace trampoline pages as executable after they are allocated. For the time being, do not change ftrace to use the text_poke() interface. As a result, ftrace still breaks W^X. Signed-off-by: Nadav Amit <namit@vmware.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: <akpm@linux-foundation.org> Cc: <ard.biesheuvel@linaro.org> Cc: <deneen.t.dock@intel.com> Cc: <kernel-hardening@lists.openwall.com> Cc: <kristen@linux.intel.com> Cc: <linux_dti@icloud.com> Cc: <will.deacon@arm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190426001143.4983-10-namit@vmware.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-31arm64: Fix compiler warning from pte_unmap() with -Wunused-but-set-variableQian Cai1-1/+2
[ Upstream commit 74dd022f9e6260c3b5b8d15901d27ebcc5f21eda ] When building with -Wunused-but-set-variable, the compiler shouts about a number of pte_unmap() users, since this expands to an empty macro on arm64: | mm/gup.c: In function 'gup_pte_range': | mm/gup.c:1727:16: warning: variable 'ptem' set but not used | [-Wunused-but-set-variable] | mm/gup.c: At top level: | mm/memory.c: In function 'copy_pte_range': | mm/memory.c:821:24: warning: variable 'orig_dst_pte' set but not used | [-Wunused-but-set-variable] | mm/memory.c:821:9: warning: variable 'orig_src_pte' set but not used | [-Wunused-but-set-variable] | mm/swap_state.c: In function 'swap_ra_info': | mm/swap_state.c:641:15: warning: variable 'orig_pte' set but not used | [-Wunused-but-set-variable] | mm/madvise.c: In function 'madvise_free_pte_range': | mm/madvise.c:318:9: warning: variable 'orig_pte' set but not used | [-Wunused-but-set-variable] Rewrite pte_unmap() as a static inline function, which silences the warnings. Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-31ARM: vdso: Remove dependency with the arch_timer driver internalsMarc Zyngier2-2/+5
[ Upstream commit 1f5b62f09f6b314c8d70b9de5182dae4de1f94da ] The VDSO code uses the kernel helper that was originally designed to abstract the access between 32 and 64bit systems. It worked so far because this function is declared as 'inline'. As we're about to revamp that part of the code, the VDSO would break. Let's fix it by doing what should have been done from the start, a proper system register access. Reviewed-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-31powerpc/perf: Fix loop exit condition in nest_imc_event_initAnju T Sudhakar2-2/+2
[ Upstream commit 860b7d2286236170a36f94946d03ca9888d32571 ] The data structure (i.e struct imc_mem_info) to hold the memory address information for nest imc units is allocated based on the number of nodes in the system. nest_imc_event_init() traverse this struct array to calculate the memory base address for the event-cpu. If we fail to find a match for the event cpu's chip-id in imc_mem_info struct array, then the do-while loop will iterate until we crash. Fix this by changing the loop exit condition based on the number of non zero vbase elements in the array, since the allocation is done for nr_chips + 1. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Fixes: 885dcd709ba91 ("powerpc/perf: Add nest IMC PMU support") Signed-off-by: Anju T Sudhakar <anju@linux.vnet.ibm.com> Reviewed-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-31powerpc/boot: Fix missing check of lseek() return valueBo YU1-1/+5
[ Upstream commit 5d085ec04a000fefb5182d3b03ee46ca96d8389b ] This is detected by Coverity scan: CID: 1440481 Signed-off-by: Bo YU <tsu.yubo@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-31powerpc/perf: Return accordingly on invalid chip-id inAnju T Sudhakar1-0/+5
[ Upstream commit a913e5e8b43be1d3897a141ce61c1ec071cad89c ] Nest hardware counter memory resides in a per-chip reserve-memory. During nest_imc_event_init(), chip-id of the event-cpu is considered to calculate the base memory addresss for that cpu. Return, proper error condition if the chip_id calculated is invalid. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Fixes: 885dcd709ba91 ("powerpc/perf: Add nest IMC PMU support") Reviewed-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com> Signed-off-by: Anju T Sudhakar <anju@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-31arm64: errata: Add workaround for Cortex-A76 erratum #1463225Will Deacon5-3/+110
commit 969f5ea627570e91c9d54403287ee3ed657f58fe upstream. Revisions of the Cortex-A76 CPU prior to r4p0 are affected by an erratum that can prevent interrupts from being taken when single-stepping. This patch implements a software workaround to prevent userspace from effectively being able to disable interrupts. Cc: <stable@vger.kernel.org> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-31arm64/iommu: handle non-remapped addresses in ->mmap and ->get_sgtableChristoph Hellwig1-0/+10
commit a98d9ae937d256ed679a935fc82d9deaa710d98e upstream. DMA allocations that can't sleep may return non-remapped addresses, but we do not properly handle them in the mmap and get_sgtable methods. Resolve non-vmalloc addresses using virt_to_page to handle this corner case. Cc: <stable@vger.kernel.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-31arm64: Kconfig: Make ARM64_PSEUDO_NMI depend on BROKEN for nowWill Deacon1-0/+1
commit 96a13f57b946be7a6c10405e4bd780c0b6b6fe63 upstream. Although we merged support for pseudo-nmi using interrupt priority masking in 5.1, we've since uncovered a number of non-trivial issues with the implementation. Although there are patches pending to address these problems, we're facing issues that prevent us from merging them at this current time: https://lkml.kernel.org/r/1556553607-46531-1-git-send-email-julien.thierry@arm.com For now, simply mark this optional feature as BROKEN in the hope that we can fix things properly in the near future. Cc: <stable@vger.kernel.org> # 5.1 Cc: Julien Thierry <julien.thierry@arm.com> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-31arm64/kernel: kaslr: reduce module randomization range to 2 GBArd Biesheuvel2-4/+4
commit b2eed9b58811283d00fa861944cb75797d4e52a7 upstream. The following commit 7290d5809571 ("module: use relative references for __ksymtab entries") updated the ksymtab handling of some KASLR capable architectures so that ksymtab entries are emitted as pairs of 32-bit relative references. This reduces the size of the entries, but more importantly, it gets rid of statically assigned absolute addresses, which require fixing up at boot time if the kernel is self relocating (which takes a 24 byte RELA entry for each member of the ksymtab struct). Since ksymtab entries are always part of the same module as the symbol they export, it was assumed at the time that a 32-bit relative reference is always sufficient to capture the offset between a ksymtab entry and its target symbol. Unfortunately, this is not always true: in the case of per-CPU variables, a per-CPU variable's base address (which usually differs from the actual address of any of its per-CPU copies) is allocated in the vicinity of the ..data.percpu section in the core kernel (i.e., in the per-CPU reserved region which follows the section containing the core kernel's statically allocated per-CPU variables). Since we randomize the module space over a 4 GB window covering the core kernel (based on the -/+ 4 GB range of an ADRP/ADD pair), we may end up putting the core kernel out of the -/+ 2 GB range of 32-bit relative references of module ksymtab entries that refer to per-CPU variables. So reduce the module randomization range a bit further. We lose 1 bit of randomization this way, but this is something we can tolerate. Cc: <stable@vger.kernel.org> # v4.19+ Signed-off-by: Ard Biesheuvel <ard.biesheuvel@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-31KVM: nVMX: Fix using __this_cpu_read() in preemptible contextWanpeng Li1-2/+2
commit 541e886f7972cc647804dbb4909189e67987a945 upstream. BUG: using __this_cpu_read() in preemptible [00000000] code: qemu-system-x86/4590 caller is nested_vmx_enter_non_root_mode+0xebd/0x1790 [kvm_intel] CPU: 4 PID: 4590 Comm: qemu-system-x86 Tainted: G OE 5.1.0-rc4+ #1 Call Trace: dump_stack+0x67/0x95 __this_cpu_preempt_check+0xd2/0xe0 nested_vmx_enter_non_root_mode+0xebd/0x1790 [kvm_intel] nested_vmx_run+0xda/0x2b0 [kvm_intel] handle_vmlaunch+0x13/0x20 [kvm_intel] vmx_handle_exit+0xbd/0x660 [kvm_intel] kvm_arch_vcpu_ioctl_run+0xa2c/0x1e50 [kvm] kvm_vcpu_ioctl+0x3ad/0x6d0 [kvm] do_vfs_ioctl+0xa5/0x6e0 ksys_ioctl+0x6d/0x80 __x64_sys_ioctl+0x1a/0x20 do_syscall_64+0x6f/0x6c0 entry_SYSCALL_64_after_hwframe+0x49/0xbe Accessing per-cpu variable should disable preemption, this patch extends the preemption disable region for __this_cpu_read(). Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Fixes: 52017608da33 ("KVM: nVMX: add option to perform early consistency checks via H/W") Cc: stable@vger.kernel.org Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-31kvm: svm/avic: fix off-by-one in checking host APIC IDSuthikulpanit, Suravee1-1/+5
commit c9bcd3e3335d0a29d89fabd2c385e1b989e6f1b0 upstream. Current logic does not allow VCPU to be loaded onto CPU with APIC ID 255. This should be allowed since the host physical APIC ID field in the AVIC Physical APIC table entry is an 8-bit value, and APIC ID 255 is valid in system with x2APIC enabled. Instead, do not allow VCPU load if the host APIC ID cannot be represented by an 8-bit value. Also, use the more appropriate AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK instead of AVIC_MAX_PHYSICAL_ID_COUNT. Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-31kvm: Check irqchip mode before assign irqfdPeter Xu2-0/+8
commit 654f1f13ea56b92bacade8ce2725aea0457f91c0 upstream. When assigning kvm irqfd we didn't check the irqchip mode but we allow KVM_IRQFD to succeed with all the irqchip modes. However it does not make much sense to create irqfd even without the kernel chips. Let's provide a arch-dependent helper to check whether a specific irqfd is allowed by the arch. At least for x86, it should make sense to check: - when irqchip mode is NONE, all irqfds should be disallowed, and, - when irqchip mode is SPLIT, irqfds that are with resamplefd should be disallowed. For either of the case, previously we'll silently ignore the irq or the irq ack event if the irqchip mode is incorrect. However that can cause misterious guest behaviors and it can be hard to triage. Let's fail KVM_I