summaryrefslogtreecommitdiff
path: root/arch/x86/kernel/cpu/mce
AgeCommit message (Collapse)AuthorFilesLines
2023-08-30Merge tag 'x86_apic_for_6.6-rc1' of ↵Linus Torvalds3-4/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 apic updates from Dave Hansen: "This includes a very thorough rework of the 'struct apic' handlers. Quite a variety of them popped up over the years, especially in the 32-bit days when odd apics were much more in vogue. The end result speaks for itself, which is a removal of a ton of code and static calls to replace indirect calls. If there's any breakage here, it's likely to be around the 32-bit museum pieces that get light to no testing these days. Summary: - Rework apic callbacks, getting rid of unnecessary ones and coalescing lots of silly duplicates. - Use static_calls() instead of indirect calls for apic->foo() - Tons of cleanups an crap removal along the way" * tag 'x86_apic_for_6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (64 commits) x86/apic: Turn on static calls x86/apic: Provide static call infrastructure for APIC callbacks x86/apic: Wrap IPI calls into helper functions x86/apic: Mark all hotpath APIC callback wrappers __always_inline x86/xen/apic: Mark apic __ro_after_init x86/apic: Convert other overrides to apic_update_callback() x86/apic: Replace acpi_wake_cpu_handler_update() and apic_set_eoi_cb() x86/apic: Provide apic_update_callback() x86/xen/apic: Use standard apic driver mechanism for Xen PV x86/apic: Provide common init infrastructure x86/apic: Wrap apic->native_eoi() into a helper x86/apic: Nuke ack_APIC_irq() x86/apic: Remove pointless arguments from [native_]eoi_write() x86/apic/noop: Tidy up the code x86/apic: Remove pointless NULL initializations x86/apic: Sanitize APIC ID range validation x86/apic: Prepare x2APIC for using apic::max_apic_id x86/apic: Simplify X2APIC ID validation x86/apic: Add max_apic_id member x86/apic: Wrap APIC ID validation into an inline ...
2023-08-28Merge tag 'ras_core_for_v6.6_rc1' of ↵Linus Torvalds3-3/+57
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 RAS updates from Borislav Petkov: - Add a quirk for AMD Zen machines where Instruction Fetch unit poison consumption MCEs are not delivered synchronously but still within the same context, which can lead to erroneously increased error severity and unneeded kernel panics - Do not log errors caught by polling shared MCA banks as they materialize as duplicated error records otherwise * tag 'ras_core_for_v6.6_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/MCE: Always save CS register on AMD Zen IF Poison errors x86/mce: Prevent duplicate error records
2023-08-18x86/MCE: Always save CS register on AMD Zen IF Poison errorsYazen Ghannam2-1/+30
The Instruction Fetch (IF) units on current AMD Zen-based systems do not guarantee a synchronous #MC is delivered for poison consumption errors. Therefore, MCG_STATUS[EIPV|RIPV] will not be set. However, the microarchitecture does guarantee that the exception is delivered within the same context. In other words, the exact rIP is not known, but the context is known to not have changed. There is no architecturally-defined method to determine this behavior. The Code Segment (CS) register is always valid on such IF unit poison errors regardless of the value of MCG_STATUS[EIPV|RIPV]. Add a quirk to save the CS register for poison consumption from the IF unit banks. This is needed to properly determine the context of the error. Otherwise, the severity grading function will assume the context is IN_KERNEL due to the m->cs value being 0 (the initialized value). This leads to unnecessary kernel panics on data poison errors due to the kernel believing the poison consumption occurred in kernel context. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230814200853.29258-1-yazen.ghannam@amd.com
2023-08-09x86/apic: Wrap IPI calls into helper functionsDave Hansen1-2/+1
Move them to one place so the static call conversion gets simpler. No functional change. [ dhansen: merge against recent x86/apic changes ] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
2023-08-09x86/apic: Nuke ack_APIC_irq()Dave Hansen2-2/+2
Yet another wrapper of a wrapper gone along with the outdated comment that this compiles to a single instruction. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Wei Liu <wei.liu@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
2023-07-22x86/MCE/AMD: Decrement threshold_bank refcount when removing threshold blocksYazen Ghannam1-2/+2
AMD systems from Family 10h to 16h share MCA bank 4 across multiple CPUs. Therefore, the threshold_bank structure for bank 4, and its threshold_block structures, will be initialized once at boot time. And the kobject for the shared bank will be added to each of the CPUs that share it. Furthermore, the threshold_blocks for the shared bank will be added again to the bank's kobject. These additions will increase the refcount for the bank's kobject. For example, a shared bank with two blocks and shared across two CPUs will be set up like this: CPU0 init bank create and add; bank refcount = 1; threshold_create_bank() block 0 init and add; bank refcount = 2; allocate_threshold_blocks() block 1 init and add; bank refcount = 3; allocate_threshold_blocks() CPU1 init bank add; bank refcount = 3; threshold_create_bank() block 0 add; bank refcount = 4; __threshold_add_blocks() block 1 add; bank refcount = 5; __threshold_add_blocks() Currently in threshold_remove_bank(), if the bank is shared then __threshold_remove_blocks() is called. Here the shared bank's kobject and the bank's blocks' kobjects are deleted. This is done on the first call even while the structures are still shared. Subsequent calls from other CPUs that share the structures will attempt to delete the kobjects. During kobject_del(), kobject->sd is removed. If the kobject is not part of a kset with default_groups, then subsequent kobject_del() calls seem safe even with kobject->sd == NULL. Originally, the AMD MCA thresholding structures did not use default_groups. And so the above behavior was not apparent. However, a recent change implemented default_groups for the thresholding structures. Therefore, kobject_del() will go down the sysfs_remove_groups() code path. In this case, the first kobject_del() may succeed and remove kobject->sd. But subsequent kobject_del() calls will give a WARNing in kernfs_remove_by_name_ns() since kobject->sd == NULL. Use kobject_put() on the shared bank's kobject when "removing" blocks. This decrements the bank's refcount while keeping kobjects enabled until the bank is no longer shared. At that point, kobject_put() will be called on the blocks which drives their refcount to 0 and deletes them and also decrementing the bank's refcount. And finally kobject_put() will be called on the bank driving its refcount to 0 and deleting it. The same example above: CPU1 shutdown bank is shared; bank refcount = 5; threshold_remove_bank() block 0 put parent bank; bank refcount = 4; __threshold_remove_blocks() block 1 put parent bank; bank refcount = 3; __threshold_remove_blocks() CPU0 shutdown bank is no longer shared; bank refcount = 3; threshold_remove_bank() block 0 put block; bank refcount = 2; deallocate_threshold_blocks() block 1 put block; bank refcount = 1; deallocate_threshold_blocks() put bank; bank refcount = 0; threshold_remove_bank() Fixes: 7f99cb5e6039 ("x86/CPU/AMD: Use default_groups in kobj_type") Reported-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Mikulas Patocka <mpatocka@redhat.com> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/alpine.LRH.2.02.2205301145540.25840@file01.intranet.prod.int.rdu2.redhat.com
2023-07-21x86/mce: Prevent duplicate error recordsBorislav Petkov (AMD)3-2/+27
A legitimate use case of the MCA infrastructure is to have the firmware log all uncorrectable errors and also, have the OS see all correctable errors. The uncorrectable, UCNA errors are usually configured to be reported through an SMI. CMCI, which is the correctable error reporting interrupt, uses SMI too and having both enabled, leads to unnecessary overhead. So what ends up happening is, people disable CMCI in the wild and leave on only the UCNA SMI. When CMCI is disabled, the MCA infrastructure resorts to polling the MCA banks. If a MCA MSR is shared between the logical threads, one error ends up getting logged multiple times as the polling runs on every logical thread. Therefore, introduce locking on the Intel side of the polling routine to prevent such duplicate error records from appearing. Based on a patch by Aristeu Rozanski <aris@ruivo.org>. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Tony Luck <tony.luck@intel.com> Acked-by: Aristeu Rozanski <aris@ruivo.org> Link: https://lore.kernel.org/r/20230515143225.GC4090740@cathedrallabs.org
2023-06-27Merge tag 'locking-core-2023-06-27' of ↵Linus Torvalds1-8/+8
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar: - Introduce cmpxchg128() -- aka. the demise of cmpxchg_double() The cmpxchg128() family of functions is basically & functionally the same as cmpxchg_double(), but with a saner interface. Instead of a 6-parameter horror that forced u128 - u64/u64-halves layout details on the interface and exposed users to complexity, fragility & bugs, use a natural 3-parameter interface with u128 types. - Restructure the generated atomic headers, and add kerneldoc comments for all of the generic atomic{,64,_long}_t operations. The generated definitions are much cleaner now, and come with documentation. - Implement lock_set_cmp_fn() on lockdep, for defining an ordering when taking multiple locks of the same type. This gets rid of one use of lockdep_set_novalidate_class() in the bcache code. - Fix raw_cpu_generic_try_cmpxchg() bug due to an unintended variable shadowing generating garbage code on Clang on certain ARM builds. * tag 'locking-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (43 commits) locking/atomic: scripts: fix ${atomic}_dec_if_positive() kerneldoc percpu: Fix self-assignment of __old in raw_cpu_generic_try_cmpxchg() locking/atomic: treewide: delete arch_atomic_*() kerneldoc locking/atomic: docs: Add atomic operations to the driver basic API documentation locking/atomic: scripts: generate kerneldoc comments docs: scripts: kernel-doc: accept bitwise negation like ~@var locking/atomic: scripts: simplify raw_atomic*() definitions locking/atomic: scripts: simplify raw_atomic_long*() definitions locking/atomic: scripts: split pfx/name/sfx/order locking/atomic: scripts: restructure fallback ifdeffery locking/atomic: scripts: build raw_atomic_long*() directly locking/atomic: treewide: use raw_atomic*_<op>() locking/atomic: scripts: add trivial raw_atomic*_<op>() locking/atomic: scripts: factor out order template generation locking/atomic: scripts: remove leftover "${mult}" locking/atomic: scripts: remove bogus order parameter locking/atomic: xtensa: add preprocessor symbols locking/atomic: x86: add preprocessor symbols locking/atomic: sparc: add preprocessor symbols locking/atomic: sh: add preprocessor symbols ...
2023-06-05x86/MCE/AMD, EDAC/mce_amd: Decode UMC_V2 ECC errorsYazen Ghannam1-2/+4
The MI200 (Aldebaran) series of devices introduced a new SMCA bank type for Unified Memory Controllers. The MCE subsystem already has support for this new type. The MCE decoder module will decode the common MCA error information for the new bank type, but it will not pass the information to the AMD64 EDAC module for detailed memory error decoding. Have the MCE decoder module recognize the new bank type as an SMCA UMC memory error and pass the MCA information to AMD64 EDAC. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Co-developed-by: Muralidhara M K <muralidhara.mk@amd.com> Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230515113537.1052146-3-muralimk@amd.com
2023-06-05locking/atomic: treewide: use raw_atomic*_<op>()Mark Rutland1-8/+8
Now that we have raw_atomic*_<op>() definitions, there's no need to use arch_atomic*_<op>() definitions outside of the low-level atomic definitions. Move treewide users of arch_atomic*_<op>() over to the equivalent raw_atomic*_<op>(). There should be no functional change as a result of this patch. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20230605070124.3741859-19-mark.rutland@arm.com
2023-05-16x86/MCE: Check a hw error's address to determine proper recovery actionYazen Ghannam1-1/+1
Make sure that machine check errors with a usable address are properly marked as poison. This is needed for errors that occur on memory which have MCG_STATUS[RIPV] clear - i.e., the interrupted process cannot be restarted reliably. One example is data poison consumption through the instruction fetch units on AMD Zen-based systems. The MF_MUST_KILL flag is passed to memory_failure() when MCG_STATUS[RIPV] is not set. So the associated process will still be killed. What this does, practically, is get rid of one more check to kill_current_task with the eventual goal to remove it completely. Also, make the handling identical to what is done on the notifier path (uc_decode_notifier() does that address usability check too). The scenario described above occurs when hardware can precisely identify the address of poisoned memory, but execution cannot reliably continue for the interrupted hardware thread. [ bp: Massage commit message. ] Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20230322005131.174499-1-tony.luck@intel.com
2023-04-25Merge tag 'ras_core_for_v6.4_rc1' of ↵Linus Torvalds2-13/+13
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RAS updates from Borislav Petkov: - Just cleanups and fixes this time around: make threshold_ktype const, an objtool fix and use proper size for a bitmap * tag 'ras_core_for_v6.4_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/MCE/AMD: Use an u64 for bank_map x86/mce: Always inline old MCA stubs x86/MCE/AMD: Make kobj_type structure constant
2023-03-19x86/MCE/AMD: Use an u64 for bank_mapMuralidhara M K1-7/+7
Thee maximum number of MCA banks is 64 (MAX_NR_BANKS), see a0bc32b3cacf ("x86/mce: Increase maximum number of banks to 64"). However, the bank_map which contains a bitfield of which banks to initialize is of type unsigned int and that overflows when those bit numbers are >= 32, leading to UBSAN complaining correctly: UBSAN: shift-out-of-bounds in arch/x86/kernel/cpu/mce/amd.c:1365:38 shift exponent 32 is too large for 32-bit type 'int' Change the bank_map to a u64 and use the proper BIT_ULL() macro when modifying bits in there. [ bp: Rewrite commit message. ] Fixes: a0bc32b3cacf ("x86/mce: Increase maximum number of banks to 64") Signed-off-by: Muralidhara M K <muralimk@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230127151601.1068324-1-muralimk@amd.com
2023-03-12x86/mce: Make sure logged MCEs are processed after sysfs updateYazen Ghannam1-0/+1
A recent change introduced a flag to queue up errors found during boot-time polling. These errors will be processed during late init once the MCE subsystem is fully set up. A number of sysfs updates call mce_restart() which goes through a subset of the CPU init flow. This includes polling MCA banks and logging any errors found. Since the same function is used as boot-time polling, errors will be queued. However, the system is now past late init, so the errors will remain queued until another error is found and the workqueue is triggered. Call mce_schedule_work() at the end of mce_restart() so that queued errors are processed. Fixes: 3bff147b187d ("x86/mce: Defer processing of early errors") Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230301221420.2203184-1-yazen.ghannam@amd.com
2023-03-08x86/mce: Always inline old MCA stubsBorislav Petkov (AMD)1-5/+5
The stubs for the ancient MCA support (CONFIG_X86_ANCIENT_MCE) are normally optimized away on 64-bit builds. However, an allmodconfig one causes the compiler to add sanitizer calls gunk into them and they exist as constprop calls. Which objtool then complains about: vmlinux.o: warning: objtool: do_machine_check+0xad8: call to \ pentium_machine_check.constprop.0() leaves .noinstr.text section due to them missing noinstr. One could tag them "noinstr" but what should really happen is, they should be forcefully inlined so that all that gunk gets optimized away and the warning doesn't even have a chance to fire. Do so. No functional changes. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230222191054.4701-1-bp@alien8.de
2023-03-06x86/MCE/AMD: Make kobj_type structure constantThomas Weißschuh1-1/+1
Since ee6d3dd4ed48 ("driver core: make kobj_type constant.") the driver core allows the usage of const struct kobj_type. Take advantage of this to constify the structure definition to prevent modification at runtime. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230217-kobj_type-mce-amd-v1-1-40ef94816444@weissschuh.net
2023-02-21Merge tag 'ras_core_for_v6.3_rc1' of ↵Linus Torvalds3-30/+58
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RAS updates from Borislav Petkov: - Add support for reporting more bits of the physical address on error, on newer AMD CPUs - Mask out bits which don't belong to the address of the error being reported * tag 'ras_core_for_v6.3_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mce: Mask out non-address bits from machine check bank x86/mce: Add support for Extended Physical Address MCA changes x86/mce: Define a function to extract ErrorAddr from MCA_ADDR
2023-01-10x86/mce: Mask out non-address bits from machine check bankTony Luck1-5/+9
Systems that support various memory encryption schemes (MKTME, TDX, SEV) use high order physical address bits to indicate which key should be used for a specific memory location. When a memory error is reported, some systems may report those key bits in the IA32_MCi_ADDR machine check MSR. The Intel SDM has a footnote for the contents of the address register that says: "Useful bits in this field depend on the address methodology in use when the register state is saved." AMD Processor Programming Reference has a more explicit description of the MCA_ADDR register: "For physical addresses, the most significant bit is given by Core::X86::Cpuid::LongModeInfo[PhysAddrSize]." Add a new #define MCI_ADDR_PHYSADDR for the mask of valid physical address bits within the machine check bank address register. Use this mask for recoverable machine check handling and in the EDAC driver to ignore any key bits that may be present. [ Tony: Based on independent fixes proposed by Fan Du and Isaku Yamahata ] Reported-by: Isaku Yamahata <isaku.yamahata@intel.com> Reported-by: Fan Du <fan.du@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com> Link: https://lore.kernel.org/r/20230109152936.397862-1-tony.luck@intel.com
2023-01-07x86/mce/dev-mcelog: use strscpy() to instead of strncpy()Xu Panda1-2/+1
The implementation of strscpy() is more robust and safer. That's now the recommended way to copy NUL terminated strings. Signed-off-by: Xu Panda <xu.panda@zte.com.cn> Signed-off-by: Yang Yang <yang.yang29@zte.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/202212031419324523731@zte.com.cn
2022-12-28x86/mce: Add support for Extended Physical Address MCA changesSmita Koralahalli3-8/+33
Newer AMD CPUs support more physical address bits. That is, the MCA_ADDR registers on Scalable MCA systems contain the ErrorAddr in bits [56:0] instead of [55:0]. Hence, the existing LSB field from bits [61:56] in MCA_ADDR must be moved around to accommodate the larger ErrorAddr size. MCA_CONFIG[McaLsbInStatusSupported] indicates this change. If set, the LSB field will be found in MCA_STATUS rather than MCA_ADDR. Each logical CPU has unique MCA bank in hardware and is not shared with other logical CPUs. Additionally, on SMCA systems, each feature bit may be different for each bank within same logical CPU. Check for MCA_CONFIG[McaLsbInStatusSupported] for each MCA bank and for each CPU. Additionally, all MCA banks do not support maximum ErrorAddr bits in MCA_ADDR. Some banks might support fewer bits but the remaining bits are marked as reserved. [ Yazen: Rebased and fixed up formatting. bp: Massage comments. ] Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20221206173607.1185907-5-yazen.ghannam@amd.com
2022-12-28x86/mce: Define a function to extract ErrorAddr from MCA_ADDRSmita Koralahalli3-18/+17
Move MCA_ADDR[ErrorAddr] extraction into a separate helper function. This will be further refactored to support extended ErrorAddr bits in MCA_ADDR in newer AMD CPUs. [ bp: Massage. ] Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com> Link: https://lore.kernel.org/all/20220225193342.215780-3-Smita.KoralahalliChannabasappa@amd.com/
2022-10-31x86/mce: Use severity table to handle uncorrected errors in kernelTony Luck1-3/+5
mce_severity_intel() has a special case to promote UC and AR errors in kernel context to PANIC severity. The "AR" case is already handled with separate entries in the severity table for all instruction fetch errors, and those data fetch errors that are not in a recoverable area of the kernel (i.e. have an extable fixup entry). Add an entry to the severity table for UC errors in kernel context that reports severity = PANIC. Delete the special case code from mce_severity_intel(). Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20220922195136.54575-2-tony.luck@intel.com
2022-10-27x86/MCE/AMD: Clear DFR errors found in THR handlerYazen Ghannam1-13/+20
AMD's MCA Thresholding feature counts errors of all severity levels, not just correctable errors. If a deferred error causes the threshold limit to be reached (it was the error that caused the overflow), then both a deferred error interrupt and a thresholding interrupt will be triggered. The order of the interrupts is not guaranteed. If the threshold interrupt handler is executed first, then it will clear MCA_STATUS for the error. It will not check or clear MCA_DESTAT which also holds a copy of the deferred error. When the deferred error interrupt handler runs it will not find an error in MCA_STATUS, but it will find the error in MCA_DESTAT. This will cause two errors to be logged. Check for deferred errors when handling a threshold interrupt. If a bank contains a deferred error, then clear the bank's MCA_DESTAT register. Define a new helper function to do the deferred error check and clearing of MCA_DESTAT. [ bp: Simplify, convert comment to passive voice. ] Fixes: 37d43acfd79f ("x86/mce/AMD: Redo error logging from APIC LVT interrupt handlers") Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20220621155943.33623-1-yazen.ghannam@amd.com
2022-08-29x86/mce: Retrieve poison range from hardwareJane Chu1-1/+12
When memory poison consumption machine checks fire, MCE notifier handlers like nfit_handle_mce() record the impacted physical address range which is reported by the hardware in the MCi_MISC MSR. The error information includes data about blast radius, i.e. how many cachelines did the hardware determine are impacted. A recent change 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine poison granularity") updated nfit_handle_mce() to stop hard coding the blast radius value of 1 cacheline, and instead rely on the blast radius reported in 'struct mce' which can be up to 4K (64 cachelines). It turns out that apei_mce_report_mem_error() had a similar problem in that it hard coded a blast radius of 4K rather than reading the blast radius from the error information. Fix apei_mce_report_mem_error() to convey the proper poison granularity. Signed-off-by: Jane Chu <jane.chu@oracle.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/7ed50fd8-521e-cade-77b1-738b8bfb8502@oracle.com Link: https://lore.kernel.org/r/20220826233851.1319100-1-jane.chu@oracle.com
2022-06-28x86/mce: Check whether writes to MCA_STATUS are getting ignoredSmita Koralahalli2-1/+48
The platform can sometimes - depending on its settings - cause writes to MCA_STATUS MSRs to get ignored, regardless of HWCR[McStatusWrEn]'s value. For further info see PPR for AMD Family 19h, Model 01h, Revision B1 Processors, doc ID 55898 at https://bugzilla.kernel.org/show_bug.cgi?id=206537. Therefore, probe for ignored writes to MCA_STATUS to determine if hardware error injection is at all possible. [ bp: Heavily massage commit message and patch. ] Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20220214233640.70510-2-Smita.KoralahalliChannabasappa@amd.com
2022-05-27Merge tag 'libnvdimm-for-5.19' of ↵Linus Torvalds1-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm and DAX updates from Dan Williams: "New support for clearing memory errors when a file is in DAX mode, alongside with some other fixes and cleanups. Previously it was only possible to clear these errors using a truncate or hole-punch operation to trigger the filesystem to reallocate the block, now, any page aligned write can opportunistically clear errors as well. This change spans x86/mm, nvdimm, and fs/dax, and has received the appropriate sign-offs. Thanks to Jane for her work on this. Summary: - Add support for clearing memory error via pwrite(2) on DAX - Fix 'security overwrite' support in the presence of media errors - Miscellaneous cleanups and fixes for nfit_test (nvdimm unit tests)" * tag 'libnvdimm-for-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: pmem: implement pmem_recovery_write() pmem: refactor pmem_clear_poison() dax: add .recovery_write dax_operation dax: introduce DAX_RECOVERY_WRITE dax access mode mce: fix set_mce_nospec to always unmap the whole page x86/mce: relocate set{clear}_mce_nospec() functions acpi/nfit: rely on mce->misc to determine poison granularity testing: nvdimm: asm/mce.h is not needed in nfit.c testing: nvdimm: iomap: make __nfit_test_ioremap a macro nvdimm: Allow overwrite in the presence of disabled dimms tools/testing/nvdimm: remove unneeded flush_workqueue
2022-05-24Merge tag 'acpi-5.19-rc1' of ↵Linus Torvalds1-5/+3
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI updates from Rafael Wysocki: "These update the ACPICA kernel code to upstream revision 20220331, improve handling of PCI devices that are in D3cold during system initialization, add support for a few features, fix bugs and clean up code. Specifics: - Update ACPICA code in the kernel to upstream revision 20220331 including the following changes: - Add support for the Windows 11 _OSI string (Mario Limonciello) - Add the CFMWS subtable to the CEDT table (Lawrence Hileman). - iASL: NHLT: Treat Terminator as specific_config (Piotr Maziarz). - iASL: NHLT: Fix parsing undocumented bytes at the end of Endpoint Descriptor (Piotr Maziarz). - iASL: NHLT: Rename linux specific strucures to device_info (Piotr Maziarz). - Add new ACPI 6.4 semantics to Load() and LoadTable() (Bob Moore). - Clean up double word in comment (Tom Rix). - Update copyright notices to the year 2022 (Bob Moore). - Remove some tabs and // comments - automated cleanup (Bob Moore). - Replace zero-length array with flexible-array member (Gustavo A. R. Silva). - Interpreter: Add units to time variable names (Paul Menzel). - Add support for ARM Performance Monitoring Unit Table (Besar Wicaksono). - Inform users about ACPI spec violation related to sleep length (Paul Menzel). - iASL/MADT: Add OEM-defined subtable (Bob Moore). - Interpreter: Fix some typo mistakes (Selvarasu Ganesan). - Updates for revision E.d of IORT (Shameer Kolothum). - Use ACPI_FORMAT_UINT64 for 64-bit output (Bob Moore). - Improve debug messages in the ACPI device PM code (Rafael Wysocki). - Block ASUS B1400CEAE from suspend to idle by default (Mario Limonciello). - Improve handling of PCI devices that are in D3cold during system initialization (Rafael Wysocki). - Fix BERT error region memory mapping (Lorenzo Pieralisi). - Add support for NVIDIA 16550-compatible port subtype to the SPCR parsing code (Jeff Brasen). - Use static for BGRT_SHOW kobj_attribute defines (Tom Rix). - Fix missing prototype warning for acpi_agdi_init() (Ilkka Koskinen). - Fix missing ERST record ID in the APEI code (Liu Xinpeng). - Make APEI error injection to refuse to inject into the zero page (Tony Luck). - Correct description of INT3407 / INT3532 DPTF attributes in sysfs (Sumeet Pawnikar). - Add support for high frequency impedance notification to the DPTF driver (Sumeet Pawnikar). - Make mp_config_acpi_gsi() a void function (Li kunyu). - Unify Package () representation for properties in the ACPI device properties documentation (Andy Shevchenko). - Include UUID in _DSM evaluation warning (Michael Niewöhner)" * tag 'acpi-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (41 commits) Revert "ACPICA: executer/exsystem: Warn about sleeps greater than 10 ms" ACPI: utils: include UUID in _DSM evaluation warning ACPI: PM: Block ASUS B1400CEAE from suspend to idle by default x86: ACPI: Make mp_config_acpi_gsi() a void function ACPI: DPTF: Add support for high frequency impedance notification ACPI: AGDI: Fix missing prototype warning for acpi_agdi_init() ACPI: bus: Avoid non-ACPI device objects in walks over children ACPI: DPTF: Correct description of INT3407 / INT3532 attributes ACPI: BGRT: use static for BGRT_SHOW kobj_attribute defines ACPI, APEI, EINJ: Refuse to inject into the zero page ACPI: PM: Always print final debug message in acpi_device_set_power() ACPI: SPCR: Add support for NVIDIA 16550-compatible port subtype ACPI: docs: enumeration: Unify Package () for properties (part 2) ACPI: APEI: Fix missing ERST record id ACPICA: Update version to 20220331 ACPICA: exsystem.c: Use ACPI_FORMAT_UINT64 for 64-bit output ACPICA: IORT: Updates for revision E.d ACPICA: executer/exsystem: Fix some typo mistakes ACPICA: iASL/MADT: Add OEM-defined subtable ACPICA: executer/exsystem: Warn about sleeps greater than 10 ms ...
2022-05-16mce: fix set_mce_nospec to always unmap the whole pageJane Chu1-3/+3
The set_memory_uc() approach doesn't work well in all cases. As Dan pointed out when "The VMM unmapped the bad page from guest physical space and passed the machine check to the guest." "The guest gets virtual #MC on an access to that page. When the guest tries to do set_memory_uc() and instructs cpa_flush() to do clean caches that results in taking another fault / exception perhaps because the VMM unmapped the page from the guest." Since the driver has special knowledge to handle NP or UC, mark the poisoned page with NP and let driver handle it when it comes down to repair. Please refer to discussions here for more details. https://lore.kernel.org/all/CAPcyv4hrXPb1tASBZUg-GgdVs0OOFKXMXLiHmktg_kFi7YBMyQ@mail.gmail.com/ Now since poisoned page is marked as not-present, in order to avoid writing to a not-present page and trigger kernel Oops, also fix pmem_do_write(). Fixes: 284ce4011ba6 ("x86/memory_failure: Introduce {set, clear}_mce_nospec()") Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jane Chu <jane.chu@oracle.com> Acked-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/165272615484.103830.2563950688772226611.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2022-04-25x86/mce: Add messages for panic errors in AMD's MCE gradingCarlos Bilbao1-1/+10
When a machine error is graded as PANIC by the AMD grading logic, the MCE handler calls mce_panic(). The notification chain does not come into effect so the AMD EDAC driver does not decode the errors. In these cases, the messages displayed to the user are more cryptic and miss information that might be relevant, like the context in which the error took place. Add messages to the grading logic for machine errors so that it is clear what error it was. [ bp: Massage commit message. ] Signed-off-by: Carlos Bilbao <carlos.bilbao@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com> Link: https://lore.kernel.org/r/20220405183212.354606-3-carlos.bilbao@amd.com
2022-04-25x86/mce: Simplify AMD severity grading logicCarlos Bilbao1-65/+36
The MCE handler needs to understand the severity of the machine errors to act accordingly. Simplify the AMD grading logic following a logic that closely resembles the descriptions of the public PPR documents. This will help include more fine-grained grading of errors in the future. [ bp: Touchups. ] Signed-off-by: Carlos Bilbao <carlos.bilbao@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com> Link: https://lore.kernel.org/r/20220405183212.354606-2-carlos.bilbao@amd.com
2022-04-13ACPI: APEI: Fix missing ERST record idLiu Xinpeng1-5/+3
Read a record is cleared by others, but the deleted record cache entry is still created by erst_get_record_id_next. When next enumerate the records, get the cached deleted record, then erst_read() return -ENOENT and try to get next record, loop back to first ID will return 0 in function __erst_record_id_cache_add_one and then set record_id as APEI_ERST_INVALID_RECORD_ID, finished this time read operation. It will result in read the records just in the cache hereafter. This patch cleared the deleted record cache, fix the issue that "./erst-inject -p" shows record counts not equal to "./erst-inject -n". A reproducer of the problem(retry many times): [root@localhost erst-inject]# ./erst-inject -c 0xaaaaa00011 [root@localhost erst-inject]# ./erst-inject -p rc: 273 rcd sig: CPER rcd id: 0xaaaaa00012 rc: 273 rcd sig: CPER rcd id: 0xaaaaa00013 rc: 273 rcd sig: CPER rcd id: 0xaaaaa00014 [root@localhost erst-inject]# ./erst-inject -i 0xaaaaa000006 [root@localhost erst-inject]# ./erst-inject -i 0xaaaaa000007 [root@localhost erst-inject]# ./erst-inject -i 0xaaaaa000008 [root@localhost erst-inject]# ./erst-inject -p rc: 273 rcd sig: CPER rcd id: 0xaaaaa00012 rc: 273 rcd sig: CPER rcd id: 0xaaaaa00013 rc: 273 rcd sig: CPER rcd id: 0xaaaaa00014 [root@localhost erst-inject]# ./erst-inject -n total error record count: 6 Signed-off-by: Liu Xinpeng <liuxp11@chinatelecom.cn> Reviewed-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2022-04-05x86/MCE/AMD: Fix memory leak when threshold_create_bank() failsAmmar Faizi1-13/+19
In mce_threshold_create_device(), if threshold_create_bank() fails, the previously allocated threshold banks array @bp will be leaked because the call to mce_threshold_remove_device() will not free it. This happens because mce_threshold_remove_device() fetches the pointer through the threshold_banks per-CPU variable but bp is written there only after the bank creation is successful, and not before, when threshold_create_bank() fails. Add a helper which unwinds all the bank creation work previously done and pass into it the previously allocated threshold banks array for freeing. [ bp: Massage. ] Fixes: 6458de97fc15 ("x86/mce/amd: Straighten CPU hotplug path") Co-developed-by: Alviro Iskandar Setiawan <alviro.iskandar@gnuweeb.org> Signed-off-by: Alviro Iskandar Setiawan <alviro.iskandar@gnuweeb.org> Co-developed-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Ammar Faizi <ammarfaizi2@gnuweeb.org> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20220329104705.65256-3-ammarfaizi2@gnuweeb.org
2022-04-05x86/mce: Avoid unnecessary padding in struct mce_bankSmita Koralahalli1-1/+3
Convert struct mce_bank member "init" from bool to a bitfield to get rid of unnecessary padding. $ pahole -C mce_bank arch/x86/kernel/cpu/mce/core.o before: /* size: 16, cachelines: 1, members: 2 */ /* padding: 7 */ /* last cacheline: 16 bytes */ after: /* size: 16, cachelines: 1, members: 3 */ /* last cacheline: 16 bytes */ No functional changes. Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20220225193342.215780-2-Smita.KoralahalliChannabasappa@amd.com
2022-03-25Merge tag 'ras_core_for_v5.18_rc1' of ↵Linus Torvalds3-90/+139
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RAS updates from Borislav Petkov: - More noinstr fixes - Add an erratum workaround for Intel CPUs which, in certain circumstances, end up consuming an unrelated uncorrectable memory error when using fast string copy insns - Remove the MCE tolerance level control as it is not really needed or used anymore * tag 'ras_core_for_v5.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mce: Remove the tolerance level control x86/mce: Work around an erratum on fast string copy instructions x86/mce: Use arch atomic and bit helpers
2022-03-22Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-3/+5
Merge updates from Andrew Morton: - A few misc subsystems: kthread, scripts, ntfs, ocfs2, block, and vfs - Most the MM patches which precede the patches in Willy's tree: kasan, pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap, sparsemem, vmalloc, pagealloc, memory-failure, mlock, hugetlb, userfaultfd, vmscan, compaction, mempolicy, oom-kill, migration, thp, cma, autonuma, psi, ksm, page-poison, madvise, memory-hotplug, rmap, zswap, uaccess, ioremap, highmem, cleanups, kfence, hmm, and damon. * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (227 commits) mm/damon/sysfs: remove repeat container_of() in damon_sysfs_kdamond_release() Docs/ABI/testing: add DAMON sysfs interface ABI document Docs/admin-guide/mm/damon/usage: document DAMON sysfs interface selftests/damon: add a test for DAMON sysfs interface mm/damon/sysfs: support DAMOS stats mm/damon/sysfs: support DAMOS watermarks mm/damon/sysfs: support schemes prioritization mm/damon/sysfs: support DAMOS quotas mm/damon/sysfs: support DAMON-based Operation Schemes mm/damon/sysfs: support the physical address space monitoring mm/damon/sysfs: link DAMON for virtual address spaces monitoring mm/damon: implement a minimal stub for sysfs-based DAMON interface mm/damon/core: add number of each enum type values mm/damon/core: allow non-exclusive DAMON start/stop Docs/damon: update outdated term 'regions update interval' Docs/vm/damon/design: update DAMON-Idle Page Tracking interference handling Docs/vm/damon: call low level monitoring primitives the operations mm/damon: remove unnecessary CONFIG_DAMON option mm/damon/paddr,vaddr: remove damon_{p,v}a_{target_valid,set_operations}() mm/damon/dbgfs-test: fix is_target_id() change ...
2022-03-22mm/hwpoison: avoid the impact of hwpoison_filter() return value on mce handlerluofei1-3/+5
When the hwpoison page meets the filter conditions, it should not be regarded as successful memory_failure() processing for mce handler, but should return a distinct value, otherwise mce handler regards the error page has been identified and isolated, which may lead to calling set_mce_nospec() to change page attribute, etc. Here memory_failure() return -EOPNOTSUPP to indicate that the error event is filtered, mce handler should not take any action for this situation and hwpoison injector should treat as correct. Link: https://lkml.kernel.org/r/20220223082135.2769649-1-luofei@unicloud.com Signed-off-by: luofei <luofei@unicloud.com> Acked-by: Borislav Petkov <bp@suse.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-02-23x86/mce: Remove the tolerance level controlBorislav Petkov3-47/+30
This is pretty much unused and not really useful. What is more, all relevant MCA hardware has recoverable machine checks support so there's no real need to tweak MCA tolerance levels in order to *maybe* extend machine lifetime. So rip it out. Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/YcDq8PxvKtTENl/e@zn.tnic
2022-02-19x86/mce: Work around an erratum on fast string copy instructionsJue Wang2-1/+68
A rare kernel panic scenario can happen when the following conditions are met due to an erratum on fast string copy instructions: 1) An uncorrected error. 2) That error must be in first cache line of a page. 3) Kernel must execute page_copy from the page immediately before that page. The fast string copy instructions ("REP; MOVS*") could consume an uncorrectable memory error in the cache line _right after_ the desired region to copy and raise an MCE. Bit 0 of MSR_IA32_MISC_ENABLE can be cleared to disable fast string copy and will avoid such spurious machine checks. However, that is less preferable due to the permanent performance impact. Considering memory poison is rare, it's desirable to keep fast string copy enabled until an MCE is seen. Intel has confirmed the following: 1. The CPU erratum of fast string copy only applies to Skylake, Cascade Lake and Cooper Lake generations. Directly return from the MCE handler: 2. Will result in complete execution of the "REP; MOVS*" with no data loss or corruption. 3. Will not result in another MCE firing on the next poisoned cache line due to "REP; MOVS*". 4. Will resume execution from a correct point in code. 5. Will result in the same instruction that triggered the MCE firing a second MCE immediately for any other software recoverable data fetch errors. 6. Is not safe without disabling the fast string copy, as the next fast string copy of the same buffer on the same CPU would result in a PANIC MCE. This should mitigate the erratum completely with the only caveat that the fast string copy is disabled on the affected hyper thread thus performance degradation. This is still better than the OS crashing on MCEs raised on an irrelevant process due to "REP; MOVS*' accesses in a kernel context, e.g., copy_page.