summaryrefslogtreecommitdiff
path: root/drivers/edac
AgeCommit message (Collapse)AuthorFilesLines
2025-02-27EDAC/qcom: Correct interrupt enable register configurationKomal Bajaj1-2/+2
commit c158647c107358bf1be579f98e4bb705c1953292 upstream. The previous implementation incorrectly configured the cmn_interrupt_2_enable register for interrupt handling. Using cmn_interrupt_2_enable to configure Tag, Data RAM ECC interrupts would lead to issues like double handling of the interrupts (EL1 and EL3) as cmn_interrupt_2_enable is meant to be configured for interrupts which needs to be handled by EL3. EL1 LLCC EDAC driver needs to use cmn_interrupt_0_enable register to configure Tag, Data RAM ECC interrupts instead of cmn_interrupt_2_enable. Fixes: 27450653f1db ("drivers: edac: Add EDAC driver support for QCOM SoCs") Signed-off-by: Komal Bajaj <quic_kbajaj@quicinc.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/20241119064608.12326-1-quic_kbajaj@quicinc.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-12-27EDAC/amd64: Simplify ECC check on unified memory controllersBorislav Petkov (AMD)1-22/+10
commit 747367340ca6b5070728b86ae36ad6747f66b2fb upstream. The intent of the check is to see whether at least one UMC has ECC enabled. So do that instead of tracking which ones are enabled in masks which are too small in size anyway and lead to not loading the driver on Zen4 machines with UMCs enabled over UMC8. Fixes: e2be5955a886 ("EDAC/amd64: Add support for AMD Family 19h Models 10h-1Fh and A0h-AFh") Reported-by: Avadhut Naik <avadhut.naik@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Avadhut Naik <avadhut.naik@amd.com> Reviewed-by: Avadhut Naik <avadhut.naik@amd.com> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/20241210212054.3895697-1-avadhut.naik@amd.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-12-05EDAC/igen6: Avoid segmentation fault on module unloadOrange Kao1-0/+2
[ Upstream commit fefaae90398d38a1100ccd73b46ab55ff4610fba ] The segmentation fault happens because: During modprobe: 1. In igen6_probe(), igen6_pvt will be allocated with kzalloc() 2. In igen6_register_mci(), mci->pvt_info will point to &igen6_pvt->imc[mc] During rmmod: 1. In mci_release() in edac_mc.c, it will kfree(mci->pvt_info) 2. In igen6_remove(), it will kfree(igen6_pvt); Fix this issue by setting mci->pvt_info to NULL to avoid the double kfree. Fixes: 10590a9d4f23 ("EDAC/igen6: Add EDAC driver for Intel client SoCs using IBECC") Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219360 Signed-off-by: Orange Kao <orange@aiven.io> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20241104124237.124109-2-orange@aiven.io Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-12-05EDAC/{skx_common,i10nm}: Fix incorrect far-memory error source indicatorQiuxu Zhuo3-0/+25
[ Upstream commit a36667037a0c0e36c59407f8ae636295390239a5 ] The Granite Rapids CPUs with Flat2LM memory configurations may mistakenly report near-memory errors as far-memory errors, resulting in the invalid decoded ADXL results: EDAC skx: Bad imc -1 Fix this incorrect far-memory error source indicator by prefetching the decoded far-memory controller ID, and adjust the error source indicator to near-memory if the far-memory controller ID is invalid. Fixes: ba987eaaabf9 ("EDAC/i10nm: Add Intel Granite Rapids server support") Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: Diego Garcia Rodriguez <diego.garcia.rodriguez@intel.com> Link: https://lore.kernel.org/r/20241015072236.24543-3-qiuxu.zhuo@intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-12-05EDAC/skx_common: Differentiate memory error sourcesQiuxu Zhuo2-18/+23
[ Upstream commit 2397f795735219caa9c2fe61e7bcdd0652e670d3 ] The current skx_common determines whether the memory error source is the near memory of the 2LM system and then retrieves the decoded error results from the ADXL components (near-memory vs. far-memory) accordingly. However, some memory controllers may have limitations in correctly reporting the memory error source, leading to the retrieval of incorrect decoded parts from the ADXL. To address these limitations, instead of simply determining whether the memory error is from the near memory of the 2LM system, it is necessary to distinguish the memory error source details as follows: Memory error from the near memory of the 2LM system. Memory error from the far memory of the 2LM system. Memory error from the 1LM system. Not a memory error. This will enable the i10nm_edac driver to take appropriate actions for those memory controllers that have limitations in reporting the memory error source. Fixes: ba987eaaabf9 ("EDAC/i10nm: Add Intel Granite Rapids server support") Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: Diego Garcia Rodriguez <diego.garcia.rodriguez@intel.com> Link: https://lore.kernel.org/r/20241015072236.24543-2-qiuxu.zhuo@intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-12-05EDAC/fsl_ddr: Fix bad bit shift operationsPriyanka Singh1-9/+13
[ Upstream commit 9ec22ac4fe766c6abba845290d5139a3fbe0153b ] Fix undefined behavior caused by left-shifting a negative value in the expression: cap_high ^ (1 << (bad_data_bit - 32)) The variable bad_data_bit ranges from 0 to 63. When it is less than 32, bad_data_bit - 32 becomes negative, and left-shifting by a negative value in C is undefined behavior. Fix this by combining cap_high and cap_low into a 64-bit variable. [ bp: Massage commit message, simplify error bits handling. ] Fixes: ea2eb9a8b620 ("EDAC, fsl-ddr: Separate FSL DDR driver from MPC85xx") Signed-off-by: Priyanka Singh <priyanka.singh@nxp.com> Signed-off-by: Li Yang <leoyang.li@nxp.com> Signed-off-by: Frank Li <Frank.Li@nxp.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20241016-imx95_edac-v3-3-86ae6fc2756a@nxp.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-12-05EDAC/bluefield: Fix potential integer overflowDavid Thompson1-1/+1
[ Upstream commit 1fe774a93b46bb029b8f6fa9d1f25affa53f06c6 ] The 64-bit argument for the "get DIMM info" SMC call consists of mem_ctrl_idx left-shifted 16 bits and OR-ed with DIMM index. With mem_ctrl_idx defined as 32-bits wide the left-shift operation truncates the upper 16 bits of information during the calculation of the SMC argument. The mem_ctrl_idx stack variable must be defined as 64-bits wide to prevent any potential integer overflow, i.e. loss of data from upper 16 bits. Fixes: 82413e562ea6 ("EDAC, mellanox: Add ECC support for BlueField DDR4") Signed-off-by: David Thompson <davthompson@nvidia.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Shravan Kumar Ramani <shravankr@nvidia.com> Link: https://lore.kernel.org/r/20240930151056.10158-1-davthompson@nvidia.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-10-05EDAC/qcom: Make irq configuration optionalRajendra Nayak1-3/+5
On most modern qualcomm SoCs, the configuration necessary to enable the Tag/Data RAM related irqs being propagated to the SoC irq controller is already done in firmware (in DSF or 'DDR System Firmware') On some like the x1e80100, these registers aren't even accesible to the kernel causing a crash when edac device is probed. Hence, make the irq configuration optional in the driver and mark x1e80100 as the SoC on which this should be avoided. Fixes: af16b00578a7 ("arm64: dts: qcom: Add base X1E80100 dtsi and the QCP dts") Reported-by: Bjorn Andersson <andersson@kernel.org> Signed-off-by: Rajendra Nayak <quic_rjendra@quicinc.com> Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Reviewed-by: Abel Vesa <abel.vesa@linaro.org> Link: https://lore.kernel.org/r/20240903101510.3452734-1-quic_rjendra@quicinc.com Signed-off-by: Bjorn Andersson <andersson@kernel.org>
2024-09-16Merge tag 'edac_updates_for_v6.12' of ↵Linus Torvalds11-1729/+115
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras Pull EDAC updates from Borislav Petkov: - Drop a now obsolete ppc4xx_edac driver - Fix conversion to physical memory addresses on Intel's Elkhart Lake and Ice Lake hardware when the system address is above the (Top-Of-Memory) TOM address - Pay attention to the memory hole on Zynq UltraScale+ MPSoC DDR controllers when injecting errors for testing purposes - Add support for translating normalized error addresses reported by an AMD memory controller into system physical addresses using an UEFI mechanism called platform runtime mechanism (PRM). - The usual cleanups and fixes * tag 'edac_updates_for_v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras: EDAC: Drop obsolete PPC4xx driver EDAC/sb_edac: Fix the compile warning of large frame size EDAC/{skx_common,i10nm}: Remove the AMAP register for determing DDR5 EDAC/{skx_common,skx,i10nm}: Move the common debug code to skx_common EDAC/igen6: Fix conversion of system address to physical memory address EDAC/synopsys: Fix error injection on Zynq UltraScale+ RAS/AMD/ATL: Translate normalized to system physical addresses using PRM ACPI: PRM: Add PRM handler direct call support
2024-09-09Merge remote-tracking branches 'ras/edac-amd-atl', 'ras/edac-misc' and ↵Borislav Petkov (AMD)11-1729/+115
'ras/edac-drivers' into edac-updates * ras/edac-amd-atl: RAS/AMD/ATL: Translate normalized to system physical addresses using PRM ACPI: PRM: Add PRM handler direct call support * ras/edac-misc: EDAC/synopsys: Fix error injection on Zynq UltraScale+ * ras/edac-drivers: EDAC: Drop obsolete PPC4xx driver EDAC/sb_edac: Fix the compile warning of large frame size EDAC/{skx_common,i10nm}: Remove the AMAP register for determing DDR5 EDAC/{skx_common,skx,i10nm}: Move the common debug code to skx_common EDAC/igen6: Fix conversion of system address to physical memory address Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2024-09-05EDAC: Drop obsolete PPC4xx driverRob Herring (Arm)4-1602/+0
Since 47d13a269bbd ("powerpc/40x: Remove 40x platforms.") support for PPC40x platforms has been removed. While the EDAC driver also mentions PPC440 and PPC460 processors, the driver refuses to probe on anything other than PPC405. It's unlikely support will ever be added at this point for these other old platforms, so the driver can be removed. Signed-off-by: Rob Herring (Arm) <robh@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Link: https://lore.kernel.org/r/20240904192224.3060307-2-robh@kernel.org
2024-09-03EDAC/sb_edac: Fix the compile warning of large frame sizeQiuxu Zhuo1-17/+18
Compiling sb_edac driver with GCC 11.4.0 and the W=1 option reported the following warning: drivers/edac/sb_edac.c: In function ‘sbridge_mce_output_error’: drivers/edac/sb_edac.c:3249:1: warning: the frame size of 1032 bytes is larger than 1024 bytes [-Wframe-larger-than=] As there is no concurrent invocation of sbridge_mce_output_error(), fix this warning by moving the large-size variables 'msg' and 'msg_full' from the stack to the pre-allocated data segment. [Tony: Fix checkpatch warnings for code alignment & use of strcpy()] Reported-by: Zhang Rui <rui.zhang@intel.com> Tested-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/all/20240829120903.84152-1-qiuxu.zhuo@intel.com
2024-09-03EDAC/{skx_common,i10nm}: Remove the AMAP register for determing DDR5Qiuxu Zhuo2-8/+3
The configuration flag 'res_config->support_ddr5 = true' sufficiently indicates DDR5 memory support for Sapphire Rapids and Granite Rapids. Additionally, the i10nm_edac driver doesn't need to use the AMAP register for setting the 'fine_grain_bank' of each DIMM. Therefore, remove the AMAP register for determining DDR5. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/all/20240829061309.57738-1-qiuxu.zhuo@intel.com
2024-09-03EDAC/{skx_common,skx,i10nm}: Move the common debug code to skx_commonQiuxu Zhuo4-100/+59
Commit afdb82fd763c ("EDAC, i10nm: make skx_common.o a separate module") made skx_common.o a separate module. With skx_common.o now a separate module, move the common debug code setup_{skx,i10nm}_debug() and teardown_{skx,i10nm}_debug() in {skx,i10nm}_base.c to skx_common.c to reduce code duplication. Additionally, prefix these function names with 'skx' to maintain consistency with other names in the file. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/all/20240829055101.56245-1-qiuxu.zhuo@intel.com
2024-09-03EDAC/igen6: Fix conversion of system address to physical memory addressQiuxu Zhuo1-1/+1
The conversion of system address to physical memory address (as viewed by the memory controller) by igen6_edac is incorrect when the system address is above the TOM (Total amount Of populated physical Memory) for Elkhart Lake and Ice Lake (Neural Network Processor). Fix this conversion. Fixes: 10590a9d4f23 ("EDAC/igen6: Add EDAC driver for Intel client SoCs using IBECC") Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/stable/20240814061011.43545-1-qiuxu.zhuo%40intel.com
2024-08-01EDAC/synopsys: Fix error injection on Zynq UltraScale+Shubhrajyoti Datta1-1/+34
The Zynq UltraScale+ MPSoC DDR has a disjoint memory from 2GB to 32GB. The DDR host interface has a contiguous memory so while injecting errors, the driver should remove the hole else the injection fails as the address translation is incorrect. Introduce a get_mem_info() function pointer and set it for Zynq UltraScale+ platform to return host address. Fixes: 1a81361f75d8 ("EDAC, synopsys: Add Error Injection support for ZynqMP DDR controller") Signed-off-by: Shubhrajyoti Datta <shubhrajyoti.datta@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240711100656.31376-1-shubhrajyoti.datta@amd.com
2024-07-28minmax: make generic MIN() and MAX() macros available everywhereLinus Torvalds1-1/+0
This just standardizes the use of MIN() and MAX() macros, with the very traditional semantics. The goal is to use these for C constant expressions and for top-level / static initializers, and so be able to simplify the min()/max() macros. These macro names were used by various kernel code - they are very traditional, after all - and all such users have been fixed up, with a few different approaches: - trivial duplicated macro definitions have been removed Note that 'trivial' here means that it's obviously kernel code that already included all the major kernel headers, and thus gets the new generic MIN/MAX macros automatically. - non-trivial duplicated macro definitions are guarded with #ifndef This is the "yes, they define their own versions, but no, the include situation is not entirely obvious, and maybe they don't get the generic version automatically" case. - strange use case #1 A couple of drivers decided that the way they want to describe their versioning is with #define MAJ 1 #define MIN 2 #define DRV_VERSION __stringify(MAJ) "." __stringify(MIN) which adds zero value and I just did my Alexander the Great impersonation, and rewrote that pointless Gordian knot as #define DRV_VERSION "1.2" instead. - strange use case #2 A couple of drivers thought that it's a good idea to have a random 'MIN' or 'MAX' define for a value or index into a table, rather than the traditional macro that takes arguments. These values were re-written as C enum's instead. The new function-line macros only expand when followed by an open parenthesis, and thus don't clash with enum use. Happily, there weren't really all that many of these cases, and a lot of users already had the pattern of using '#ifndef' guarding (or in one case just using '#undef MIN') before defining their own private version that does the same thing. I left such cases alone. Cc: David Laight <David.Laight@aculab.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-07-28minmax: add a few more MIN_T/MAX_T usersLinus Torvalds1-2/+2
Commit 3a7e02c040b1 ("minmax: avoid overly complicated constant expressions in VM code") added the simpler MIN_T/MAX_T macros in order to avoid some excessive expansion from the rather complicated regular min/max macros. The complexity of those macros stems from two issues: (a) trying to use them in situations that require a C constant expression (in static initializers and for array sizes) (b) the type sanity checking and MIN_T/MAX_T avoids both of these issues. Now, in the whole (long) discussion about all this, it was pointed out that the whole type sanity checking is entirely unnecessary for min_t/max_t which get a fixed type that the comparison is done in. But that still leaves min_t/max_t unnecessarily complicated due to worries about the C constant expression case. However, it turns out that there really aren't very many cases that use min_t/max_t for this, and we can just force-convert those. This does exactly that. Which in turn will then allow for much simpler implementations of min_t()/max_t(). All the usual "macros in all upper case will evaluate the arguments multiple times" rules apply. We should do all the same things for the regular min/max() vs MIN/MAX() cases, but that has the added complexity of various drivers defining their own local versions of MIN/MAX, so that needs another level of fixes first. Link: https://lore.kernel.org/all/b47fad1d0cf8449886ad148f8c013dae@AcuMS.aculab.com/ Cc: David Laight <David.Laight@aculab.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-07-15Merge tag 'x86_misc_for_v6.11_rc1' of ↵Linus Torvalds2-35/+38
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull misc x86 updates from Borislav Petkov: - Make error checking of AMD SMN accesses more robust in the callers as they're the only ones who can interpret the results properly - The usual cleanups and fixes, left and right * tag 'x86_misc_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/kmsan: Fix hook for unaligned accesses x86/platform/iosf_mbi: Convert PCIBIOS_* return codes to errnos x86/pci/xen: Fix PCIBIOS_* return code handling x86/pci/intel_mid_pci: Fix PCIBIOS_* return code handling x86/of: Return consistent error type from x86_of_pci_irq_enable() hwmon: (k10temp) Rename _data variable hwmon: (k10temp) Remove unused HAVE_TDIE() macro hwmon: (k10temp) Reduce k10temp_get_ccd_support() parameters hwmon: (k10temp) Define a helper function to read CCD temperature x86/amd_nb: Enhance SMN access error checking hwmon: (k10temp) Check return value of amd_smn_read() EDAC/amd64: Check return value of amd_smn_read() EDAC/amd64: Remove unused register accesses tools/x86/kcpuid: Add missing dir via Makefile x86, arm: Add missing license tag to syscall tables files
2024-07-15Merge tag 'edac_updates_for_v6.11' of ↵Linus Torvalds17-38/+63
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras Pull EDAC updates from Borislav Petkov: - The AMD memory controllers data fabric version 4.5 supports non-power-of-2 denormalization in the sense that certain bits of the system physical address cannot be reconstructed from the normalized address reported by the RAS hardware. Add support for handling such addresses - Switch the EDAC drivers to the new Intel CPU model defines - The usual fixes and cleanups all over the place * tag 'edac_updates_for_v6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras: EDAC: Add missing MODULE_DESCRIPTION() macros EDAC/dmc520: Use devm_platform_ioremap_resource() EDAC/igen6: Add Intel Arrow Lake-U/H SoCs support RAS/AMD/FMPM: Use atl internal.h for INVALID_SPA RAS/AMD/ATL: Implement DF 4.5 NP2 denormalization RAS/AMD/ATL: Validate address map when information is gathered RAS/AMD/ATL: Expand helpers for adding and removing base and hole RAS/AMD/ATL: Read DRAM hole base early RAS/AMD/ATL: Add amd_atl pr_fmt() prefix RAS/AMD/ATL: Add a missing module description EDAC, i10nm: make skx_common.o a separate module EDAC/skx: Switch to new Intel CPU model defines EDAC/sb_edac: Switch to new Intel CPU model defines EDAC, pnd2: Switch to new Intel CPU model defines EDAC/i10nm: Switch to new Intel CPU model defines EDAC/ghes: Add missing newline to pr_info() statement RAS/AMD/ATL: Add missing newline to pr_info() statement EDAC/thunderx: Remove unused struct error_syndrome
2024-06-29EDAC: Add missing MODULE_DESCRIPTION() macrosJeff Johnson6-0/+6
With ARCH=arm64 make allmodconfig && make W=1 C=1 reports: WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/edac/layerscape_edac_mod.o Add the missing invocation of the MODULE_DESCRIPTION() macro to all files which have a MODULE_LICENSE(). This includes mpc85xx_edac.c and four octeon_edac-*.c files which, although they did not produce a warning with the arm64 allmodconfig configuration, may cause this warning with other configurations. [ bp: s/module/driver/ for layerscape_edac ] Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240617-md-arm64-drivers-edac-v2-1-6d6c5dd1e5da@quicinc.com
2024-06-23EDAC/dmc520: Use devm_platform_ioremap_resource()Jai Arora1-3/+1
platform_get_resource() and devm_ioremap_resource() are wrapped up in the devm_platform_ioremap_resource() helper. Use the helper and get rid of the local variable for struct resource *. Signed-off-by: Jai Arora <jai.arora@samsung.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240618110226.97395-1-jai.arora@samsung.com
2024-06-14EDAC/igen6: Add Intel Arrow Lake-U/H SoCs supportQiuxu Zhuo1-0/+8
Arrow Lake-U/H SoCs share same IBECC registers with Meteor Lake-P SoCs. Add Arrow Lake-U/H SoC compute die IDs for EDAC support. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20240614030354.69180-1-qiuxu.zhuo@intel.com
2024-06-12EDAC/amd64: Check return value of amd_smn_read()Yazen Ghannam1-14/+37
Check the return value of amd_smn_read() before saving a value. This ensures invalid values aren't saved. The struct umc instance is initialized to 0 during memory allocation. Therefore, a bad read will keep the value as 0 providing the expected Read-as-Zero behavior. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Link: https://lore.kernel.org/r/20240606-fix-smn-bad-read-v4-2-ffde21931c3f@amd.com
2024-06-12EDAC/amd64: Remove unused register accessesYazen Ghannam2-21/+1
A number of UMC registers are read only for the purpose of debug printing. They are not used in any calculations. Nor do they have any specific debug value. Remove them. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Link: https://lore.kernel.org/r/20240606-fix-smn-bad-read-v4-1-ffde21931c3f@amd.com
2024-06-04EDAC/igen6: Convert PCIBIOS_* return codes to errnosIlpo Järvinen1-2/+2
errcmd_enable_error_reporting() uses pci_{read,write}_config_word() that return PCIBIOS_* codes. The return code is then returned all the way into the probe function igen6_probe() that returns it as is. The probe functions, however, should return normal errnos. Convert PCIBIOS_* returns code using pcibios_err_to_errno() into normal errno before returning it from errcmd_enable_error_reporting(). Fixes: 10590a9d4f23 ("EDAC/igen6: Add EDAC driver for Intel client SoCs using IBECC") Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240527132236.13875-2-ilpo.jarvinen@linux.intel.com
2024-06-04EDAC/amd64: Convert PCIBIOS_* return codes to errnosIlpo Järvinen1-3/+5
gpu_get_node_map() uses pci_read_config_dword() that returns PCIBIOS_* codes. The return code is then returned all the way into the module init function amd64_edac_init() that returns it as is. The module init functions, however, should return normal errnos. Convert PCIBIOS_* returns code using pcibios_err_to_errno() into normal errno before returning it from gpu_get_node_map(). For consistency, convert also the other similar cases which return PCIBIOS_* codes even if they do not have any bugs at the moment. Fixes: 4251566ebc1c ("EDAC/amd64: Cache and use GPU node map") Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240527132236.13875-1-ilpo.jarvinen@linux.intel.com
2024-05-29EDAC, i10nm: make skx_common.o a separate moduleArnd Bergmann3-8/+27
Commit 598afa050403 ("kbuild: warn objects shared among multiple modules") was added to track down cases where the same object is linked into multiple modules. This can cause serious problems if some modules are builtin while others are not. That test triggers this warning: scripts/Makefile.build:236: drivers/edac/Makefile: skx_common.o is added to multiple modules: i10nm_edac skx_edac Make this a separate module instead. [Tony: Added more background details to commit message] Fixes: d4dc89d069aa ("EDAC, i10nm: Add a driver for Intel 10nm server processors") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/all/20240529095132.1929397-1-arnd@kernel.org/
2024-05-28EDAC/skx: Switch to new Intel CPU model definesTony Luck1-1/+1
New CPU #defines encode vendor and family as well as model. Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20240520224620.9480-39-tony.luck@intel.com
2024-05-28EDAC/sb_edac: Switch to new Intel CPU model definesTony Luck1-7/+7
New CPU #defines encode vendor and family as well as model. Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20240520224620.9480-38-tony.luck@intel.com
2024-05-28EDAC, pnd2: Switch to new Intel CPU model definesTony Luck1-2/+2
New CPU #defines encode vendor and family as well as model. Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20240520224620.9480-37-tony.luck@intel.com
2024-05-28EDAC/i10nm: Switch to new Intel CPU model definesTony Luck1-10/+10
New CPU #defines encode vendor and family as well as model. Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20240520224620.9480-36-tony.luck@intel.com
2024-05-28EDAC/ghes: Add missing newline to pr_info() statementVasyl Gomonovych1-1/+1
Add a missing newline character even if printk() adds newlines to non-\n-terminated strings because in the unlikely case a KERN_CONT print statement is added after the unterminated statement, the two will get glued together which is not the expected behavior. [ bp: Rewrite commit message. ] Signed-off-by: Vasyl Gomonovych <gomonovych@gmail.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240517204951.2019031-1-gomonovych@gmail.com
2024-05-27EDAC/thunderx: Remove unused struct error_syndromeDr. David Alan Gilbert1-6/+0
struct error_syndrome appears never to have been used. Remove it, together with the MAX_SYNDROME_REGS it used. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240516133404.251397-1-linux@treblig.org
2024-05-14Merge tag 'edac_updates_for_v6.10' of ↵Linus Torvalds20-147/+50
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras Pull EDAC updates from Borislav Petkov: - Have skx_edac decode error addresses belonging to SGX properly - Remove a bunch of unused struct members - Other cleanups * tag 'edac_updates_for_v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras: EDAC/skx_common: Allow decoding of SGX addresses EDAC/mc_sysfs: Convert sprintf()/snprintf() to sysfs_emit() EDAC: Remove unused struct members EDAC: Remove dynamic attributes from edac_device_alloc_ctl_info() EDAC/device: Remove edac_dev_sysfs_block_attribute::store() EDAC/device: Remove edac_dev_sysfs_block_attribute::{block,value} EDAC/amd64: Remove unused struct member amd64_pvt::ext_nbcfg
2024-05-06EDAC/synopsys: Fix ECC status and IRQ control race conditionSerge Semin1-13/+37
The race condition around the ECCCLR register access happens in the IRQ disable method called in the device remove() procedure and in the ECC IRQ handler: 1. Enable IRQ: a. ECCCLR = EN_CE | EN_UE 2. Disable IRQ: a. ECCCLR = 0 3. IRQ handler: a. ECCCLR = CLR_CE | CLR_CE_CNT | CLR_CE | CLR_CE_CNT b. ECCCLR = 0 c. ECCCLR = EN_CE | EN_UE So if the IRQ disabling procedure is called concurrently with the IRQ handler method the IRQ might be actually left enabled due to the statement 3c. The root cause of the problem is that ECCCLR register (which since v3.10a has been called as ECCCTL) has intermixed ECC status data clear flags and the IRQ enable/disable flags. Thus the IRQ disabling (clear EN flags) and handling (write 1 to clear ECC status data) procedures must be serialised around the ECCCTL register modification to prevent the race. So fix the problem described above by adding the spin-lock around the ECCCLR modifications and preventing the IRQ-handler from modifying the IRQs enable flags (there is no point in disabling the IRQ and then re-enabling it again within a single IRQ handler call, see the statements 3a/3b and 3c above). Fixes: f7824ded4149 ("EDAC/synopsys: Add support for version 3 of the Synopsys EDAC DDR") Signed-off-by: Serge Semin <fancer.lancer@gmail.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240222181324.28242-2-fancer.lancer@gmail.com
2024-04-25EDAC/versal: Do not log total error countsShubhrajyoti Datta1-2/+2
When logging errors, the driver currently logs the total error count. However, it should log the current error only. Fix it. [ bp: Rewrite text. ] Fixes: 6f15b178cd63 ("EDAC/versal: Add a Xilinx Versal memory controller driver") Signed-off-by: Shubhrajyoti Datta <shubhrajyoti.datta@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240425121942.26378-4-shubhrajyoti.datta@amd.com
2024-04-25EDAC/versal: Check user-supplied data before injecting an errorShubhrajyoti Datta1-0/+3
The function inject_data_ue_store() lacks a NULL check for the user passed values. To prevent below kernel crash include a NULL check. Call trace: kstrtoull kstrtou8 inject_data_ue_store full_proxy_write vfs_write ksys_write __arm64_sys_write invoke_syscall el0_svc_common.constprop.0 do_el0_svc el0_svc el0t_64_sync_handler el0t_64_sync Fixes: 83bf24051a60 ("EDAC/versal: Make the bit position of injected errors configurable") Signed-off-by: Shubhrajyoti Datta <shubhrajyoti.datta@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240425121942.26378-3-shubhrajyoti.datta@amd.com
2024-04-25EDAC/versal: Do not register for NOC errorsShubhrajyoti Datta1-4/+1
The NOC errors are not handled in the driver. Remove the request for registration. Signed-off-by: Shubhrajyoti Datta <shubhrajyoti.datta@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240425121942.26378-2-shubhrajyoti.datta@amd.com
2024-04-08EDAC/skx_common: Allow decoding of SGX addressesQiuxu Zhuo1-1/+1
There are no "struct page" associations with SGX pages, causing the check pfn_to_online_page() to fail. This results in the inability to decode the SGX addresses and warning messages like: Invalid address 0x34cc9a98840 in IA32_MC17_ADDR Add an additional check to allow the decoding of the error address and to skip the warning message, if the error address is an SGX address. Fixes: 1e92af09fab1 ("EDAC/skx_common: Filter out the invalid address") Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20240408120419.50234-1-qiuxu.zhuo@intel.com
2024-04-04EDAC/mc_sysfs: Convert sprintf()/snprintf() to sysfs_emit()Li Zhijian1-24/+23
Per Documentation/filesystems/sysfs.rst, show() should only use sysfs_emit() or sysfs_emit_at() when formatting the value to be returned to user space. Generated by: make coccicheck M=<path/to/file> MODE=patch \ COCCI=scripts/coccinelle/api/device_attr_show.cocci No functional change intended. [ bp: Massage. ] Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240314084628.1322006-1-lizhijian@fujitsu.com
2024-03-27EDAC: Remove unused struct membersJiri Slaby (SUSE)2-8/+0
Remove unused - edac_pci_ctl_info::edac_subsys - edac_pci_ctl_info::complete - edac_device_ctl_info::removal_complete members. Found by https://github.com/jirislaby/clang-struct. [ bp: Squash three almost identical trivial patches into one. ] Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240213112051.27715-6-jirislaby@kernel.org Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2024-03-27EDAC: Remove dynamic attributes from edac_device_alloc_ctl_info()Jiri Slaby (SUSE)15-79/+20
Dynamic attributes are not passed from any caller of edac_device_alloc_ctl_info(). Drop this unused/untested functionality completely. Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240213112051.27715-5-jirislaby@kernel.org Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2024-03-27EDAC/device: Remove edac_dev_sysfs_block_attribute::store()Jiri Slaby (SUSE)3-26/+6
No one uses this store hook (both BLOCK_ATTR() pass NULL). It actually never was since its addition in fd309a9d8e63 ("drivers/edac: fix leaf sysfs attribute") so drop it. Found by https://github.com/jirislaby/clang-struct. Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240213112051.27715-4-jirislaby@kernel.org Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2024-03-27EDAC/device: Remove edac_dev_sysfs_block_attribute::{block,value}Jiri Slaby (SUSE)2-8/+0
They're unused. And they were never used since their addition in fd309a9d8e63 ("drivers/edac: fix leaf sysfs attribute") Drop it. Found by https://github.com/jirislaby/clang-struct. Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240213112051.27715-3-jirislaby@kernel.org Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2024-03-27EDAC/amd64: Remove unused struct member amd64_pvt::ext_nbcfgJiri Slaby (SUSE)1-1/+0
Commit cfe40fdb4a46 ("amd64_edac: add driver header") added amd64_pvt struct with ext_nbcfg in it. But no one used that member since then. Therefore, remove it. Found by https://github.com/jirislaby/clang-struct. Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com> Link: https://lore.kernel.org/r/20240213112051.27715-2-jirislaby@kernel.org Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2024-03-11Merge tag 'edac_updates_for_v6.9' of ↵Linus Torvalds6-316/+177
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras Pull EDAC updates from Borislav Petkov: - Add a FRU (Field Replaceable Unit) memory poison manager which collects and manages previously encountered hw errors in order to save them to persistent storage across reboots. Previously recorded errors are "replayed" upon reboot in order to poison memory which has caused said errors in the past. The main use case is stacked, on-chip memory which cannot simply be replaced so poisoning faulty areas of it and thus making them inaccessible is the only strategy to prolong its lifetime. - Add an AMD address translation library glue which converts the reported addresses of hw errors into system physical addresses in order to be used by other subsystems like memory failure, for example. Add support for MI300 accelerators to that library. - igen6: Add support for Alder Lake-N SoC - i10nm: Add Grand Ridge support - The usual fixlets and cleanups * tag 'edac_updates_for_v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras: EDAC/versal: Convert to platform remove callback returning void RAS/AMD/FMPM: Fix off by one when unwinding on error RAS/AMD/FMPM: Add debugfs interface to print record entries RAS/AMD/FMPM: Save SPA values RAS: Export helper to get ras_debugfs_dir RAS/AMD/ATL: Fix bit overflow in denorm_addr_df4_np2() RAS: Introduce a FRU memory poison manager RAS/AMD/ATL: Add MI300 row retirement support Documentation: Move RAS section to admin-guide EDAC/versal: Make the bit position of injected errors configurable EDAC/i10nm: Add Intel Grand Ridge micro-server support EDAC/igen6: Add one more Intel Alder Lake-N SoC support RAS/AMD/ATL: Add MI300 DRAM to normalized address translation support RAS/AMD/ATL: Fix array overflow in get_logical_coh_st_fabric_id_mi300() RAS/AMD/ATL: Add MI300 support Documentation: RAS: Add index and address translation section EDAC/amd64: Use new AMD Address Translation Library RAS: Introduce AMD Address Translation Library EDAC/synopsys: Convert to devm_platform_ioremap_resource()
2024-03-11Merge remote-tracking branches 'ras/edac-drivers', 'ras/edac-misc' and ↵Borislav Petkov (AMD)4-316/+174
'ras/edac-amd-atl' into edac-updates-for-v6.9 * ras/edac-drivers: EDAC/i10nm: Add Intel Grand Ridge micro-server support EDAC/igen6: Add one more Intel Alder Lake-N SoC support * ras/edac-misc: EDAC/versal: Convert to platform remove callback returning void EDAC/versal: Make the bit position of injected errors configurable EDAC/synopsys: Convert to devm_platform_ioremap_resource() * ras/edac-amd-atl: RAS/AMD/FMPM: Fix off by one when unwinding on error RAS/AMD/FMPM: Add debugfs interface to print record entries RAS/AMD/FMPM: Save SPA values RAS: Export helper to get ras_debugfs_dir RAS/AMD/ATL: Fix bit overflow in denorm_addr_df4_np2() RAS: Introduce a FRU memory poison manager RAS/AMD/ATL: Add MI300 row retirement support Documentation: Move RAS section to admin-guide RAS/AMD/ATL: Add MI300 DRAM to normalized address translation support RAS/AMD/ATL: Fix array overflow in get_logical_coh_st_fabric_id_mi300() RAS/AMD/ATL: Add MI300 support Documentation: RAS: Add index and address translation section EDAC/amd64: Use new AMD Address Translation Library RAS: Introduce AMD Address Translation Library Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2024-03-08EDAC/versal: Convert to platform remove callback returning voidUwe Kleine-König1-4/+2
The .remove() callback for a platform driver returns an int which makes many driver authors wrongly assume it's possible to do error handling by returning an error code. However the value returned is ignored (apart from emitting a warning) and this typically results in resource leaks. To improve this, there is a quest to make the remove callback return void. In the first step of this quest all drivers are converted to .remove_new(), which already returns void. Eventually after all drivers are converted, .remove_new() will be renamed to .remove(). Trivially convert this driver from always returning zero in the remove callback to the void returning variant. Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Shubhrajyoti Datta <shubhrajyoti.datta@amd.com> Link: https://lore.kernel.org/r/83deca1ce260f7e17ff3cb106c9a6946d4ca4505.1709886922.git.u.kleine-koenig@pengutronix.de
2024-02-15x86/cpu/amd: Provide a separate accessor for Node IDThomas Gleixner2-4/+4
AMD (ab)uses topology_die_id() to store the Node ID information and topology_max_dies_per_pkg to store the number of nodes per package. This collides with the proper processor die level enumeration which is coming on AMD with CPUID 8000_0026, unless there is a correlation between the two. There is zero documentation about that. So provide new storage and new accessors which for now still access die_id and topology_max_die_per_pkg(). Will be mopped up after AMD and HYGON are converted over. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Sohil Mehta &l