summaryrefslogtreecommitdiff
path: root/drivers/edac
AgeCommit message (Collapse)AuthorFilesLines
2022-02-23EDAC: Fix calculation of returned address and next offset in edac_align_ptr()Eliav Farber1-1/+1
commit f8efca92ae509c25e0a4bd5d0a86decea4f0c41e upstream. Do alignment logic properly and use the "ptr" local variable for calculating the remainder of the alignment. This became an issue because struct edac_mc_layer has a size that is not zero modulo eight, and the next offset that was prepared for the private data was unaligned, causing an alignment exception. The patch in Fixes: which broke this actually wanted to "what we actually care about is the alignment of the actual pointer that's about to be returned." But it didn't check that alignment. Use the correct variable "ptr" for that. [ bp: Massage commit message. ] Fixes: 8447c4d15e35 ("edac: Do alignment logic properly in edac_align_ptr()") Signed-off-by: Eliav Farber <farbere@amazon.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20220113100622.12783-2-farbere@amazon.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-08EDAC/xgene: Fix deferred probingSergey Shtylyov1-1/+1
commit dfd0dfb9a7cc04acf93435b440dd34c2ca7b4424 upstream. The driver overrides error codes returned by platform_get_irq_optional() to -EINVAL for some strange reason, so if it returns -EPROBE_DEFER, the driver will fail the probe permanently instead of the deferred probing. Switch to propagating the proper error codes to platform driver code upwards. [ bp: Massage commit message. ] Fixes: 0d4429301c4a ("EDAC: Add APM X-Gene SoC EDAC driver") Signed-off-by: Sergey Shtylyov <s.shtylyov@omp.ru> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20220124185503.6720-3-s.shtylyov@omp.ru Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-08EDAC/altera: Fix deferred probingSergey Shtylyov1-1/+1
commit 279eb8575fdaa92c314a54c0d583c65e26229107 upstream. The driver overrides the error codes returned by platform_get_irq() to -ENODEV for some strange reason, so if it returns -EPROBE_DEFER, the driver will fail the probe permanently instead of the deferred probing. Switch to propagating the proper error codes to platform driver code upwards. [ bp: Massage commit message. ] Fixes: 71bcada88b0f ("edac: altera: Add Altera SDRAM EDAC support") Signed-off-by: Sergey Shtylyov <s.shtylyov@omp.ru> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Dinh Nguyen <dinguyen@kernel.org> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20220124185503.6720-2-s.shtylyov@omp.ru Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-01-27EDAC/synopsys: Use the quirk for version instead of ddr versionDinh Nguyen1-2/+1
[ Upstream commit bd1d6da17c296bd005bfa656952710d256e77dd3 ] Version 2.40a supports DDR_ECC_INTR_SUPPORT for a quirk, so use that quirk to determine a call to setup_address_map(). Signed-off-by: Dinh Nguyen <dinguyen@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Michal Simek <michal.simek@xilinx.com> Link: https://lkml.kernel.org/r/20211012190709.1504152-1-dinguyen@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18EDAC/amd64: Handle three rank interleaving modeYazen Ghannam1-1/+21
[ Upstream commit 9f4873fb6af7966de8fcbd95c36b61351c1c4b1f ] AMD Rome systems and later support interleaving between three identical ranks within a channel. Check for this mode by counting the number of enabled chip selects and comparing their masks. If there are exactly three enabled chip selects and their masks are identical, then three rank interleaving is enabled. The size of a rank is determined from its mask value. However, three rank interleaving doesn't follow the method of swapping an interleave bit with the most significant bit. Rather, the interleave bit is flipped and the most significant bit remains the same. There is only a single interleave bit in this case. Account for this when determining the chip select size by keeping the most significant bit at its original value and ignoring any zero bits. This will return a full bitmask in [MSB:1]. Fixes: e53a3b267fb0 ("EDAC/amd64: Find Chip Select memory size using Address Mask") Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20211005154419.2060504-1-yazen.ghannam@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18EDAC/sb_edac: Fix top-of-high-memory value for Broadwell/HaswellEric Badger1-1/+1
commit 537bddd069c743759addf422d0b8f028ff0f8dbc upstream. The computation of TOHM is off by one bit. This missed bit results in too low a value for TOHM, which can cause errors in regular memory to incorrectly report: EDAC MC0: 1 CE Error at MMIOH area, on addr 0x000000207fffa680 on any memory Fixes: 50d1bb93672f ("sb_edac: add support for Haswell based systems") Cc: stable@vger.kernel.org Reported-by: Meeta Saggi <msaggi@purestorage.com> Signed-off-by: Eric Badger <ebadger@purestorage.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20211010170127.848113-1-ebadger@purestorage.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-10-20EDAC/armada-xp: Fix output of uncorrectable error counterHans Potsch1-1/+1
commit d9b7748ffc45250b4d7bcf22404383229bc495f5 upstream. The number of correctable errors is displayed as uncorrectable errors because the "SBE" error count is passed to both calls of edac_mc_handle_error(). Pass the correct uncorrectable error count to the second edac_mc_handle_error() call when logging uncorrectable errors. [ bp: Massage commit message. ] Fixes: 7f6998a41257 ("ARM: 8888/1: EDAC: Add driver for the Marvell Armada XP SDRAM and L2 cache ECC") Signed-off-by: Hans Potsch <hans.potsch@nokia.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20211006121332.58788-1-hans.potsch@nokia.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-30EDAC/dmc520: Assign the proper type to dimm->edac_modeBorislav Petkov1-1/+1
commit 54607282fae6148641a08d81a6e0953b541249c7 upstream. dimm->edac_mode contains values of type enum edac_type - not the corresponding capability flags. Fix that. Fixes: 1088750d7839 ("EDAC: Add EDAC driver for DMC520") Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20210916085258.7544-1-bp@alien8.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-30EDAC/synopsys: Fix wrong value type assignment for edac_modeSai Krishna Potthuri1-1/+1
commit 5297cfa6bdf93e3889f78f9b482e2a595a376083 upstream. dimm->edac_mode contains values of type enum edac_type - not the corresponding capability flags. Fix that. Issue caught by Coverity check "enumerated type mixed with another type." [ bp: Rewrite commit message, add tags. ] Fixes: ae9b56e3996d ("EDAC, synps: Add EDAC support for zynq ddr ecc controller") Signed-off-by: Sai Krishna Potthuri <lakshmi.sai.krishna.potthuri@xilinx.com> Signed-off-by: Shubhrajyoti Datta <shubhrajyoti.datta@xilinx.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20210818072315.15149-1-shubhrajyoti.datta@xilinx.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-15EDAC/i10nm: Fix NVDIMM detectionQiuxu Zhuo1-3/+3
[ Upstream commit 2294a7299f5e51667b841f63c6d69474491753fb ] MCDDRCFG is a per-channel register and uses bit{0,1} to indicate the NVDIMM presence on DIMM slot{0,1}. Current i10nm_edac driver wrongly uses MCDDRCFG as per-DIMM register and fails to detect the NVDIMM. Fix it by reading MCDDRCFG as per-channel register and using its bit{0,1} to check whether the NVDIMM is populated on DIMM slot{0,1}. Fixes: d4dc89d069aa ("EDAC, i10nm: Add a driver for Intel 10nm server processors") Reported-by: Fan Du <fan.du@intel.com> Tested-by: Wen Jin <wen.jin@intel.com> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20210818175701.1611513-2-tony.luck@intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-15EDAC/mce_amd: Do not load edac_mce_amd module on guestsSmita Koralahalli1-0/+3
[ Upstream commit 767f4b620edadac579c9b8b6660761d4285fa6f9 ] Hypervisors likely do not expose the SMCA feature to the guest and loading this module leads to false warnings. This module should not be loaded in guests to begin with, but people tend to do so, especially when testing kernels in VMs. And then they complain about those false warnings. Do the practical thing and do not load this module when running as a guest to avoid all that complaining. [ bp: Rewrite commit message. ] Suggested-by: Borislav Petkov <bp@suse.de> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com> Tested-by: Kim Phillips <kim.phillips@amd.com> Link: https://lkml.kernel.org/r/20210628172740.245689-1-Smita.KoralahalliChannabasappa@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14EDAC/Intel: Do not load EDAC driver when running as a guestLuck, Tony4-0/+12
[ Upstream commit f0a029fff4a50eb01648810a77ba1873e829fdd4 ] There's little to no point in loading an EDAC driver running in a guest: 1) The CPU model reported by CPUID may not represent actual h/w 2) The hypervisor likely does not pass in access to memory controller devices 3) Hypervisors generally do not pass corrected error details to guests Add a check in each of the Intel EDAC drivers for X86_FEATURE_HYPERVISOR and simply return -ENODEV in the init routine. Acked-by: Borislav Petkov <bp@suse.de> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20210615174419.GA1087688@agluck-desk2.amr.corp.intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14EDAC/ti: Add missing MODULE_DEVICE_TABLEBixuan Cui1-0/+1
[ Upstream commit 0a37f32ba5272b2d4ec8c8d0f6b212b81b578f7e ] The module misses MODULE_DEVICE_TABLE() for of_device_id tables and thus never autoloads on ID matches. Add the missing declaration. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Bixuan Cui <cuibixuan@huawei.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Tero Kristo <kristo@kernel.org> Link: https://lkml.kernel.org/r/20210512033727.26701-1-cuibixuan@huawei.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-07EDAC/amd64: Do not load on family 0x15, model 0x13Borislav Petkov1-3/+7
[ Upstream commit 6c13d7ff81e6d2f01f62ccbfa49d1b8d87f274d0 ] Those were only laptops and are very very unlikely to have ECC memory. Currently, when the driver attempts to load, it issues: EDAC amd64: Error: F1 not found: device 0x1601 (broken BIOS?) because the PCI device is the wrong one (it uses the F15h default one). So do not load the driver on them as that is pointless. Reported-by: Don Curtis <bugrprt21882@online.de> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Don Curtis <bugrprt21882@online.de> Link: http://bugzilla.opensuse.org/show_bug.cgi?id=1179763 Link: https://lkml.kernel.org/r/20201218160622.20146-1-bp@alien8.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-30EDAC/amd64: Fix PCI component registrationBorislav Petkov1-12/+14
commit 706657b1febf446a9ba37dc51b89f46604f57ee9 upstream. In order to setup its PCI component, the driver needs any node private instance in order to get a reference to the PCI device and hand that into edac_pci_create_generic_ctl(). For convenience, it uses the 0th memory controller descriptor under the assumption that if any, the 0th will be always present. However, this assumption goes wrong when the 0th node doesn't have memory and the driver doesn't initialize an instance for it: EDAC amd64: F17h detected (node 0). ... EDAC amd64: Node 0: No DIMMs detected. But looking up node instances is not really needed - all one needs is the pointer to the proper device which gets discovered during instance init. So stash that pointer into a variable and use it when setting up the EDAC PCI component. Clear that variable when the driver needs to unwind due to some instances failing init to avoid any registration imbalance. Cc: <stable@vger.kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20201122150815.13808-1-bp@alien8.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-30EDAC/i10nm: Use readl() to access MMIO registersQiuxu Zhuo1-4/+7
commit 83ff51c4e3fecf6b8587ce4d46f6eac59f5d7c5a upstream. Instead of raw access, use readl() to access MMIO registers of memory controller to avoid possible compiler re-ordering. Fixes: d4dc89d069aa ("EDAC, i10nm: Add a driver for Intel 10nm server processors") Cc: <stable@vger.kernel.org> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-30EDAC/mce_amd: Use struct cpuinfo_x86.cpu_die_id for AMD NodeIdYazen Ghannam1-1/+1
[ Upstream commit 8de0c9917cc1297bc5543b61992d5bdee4ce621a ] The edac_mce_amd module calls decode_dram_ecc() on AMD Family17h and later systems. This function is used in amd64_edac_mod to do system-specific decoding for DRAM ECC errors. The function takes a "NodeId" as a parameter. In AMD documentation, NodeId is used to identify a physical die in a system. This can be used to identify a node in the AMD_NB code and also it is used with umc_normaddr_to_sysaddr(). However, the input used for decode_dram_ecc() is currently the NUMA node of a logical CPU. In the default configuration, the NUMA node and physical die will be equivalent, so this doesn't have an impact. But the NUMA node configuration can be adjusted with optional memory interleaving modes. This will cause the NUMA node enumeration to not match the physical die enumeration. The mismatch will cause the address translation function to fail or report incorrect results. Use struct cpuinfo_x86.cpu_die_id for the node_id parameter to ensure the physical ID is used. Fixes: fbe63acf62f5 ("EDAC, mce_amd: Use cpu_to_node() to find the node ID") Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20201109210659.754018-4-Yazen.Ghannam@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-12Merge tag 'efi-core-2020-10-12' of ↵Linus Torvalds1-2/+15
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull EFI changes from Ingo Molnar: - Preliminary RISC-V enablement - the bulk of it will arrive via the RISCV tree. - Relax decompressed image placement rules for 32-bit ARM - Add support for passing MOK certificate table contents via a config table rather than a EFI variable. - Add support for 18 bit DIMM row IDs in the CPER records. - Work around broken Dell firmware that passes the entire Boot#### variable contents as the command line - Add definition of the EFI_MEMORY_CPU_CRYPTO memory attribute so we can identify it in the memory map listings. - Don't abort the boot on arm64 if the EFI RNG protocol is available but returns with an error - Replace slashes with exclamation marks in efivarfs file names - Split efi-pstore from the deprecated efivars sysfs code, so we can disable the latter on !x86. - Misc fixes, cleanups and updates. * tag 'efi-core-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (26 commits) efi: mokvar: add missing include of asm/early_ioremap.h efi: efivars: limit availability to X86 builds efi: remove some false dependencies on CONFIG_EFI_VARS efi: gsmi: fix false dependency on CONFIG_EFI_VARS efi: efivars: un-export efivars_sysfs_init() efi: pstore: move workqueue handling out of efivars efi: pstore: disentangle from deprecated efivars module efi: mokvar-table: fix some issues in new code efi/arm64: libstub: Deal gracefully with EFI_RNG_PROTOCOL failure efivarfs: Replace invalid slashes with exclamation marks in dentries. efi: Delete deprecated parameter comments efi/libstub: Fix missing-prototypes in string.c efi: Add definition of EFI_MEMORY_CPU_CRYPTO and ability to report it cper,edac,efi: Memory Error Record: bank group/address and chip id edac,ghes,cper: Add Row Extension to Memory Error Record efi/x86: Add a quirk to support command line arguments on Dell EFI firmware efi/libstub: Add efi_warn and *_once logging helpers integrity: Load certs from the EFI MOK config table integrity: Move import of MokListRT certs to a separate routine efi: Support for MOK variable config table ...
2020-10-12Merge tag 'ras_updates_for_v5.10' of ↵Linus Torvalds1-3/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RAS updates from Borislav Petkov: - Extend the recovery from MCE in kernel space also to processes which encounter an MCE in kernel space but while copying from user memory by sending them a SIGBUS on return to user space and umapping the faulty memory, by Tony Luck and Youquan Song. - memcpy_mcsafe() rework by splitting the functionality into copy_mc_to_user() and copy_mc_to_kernel(). This, as a result, enables support for new hardware which can recover from a machine check encountered during a fast string copy and makes that the default and lets the older hardware which does not support that advance recovery, opt in to use the old, fragile, slow variant, by Dan Williams. - New AMD hw enablement, by Yazen Ghannam and Akshay Gupta. - Do not use MSR-tracing accessors in #MC context and flag any fault while accessing MCA architectural MSRs as an architectural violation with the hope that such hw/fw misdesigns are caught early during the hw eval phase and they don't make it into production. - Misc fixes, improvements and cleanups, as always. * tag 'ras_updates_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mce: Allow for copy_mc_fragile symbol checksum to be generated x86/mce: Decode a kernel instruction to determine if it is copying from user x86/mce: Recover from poison found while copying from user space x86/mce: Avoid tail copy when machine check terminated a copy from user x86/mce: Add _ASM_EXTABLE_CPY for copy user access x86/mce: Provide method to find out the type of an exception handler x86/mce: Pass pointer to saved pt_regs to severity calculation routines x86/copy_mc: Introduce copy_mc_enhanced_fast_string() x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user, kernel}() x86/mce: Drop AMD-specific "DEFERRED" case from Intel severity rule list x86/mce: Add Skylake quirk for patrol scrub reported errors RAS/CEC: Convert to DEFINE_SHOW_ATTRIBUTE() x86/mce: Annotate mce_rd/wrmsrl() with noinstr x86/mce/dev-mcelog: Do not update kflags on AMD systems x86/mce: Stop mce_reign() from re-computing severity for every CPU x86/mce: Make mce_rdmsrl() panic on an inaccessible MSR x86/mce: Increase maximum number of banks to 64 x86/mce: Delay clearing IA32_MCG_STATUS to the end of do_machine_check() x86/MCE/AMD, EDAC/mce_amd: Remove struct smca_hwid.xec_bitmap RAS/CEC: Fix cec_init() prototype
2020-10-12Merge tag 'edac_updates_for_v5.10' of ↵Linus Torvalds17-50/+420
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras Pull EDAC updates from Borislav Petkov: - Add Amazon's Annapurna Labs memory controller EDAC driver (Talel Shenhar) - New AMD CPUs support (Yazen Ghannam) - The usual misc fixes and cleanups all over the subsystem * tag 'edac_updates_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras: EDAC/amd64: Set proper family type for Family 19h Models 20h-2Fh EDAC/mc_sysfs: Add missing newlines when printing {max,dimm}_location EDAC/aspeed: Use module_platform_driver() to simplify EDAC, sb_edac: Simplify switch statement EDAC/ti: Fix handling of platform_get_irq() error EDAC/aspeed: Fix handling of platform_get_irq() error EDAC/i5100: Fix error handling order in i5100_init_one() EDAC/highbank: Handover Calxeda Highbank maintenance to Andre Przywara EDAC/socfpga: Transfer SoCFPGA EDAC maintainership EDAC/thunderx: Make symbol lmc_dfs_ents static EDAC/al-mc-edac: Add Amazon's Annapurna Labs Memory Controller driver dt-bindings: EDAC: Add Amazon's Annapurna Labs Memory Controller binding EDAC/mce_amd: Add new error descriptions for existing types EDAC: Replace HTTP links with HTTPS ones
2020-10-12Merge branch 'edac-drivers' into edac-updates-for-v5.10Borislav Petkov3-0/+362
Signed-off-by: Borislav Petkov <bp@suse.de>
2020-10-09EDAC/amd64: Set proper family type for Family 19h Models 20h-2FhYazen Ghannam1-0/+6
AMD Family 19h Models 20h-2Fh use the same PCI IDs as Family 17h Models 70h-7Fh. The same family ops and number of channels also apply. Use the Family17h Model 70h family_type and ops for Family 19h Models 20h-2Fh. Update the controller name to match the system. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20201009171803.3214354-1-Yazen.Ghannam@amd.com
2020-09-18EDAC/mc_sysfs: Add missing newlines when printing {max,dimm}_locationXiongfeng Wang1-5/+17
Reading those sysfs entries gives: [root@localhost /]# cat /sys/devices/system/edac/mc/mc0/max_location memory 3 [root@localhost /]# cat /sys/devices/system/edac/mc/mc0/dimm0/dimm_location memory 0 [root@localhost /]# Add newlines after the value it prints for better readability. [ bp: Make len a signed int and change the check to catch wraparound. Increment the pointer p only when the length check passes. Use scnprintf(). ] Signed-off-by: Xiongfeng Wang <wangxiongfeng2@huawei.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/1600051734-8993-1-git-send-email-wangxiongfeng2@huawei.com
2020-09-18EDAC/aspeed: Use module_platform_driver() to simplifyLiu Shixin1-17/+1
Use module_platform_driver() which makes the code simpler by eliminating boilerplate code. Signed-off-by: Liu Shixin <liushixin2@huawei.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Joel Stanley <joel@jms.id.au> Link: https://lkml.kernel.org/r/20200914065358.3726216-1-liushixin2@huawei.com
2020-09-17cper,edac,efi: Memory Error Record: bank group/address and chip idAlex Kluver1-0/+9
Updates to the UEFI 2.8 Memory Error Record allow splitting the bank field into bank address and bank group, and using the last 3 bits of the extended field as a chip identifier. When needed, print correct version of bank field, bank group, and chip identification. Based on UEFI 2.8 Table 299. Memory Error Record. Signed-off-by: Alex Kluver <alex.kluver@hpe.com> Reviewed-by: Russ Anderson <russ.anderson@hpe.com> Reviewed-by: Kyle Meyer <kyle.meyer@hpe.com> Reviewed-by: Steve Wahl <steve.wahl@hpe.com> Acked-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20200819143544.155096-3-alex.kluver@hpe.com Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2020-09-17edac,ghes,cper: Add Row Extension to Memory Error RecordAlex Kluver1-2/+6
Memory errors could be printed with incorrect row values since the DIMM size has outgrown the 16 bit row field in the CPER structure. UEFI Specification Version 2.8 has increased the size of row by allowing it to use the first 2 bits from a previously reserved space within the structure. When needed, add the extension bits to the row value printed. Based on UEFI 2.8 Table 299. Memory Error Record Signed-off-by: Alex Kluver <alex.kluver@hpe.com> Tested-by: Russ Anderson <russ.anderson@hpe.com> Reviewed-by: Steve Wahl <steve.wahl@hpe.com> Reviewed-by: Kyle Meyer <kyle.meyer@hpe.com> Acked-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20200819143544.155096-2-alex.kluver@hpe.com Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2020-09-15EDAC/ghes: Check whether the driver is on the safe list correctlyBorislav Petkov1-0/+4
With CONFIG_DEBUG_TEST_DRIVER_REMOVE=y, a system would try to probe, unregister and probe again a driver. When ghes_edac is attempted to be loaded on a system which is not on the safe platforms list, ghes_edac_register() would return early. The unregister counterpart ghes_edac_unregister() would still attempt to unregister and exit early at the refcount test, leading to the refcount underflow below. In order to not do *anything* on the unregister path too, reuse the force_load parameter and check it on that path too, before fumbling with the refcount. ghes_edac: ghes_edac_register: entry ghes_edac: ghes_edac_register: return -ENODEV ------------[ cut here ]------------ refcount_t: underflow; use-after-free. WARNING: CPU: 10 PID: 1 at lib/refcount.c:28 refcount_warn_saturate+0xb9/0x100 Modules linked in: CPU: 10 PID: 1 Comm: swapper/0 Not tainted 5.9.0-rc4+ #12 Hardware name: GIGABYTE MZ01-CE1-00/MZ01-CE1-00, BIOS F02 08/29/2018 RIP: 0010:refcount_warn_saturate+0xb9/0x100 Code: 82 e8 fb 8f 4d 00 90 0f 0b 90 90 c3 80 3d 55 4c f5 00 00 75 88 c6 05 4c 4c f5 00 01 90 48 c7 c7 d0 8a 10 82 e8 d8 8f 4d 00 90 <0f> 0b 90 90 c3 80 3d 30 4c f5 00 00 0f 85 61 ff ff ff c6 05 23 4c RSP: 0018:ffffc90000037d58 EFLAGS: 00010292 RAX: 0000000000000026 RBX: ffff88840b8da000 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffffffff8216b24f RDI: 00000000ffffffff RBP: ffff88840c662e00 R08: 0000000000000001 R09: 0000000000000001 R10: 0000000000000001 R11: 0000000000000046 R12: 0000000000000000 R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88840ee80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000800002211000 CR4: 00000000003506e0 Call Trace: ghes_edac_unregister ghes_remove platform_drv_remove really_probe driver_probe_device device_driver_attach __driver_attach ? device_driver_attach ? device_driver_attach bus_for_each_dev bus_add_driver driver_register ? bert_init ghes_init do_one_initcall ? rcu_read_lock_sched_held kernel_init_freeable ? rest_init kernel_init ret_from_fork ... ghes_edac: ghes_edac_unregister: FALSE, refcount: -1073741824 Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200911164950.GB19320@zn.tnic
2020-09-15EDAC/ghes: Clear scanned data on unloadBorislav Petkov1-0/+1
Commit b972fdba8665 ("EDAC/ghes: Fix NULL pointer dereference in ghes_edac_register()") didn't clear all the information from the scanned system and, more specifically, left ghes_hw.num_dimms to its previous value. On a second load (CONFIG_DEBUG_TEST_DRIVER_REMOVE=y), the driver would use the leftover num_dimms value which is not 0 and thus the 0 check in enumerate_dimms() will get bypassed and it would go directly to the pointer deref: d = &hw->dimms[hw->num_dimms]; which is, of course, NULL: #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 0 P4D 0 Oops: 0002 [#1] PREEMPT SMP CPU: 7 PID: 1 Comm: swapper/0 Not tainted 5.9.0-rc4+ #7 Hardware name: GIGABYTE MZ01-CE1-00/MZ01-CE1-00, BIOS F02 08/29/2018 RIP: 0010:enumerate_dimms.cold+0x7b/0x375 Reset the whole ghes_hw on driver unregister so that no stale values are used on a second system scan. Fixes: b972fdba8665 ("EDAC/ghes: Fix NULL pointer dereference in ghes_edac_register()") Cc: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200911164817.GA19320@zn.tnic
2020-09-08EDAC, sb_edac: Simplify switch statementTom Rix1-4/+1
clang static analyzer reports this problem sb_edac.c:959:2: warning: Undefined or garbage value returned to caller return type; ^~~~~~~~~~~ This is a false positive. However by initializing the type to DEV_UNKNOWN the 3 case can be removed from the switch, saving a comparison and jump. Signed-off-by: Tom Rix <trix@redhat.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20200907153225.7294-1-trix@redhat.com
2020-09-01EDAC/ti: Fix handling of platform_get_irq() errorKrzysztof Kozlowski1-1/+2
platform_get_irq() returns a negative error number on error. In such a case, comparison to 0 would pass the check therefore check the return value properly, whether it is negative. [ bp: Massage commit message. ] Fixes: 86a18ee21e5e ("EDAC, ti: Add support for TI keystone and DRA7xx EDAC") Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tero Kristo <t-kristo@ti.com> Link: https://lkml.kernel.org/r/20200827070743.26628-2-krzk@kernel.org
2020-09-01EDAC/aspeed: Fix handling of platform_get_irq() errorKrzysztof Kozlowski1-2/+2
platform_get_irq() returns a negative error number on error. In such a case, comparison to 0 would pass the check therefore check the return value properly, whether it is negative. [ bp: Massage commit message. ] Fixes: 9b7e6242ee4e ("EDAC, aspeed: Add an Aspeed AST2500 EDAC driver") Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Stefan Schaeckeler <schaecsn@gmx.net> Link: https://lkml.kernel.org/r/20200827070743.26628-1-krzk@kernel.org
2020-09-01EDAC/i5100: Fix error handling order in i5100_init_one()Dinghao Liu1-6/+5
When pci_get_device_func() fails, the driver doesn't need to execute pci_dev_put(). mci should still be freed, though, to prevent a memory leak. When pci_enable_device() fails, the error injection PCI device "einj" doesn't need to be disabled either. [ bp: Massage commit message, rename label to "bail_mc_free". ] Fixes: 52608ba205461 ("i5100_edac: probe for device 19 function 0") Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200826121437.31606-1-dinghao.liu@zju.edu.cn
2020-08-30Merge tag 'edac_urgent_for_v5.9_rc3' of ↵Linus Torvalds1-4/+6
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras Pull EDAC fix from Borislav Petkov: "A fix to properly clear ghes_edac driver state on driver remove so that a subsequent load can probe the system properly (Shiju Jose)" * tag 'edac_urgent_for_v5.9_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras: EDAC/ghes: Fix NULL pointer dereference in ghes_edac_register()
2020-08-27EDAC/ghes: Fix NULL pointer dereference in ghes_edac_register()Shiju Jose1-4/+6
After b9cae27728d1 ("EDAC/ghes: Scan the system once on driver init") and with CONFIG_DEBUG_TEST_DRIVER_REMOVE enabled, ghes_hw.dimms becomes a NULL pointer after the second ->probe() (aka ghes_edac_register()) which the config option causes to be called. This happens because the static variable which holds down whether the system has been scanned already, doesn't get reset in ghes_edac_unregister(). Then, on the second probe, ghes_scan_system() doesn't get to enumerate the DIMMs, leading to ghes_hw.dimms remaining NULL. Clear the variable and rename it to something more descriptive so that a second probe succeeds. [ bp: Rewrite commit message. ] Fixes: b9cae27728d1 ("EDAC/ghes: Scan the system once on driver init") Suggested-by: Borislav Petkov <bp@suse.de> Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200827140450.1620-1-shiju.jose@huawei.com
2020-08-23treewide: Use fallthrough pseudo-keywordGustavo A. R. Silva2-2/+2
Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through markings when it is the case. [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-08-20x86/MCE/AMD, EDAC/mce_amd: Remove struct smca_hwid.xec_bitmapYazen Ghannam1-3/+1
The Extended Error Code Bitmap (xec_bitmap) for a Scalable MCA bank type was intended to be used by the kernel to filter out invalid error codes on a system. However, this is unnecessary after a few product releases because the hardware will only report valid error codes. Thus, there's no need for it with future systems. Remove the xec_bitmap field and all references to it. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200720145353.43924-1-Yazen.Ghannam@amd.com
2020-08-18EDAC/{i7core,sb,pnd2,skx}: Fix error event severityTony Luck4-7/+7
IA32_MCG_STATUS.RIPV indicates whether the return RIP value pushed onto the stack as part of machine check delivery is valid or not. Various drivers copied a code fragment that uses the RIPV bit to determine the severity of the error as either HW_EVENT_ERR_UNCORRECTED or HW_EVENT_ERR_FATAL, but this check is reversed (marking errors where RIPV is set as "FATAL"). Reverse the tests so that the error is marked fatal when RIPV is not set. Reported-by: Gabriele Paoloni <gabriele.paoloni@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20200707194324.14884-1-tony.luck@intel.com
2020-08-17EDAC/thunderx: Make symbol lmc_dfs_ents staticWei Yongjun1-1/+1
Symbol 'lmc_dfs_ents' is not used outside of thunderx_edac.c, so make it static: drivers/edac/thunderx_edac.c:457:22: warning: symbol 'lmc_dfs_ents' was not declared. Should it be static? Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Robert Richter <rrichter@marvell.com> Link: https://lkml.kernel.org/r/20200714142308.46612-1-weiyongjun1@huawei.com
2020-08-17EDAC/al-mc-edac: Add Amazon's Annapurna Labs Memory Controller driverTalel Shenhar3-0/+362
The Amazon's Annapurna Labs Memory Controller EDAC supports ECC capability for error detection and correction (Single bit error correction, Double detection). This driver introduces EDAC driver for that capability. [ bp: Remove "EDAC" string from Kconfig tristate as it is redundant. ] Signed-off-by: Talel Shenhar <talel@amazon.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: James Morse <james.morse@arm.com> Link: https://lkml.kernel.org/r/20200816185551.19108-3-talel@amazon.com
2020-08-17EDAC/mce_amd: Add new error descriptions for existing typesYazen Ghannam1-1/+10
A few existing MCA bank types will have new error types in future SMCA systems. Add the descriptions for the new error types. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200708153515.1911642-1-Yazen.Ghannam@amd.com
2020-08-17EDAC: Replace HTTP links with HTTPS onesAlexander A. Klimov8-13/+13
Rationale: Reduces attack surface on kernel devs opening the links for MITM as HTTPS traffic is much harder to manipulate. Deterministic algorithm: For each file: If not .svg: For each line: If doesn't contain `\bxmlns\b`: For each link, `\bhttp://[^# \t\r\n]*(?:\w|/)`: If neither `\bgnu\.org/license`, nor `\bmozilla\.org/MPL\b`: If both the HTTP and HTTPS versions return 200 OK and serve the same content: Replace HTTP with HTTPS. [ bp: Merge all EDAC patches into a single one. ] Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Tero Kristo <t-kristo@ti.com> # ti_edac Link: https://lkml.kernel.org/r/20200708113546.14135-1-grandmaster@al2klimov.de
2020-08-15Merge tag 'edac_updates_for_5.9_pt2' of ↵Linus Torvalds1-3/+47
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras Pull edac fix from Tony Luck: "Fix for the ie31200 driver that missed the first pull" * tag 'edac_updates_for_5.9_pt2' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras: EDAC/ie31200: Fallback if host bridge device is already initialized
2020-08-10EDAC/ie31200: Fallback if host bridge device is already initializedJason Baron1-3/+47
The Intel uncore driver may claim some of the pci ids from ie31200 which means that the ie31200 edac driver will not initialize them as part of pci_register_driver(). Let's add a fallback for this case to 'pci_get_device()' to get a reference on the device such that it can still be configured. This is similar in approach to other edac drivers. Signed-off-by: Jason Baron <jbaron@akamai.com> Cc: Borislav Petkov <bp@suse.de> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: linux-edac <linux-edac@vger.kernel.org> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/1594923911-10885-1-git-send-email-jbaron@akamai.com
2020-08-03Merge tag 'edac_updates_for_5.9' of ↵Linus Torvalds7-141/+204
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras Pull EDAC updates from Tony Luck: "Boris is on vacation and aske me to send you the EDAC changes" * tag 'edac_updates_for_5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras: EDAC: Fix reference count leaks EDAC: Remove edac_get_dimm_by_index() EDAC/ghes: Scan the system once on driver init EDAC/ghes: Remove unused members of struct ghes_edac_pvt, rename it to ghes_pvt EDAC/ghes: Setup DIMM label from DMI and use it in error reports EDAC, {skx,i10nm}: Use CPU stepping macro to pass configurations EDAC/mc: Call edac_inc_ue_error() before panic EDAC, pnd2: Set MCE_PRIO_EDAC priority for pnd2_mce_dec notifier
2020-08-03Merge tag 'ras-core-2020-08-03' of ↵Linus Torvalds1-0/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 RAS updates from Ingo Molnar: "Boris is on vacation and he asked us to send you the pending RAS bits: - Print the PPIN field on CPUs that fill them out - Fix an MCE injection bug - Simplify a kzalloc in dev_mcelog_init_device()" * tag 'ras-core-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mce, EDAC/mce_amd: Print PPIN in machine check records x86/mce/dev-mcelog: Use struct_size() helper in kzalloc() x86/mce/inject: Fix a wrong assignment of i_mce.status
2020-06-23x86/mce, EDAC/mce_amd: Print PPIN in machine check recordsSmita Koralahalli1-0/+3
Print the Protected Processor Identification Number (PPIN) on processors which support it. [ bp: Massage. ] Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200623130059.8870-1-Smita.KoralahalliChannabasappa@amd.com
2020-06-22Merge branch 'edac-ghes' into edac-for-nextBorislav Petkov1-130/+193
2020-06-18EDAC/amd64: Read back the scrub rate PCI register on F15hBorislav Petkov1-0/+2
Commit: da92110dfdfa ("EDAC, amd64_edac: Extend scrub rate support to F15hM60h") added support for F15h, model 0x60 CPUs but in doing so, missed to read back SCRCTRL PCI config register on F15h CPUs which are *not* model 0x60. Add that read so that doing $ cat /sys/devices/system/edac/mc/mc0/sdram_scrub_rate can show the previously set DRAM scrub rate. Fixes: da92110dfdfa ("EDAC, amd64_edac: Extend scrub rate support to F15hM60h") Reported-by: Anders Andersson <pipatron@gmail.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> #v4.4.. Link: https://lkml.kernel.org/r/CAKkunMbNWppx_i6xSdDHLseA2QQmGJqj_crY=NF-GZML5np4Vw@mail.gmail.com
2020-06-17EDAC: Fix reference count leaksQiushi Wu2-1/+2
When kobject_init_and_add() returns an error, it should be handled because kobject_init_and_add() takes a reference even when it fails. If this function returns an error, kobject_put() must be called to properly clean up the memory associated with the object. Therefore, replace calling kfree() and call kobject_put() and add a missing kobject_put() in the edac_device_register_sysfs_main_kobj() error path. [ bp: Massage and merge into a single patch. ] Fixes: b2ed215a3338 ("Kobject: change drivers/edac to use kobject_init_and_add") Signed-off-by: Qiushi Wu <wu000273@umn.edu> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200528202238.18078-1-wu000273@umn.edu Link: https://lkml.kernel.org/r/20200528203526.20908-1-wu000273@umn.edu
2020-06-16EDAC/ghes: Scan the system once on driver initBorislav Petkov1-113/+166
Change the hardware scanning and figuring out how many DIMMs a machine has to a single, one-time thing which happens once on driver init. After that scanning completes, struct ghes_hw_desc contains a representation of the hardware which the driver can then use for later initialization. Then, copy the DIMM information into the respective EDAC core representation of those. Get rid of ghes_edac_dimm_fill and use a struct dimm_info array directly. This way, hw detection and further driver initialization is nicely and logically split. Further additions should all be added to ghes_scan_system() and the hw representation extended as needed. There should be no functionality change resulting from this patch. Signed-off-by: Borislav Petkov <bp@suse.de>