linux.git/drivers/ras, branch v6.12.80

APEI/GHES: ARM processor Error: don't go past allocated memory

2026-03-04T12:20:54+00:00

[ Upstream commit 87880af2d24e62a84ed19943dbdd524f097172f2 ]

If the BIOS generates a very small ARM Processor Error, or
an incomplete one, the current logic will fail to deferrence

	err->section_length
and
	ctx_info->size

Add checks to avoid that. With such changes, such GHESv2
records won't cause OOPSes like this:

[    1.492129] Internal error: Oops: 0000000096000005 [#1]  SMP
[    1.495449] Modules linked in:
[    1.495820] CPU: 0 UID: 0 PID: 9 Comm: kworker/0:0 Not tainted 6.18.0-rc1-00017-gabadcc3553dd-dirty #18 PREEMPT
[    1.496125] Hardware name: QEMU QEMU Virtual Machine, BIOS unknown 02/02/2022
[    1.496433] Workqueue: kacpi_notify acpi_os_execute_deferred
[    1.496967] pstate: 814000c5 (Nzcv daIF +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
[    1.497199] pc : log_arm_hw_error+0x5c/0x200
[    1.497380] lr : ghes_handle_arm_hw_error+0x94/0x220

0xffff8000811c5324 is in log_arm_hw_error (../drivers/ras/ras.c:75).
70		err_info = (struct cper_arm_err_info *)(err + 1);
71		ctx_info = (struct cper_arm_ctx_info *)(err_info + err->err_info_num);
72		ctx_err = (u8 *)ctx_info;
73
74		for (n = 0; n < err->context_info_num; n++) {
75			sz = sizeof(struct cper_arm_ctx_info) + ctx_info->size;
76			ctx_info = (struct cper_arm_ctx_info *)((long)ctx_info + sz);
77			ctx_len += sz;
78		}
79

and similar ones while trying to access section_length on an
error dump with too small size.

Signed-off-by: Mauro Carvalho Chehab 
Reviewed-by: Jonathan Cameron 
Acked-by: Ard Biesheuvel 
Reviewed-by: Hanjun Guo 
[ rjw: Subject tweaks ]
Link: https://patch.msgid.link/7fd9f38413be05ee2d7cfdb0dc31ea2274cf1a54.1767871950.git.mchehab+huawei@kernel.org
Signed-off-by: Rafael J. Wysocki 
Signed-off-by: Sasha Levin

RAS: Report all ARM processor CPER information to userspace

2025-12-18T12:55:04+00:00

[ Upstream commit 05954511b73e748d0370549ad9dd9cd95297d97a ]

The ARM processor CPER record was added in UEFI v2.6 and remained
unchanged up to v2.10.

Yet, the original arm_event trace code added by

  e9279e83ad1f ("trace, ras: add ARM processor error trace event")

is incomplete, as it only traces some fields of UAPI 2.6 table N.16, not
exporting any information from tables N.17 to N.29 of the record.

This is not enough for the user to be able to figure out what has
exactly happened or to take appropriate action.

According to the UEFI v2.9 specification chapter N2.4.4, the ARM
processor error section includes:

- several (ERR_INFO_NUM) ARM processor error information structures
  (Tables N.17 to N.20);
- several (CONTEXT_INFO_NUM) ARM processor context information
  structures (Tables N.21 to N.29);
- several vendor specific error information structures. The
  size is given by Section Length minus the size of the other
  fields.

In addition, it also exports two fields that are parsed by the GHES
driver when firmware reports it, e.g.:

- error severity
- CPU logical index

Report all of these information to userspace via a the ARM tracepoint so
that userspace can properly record the error and take decisions related
to CPU core isolation according to error severity and other info.

The updated ARM trace event now contains the following fields:

======================================  =============================
UEFI field on table N.16                ARM Processor trace fields
======================================  =============================
Validation                              handled when filling data for
                                        affinity MPIDR and running
                                        state.
ERR_INFO_NUM                            pei_len
CONTEXT_INFO_NUM                        ctx_len
Section Length                          indirectly reported by
                                        pei_len, ctx_len and oem_len
Error affinity level                    affinity
MPIDR_EL1                               mpidr
MIDR_EL1                                midr
Running State                           running_state
PSCI State                              psci_state
Processor Error Information Structure   pei_err - count at pei_len
Processor Context                       ctx_err- count at ctx_len
Vendor Specific Error Info              oem - count at oem_len
======================================  =============================

It should be noted that decoding of tables N.17 to N.29, if needed, will
be handled in userspace. That gives more flexibility, as there won't be
any need to flood the kernel with micro-architecture specific error
decoding.

Also, decoding the other fields require a complex logic, and should be
done for each of the several values inside the record field.  So, let
userspace daemons like rasdaemon decode them, parsing such tables and
having vendor-specific micro-architecture-specific decoders.

 [mchehab: modified description, solved merge conflicts and fixed coding style]

Signed-off-by: Jason Tian 
Co-developed-by: Shengwei Luo 
Signed-off-by: Shengwei Luo 
Signed-off-by: Mauro Carvalho Chehab 
Signed-off-by: Daniel Ferguson  # rebased
Reviewed-by: Jonathan Cameron 
Tested-by: Shiju Jose 
Acked-by: Borislav Petkov (AMD) 
Fixes: e9279e83ad1f ("trace, ras: add ARM processor error trace event")
Link: https://uefi.org/specs/UEFI/2.10/Apx_N_Common_Platform_Error_Record.html#arm-processor-error-section
Signed-off-by: Ard Biesheuvel 
Signed-off-by: Sasha Levin

RAS/AMD/FMPM: Get masked address

2025-04-25T08:47:56+00:00

commit 58029c39cdc54ac4f4dc40b4a9c05eed9f9b808a upstream.

Some operations require checking, or ignoring, specific bits in an address
value. For example, this can be comparing address values to identify unique
structures.

Currently, the full address value is compared when filtering for duplicates.
This results in over counting and creation of extra records.  This gives the
impression that more unique events occurred than did in reality.

Mask the address for physical rows on MI300.

  [ bp: Simplify. ]

Fixes: 6f15e617cc99 ("RAS: Introduce a FRU memory poison manager")
Signed-off-by: Yazen Ghannam 
Signed-off-by: Borislav Petkov (AMD) 
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman

RAS/AMD/ATL: Include row[13] bit in row retirement

2025-04-25T08:47:56+00:00

commit 6c44e5354d4d16d9d891a419ca3f57abfe18ce7a upstream.

Based on feedback from hardware folks, row[13] is part of the variable
bits within a physical row (along with all column bits).

Only half the physical addresses affected by a row are calculated if
this bit is not included.

Add the row[13] bit to the row retirement flow.

Fixes: 3b566b30b414 ("RAS/AMD/ATL: Add MI300 row retirement support")
Signed-off-by: Yazen Ghannam 
Signed-off-by: Borislav Petkov (AMD) 
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250401-fix-fmpm-extra-records-v1-1-840bcf7a8ac5@amd.com
Signed-off-by: Greg Kroah-Hartman

RAS/AMD/ATL: Translate normalized to system physical addresses using PRM

2024-08-01T12:36:29+00:00

AMD Zen-based systems report memory error addresses through machine
check banks representing Unified Memory Controllers (UMCs) in the form
of UMC relative "normalized" addresses. A normalized address must be
converted to a system physical address to be usable by the OS.

Future AMD platforms will provide a UEFI PRM module that implements a
number of address translation PRM handlers. This will provide an
interface for the OS to call platform specific code without requiring
the use of SMM or other heavy firmware operations.

Add support for the normalized to system physical address translation
PRM handler in the AMD Address Translation Library and prefer it over
native code if available. The GUID and parameter buffer structure are
specific to the normalized to system physical address handler provided
by the address translation PRM module included in future AMD systems.

The address translation PRM module is documented in chapter 22 of the
publicly available "AMD Family 1Ah Models 00h–0Fh and Models 10h–1Fh
ACPI v6.5 Porting Guide".

  [ bp: Massage commit message. ]

Signed-off-by: John Allen 
Signed-off-by: Borislav Petkov (AMD) 
Link: https://lore.kernel.org/r/20240730151731.15363-3-john.allen@amd.com

Merge tag 'edac_updates_for_v6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras

2024-07-16T01:20:24+00:00

Pull EDAC updates from Borislav Petkov:

 - The AMD memory controllers data fabric version 4.5 supports
   non-power-of-2 denormalization in the sense that certain bits of the
   system physical address cannot be reconstructed from the normalized
   address reported by the RAS hardware. Add support for handling such
   addresses

 - Switch the EDAC drivers to the new Intel CPU model defines

 - The usual fixes and cleanups all over the place

* tag 'edac_updates_for_v6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
  EDAC: Add missing MODULE_DESCRIPTION() macros
  EDAC/dmc520: Use devm_platform_ioremap_resource()
  EDAC/igen6: Add Intel Arrow Lake-U/H SoCs support
  RAS/AMD/FMPM: Use atl internal.h for INVALID_SPA
  RAS/AMD/ATL: Implement DF 4.5 NP2 denormalization
  RAS/AMD/ATL: Validate address map when information is gathered
  RAS/AMD/ATL: Expand helpers for adding and removing base and hole
  RAS/AMD/ATL: Read DRAM hole base early
  RAS/AMD/ATL: Add amd_atl pr_fmt() prefix
  RAS/AMD/ATL: Add a missing module description
  EDAC, i10nm: make skx_common.o a separate module
  EDAC/skx: Switch to new Intel CPU model defines
  EDAC/sb_edac: Switch to new Intel CPU model defines
  EDAC, pnd2: Switch to new Intel CPU model defines
  EDAC/i10nm: Switch to new Intel CPU model defines
  EDAC/ghes: Add missing newline to pr_info() statement
  RAS/AMD/ATL: Add missing newline to pr_info() statement
  EDAC/thunderx: Remove unused struct error_syndrome

Merge remote-tracking branches 'ras/edac-amd-atl' and 'ras/edac-misc' into edac-updates

2024-07-15T09:59:10+00:00

* ras/edac-amd-atl:
  RAS/AMD/FMPM: Use atl internal.h for INVALID_SPA
  RAS/AMD/ATL: Implement DF 4.5 NP2 denormalization
  RAS/AMD/ATL: Validate address map when information is gathered
  RAS/AMD/ATL: Expand helpers for adding and removing base and hole
  RAS/AMD/ATL: Read DRAM hole base early
  RAS/AMD/ATL: Add amd_atl pr_fmt() prefix
  RAS/AMD/ATL: Add a missing module description

* ras/edac-misc:
  EDAC: Add missing MODULE_DESCRIPTION() macros
  EDAC/dmc520: Use devm_platform_ioremap_resource()
  EDAC/igen6: Add Intel Arrow Lake-U/H SoCs support
  EDAC, i10nm: make skx_common.o a separate module
  EDAC/skx: Switch to new Intel CPU model defines
  EDAC/sb_edac: Switch to new Intel CPU model defines
  EDAC, pnd2: Switch to new Intel CPU model defines
  EDAC/i10nm: Switch to new Intel CPU model defines
  EDAC/ghes: Add missing newline to pr_info() statement
  RAS/AMD/ATL: Add missing newline to pr_info() statement
  EDAC/thunderx: Remove unused struct error_syndrome

Signed-off-by: Borislav Petkov (AMD)

RAS/AMD/ATL: Use system settings for MI300 DRAM to normalized address translation

2024-06-16T09:22:57+00:00

The currently used normalized address format is not applicable to all
MI300 systems. This leads to incorrect results during address
translation.

Drop the fixed layout and construct the normalized address from system
settings.

Fixes: 87a612375307 ("RAS/AMD/ATL: Add MI300 DRAM to normalized address translation support")
Signed-off-by: Yazen Ghannam 
Signed-off-by: Borislav Petkov (AMD) 
Cc: 
Link: https://lore.kernel.org/r/20240607-mi300-dram-xl-fix-v1-2-2f11547a178c@amd.com

RAS/AMD/ATL: Fix MI300 bank hash

2024-06-10T05:56:33+00:00

Apply the SID bits to the correct offset in the Bank value. Do this in
the temporary value so they don't need to be masked off later.

Fixes: 87a612375307 ("RAS/AMD/ATL: Add MI300 DRAM to normalized address translation support")
Signed-off-by: Yazen Ghannam 
Signed-off-by: Borislav Petkov (AMD) 
Cc: 
Link: https://lore.kernel.org/r/20240607-mi300-dram-xl-fix-v1-1-2f11547a178c@amd.com

RAS/AMD/FMPM: Use atl internal.h for INVALID_SPA

2024-06-09T21:44:05+00:00

Both the AMD ATL and the FMPM driver define INVALID_SPA. Include the
definition from the ATL internal.h header in the FMPM driver.

Signed-off-by: John Allen 
Signed-off-by: Borislav Petkov (AMD) 
Link: https://lore.kernel.org/r/20240606203313.51197-7-john.allen@amd.com