linux.git/drivers/ras, branch v6.18.21

APEI/GHES: ARM processor Error: don't go past allocated memory

2026-03-04T12:19:35+00:00

[ Upstream commit 87880af2d24e62a84ed19943dbdd524f097172f2 ]

If the BIOS generates a very small ARM Processor Error, or
an incomplete one, the current logic will fail to deferrence

	err->section_length
and
	ctx_info->size

Add checks to avoid that. With such changes, such GHESv2
records won't cause OOPSes like this:

[    1.492129] Internal error: Oops: 0000000096000005 [#1]  SMP
[    1.495449] Modules linked in:
[    1.495820] CPU: 0 UID: 0 PID: 9 Comm: kworker/0:0 Not tainted 6.18.0-rc1-00017-gabadcc3553dd-dirty #18 PREEMPT
[    1.496125] Hardware name: QEMU QEMU Virtual Machine, BIOS unknown 02/02/2022
[    1.496433] Workqueue: kacpi_notify acpi_os_execute_deferred
[    1.496967] pstate: 814000c5 (Nzcv daIF +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
[    1.497199] pc : log_arm_hw_error+0x5c/0x200
[    1.497380] lr : ghes_handle_arm_hw_error+0x94/0x220

0xffff8000811c5324 is in log_arm_hw_error (../drivers/ras/ras.c:75).
70		err_info = (struct cper_arm_err_info *)(err + 1);
71		ctx_info = (struct cper_arm_ctx_info *)(err_info + err->err_info_num);
72		ctx_err = (u8 *)ctx_info;
73
74		for (n = 0; n < err->context_info_num; n++) {
75			sz = sizeof(struct cper_arm_ctx_info) + ctx_info->size;
76			ctx_info = (struct cper_arm_ctx_info *)((long)ctx_info + sz);
77			ctx_len += sz;
78		}
79

and similar ones while trying to access section_length on an
error dump with too small size.

Signed-off-by: Mauro Carvalho Chehab 
Reviewed-by: Jonathan Cameron 
Acked-by: Ard Biesheuvel 
Reviewed-by: Hanjun Guo 
[ rjw: Subject tweaks ]
Link: https://patch.msgid.link/7fd9f38413be05ee2d7cfdb0dc31ea2274cf1a54.1767871950.git.mchehab+huawei@kernel.org
Signed-off-by: Rafael J. Wysocki 
Signed-off-by: Sasha Levin

RAS: Report all ARM processor CPER information to userspace

2025-12-18T13:03:09+00:00

[ Upstream commit 05954511b73e748d0370549ad9dd9cd95297d97a ]

The ARM processor CPER record was added in UEFI v2.6 and remained
unchanged up to v2.10.

Yet, the original arm_event trace code added by

  e9279e83ad1f ("trace, ras: add ARM processor error trace event")

is incomplete, as it only traces some fields of UAPI 2.6 table N.16, not
exporting any information from tables N.17 to N.29 of the record.

This is not enough for the user to be able to figure out what has
exactly happened or to take appropriate action.

According to the UEFI v2.9 specification chapter N2.4.4, the ARM
processor error section includes:

- several (ERR_INFO_NUM) ARM processor error information structures
  (Tables N.17 to N.20);
- several (CONTEXT_INFO_NUM) ARM processor context information
  structures (Tables N.21 to N.29);
- several vendor specific error information structures. The
  size is given by Section Length minus the size of the other
  fields.

In addition, it also exports two fields that are parsed by the GHES
driver when firmware reports it, e.g.:

- error severity
- CPU logical index

Report all of these information to userspace via a the ARM tracepoint so
that userspace can properly record the error and take decisions related
to CPU core isolation according to error severity and other info.

The updated ARM trace event now contains the following fields:

======================================  =============================
UEFI field on table N.16                ARM Processor trace fields
======================================  =============================
Validation                              handled when filling data for
                                        affinity MPIDR and running
                                        state.
ERR_INFO_NUM                            pei_len
CONTEXT_INFO_NUM                        ctx_len
Section Length                          indirectly reported by
                                        pei_len, ctx_len and oem_len
Error affinity level                    affinity
MPIDR_EL1                               mpidr
MIDR_EL1                                midr
Running State                           running_state
PSCI State                              psci_state
Processor Error Information Structure   pei_err - count at pei_len
Processor Context                       ctx_err- count at ctx_len
Vendor Specific Error Info              oem - count at oem_len
======================================  =============================

It should be noted that decoding of tables N.17 to N.29, if needed, will
be handled in userspace. That gives more flexibility, as there won't be
any need to flood the kernel with micro-architecture specific error
decoding.

Also, decoding the other fields require a complex logic, and should be
done for each of the several values inside the record field.  So, let
userspace daemons like rasdaemon decode them, parsing such tables and
having vendor-specific micro-architecture-specific decoders.

 [mchehab: modified description, solved merge conflicts and fixed coding style]

Signed-off-by: Jason Tian 
Co-developed-by: Shengwei Luo 
Signed-off-by: Shengwei Luo 
Signed-off-by: Mauro Carvalho Chehab 
Signed-off-by: Daniel Ferguson  # rebased
Reviewed-by: Jonathan Cameron 
Tested-by: Shiju Jose 
Acked-by: Borislav Petkov (AMD) 
Fixes: e9279e83ad1f ("trace, ras: add ARM processor error trace event")
Link: https://uefi.org/specs/UEFI/2.10/Apx_N_Common_Platform_Error_Record.html#arm-processor-error-section
Signed-off-by: Ard Biesheuvel 
Signed-off-by: Sasha Levin

RAS: Export log_non_standard_event() to drivers

2025-09-15T14:20:29+00:00

The function log_non_standard_event() is responsible for logging
platform-specific or vendor-defined RAS (Reliability, Availability, and
Serviceability) events. Currently, this function is only available within the
RAS subsystem, preventing external modules from leveraging its capabilities.

Export it to drivers to log non-standard RAS events via EDAC.

Signed-off-by: Shubhrajyoti Datta 
Signed-off-by: Borislav Petkov (AMD) 
Link: https://lore.kernel.org/20250908115649.22903-1-shubhrajyoti.datta@amd.com

Merge tag 'v6.15-rc5' into x86/cpu, to resolve conflicts

2025-05-06T08:00:58+00:00

 Conflicts:
	tools/arch/x86/include/asm/cpufeatures.h

Signed-off-by: Ingo Molnar

x86/platform/amd: Move the header to

2025-04-14T07:34:17+00:00

Collect AMD specific platform header files in .

Signed-off-by: Ingo Molnar 
Acked-by: Borislav Petkov (AMD) 
Cc: H. Peter Anvin 
Cc: Linus Torvalds 
Cc: Mario Limonciello 
Link: https://lore.kernel.org/r/20250413084144.3746608-7-mingo@kernel.org

x86/platform/amd: Move the header to

2025-04-14T07:34:14+00:00

Collect AMD specific platform header files in .

Signed-off-by: Ingo Molnar 
Acked-by: Borislav Petkov (AMD) 
Cc: H. Peter Anvin 
Cc: Linus Torvalds 
Cc: Mario Limonciello 
Link: https://lore.kernel.org/r/20250413084144.3746608-4-mingo@kernel.org

RAS/AMD/FMPM: Get masked address

2025-04-08T17:30:58+00:00

Some operations require checking, or ignoring, specific bits in an address
value. For example, this can be comparing address values to identify unique
structures.

Currently, the full address value is compared when filtering for duplicates.
This results in over counting and creation of extra records.  This gives the
impression that more unique events occurred than did in reality.

Mask the address for physical rows on MI300.

  [ bp: Simplify. ]

Fixes: 6f15e617cc99 ("RAS: Introduce a FRU memory poison manager")
Signed-off-by: Yazen Ghannam 
Signed-off-by: Borislav Petkov (AMD) 
Cc: stable@vger.kernel.org

RAS/AMD/ATL: Include row[13] bit in row retirement

2025-04-07T13:06:06+00:00

Based on feedback from hardware folks, row[13] is part of the variable
bits within a physical row (along with all column bits).

Only half the physical addresses affected by a row are calculated if
this bit is not included.

Add the row[13] bit to the row retirement flow.

Fixes: 3b566b30b414 ("RAS/AMD/ATL: Add MI300 row retirement support")
Signed-off-by: Yazen Ghannam 
Signed-off-by: Borislav Petkov (AMD) 
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250401-fix-fmpm-extra-records-v1-1-840bcf7a8ac5@amd.com

x86/amd_nb: Move SMN access code to a new amd_node driver

2025-01-08T09:59:44+00:00

SMN access was bolted into amd_nb mostly as convenience.  This has
limitations though that require incurring tech debt to keep it working.

Move SMN access to the newly introduced AMD Node driver.

Signed-off-by: Mario Limonciello 
Signed-off-by: Yazen Ghannam 
Signed-off-by: Borislav Petkov (AMD) 
Acked-by: Ilpo Järvinen  # pdx86
Acked-by: Shyam Sundar S K  # PMF, PMC
Link: https://lore.kernel.org/r/20241206161210.163701-11-yazen.ghannam@amd.com

RAS/AMD/ATL: Add debug prints for DF register reads

2024-10-22T16:55:57+00:00

The ATL will fail early if the DF register access fails due to missing
PCI IDs in the amd_nb code. There aren't any clear indicators on why the
ATL will fail to load in this case.

Add a couple of debug print statements to highlight reasons for failure.

A common scenario is missing support for new hardware. If the ATL fails
to load on a system, and there is interest to support it, then dynamic
debugging can be enabled to help find the cause for failure. If there is
no interest in supporting ATL on a new system, then these failures will
be silent.

Signed-off-by: Yazen Ghannam 
Signed-off-by: Borislav Petkov (AMD) 
Link: https://lore.kernel.org/r/20241021152158.2525669-1-yazen.ghannam@amd.com