linux.git/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c, branch v5.10.258

drm/amdgpu: Fix amdgpu_ras_eeprom_init()

2021-09-18T11:40:17+00:00

[ Upstream commit dce4400e6516d18313d23de45b5be8a18980b00e ]

No need to account for the 2 bytes of EEPROM
address--this is now well abstracted away by
the fixes the the lower layers.

Cc: Andrey Grodzovsky 
Cc: Alexander Deucher 
Signed-off-by: Luben Tuikov 
Acked-by: Alexander Deucher 
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin

drm/amdgpu: added RAS EEPROM device support check

2020-08-04T21:29:18+00:00

updated RAS EEPROM init/threshold sequences to check for device support

Reviewed-by: Guchun Chen 
Signed-off-by: John Clements 
Signed-off-by: Alex Deucher

drm/amdgpu: update eeprom once specifying one bigger threshold(v3)

2020-08-04T21:27:25+00:00

During driver's probe, when it hits bad gpu tag in eeprom i2c
init calling(the tag was set when reported bad page reaches
bad page threshold in last driver's working loop), there are
some strategys to deal with the cases:

1. when the module parameter amdgpu_bad_page_threshold = 0,
that means page retirement feature is disabled, so just resetting
the eeprom is fine.
2. When amdgpu_bad_page_threshold is not 0, and moreover, user
sets one bigger valid data in order to make current boot up
succeeds, correct eeprom header tag and do not break booting.
3. For other cases, driver's probe will be broken.

v2: Just update eeprom header tag instead of resetting the whole
    table header when user sets one bigger threshold data.

v3: Use dev_info/dev_err to print PCI device information, which
    helps in mGPU case.

Signed-off-by: Guchun Chen 
Reviewed-by: Hawking Zhang 
Signed-off-by: Alex Deucher

drm/amdgpu: break GPU recovery once it's in bad state(v4)

2020-08-04T21:26:54+00:00

When GPU executes recovery and retriving bad GPU tag
from external eerpom device, the recovery will be broken
and error message is printed as well for user's awareness.

v2: Refine warning message in threshold reaching case, and
    fix spelling typo.

v3: Fix explicit calling of bad gpu.

v4: Rename function names.

Signed-off-by: Guchun Chen 
Reviewed-by: Hawking Zhang 
Signed-off-by: Alex Deucher

drm/amdgpu: schedule ras recovery when reaching bad page threshold(v2)

2020-08-04T21:26:46+00:00

Once the bad page saved to eeprom reaches the configured
threshold, ras recovery will be issued to notify user.

v2: Fix spelling typo.

Signed-off-by: Guchun Chen 
Reviewed-by: Hawking Zhang 
Signed-off-by: Alex Deucher

drm/amdgpu: break driver init process when it's bad GPU(v5)

2020-08-04T21:26:31+00:00

When retrieving bad gpu tag from eeprom, GPU init should
fail as the GPU needs to be retired for further check.

v2: Fix spelling typo, correct the condition to detect
    bad gpu tag and refine error message.

v3: Refine function argument name.

v4: Fix missing check of returning value of i2c
    initialization error case.

v5: Use dev_err to print PCI information in dmesg instead
    of DRM_ERROR.

Signed-off-by: Guchun Chen 
Reviewed-by: Hawking Zhang 
Signed-off-by: Alex Deucher

drm/amdgpu: add bad gpu tag definition

2020-08-04T21:26:16+00:00

This tag will be hired for bad gpu detection in eeprom's access.

Signed-off-by: Guchun Chen 
Reviewed-by: Hawking Zhang 
Signed-off-by: Alex Deucher

drm/amdgpu: validate bad page threshold in ras(v3)

2020-08-04T21:25:58+00:00

Bad page threshold value should be valid in the range between
-1 and max records length of eeprom. It could determine when
saved bad pages exceed threshold value, and proceed corresponding
actions.

v2: When using the default typical value, it should be min
value between typical value and eeprom max records length.

v3: drop the case of setting bad_page_cnt_threshold to be
    0xFFFFFFFF, as it confuses user.

Signed-off-by: Guchun Chen 
Reviewed-by: Hawking Zhang 
Signed-off-by: Alex Deucher

drm/amdgpu: Move EEPROM I2C adapter to amdgpu_device

2020-03-16T20:21:32+00:00

Puts the i2c adapter in common place for sharing by RAS
and upcoming data read from FRU EEPROM feature.

v2:
Move i2c adapter to amdgpu_pm and rename it.

v3: Move i2c adapter init to ASIC specific code and get rid
of the switch case in amdgpu_device

Signed-off-by: Andrey Grodzovsky 
Reviewed-by: Alex Deucher 
Signed-off-by: Alex Deucher

drm/amdgpu: Add Arcturus D342 page retire support

2020-02-26T19:17:32+00:00

Check Arcturus SKU type to select I2C address of page retirement EEPROM

Reviewed-by: Hawking Zhang 
Signed-off-by: John Clements 
Signed-off-by: Alex Deucher