linux.git/drivers/gpu/drm/amd/amdkfd, branch v6.12.80

drm/amdkfd: Unreserve bo if queue update failed

2026-03-25T10:08:30+00:00

[ Upstream commit 2ce75a0b7e1bfddbcb9bc8aeb2e5e7fa99971acf ]

Error handling path should unreserve bo then return failed.

Fixes: 305cd109b761 ("drm/amdkfd: Validate user queue update")
Signed-off-by: Philip Yang 
Reviewed-by: Alex Sierra 
Signed-off-by: Alex Deucher 
(cherry picked from commit c24afed7de9ecce341825d8ab55a43a254348b33)
Signed-off-by: Sasha Levin

drm/amdkfd: Fix out-of-bounds write in kfd_event_page_set()

2026-03-04T12:21:53+00:00

[ Upstream commit 8a70a26c9f34baea6c3199a9862ddaff4554a96d ]

The kfd_event_page_set() function writes KFD_SIGNAL_EVENT_LIMIT * 8
bytes via memset without checking the buffer size parameter. This allows
unprivileged userspace to trigger an out-of bounds kernel memory write
by passing a small buffer, leading to  potential privilege
escalation.

Signed-off-by: Sunday Clement 
Reviewed-by: Alexander Deucher 
Signed-off-by: Alex Deucher 
Cc: stable@vger.kernel.org
Signed-off-by: Sasha Levin

drm/amdkfd: Fix GART PTE for non-4K pagesize in svm_migrate_gart_map()

2026-03-04T12:21:05+00:00

[ Upstream commit 6c160001661b6c4e20f5c31909c722741e14c2d8 ]

In svm_migrate_gart_map(), while migrating GART mapping, the number of
bytes copied for the GART table only accounts for CPU pages. On non-4K
systems, each CPU page can contain multiple GPU pages, and the GART
requires one 8-byte PTE per GPU page. As a result, an incorrect size was
passed to the DMA, causing only a partial update of the GART table.

Fix this function to work correctly on non-4K page-size systems by
accounting for the number of GPU pages per CPU page when calculating the
number of bytes to be copied.

Acked-by: Christian König 
Reviewed-by: Philip Yang 
Signed-off-by: Ritesh Harjani (IBM) 
Signed-off-by: Donet Tom 
Signed-off-by: Felix Kuehling 
Reviewed-by: Felix Kuehling 
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin

drm/amdkfd: Relax size checking during queue buffer get

2026-03-04T12:21:05+00:00

[ Upstream commit 42ea9cf2f16b7131cb7302acb3dac510968f8bdc ]

HW-supported EOP buffer sizes are 4K and 32K. On systems that do not
use 4K pages, the minimum buffer object (BO) allocation size is
PAGE_SIZE (for example, 64K). During queue buffer acquisition, the driver
currently checks the allocated BO size against the supported EOP buffer
size. Since the allocated BO is larger than the expected size, this check
fails, preventing queue creation.

Relax the strict size validation and allow PAGE_SIZE-sized BOs to be used.
Only the required 4K region of the buffer will be used as the EOP buffer
and avoids queue creation failures on non-4K page systems.

Acked-by: Christian König 
Suggested-by: Philip Yang 
Signed-off-by: Donet Tom 
Signed-off-by: Felix Kuehling 
Reviewed-by: Felix Kuehling 
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin

drm/amdkfd: Handle GPU reset and drain retry fault race

2026-03-04T12:20:59+00:00

[ Upstream commit 5b57c3c3f22336e8fd5edb7f0fef3c7823f8eac1 ]

Only check and drain IH1 ring if CAM is not enabled.

If GPU is under reset, don't access IH to drain retry fault.

Signed-off-by: Philip Yang 
Reviewed-by: Harish Kasiviswanathan 
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin

drm/amdkfd: Fix watch_id bounds checking in debug address watch v2

2026-03-04T12:20:42+00:00

[ Upstream commit 5a19302cab5cec7ae7f1a60c619951e6c17d8742 ]

The address watch clear code receives watch_id as an unsigned value
(u32), but some helper functions were using a signed int and checked
bits by shifting with watch_id.

If a very large watch_id is passed from userspace, it can be converted
to a negative value.  This can cause invalid shifts and may access
memory outside the watch_points array.

drm/amdkfd: Fix watch_id bounds checking in debug address watch v2

Fix this by checking that watch_id is within MAX_WATCH_ADDRESSES before
using it.  Also use BIT(watch_id) to test and clear bits safely.

This keeps the behavior unchanged for valid watch IDs and avoids
undefined behavior for invalid ones.

Fixes the below:
drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_debug.c:448
kfd_dbg_trap_clear_dev_address_watch() error: buffer overflow
'pdd->watch_points' 4 <= u32max user_rl='0-3,2147483648-u32max' uncapped

drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_debug.c
    433 int kfd_dbg_trap_clear_dev_address_watch(struct kfd_process_device *pdd,
    434                                         uint32_t watch_id)
    435 {
    436         int r;
    437
    438         if (!kfd_dbg_owns_dev_watch_id(pdd, watch_id))

kfd_dbg_owns_dev_watch_id() doesn't check for negative values so if
watch_id is larger than INT_MAX it leads to a buffer overflow.
(Negative shifts are undefined).

    439                 return -EINVAL;
    440
    441         if (!pdd->dev->kfd->shared_resources.enable_mes) {
    442                 r = debug_lock_and_unmap(pdd->dev->dqm);
    443                 if (r)
    444                         return r;
    445         }
    446
    447         amdgpu_gfx_off_ctrl(pdd->dev->adev, false);
--> 448         pdd->watch_points[watch_id] = pdd->dev->kfd2kgd->clear_address_watch(
    449                                                         pdd->dev->adev,
    450                                                         watch_id);

v2: (as per, Jonathan Kim)
 - Add early watch_id >= MAX_WATCH_ADDRESSES validation in the set path to
   match the clear path.
 - Drop the redundant bounds check in kfd_dbg_owns_dev_watch_id().

Fixes: e0f85f4690d0 ("drm/amdkfd: add debug set and clear address watch points operation")
Reported-by: Dan Carpenter 
Cc: Jonathan Kim 
Cc: Felix Kuehling 
Cc: Alex Deucher 
Cc: Christian König 
Signed-off-by: Srinivasan Shanmugam 
Reviewed-by: Jonathan Kim 
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin

drm/amdkfd: Fix signal_eviction_fence() bool return value

2026-03-04T12:19:53+00:00

[ Upstream commit 31dc58adda9874420ab8fa5a2f9c43377745753a ]

signal_eviction_fence() is declared to return bool, but returns -EINVAL
when no eviction fence is present.  This makes the "no fence" or "the
NULL-fence" path evaluate to true and triggers a Smatch warning.

v2: Return true instead to explicitly indicate that there is no eviction
fence to signal and that eviction is already complete. This matches the
existing caller logic where a NULL fence means "nothing to do" and
allows restore handling to proceed normally. (Christian)

Fixes the below:
drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_process.c:2099 signal_eviction_fence()
warn: '(-22)' is not bool

drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_process.c
    2090 static bool signal_eviction_fence(struct kfd_process *p)
                ^^^^

    2091 {
    2092         struct dma_fence *ef;
    2093         bool ret;
    2094
    2095         rcu_read_lock();
    2096         ef = dma_fence_get_rcu_safe(&p->ef);
    2097         rcu_read_unlock();
    2098         if (!ef)
--> 2099                 return -EINVAL;

		This should be either true or false.
		Probably true because presumably
		it has been tested?

    2100
    2101         ret = dma_fence_check_and_signal(ef);
    2102         dma_fence_put(ef);
    2103
    2104         return ret;
    2105 }

Fixes: 37865e02e6cc ("drm/amdkfd: Fix eviction fence handling")
Reported by: Dan Carpenter 
Cc: Philip Yang 
Cc: Gang BA 
Cc: Felix Kuehling 
Signed-off-by: Srinivasan Shanmugam 
Reviewed-by: Christian König 
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin

drm/amdkfd: fix a memory leak in device_queue_manager_init()

2026-01-23T10:18:47+00:00

commit 80614c509810fc051312d1a7ccac8d0012d6b8d0 upstream.

If dqm->ops.initialize() fails, add deallocate_hiq_sdma_mqd()
to release the memory allocated by allocate_hiq_sdma_mqd().
Move deallocate_hiq_sdma_mqd() up to ensure proper function
visibility at the point of use.

Fixes: 11614c36bc8f ("drm/amdkfd: Allocate MQD trunk for HIQ and SDMA")
Signed-off-by: Haoxiang Li 
Signed-off-by: Felix Kuehling 
Reviewed-by: Oak Zeng 
Reviewed-by: Felix Kuehling 
Signed-off-by: Alex Deucher 
(cherry picked from commit b7cccc8286bb9919a0952c812872da1dcfe9d390)
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman

drm/amdkfd: Fix improper NULL termination of queue restore SMI event string

2026-01-17T15:31:28+00:00

[ Upstream commit 969faea4e9d01787c58bab4d945f7ad82dad222d ]

Pass character "0" rather than NULL terminator to properly format
queue restoration SMI events. Currently, the NULL terminator precedes
the newline character that is intended to delineate separate events
in the SMI event buffer, which can break userspace parsers.

Signed-off-by: Brian Kocoloski 
Reviewed-by: Philip Yang 
Signed-off-by: Alex Deucher 
(cherry picked from commit 6e7143e5e6e21f9d5572e0390f7089e6d53edf3c)
Signed-off-by: Sasha Levin

drm/amdkfd: Trap handler support for expert scheduling mode

2026-01-08T09:14:54+00:00

commit b7851f8c66191cd23a0a08bd484465ad74bbbb7d upstream.

The trap may be entered with dependency checking disabled.
Wait for dependency counters and save/restore scheduling mode.

v2:

Use ttmp1 instead of ttmp11. ttmp11 is not zero-initialized.
While the trap handler does zero this field before use, a user-mode
second-level trap handler could not rely on this being zero when
using an older kernel mode driver.

v3:

Use ttmp11 primarily but copy to ttmp1 before jumping to the
second level trap handler. ttmp1 is inspectable by a debugger.
Unexpected bits in the unused space may regress existing software.

Signed-off-by: Jay Cornwall 
Reviewed-by: Lancelot Six 
Signed-off-by: Alex Deucher 
(cherry picked from commit 423888879412e94725ca2bdccd89414887d98e31)
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman