diff options
| author | Dave Airlie <airlied@redhat.com> | 2023-05-29 06:21:50 +1000 |
|---|---|---|
| committer | Dave Airlie <airlied@redhat.com> | 2023-05-29 06:21:51 +1000 |
| commit | 85d712f033d23bb56a373e29465470c036532d46 (patch) | |
| tree | bc08335816b8113d9edee563eb92d2e2c677fb27 /drivers/gpu/drm/i915/i915_gpu_error.c | |
| parent | b8887e796e06b1de4db899f49d531d220f94f393 (diff) | |
| parent | 0fbcf57077c47b444e91b9ce8a243e6f7f53693d (diff) | |
| download | linux-85d712f033d23bb56a373e29465470c036532d46.tar.gz linux-85d712f033d23bb56a373e29465470c036532d46.tar.bz2 linux-85d712f033d23bb56a373e29465470c036532d46.zip | |
Merge tag 'drm-intel-gt-next-2023-05-24' of git://anongit.freedesktop.org/drm/drm-intel into drm-next
UAPI Changes:
- New getparam for querying PXP support and load status
Cross-subsystem Changes:
- GSC/MEI proxy driver
Driver Changes:
Fixes/improvements/new stuff:
- Avoid clearing pre-allocated framebuffers with the TTM backend (Nirmoy Das)
- Implement framebuffer mmap support (Nirmoy Das)
- Disable sampler indirect state in bindless heap (Lionel Landwerlin)
- Avoid out-of-bounds access when loading HuC (Lucas De Marchi)
- Actually return an error if GuC version range check fails (John Harrison)
- Get mutex and rpm ref just once in hwm_power_max_write (Ashutosh Dixit)
- Disable PL1 power limit when loading GuC firmware (Ashutosh Dixit)
- Block in hwmon while waiting for GuC reset to complete (Ashutosh Dixit)
- Provide sysfs for SLPC efficient freq (Vinay Belgaumkar)
- Add support for total context runtime for GuC back-end (Umesh Nerlige Ramappa)
- Enable fdinfo for GuC backends (Umesh Nerlige Ramappa)
- Don't capture Gen8 regs on Xe devices (John Harrison)
- Fix error capture for virtual engines (John Harrison)
- Track patch level versions on reduced version firmware files (John Harrison)
- Decode another GuC load failure case (John Harrison)
- GuC loading and firmware table handling fixes (John Harrison)
- Fix confused register capture list creation (John Harrison)
- Dump error capture to kernel log (John Harrison)
- Dump error capture to dmesg on CTB error (John Harrison)
- Disable rps_boost debugfs when SLPC is used (Vinay Belgaumkar)
Future platform enablement:
- Disable stolen memory backed FB for A0 [mtl] (Nirmoy Das)
- Various refactors for multi-tile enablement (Andi Shyti, Tejas Upadhyay)
- Extend Wa_22011802037 to MTL A-step (Madhumitha Tolakanahalli Pradeep)
- WA to clear RDOP clock gating [mtl] (Haridhar Kalvala)
- Set has_llc=0 [mtl] (Fei Yang)
- Define MOCS and PAT tables for MTL (Madhumitha Tolakanahalli Pradeep)
- Add PTE encode function [mtl] (Fei Yang)
- fix mocs selftest [mtl] (Fei Yang)
- Workaround coherency issue for Media [mtl] (Fei Yang)
- Add workaround 14018778641 [mtl] (Tejas Upadhyay)
- Implement Wa_14019141245 [mtl] (Radhakrishna Sripada)
- Fix the wa number for Wa_22016670082 [mtl] (Radhakrishna Sripada)
- Use correct huge page manager for MTL (Jonathan Cavitt)
- GSC/MEI support for Meteorlake (Alexander Usyskin, Daniele Ceraolo Spurio)
- Define GuC firmware version for MTL (John Harrison)
- Drop FLAT CCS check [mtl] (Pallavi Mishra)
- Add MTL for remapping CCS FBs [mtl] (Clint Taylor)
- Meteorlake PXP enablement (Alan Previn)
- Do not enable render power-gating on MTL (Andrzej Hajda)
- Add MTL performance tuning changes (Radhakrishna Sripada)
- Extend Wa_16014892111 to MTL A-step (Radhakrishna Sripada)
- PMU multi-tile support (Tvrtko Ursulin)
- End support for set caching ioctl [mtl] (Fei Yang)
Driver refactors:
- Use i915 instead of dev_priv insied the file_priv structure (Andi Shyti)
- Use proper parameter naming in for_each_engine() (Andi Shyti)
- Use gt_err for GT info (Tejas Upadhyay)
- Consolidate duplicated capture list code (John Harrison)
- Capture list naming clean up (John Harrison)
- Use kernel-doc -Werror when CONFIG_DRM_I915_WERROR=y (Jani Nikula)
- Preparation for using PAT index (Fei Yang)
- Use pat_index instead of cache_level (Fei Yang)
Miscellaneous:
- Fix memory leaks in i915 selftests (Cong Liu)
- Record GT error for gt failure (Tejas Upadhyay)
- Migrate platform-dependent mock hugepage selftests to live (Jonathan Cavitt)
- Update the SLPC selftest (Vinay Belgaumkar)
- Throw out set() wrapper (Jani Nikula)
- Large driver kernel doc cleanup (Jani Nikula)
- Fix probe injection CI failures after recent change (John Harrison)
- Make unexpected firmware versions an error in debug builds (John Harrison)
- Silence UBSAN uninitialized bool variable warning (Ashutosh Dixit)
- Fix memory leaks in function live_nop_switch (Cong Liu)
Merges:
- Merge drm/drm-next into drm-intel-gt-next (Joonas Lahtinen)
Signed-off-by: Dave Airlie <airlied@redhat.com>
# Conflicts:
# drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/ZG5SxCWRSkZhTDtY@tursulin-desk
Diffstat (limited to 'drivers/gpu/drm/i915/i915_gpu_error.c')
| -rw-r--r-- | drivers/gpu/drm/i915/i915_gpu_error.c | 153 |
1 files changed, 147 insertions, 6 deletions
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c index f020c0086fbc..ec368e700235 100644 --- a/drivers/gpu/drm/i915/i915_gpu_error.c +++ b/drivers/gpu/drm/i915/i915_gpu_error.c @@ -808,10 +808,15 @@ static void err_print_gt_engines(struct drm_i915_error_state_buf *m, for (ee = gt->engine; ee; ee = ee->next) { const struct i915_vma_coredump *vma; - if (ee->guc_capture_node) - intel_guc_capture_print_engine_node(m, ee); - else + if (gt->uc && gt->uc->guc.is_guc_capture) { + if (ee->guc_capture_node) + intel_guc_capture_print_engine_node(m, ee); + else + err_printf(m, " Missing GuC capture node for %s\n", + ee->engine->name); + } else { error_print_engine(m, ee); + } err_printf(m, " hung: %u\n", ee->hung); err_printf(m, " engine reset count: %u\n", ee->reset_count); @@ -1117,10 +1122,14 @@ i915_vma_coredump_create(const struct intel_gt *gt, mutex_lock(&ggtt->error_mutex); if (ggtt->vm.raw_insert_page) ggtt->vm.raw_insert_page(&ggtt->vm, dma, slot, - I915_CACHE_NONE, 0); + i915_gem_get_pat_index(gt->i915, + I915_CACHE_NONE), + 0); else ggtt->vm.insert_page(&ggtt->vm, dma, slot, - I915_CACHE_NONE, 0); + i915_gem_get_pat_index(gt->i915, + I915_CACHE_NONE), + 0); mb(); s = io_mapping_map_wc(&ggtt->iomap, slot, PAGE_SIZE); @@ -2162,7 +2171,7 @@ void i915_error_state_store(struct i915_gpu_coredump *error) * i915_capture_error_state - capture an error record for later analysis * @gt: intel_gt which originated the hang * @engine_mask: hung engines - * + * @dump_flags: dump flags * * Should be called when an error is detected (either a hang or an error * interrupt) to capture error state from the time of the error. Fills @@ -2219,3 +2228,135 @@ void i915_disable_error_state(struct drm_i915_private *i915, int err) i915->gpu_error.first_error = ERR_PTR(err); spin_unlock_irq(&i915->gpu_error.lock); } + +#if IS_ENABLED(CONFIG_DRM_I915_DEBUG_GEM) +void intel_klog_error_capture(struct intel_gt *gt, + intel_engine_mask_t engine_mask) +{ + static int g_count; + struct drm_i915_private *i915 = gt->i915; + struct i915_gpu_coredump *error; + intel_wakeref_t wakeref; + size_t buf_size = PAGE_SIZE * 128; + size_t pos_err; + char *buf, *ptr, *next; + int l_count = g_count++; + int line = 0; + + /* Can't allocate memory during a reset */ + if (test_bit(I915_RESET_BACKOFF, >->reset.flags)) { + drm_err(>->i915->drm, "[Capture/%d.%d] Inside GT reset, skipping error capture :(\n", + l_count, line++); + return; + } + + error = READ_ONCE(i915->gpu_error.first_error); + if (error) { + drm_err(&i915->drm, "[Capture/%d.%d] Clearing existing error capture first...\n", + l_count, line++); + i915_reset_error_state(i915); + } + + with_intel_runtime_pm(&i915->runtime_pm, wakeref) + error = i915_gpu_coredump(gt, engine_mask, CORE_DUMP_FLAG_NONE); + + if (IS_ERR(error)) { + drm_err(&i915->drm, "[Capture/%d.%d] Failed to capture error capture: %ld!\n", + l_count, line++, PTR_ERR(error)); + return; + } + + buf = kvmalloc(buf_size, GFP_KERNEL); + if (!buf) { + drm_err(&i915->drm, "[Capture/%d.%d] Failed to allocate buffer for error capture!\n", + l_count, line++); + i915_gpu_coredump_put(error); + return; + } + + drm_info(&i915->drm, "[Capture/%d.%d] Dumping i915 error capture for %ps...\n", + l_count, line++, __builtin_return_address(0)); + + /* Largest string length safe to print via dmesg */ +# define MAX_CHUNK 800 + + pos_err = 0; + while (1) { + ssize_t got = i915_gpu_coredump_copy_to_buffer(error, buf, pos_err, buf_size - 1); + + if (got <= 0) + break; + + buf[got] = 0; + pos_err += got; + + ptr = buf; + while (got > 0) { + size_t count; + char tag[2]; + + next = strnchr(ptr, got, '\n'); + if (next) { + count = next - ptr; + *next = 0; + tag[0] = '>'; + tag[1] = '<'; + } else { + count = got; + tag[0] = '}'; + tag[1] = '{'; + } + + if (count > MAX_CHUNK) { + size_t pos; + char *ptr2 = ptr; + + for (pos = MAX_CHUNK; pos < count; pos += MAX_CHUNK) { + char chr = ptr[pos]; + + ptr[pos] = 0; + drm_info(&i915->drm, "[Capture/%d.%d] }%s{\n", + l_count, line++, ptr2); + ptr[pos] = chr; + ptr2 = ptr + pos; + + /* + * If spewing large amounts of data via a serial console, + * this can be a very slow process. So be friendly and try + * not to cause 'softlockup on CPU' problems. + */ + cond_resched(); + } + + if (ptr2 < (ptr + count)) + drm_info(&i915->drm, "[Capture/%d.%d] %c%s%c\n", + l_count, line++, tag[0], ptr2, tag[1]); + else if (tag[0] == '>') + drm_info(&i915->drm, "[Capture/%d.%d] ><\n", + l_count, line++); + } else { + drm_info(&i915->drm, "[Capture/%d.%d] %c%s%c\n", + l_count, line++, tag[0], ptr, tag[1]); + } + + ptr = next; + got -= count; + if (next) { + ptr++; + got--; + } + + /* As above. */ + cond_resched(); + } + + if (got) + drm_info(&i915->drm, "[Capture/%d.%d] Got %zd bytes remaining!\n", + l_count, line++, got); + } + + kvfree(buf); + + drm_info(&i915->drm, "[Capture/%d.%d] Dumped %zd bytes\n", l_count, line++, pos_err); +} +#endif |
