summaryrefslogtreecommitdiff
path: root/drivers/accel/habanalabs
AgeCommit message (Collapse)AuthorFilesLines
2023-10-09accel/habanalabs: rename fd_list to hpriv_listKoby Elbaz1-24/+19
Every time an FD is returned to the user, the driver adds a corresponding private structure to the list. Yet, it's still a list of private structures rather than of FDs. Remove, as well, an unnecessary comment. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09accel/habanalabs: call put_pid after hpriv list is updatedKoby Elbaz1-2/+2
Because we might still be using related resources, decrementing PID's reference count should be done at later stages of the device release. A good place is right after the representing private structure is removed from LKD's list. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09accel/habanalabs: print return code when process termination failsKoby Elbaz1-3/+4
As part of driver teardown, we attempt to kill all user processes. It shouldn't fail, but if it does we want to print the error code that the kapi returned to us. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09accel/habanalabs: fix standalone preboot descriptor requestfarah kassabri1-1/+2
The preboot used to statically allocate memory for the comms descriptor on the device memory when driver requested the descriptor information. Now preboot moved to dynamic memory allocation where it wants to check the size the driver expects vs. what the f/w expects. Note there are no backward compatibility issues as older f/w versions simply ignore this value. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09accel/habanalabs: handle arc farm razwiDani Liberman1-3/+13
Implement razwi handling for arc farm and add it to arc farm sei event handler. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09accel/habanalabs: stop fetching MME SBTE error causeOfir Bitton1-23/+8
Because in this case we have only a single possible cause, we can safely stop fetching the cause from firmware. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09accel/habanalabs: set device status 'malfunction' while in rmmodKoby Elbaz1-2/+4
hl_device_status() returns the status of an acquired device. If a device is going down (following an rmmod cmd), it should be marked as an unusable/malfunctioning device, and hence should not be acquired. However, since this was not the case so far (i.e., a device going down would inaccurately return 'in reset' status allowing the user to acquire the device) it introduced a bug where as part of a reset flow, the driver could not kill processes that have not run yet, and since those processes aren't blocked from reacquiring a device, we get eventually a new flow of a driver attempting to kill all processes in a list that can't be ever really empty. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09accel/habanalabs: print task name upon creation of a user contextTomer Tayar1-2/+4
It is useful for debug to know which user process have acquired the device. Add this info to the relevant debug print, in addition to the already printed user context's ASID. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09accel/habanalabs: print task name and request code upon ioctl failureTomer Tayar1-7/+17
When an ioctl fails, it is useful to know what is the task command name and the full ioctl request code, in addition to the task pid and the ioctl number. Add the additional information to the relevant debug error prints. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09accel/habanalabs: notify user about undefined opcode eventOfir Bitton1-5/+137
In order for user to be aware of undefined opcode events, we must store all relevant information and notify user about the failure. The user will fetch the stored info via info ioctl. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09accel/habanalabs: update pending reset flags with new reset requestsTomer Tayar1-1/+3
If hl_device_cond_reset() is called while a reset is already pending but hasn't started, the reset request will be dropped. If the flags of the new request are more severe, e.g. a hard reset while the pending reset is a compute reset, the eventual reset won't be suitable for the device status. To prevent such cases, update the pending reset flags with the new requests flags before the requests are dropped. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09accel/habanalabs: prevent immediate hard reset due to 2 adjacent H/W eventsTomer Tayar1-1/+10
When a H/W event is received while a user is registered to events, no immediate hard reset will happen, and instead the user will be notified and will have some time to handle it and eventually release the device, after which the reset will be done. If a user, as part of the handling and as part of the cleanup steps towards releasing the device, unregisters from receiving those events, and at that time an adjacent H/W event is received, it will be assumed that the user is not registered to events and thus an immediate hard reset is required. To prevent such an unwanted immediate reset, modify the driver to perform it if the user is not registered to events AND we don't already have a pending reset for a previous H/W event. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-07-20accel/habanalabs: add more debugfs stub helpersArnd Bergmann1-0/+9
Two functions got added with normal prototypes for debugfs, but not alternative when building without it: drivers/accel/habanalabs/common/device.c: In function 'hl_device_init': drivers/accel/habanalabs/common/device.c:2177:14: error: implicit declaration of function 'hl_debugfs_device_init'; did you mean 'hl_debugfs_init'? [-Werror=implicit-function-declaration] drivers/accel/habanalabs/common/device.c:2305:9: error: implicit declaration of function 'hl_debugfs_device_fini'; did you mean 'hl_debugfs_remove_file'? [-Werror=implicit-function-declaration] Add stubs for these as well. Fixes: 3b9abb4fa642 ("accel/habanalabs: expose debugfs files later") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Tomer Tayar <ttayar@habana.ai> Acked-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> Link: https://patchwork.freedesktop.org/patch/msgid/20230609120636.3969045-1-arnd@kernel.org
2023-06-08accel/habanalabs: refactor error info resetDani Liberman3-4/+10
Moved error info reset code to single function for future use from other places in the driver. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-08accel/habanalabs: add event queue extra validationOfir Bitton2-1/+7
In order to increase reliability of the event queue interface, we apply to Gaudi2 the same mechanism we have in Gaudi1. The extra validation is basically checking that the received event index matches the expected index. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-08accel/habanalabs: unsecure TSB_CFG_MTRR regsOfir Bitton1-0/+4
In order to utilize Engine Barrier padding, user must have access to this register set. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-08accel/habanalabs: move ioctl error print to debug levelOded Gabbay1-3/+3
We don't want to allow users to spam the kernel log and sending ioctls with bad opcodes is a sure way to do it. Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Ofir Bitton <obitton@habana.ai>
2023-06-08accel/habanalabs: fix bug of not fetching addr_dec infoOfir Bitton1-2/+6
addr_dec info should always be fetched, regardless of cause value. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-08accel/habanalabs: remove sim codeOded Gabbay2-26/+7
There were a few places where simulator only code got into the upstream. Remove those places that can confuse other developers. Fixes: 2a0a839b6a28 ("habanalabs: extend fatal messages to contain PCI info") Cc: Moti Haimovski <mhaimovski@habana.ai> Cc: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-08accel/habanalabs: mask part of hmmu page fault captured addressDani Liberman1-3/+11
When receiving page fault from hmmu, the captured address is scrambled both by HW and by driver. The driver part is unscrambled but the HW part isn't getting unscrambled. To avoid declaring wrong address, the HW scrambled part will be masked. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-08accel/habanalabs: update state when loading boot fitKoby Elbaz1-15/+10
Any FW component we load must be followed by a corresponding state update. However, it seems that so far we skipped doing so for the bootfit case, so fix that. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-08accel/habanalabs: print qman data on error only for lower qmanTomer Tayar3-128/+31
By default, the upper QMANs are not used, and instead engines ARCs access the lower QMANs directly. Errors for upper QMANs are therefore not expected, and the debug print of the PQ entries is not needed. Modify the QMAN debug data print on errors to include only information for the lower QMAN. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-08accel/habanalabs: use lower QM in QM errors handlingTomer Tayar1-5/+5
The QMAN GLBL_ERR_STS_4 register has indications for errors also in the lower CQ and the ARC CQ, and not just for errors in the lower CP. Modify the relevant define/struct and the related print to use "lower QM" instead of "lower CP". Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-08accel/habanalabs: use binning info when handling razwiDani Liberman1-3/+14
When receiving sei interrupt from tpc or decoder, we need to check the binning mask because if the engine is binned, the razwi info won't be in the router of the binned engine, instead will be in the router of the substitute engine. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-08accel/habanalabs: remove support for mmu disableOfir Bitton11-232/+26
As mmu disable mode is only used for bring-up stages, let's remove this option and all code related to it. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-08accel/habanalabs: upon DMA errors, use FW-extracted error causeKoby Elbaz1-29/+8
Initially, the driver used to read the error cause data directly from the ASIC. However, the FW now clears it before the driver could read it. Therefore we should use the error cause data that is extracted by the FW. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-08accel/habanalabs: print max timeout value on CS stuckOded Gabbay1-11/+15
If a workload got stuck, we print an error to the kernel log about it. Add to that print the configured max timeout value, as that value is not fixed between ASICs and in addition it can be configured using a kernel module parameter. Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Ofir Bitton <obitton@habana.ai>
2023-06-08accel/habanalabs: align to latest firmware specsOded Gabbay2-43/+16
Update the firmware common interface files with the latest version. Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Ofir Bitton <obitton@habana.ai>
2023-06-08accel/habanalabs: fix mem leak in capture user mappingsMoti Haimovski1-0/+2
This commit fixes a memory leak caused when clearing the user_mappings info when a new context is opened immediately after user_mapping is captured and a hard reset is performed. Signed-off-by: Moti Haimovski <mhaimovski@habana.ai> Reviewed-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-08accel/habanalabs: set unused bit as reservedOded Gabbay1-1/+1
Get latest f/w gaudi2 interface file which marks unused bist_need_iatu_config bit in cold_rst_data structure as reserved bit. Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Ofir Bitton <obitton@habana.ai>
2023-06-08accel/habanalabs: rename security functions related argumentsKoby Elbaz1-28/+29
Make the argument names specify the registers array represent registers that should be unsecured so the user can access them. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-08accel/habanalabs: fix gaudi2_get_tpc_idle_status() returnDan Carpenter1-1/+1
The gaudi2_get_tpc_idle_status() function returned the incorrect variable so it always returned true. Fixes: d85f0531b928 ("accel/habanalabs: break is_idle function into per-engine sub-routines") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-08accel/habanalabs: Fix some kernel-doc commentsYang Li1-2/+2
Make the description of @regs_range_array and @regs_range_array_size to @user_regs_range_array and @user_regs_range_array_size to silence the warnings: drivers/accel/habanalabs/common/security.c:506: warning: Function parameter or member 'user_regs_range_array' not described in 'hl_init_pb_ranges_single_dcore' drivers/accel/habanalabs/common/security.c:506: warning: Function parameter or member 'user_regs_range_array_size' not described in 'hl_init_pb_ranges_single_dcore' drivers/accel/habanalabs/common/security.c:506: warning: Excess function parameter 'regs_range_array' description in 'hl_init_pb_ranges_single_dcore' drivers/accel/habanalabs/common/security.c:506: warning: Excess function parameter 'regs_range_array_size' description in 'hl_init_pb_ranges_single_dcore' Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=4940 Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-08accel/habanalabs: always fetch pci addr_dec error infoOfir Bitton1-8/+6
Due to missing indication of address decode source (LBW/HBW bus), we should always try and fetch extended information. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-08accel/habanalabs: fix a static warning - 'dubious: x & !y'Koby Elbaz1-3/+3
Use a straight forward approach to get a conditional result. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-08accel/habanalabs: poll for device status update following WFE cmdKoby Elbaz1-5/+23
Currently, we rely on COMMS protocol's ack to verify that WFE command has been acknowledged by the FW. However, this does not guarantee that the device status has been updated. Although unlikely, this could trigger a race since the driver expects the device to be halted at that stage, but it might not be. Therefore, we increase WFE's robustness by polling on the status register that will be updated once the device is actually halted. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-08accel/habanalabs: expose debugfs files laterTomer Tayar3-44/+61
Currently the debugfs root folder and files for a device are created at an early step, before the device initialization and before the char device and sysfs files are exposed to user. As there is no real reason not to do it together with the device creation, postpone it to be done right afterwards. The initialization of the debugfs entry structure is left in its current position because it is used before creating the files. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-08accel/habanalabs: add pci health check during heartbeatOfir Bitton3-3/+16
Currently upon a heartbeat failure, we don't know if the failure is due to firmware hang or due to a bad PCI link. Hence, we are reading a PCI config space register with a known value (vendor ID) so we will know which of the two possibilities caused the heartbeat failure. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-08accel/habanalabs: add missing tpc interrupt infoDafna Hirschfeld1-1/+2
For some reason the last possible tpc interrupt cause in gaudi2_tpc_interrupts_cause is missing from the code. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-08accel/habanalabs: refactor abort of completions and waitsKoby Elbaz3-8/+14
Aborting CS completions should be in command_submission.c but aborting waiting for user interrupts should be in device.c. This separation is also for adding more abort operations in the future. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-08accel/habanalabs: minimize encapsulation signal mutex lock timeKoby Elbaz1-2/+8
Sync Stream Encapsulated Signal Handlers can be managed from different contexts, and as such they are protected via a spin_lock. However, spin_lock was unnecessarily protecting a larger code section than really needed, covering a sleepable code section as well. Since spin_lock disables preemption, it could lead to sleeping in atomic context. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-05accel/habanalabs: call to HW/FW err returns 0 when no events existMoti Haimovski1-10/+10
This commit modifies the call to retrieve HW or FW error events to return success when no events are pending, as done in the calls to other events. Signed-off-by: Moti Haimovski <mhaimovski@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-05accel/habanalabs: unsecure TPC bias registersOfir Bitton1-0/+1
User needs to be able to perform downcast / upcast of fp8_143 dtype. Hence bias register needs to be accessed by the user. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-05accel/habanalabs: do soft-reset using cpucp packetDafna Hirschfeld4-10/+35
This is done depending on the FW version. The cpucp method is preferable and saves scratchpads resource. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-05accel/habanalabs: check fw version using sw versionDafna Hirschfeld2-3/+9
The fw inner version is less trustable, instead use the fw general sw release version. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-05accel/habanalabs: extract and save the FW's SW major/minor/sub-minorDafna Hirschfeld2-6/+78
It is not always possible to know the FW's SW version from the inner FW version. Therefore we should extract the general SW version in addition to the FW version and use it in functions like 'hl_is_fw_ver_below_1_9' etc. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-05accel/habanalabs: rename fw_{major/minor}_version to fw_inner_{major/minor}_verDafna Hirschfeld2-9/+9
We later want to add fields for Firmware SW version. The current extracted FW version is the inner FW versioning so the new name is better and also better differentiate from the FW's SW version. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-05accel/habanalabs: add helper to extract the FW major/minorDafna Hirschfeld1-24/+45
the helper is extract_u32_until_given_char and can later be used to also get the major/minor of the sw version. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-05accel/habanalabs: fix bug in free scratchpad memoryMoti Haimovski1-2/+2
This commit fixes a bug in Gaudi2 when freeing the scratchpad memory in case software init fails. Signed-off-by: Moti Haimovski <mhaimovski@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-05accel/habanalabs: remove commented code that won't be usedKoby Elbaz1-9/+0
Once it was decided that these security settings are to be done by FW rather than by the driver, there's no reason to keep them in the code. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>