Debugging ROCm test failures
Description of the project: The Debian ROCm Team operates ci.rocm.debian.net, a CI environment in which the ROCm software stack's unit tests are run on various AMD GPU architectures.
Some of the tests are failing. Failures can result from something as simple as test results deviating beyond some tolerated error, or as complex as amdgpu driver issues. These are often architecture or environment-specific and may require remote access to specialized hardware.
The task would be analyze and report on the failures, and fixing them and/or submitting patches, where possible.
Confirmed Mentor: Cordell Bloor
How to contact the mentor: cgmb@slerp.xyz
Confirmed co-mentors: Christian Kastner ckk@debian.org, Mo Zhou
Difficulty level: High
Project size: High (350h)
Deliverables of the project:
- Report on all identified test failures
- Report on individual analyses of test failures
- Bonus: Submitted fixes
Desirable skills:
- Proficiency with Linux-based operating systems (ideally Debian or Ubuntu)
- C++
- Debugging with gdb
- General understanding of how operating systems work and interact with hardware
What the intern will learn:
- Working with ROCm, the main challenger to CUDA
- Hardware
Application tasks:
- Collect all failed tests in our CI
- Analyze each failed test, documenting findings
- Suggest fixes if possible
