GPU and Multi-GPU Jobs on HPC
Author: Kiowa Scott-Hurley
Date: 19/10/2021
Migrating the average research computing workflow onto a HPC can be challenging - learning how to access GPU compute and running multi-GPU jobs is an additional challenge, especially for deep learning. These additional challenges include learning how to request resources via a HPC scheduler, being able to monitor the efficiency of GPU jobs, using the pre-installed CUDA and CudNN available on a HPC, testing multi-GPU jobs work as expected in non-interactive environments, and understanding the best practices associated with whichever facility you happen to be using.
We’ve published some introductory documentation to help users transition their deep learning code onto the HPC, including comments on how to benchmark code for efficiency, example SLURM job submission scripts, and instructions on moving from interactive desktop environments into the non-interactive queue. These scripts are currently designed for MASSIVE users, with aim to create similar examples and instructions for other HPC facilities. I’d like to acknowledge these instructions are adapted from the excellent documentation already provided by the Biowulf HPC facility. We also provide some additional GPU specific documentation for MASSIVE users on our docs.massive.org.au website here: https://docs.massive.org.au/M3/GPUs-on-M3.html
Similarly, University of Queensland has been working on developing more advanced training materials for running the multi-GPU and multi-node tool Horovod on their Wiener HPC facility - you can find the materials here: https://github.com/ML4AU/Horovod-on-HPC. We’re currently working to get Horovod running on MASSIVE too.
Please feel free to reach out with feedback on either the GitHub repository of examples or our documentation - we’re also eager to collaborate with other facilities facing similar challenges - contact us and let’s have a chat!