Job Resource Monitoring on HPCs

One challenge encountered by researchers when moving from their local workstation to HPC is the transition from interactive programming to the job submission black-box. One particular point of obscurity when submitting jobs is the resources your Machine Learning algorithm is utilising; this information is important as it allows you to see if your code is taking full advantage of resources like GPUs, or spending a large amount of time reading in data.

To combat this lack of visibility, we have put together some scripts for monitoring the resources a Python job uses, such as GPUs and CPUs, and a Jupyter Notebook to analyse the results. These scripts have been tested on MASSIVE M3 and should make it relatively easy to gain visibility over the resources used by Python jobs.

*Output of the provided Jupyter Notebook, looking at GPU utilisation of a job. You can clearly see the GPU spike with each epoch of training.*

Find them here: https://github.com/ML4AU/job-monitoring

Job Resource Monitoring on HPCs

Benchmarking GPUs for Machine Learning

The CvL Desktop: Strudel2