Job Resource Monitoring on HPCs
One challenge encountered by researchers when moving from their local workstation to HPC is the transition from interactive programming to the job submission black-box. One particular point of obscurity when submitting jobs is the resources your Machine Learning algorithm is utilising; this information is important as it allows you to see if your code is taking full advantage of resources like GPUs, or spending a large amount of time reading in data.
To combat this lack of visibility, we have put together some scripts for monitoring the resources a Python job uses, such as GPUs and CPUs, and a Jupyter Notebook to analyse the results. These scripts have been tested on MASSIVE M3 and should make it relatively easy to gain visibility over the resources used by Python jobs.
Find them here: https://github.com/ML4AU/job-monitoring