Job monitoring and efficiency

This section looks at how to monitor your job(s), including to see if they are efficient.

Many of the relevant commands have already been discussed in previous parts:

But there are several others that have either not been mentioned or only done so briefly, including sacct, projinfo, `sshare`` and a number of center specific commands. We will look more into all of them here.

Why is a job ineffective?

There are several reasons that a job might be ineffective. Some of those could be:

  • using more threads than the allocated number of cores
  • not using all the cores you have allocated (unless on purpose/for memory)
  • inefficient use of the file system (many small files, open/close many files)
  • running a job that could run on GPUs on CPUs instead

Job monitoring is (also) about detecting signs the job is not running efficiently. This can be done with many different commands.

Job monitoring

Now let us look at some of the commands that are genereally available, as well as those that are specific to one or more centres.

Commands valid at all centres

Command What
scontrol show job JOBID info about a job, including estimated start time
squeue --me --start your running and queued jobs with estimated start time
sacct -l -j JOBID info about j ob, pipe to less -S for scrolling side-ways (it is a wide output)
projinfo usage of your project, adding -vd lists member usage
sshare -l -A <proj-account> gives priority/fairshare (LevelIFS)

Most up-to-date project usage on a project’s SUPR page, linked from here: https://supr.naiss.se/project/

Site-specific commands

Command What Centre
jobinfo wrapper around squeue Pelle, Cosmos, Alvis
jobstats -p JOBID CPU and memory use of finished job (> 5 min) in a plot Pelle
job_stats.py link to Grafana dashboard with overview of your running jobs. Add JOBID for real-time usage of a job Alvis
job-usage JOBID grafana graphics of resource use for job (> few minutes) Kebnekaise
jobload JOBID show cpu and memory usage in a job Tetralith
jobsh NODE login to node, run “top” Tetralith
seff JOBID displays memory and CPU usage from job run Tetralith, Dardel
lastjobs lists 10 most recent job in recent 30 days Tetralith
https://pdc-web.eecs.kth.se/cluster_usage/ Information about project usage Dardel
https://grafana.c3se.chalmers.se/d/user-jobs/user-jobs Grafana dashboard for user jobs Alvis
https://www.nsc.liu.se/support/batch-jobs/tetralith/monitoring/ Job monitoring Tetralith
https://docs.uppmax.uu.se/software/jobstats/ Job efficiency Pelle