Job monitoring and efficiency¶
This section looks at how to monitor your job(s), including to see if they are efficient.
Many of the relevant commands have already been discussed in previous parts:
squeue: for viewing the state of the batch queue. More here: https://uppmax.github.io/NAISS_Slurm/slurm/#squeuescancel: to cancel a job. More info here: https://uppmax.github.io/NAISS_Slurm/slurm/#scancelsinfo: information about the partitions/queues. More info here: https://uppmax.github.io/NAISS_Slurm/slurm/#sinfoscontrol show job: lots of information about a job. More info here: https://uppmax.github.io/NAISS_Slurm/slurm/#scontrol__show__job
But there are several others that have either not been mentioned or only done so briefly, including sacct, projinfo, `sshare`` and a number of center specific commands. We will look more into all of them here.
Why is a job ineffective?¶
There are several reasons that a job might be ineffective. Some of those could be:
- using more threads than the allocated number of cores
- not using all the cores you have allocated (unless on purpose/for memory)
- inefficient use of the file system (many small files, open/close many files)
- running a job that could run on GPUs on CPUs instead
Job monitoring is (also) about detecting signs the job is not running efficiently. This can be done with many different commands.
Job monitoring¶
Now let us look at some of the commands that are genereally available, as well as those that are specific to one or more centres.
Commands valid at all centres¶
| Command | What |
|---|---|
scontrol show job JOBID |
info about a job, including estimated start time |
squeue --me --start |
your running and queued jobs with estimated start time |
sacct -l -j JOBID |
info about j ob, pipe to less -S for scrolling side-ways (it is a wide output) |
projinfo |
usage of your project, adding -vd lists member usage |
sshare -l -A <proj-account> |
gives priority/fairshare (LevelIFS) |
Most up-to-date project usage on a project’s SUPR page, linked from here: https://supr.naiss.se/project/
Site-specific commands¶
| Command | What | Centre |
|---|---|---|
jobinfo |
wrapper around squeue |
Pelle, Cosmos, Alvis |
jobstats -p JOBID |
CPU and memory use of finished job (> 5 min) in a plot | Pelle |
job_stats.py |
link to Grafana dashboard with overview of your running jobs. Add JOBID for real-time usage of a job |
Alvis |
job-usage JOBID |
grafana graphics of resource use for job (> few minutes) | Kebnekaise |
jobload JOBID |
show cpu and memory usage in a job | Tetralith |
jobsh NODE |
login to node, run “top” | Tetralith |
seff JOBID |
displays memory and CPU usage from job run | Tetralith, Dardel |
lastjobs |
lists 10 most recent job in recent 30 days | Tetralith |
| https://pdc-web.eecs.kth.se/cluster_usage/ | Information about project usage | Dardel |
| https://grafana.c3se.chalmers.se/d/user-jobs/user-jobs | Grafana dashboard for user jobs | Alvis |
| https://www.nsc.liu.se/support/batch-jobs/tetralith/monitoring/ | Job monitoring | Tetralith |
| https://docs.uppmax.uu.se/software/jobstats/ | Job efficiency | Pelle |