Job monitoring and efficiency¶

This section looks at how to monitor your job(s), including to see if they are efficient.

Many of the relevant commands have already been discussed in previous parts:

squeue: for viewing the state of the batch queue. More here: https://uppmax.github.io/NAISS_Slurm/slurm/#squeue
scancel: to cancel a job. More info here: https://uppmax.github.io/NAISS_Slurm/slurm/#scancel
sinfo: information about the partitions/queues. More info here: https://uppmax.github.io/NAISS_Slurm/slurm/#sinfo
scontrol show job: lots of information about a job. More info here: https://uppmax.github.io/NAISS_Slurm/slurm/#scontrol__show__job

But there are several others that have either not been mentioned or only done so briefly, including sacct, projinfo, `sshare`` and a number of center specific commands. We will look more into all of them here.

Why is a job ineffective?¶

There are several reasons that a job might be ineffective. Some of those could be:

using more threads than the allocated number of cores
not using all the cores you have allocated (unless on purpose/for memory)
inefficient use of the file system (many small files, open/close many files)
running a job that could run on GPUs on CPUs instead

Job monitoring is (also) about detecting signs the job is not running efficiently. This can be done with many different commands.

Job monitoring¶

Now let us look at some of the commands that are generally available, as well as those that are specific to one or more centres.

Commands valid at all centres¶

Command	What
`scontrol show job JOBID`	info about a job, including estimated start time
`squeue --me --start`	your running and queued jobs with estimated start time
`sacct -l -j JOBID`	info about j ob, pipe to `less -S` for scrolling side-ways (it is a wide output)
`projinfo`	usage of your project, adding `-vd` lists member usage
`sshare -l -A <proj-account>`	gives priority/fairshare (LevelIFS)

Most up-to-date project usage on a project’s SUPR page, linked from here: https://supr.naiss.se/project/

Site-specific commands¶

Command	What	Cluster
`jobinfo`	wrapper around `squeue`	Bianca, Cosmos, Alvis
`jobstats -p JOBID`	CPU and memory use of finished job (> 5 min) in a plot	Bianca
`job_stats.py`	link to Grafana dashboard with overview of your running jobs. Add `JOBID` for real-time usage of a job	Alvis
`job-usage JOBID`	grafana graphics of resource use for job (> few minutes)	Kebnekaise
`jobload JOBID`	show cpu and memory usage in a job	Tetralith
`jobsh NODE`	login to node, run “top”	Tetralith
`seff JOBID`	displays memory and CPU usage from job run	Tetralith, Dardel
`lastjobs`	lists 10 most recent job in recent 30 days	Tetralith
https://pdc-web.eecs.kth.se/cluster_usage/	Information about project usage	Dardel
https://grafana.c3se.chalmers.se/d/user-jobs/user-jobs	Grafana dashboard for user jobs	Alvis
https://www.nsc.liu.se/support/batch-jobs/tetralith/monitoring/	Job monitoring	Tetralith
https://docs.uppmax.uu.se/software/jobstats/	Job efficiency	Bianca