Interactive work on the compute nodes

Note

  • It is possible to run Python directly on the login (including ThinLinc) nodes.

  • Only be done for shorter jobs or jobs that do not use a lot of resources, as the login nodes can otherwise become slow for all users.

  • If you want to work interactively with your code or data like for plotting graphs or developing, you should start an interactive session if it requires much CPU or RAM.

  • If you rather will run a script which won’t use any interactive user input while running, you can instead start a batch job, see last session.

Questions

  • How to reach the compute/calculation nodes

  • How do I proceed to work interactively?

Objectives

  • Show how to reach the compute/calculation nodes on UPPMAX and HPC2N

  • Test some commands on the compute/calculation nodes

General

  • Running interactively a compute node involves either:

    • developing python code and running it and test and fix the upcoming bugs.

    • using a GUI, like Jupyter, and working interactively with data, and possibly plotting graphs.

      • Jupyter-notebook/lab are available on both UPPMAX and HPC2N.

  • You allocate a compute node in the SLURM system, using the same options as for batch jobs.

  • The way it works differs, however, between UPPMAX and HPC2N.

    • At UPPMAX, you actually are “physically” on the compute node.

      • You can also ask for “devcore” partition, with -p devcore if you run for shorter than 1 hour. Then waiting times are short.

    • At HPC2N, you are not “physically” on the compute node, but can see the output of the commands run in “batch mode”

  • Running Jupyter and other graphics benefit from being run in ThinLinc or other places closer to your own computer.
  • We will also deal with Jupyter in the next session about parallel computing.

Warning

(HPC2N) Do note that it is not real interactivity as you probably mean it, as you will have to run it as a Python script instead of by starting Python and giving commands inside it. The reason for this is that you are not actually logged into the compute node and only sees the output of the commands you run.

Python “interactively” on the compute nodes

To run interactively, you need to allocate resources on the cluster first. You can use the command salloc to allow interactive use of resources allocated to your job. When the resources are allocated, you need to preface commands with srun in order to run on the allocated nodes instead of the login node.

  • First, you make a request for resources with interactive/salloc, like this:

$ interactive -n <tasks> --time=HHH:MM:SS -A naiss2023-22-1126
  • where <tasks> is the number of tasks (or cores, for default 1 task per core), time is given in hours, minutes, and seconds (maximum T168 hours), and then you give the id for your project (on UPPMAX this is naiss2023-22-1126 for this course, on HPC2N it is hpc2nXXXX-YYY)

  • Your request enters the job queue just like any other job, and interactive/salloc will tell you that it is waiting for the requested resources. When interactive/salloc tells you that your job has been allocated resources, you can interactively run programs on those resources with srun. The commands you run with srun will then be executed on the resources your job has been allocated. NOTE If you do not preface with srun the command is run on the login node!

  • You can now run Python scripts on the allocated resources directly instead of waiting for your batch job to return a result. This is an advantage if you want to test your Python script or perhaps figure out which parameters are best.

Example

Tip

Type along!

Requesting 4 cores for 10 minutes, then running Python

[bjornc@rackham2 ~]$ interactive -A naiss2023-22-1126 -p devcore -n 4 -t 10:00
You receive the high interactive priority.
There are free cores, so your job is expected to start at once.

Please, use no more than 6.4 GB of RAM.

Waiting for job 29556505 to start...
Starting job now -- you waited for 1 second.

[bjornc@r484 ~]$ module load python/3.9.5

Let us check that we actually run on the compute node:

[bjornc@r483 ~]$ srun hostname
r483.uppmax.uu.se
r483.uppmax.uu.se
r483.uppmax.uu.se
r483.uppmax.uu.se

We are. Notice that we got a response from all four cores we have allocated.

I am going to use the following two Python codes for the examples:

Adding two numbers from user input (add2.py)

# This program will add two numbers that are provided by the user

# Get the numbers
a = int(input("Enter the first number: "))
b = int(input("Enter the second number: "))

# Add the two numbers together
sum = a + b

# Output the sum
print("The sum of {0} and {1} is {2}".format(a, b, sum))

Adding two numbers given as arguments (sum-2args.py)

import sys

x = int(sys.argv[1])
y = int(sys.argv[2])

sum = x + y

print("The sum of the two numbers is: {0}".format(sum))

Now for running the examples:

  • Note that the commands are the same for both HPC2N and UPPMAX!

    1. Running a Python script in the allocation we made further up. Notice that since we asked for 4 cores, the script is run 4 times, since it is a serial script

    $ srun python sum-2args.py 3 4
    The sum of the two numbers is: 7
    The sum of the two numbers is: 7
    The sum of the two numbers is: 7
    The sum of the two numbers is: 7
    b-an01 [~]$
    
    1. Running a Python script in the above allocation, but this time a script that expects input from you.

    $ srun python add2.py
    2
    3
    Enter the first number: Enter the second number: The sum of 2 and 3 is 5
    Enter the first number: Enter the second number: The sum of 2 and 3 is 5
    Enter the first number: Enter the second number: The sum of 2 and 3 is 5
    Enter the first number: Enter the second number: The sum of 2 and 3 is 5
    

    As you can see, it is possible, but it will not show any interaction it otherwise would have. This is how it would look on the login node:

    $ python add2.py
    Enter the first number: 2
    Enter the second number: 3
    The sum of 2 and 3 is 5
    

Exit

When you have finished using the allocation, either wait for it to end, or close it with exit

[bjornc@r484 ~]$ exit

exit
[screen is terminating]
Connection to r484 closed.

[bjornc@rackham2 ~]$

Keypoints

  • Start an interactive session on a calculation node by a SLURM allocation

    • At HPC2N: salloc

    • At UPPMAX: interactive

  • Follow the same procedure as usual by loading the Python module and possible prerequisites.

  • CPU-hours are more effectively used in “batch jobs”. Therefore:

    • Use “interactive” for testing and developing

    • Don’t book too many cores/nodes and try to be effective when the session is going.