High Performance Computing — HPC

Objectives

  • Let’s recap and go a little further into the UPPMAX hardware!

HPC, HTC and MTC

  • The Buzz word is HPC or High Performance Computing, but this is rather narrow focusing on fast calculation, i.e. processors and parallelism

  • Many of your projects are more focusing on high throughput, large memory demands and many tasks.

  • Here is a list of the three most common Computing paradigms:

  • HPC: High Performance Computing — Focus on floating point operations per second (FLOPS, flops or flop/s)

    • characterized as needing large amounts of computing power for short periods of time

  • HTC: High-Throughput Computing —

    • operations or jobs per month or per year.

    • more interested in how many jobs can be completed over a long period of time instead of how fast.

    • independent, sequential jobs that can be individually scheduled o

  • MTC: Many-task Computing — emphasis of using many computing resources over short periods of time to accomplish many computational tasks

    • bridge the gap between HTC and HPC.

    • reminiscent of HTC, but including both dependent and independent tasks), where the primary metrics are measured in seconds (e.g. FLOPS, tasks/s, MB/s I/O rates), as opposed to operations (e.g. jobs) per month.

    • high-performance computations comprising multiple distinct activities, coupled via file system operations.

What is a cluster?

  • A network of computers, each computer working as a node.

  • From small scale RaspberryPi cluster…

RaspBerry

  • To supercomputers like Rackham.

Rackham

  • Each node contains several processor cores and RAM and a local disk called scratch.

Node

  • The user logs in to login nodes via Internet through ssh or Thinlinc.

    • Here the file management and lighter data analysis can be performed.

RaspBerry

RaspBerry

  • The calculation nodes has to be used for intense computing.

    • “Normal” softwares use one core.

    • Parallelized software can utilize several cores or even several nodes. Keywords signalizing this are e.g.:

      • “multi-threaded”, “MPI”, “distributed memory”, “openMP”, “shared memory”.

    • To let your software run on the calculation nodes

      • start an “interactive session” or

      • “submit a batch job”.

      • More about this in today’s introduction to jobs.

Storage basics

  • All nodes can access:

    • Your home directory on Domus or Castor

    • Your project directories on Crex or Castor

    • Its own local scratch disk (2-3 TB)

  • If you’re reading/writing a file once, use a directory on Crex or Castor

  • If you’re reading/writing a file many times…

    • Copy the file to ”scratch”, the node local disk:

    cp myFile $SNIC_TMP
    

The UPPMAX hardware

Clusters

  • We have a number of compute clusters:

    • Rackham , reserved for SNIC projects

    • Snowy, GPU, long jobs reserved for UPPMAX projects and Education

    • Bianca , a part of SNIC-SENS

    • Miarka, reserved for Scilifelab production

    • UPPMAX cloud, a part of SNIC Science Cloud

  • User guides

  • The storage systems we have provide a total volume of about 25 PB, the equivalent of 50,000 years of 128-bit encoded music. Read more on the storage systems page.

UPPMAX storage system names (projects & home directories)

  • Rackham storage: Crex & Domus

  • Bianca storage: Castor & Cygnus

  • NGI production system (Miarka): Vulpes

  • NGI delivery server: Grus

  • Off-load storage: Lutra

System usage

System usage

A little bit more about Snowy

  • User guide.

    • There is a local compute round for UU users applying for Snowy in SUPR.

    • GU (courses) applications (including GU GPU usage) are not done in SUPR, but are supposed to be routed through the service desk.

    • The details can be found at the Getting started page.

About Bianca?

  • Wait for it!

Summary about the three “common” UPPMAX clusters

Rackham

Snowy

Bianca

Purpose

General-purpose

General-purpose

Sensitive

# Nodes (Intel)

486+144

228+
50 Nvidia T4 GPUs

288 +
10 nodes á 2
NVIDIA A100 GPUs

Cores per node

20/16

16

16/64

Memory per node

128 GB

128 GB

128 GB

Fat nodes

256 GB & 1 TB

256, 512 GB & 4 TB

256 & 512 GB

Local disk (scratch)

2/3 TB

4 TB

4 TB

Login nodes

Yes

No (reached from Rackham)

Yes (2 cores and 15 GB)

“Home” storage

Domus

Domus

Castor

“Project” Storage

Crex, Lutra

Crex, Lutra

Castor

Overview of the UPPMAX systems

graph TB

  Node1 -- interactive --> SubGraph2Flow
  Node1 -- sbatch --> SubGraph2Flow
  subgraph "Snowy"
  SubGraph2Flow(calculation nodes) 
        end

        thinlinc -- usr-sensXXX + 2FA----> SubGraph1Flow
        Node1 -- usr-sensXXX + 2FA----> SubGraph1Flow
        subgraph "Bianca"
        SubGraph1Flow(Bianca login) -- usr+passwd --> private(private cluster)
        private -- interactive --> calcB(calculation nodes)
        private -- sbatch --> calcB
        end

        subgraph "Rackham"
        Node1[Login] -- interactive --> Node2[calculation nodes]
        Node1 -- sbatch --> Node2
        end

Keypoints

  • UPPMAX has several clusters

    • each having its focus and limitation or possibilites

    • access is determined by type of project