Guide 1: Working with the Snellius cluster

Authors: Phillip Lippe, Danilo de Goede

This tutorial explains how to work with the Snellius cluster for the Deep Learning course at the University of Amsterdam. Every student will receive an account to have resources for training deep neural networks and get familiar with working on a cluster. It is recommended to have listened to the presentation by the SURFsara team or the TA team before going through this tutorial. Further, this tutorial assumes that you are familiar with using the terminal in Linux. If not, a crash course can be found here.

The Snellius cluster

The following section gives a quick introduction to the Snellius cluster, and how it is build up. A detailed description of the system can be found in the SURFsara user guide.

What is a cluster computer?

(Disclaimer: the following paragraph is an adapted version of the HPC user guide. Credits: SURFsara team)

You can imagine a cluster computer as a collection of regular computers (known as nodes), tied together with network cables that are similar to the network cables in your home or office (see the figure below - credit: SURFsara team). Each node has its own CPU, memory and disk space, in addition to which they generally have access to a shared file system. On a cluster computer, you can run hundreds of computational tasks simultaneously.

a6d9678dab0a44649f0203a3349fb2a1

Interacting with a cluster computer is different from a normal computer. Normal computers are mostly used interactively, i.e. you type a command or click with your mouse, and your computer instantly responds by e.g. running a program. Cluster computers are mostly used non-interactively.

A cluster computer such as Snellius mainly has two types of nodes: login nodes and batch nodes. You connect to Snellius through the login node (see next section). This is an interactive node: similar to your own PC, it immediately responds to the commands you type. There are only a few login nodes on a cluster computer, and you only use them for light tasks: adjusting your code, preparing your input data, writing job scripts, etc. Since the login nodes are only meant for light tasks, many users can be on the same login node at the same time. To prevent users from over-using the login node, any command that takes longer than 15 minutes will be killed.

Your ‘big’ calculations such a neural network training will be done on the batch nodes. These perform what is known as batch jobs. A batch job is essentially a recipe of commands (put together in a job script) that you want the computer to execute. Calculations on the batch nodes are not performed right away. Instead, you submit your job script to the job queue. As soon as sufficient resources (i.e. batch nodes) are available for your job, the system will take your job from the queue, and send it to the batch nodes for execution.

Architecture of Snellius

A visual description of the Snellius architecture can be found below (figure credit - SURFsara team). You can connect to any login node of Snellius to interact with the system. Over the login nodes, you have access to the shared file system across nodes. The one you will mainly interact is /home where you can store your code, data, etc. You can access your files from any login node, as well as any compute node. You have a maximum disk space of 200GB which should be sufficient for the DL course.

You do not directly interact with any compute node. Instead, you can request computational resources with a job script, and Snellius will assign a compute node to this job using a SLURM job scheduler. If all computational resources are occupied, your job will be placed in a queue, and scheduled when resources are available.

Snellius has multiple sets of compute nodes with different computational resources, also called partitions. The one we will use for the Deep Learning course is called gpu, and provides us compute nodes with the following resources:

Processor

CPU Cores

RAM Memory

GPUs

Platinum 8360Y (2.4GHz)

72

480GB

4x NVIDIA A100, 40 GB HBM2

These computational resources are more than sufficient for the assignments in this course. For a job, we usually only use a single GPU from a compute node, meaning that we would also use 1/4th of the other resources (18 CPU cores, 120GB RAM). The scheduler of Snellius will assign multiple jobs to the same node if its computational resources are not exhausted yet, thus not wasting any if we only use a single GPU.

9b4c6fb9f5e2493c94d448996551f6fa

First steps

After discussing the general architecture of Snellius, we are ready to discuss the practical aspects of how to use the Snellius cluster.

REMINDER: When you first receive your login data for Snellius, make sure to go to the user portal and change the password.

How to connect to Snellius

You can login to Snellius’s login nodes using a secure shell (SSH):

ssh -X scur____@snellius.surf.nl

Replace scur___ by your username. You will be connected to one of its login nodes, and have the view of a standard Linux system in your home directory. Note that you should only use the login node as an interface, and not as compute unit. Do not run any trainings on this node, as it will be killed after 15 minutes, and slows down the communication with Snellius for everyone. Instead, Snellius uses a SLURM scheduler to handle computational expensive jobs (see below).

If you want to transfer files between Snellius and your local computer, you can use standard Unix commands such as scp or rsync, GitHub, or graphical interfaces such as FileZilla (use port 22 in FileZilla) or WinSCP (for Windows PC). Note that using GitHub on Snellius requires adding the SSH key from Snellius to your GitHub account.

A copy operation from Snellius to your local computer with rsync, started from your local computer, could look as follows:

rsync -av scur___@snellius.surf.nl:~/source destination

Replace scur___ by your username, source by the directory/file on Snellius you want to copy on your local machine, and destination by the directory/file it should be copied to. Note that source is referenced from your home directory on Snellius. If you want to copy a file from your local computer to Snellius, use:

rsync -av source scur___@snellius.surf.nl:~/destination

Again, replace source with the directory/file on your local computer you want to copy to Snellius, and destination by the directory/file it should be copied to.

Modules

Snellius uses modules to provide you various pre-installed software. This includes simple Python, but also the NVIDIA libraries CUDA and cuDNN that can be necessary to access GPUs. However, for our course, we only need the Anaconda module:

module load 2022
module load Anaconda3/2022.05

Note: Loading Anaconda3/2022.05 may result in a warning about the potential for corruption in the user environment due to mixing Conda and module environments. Since we are only loading the 2022 and Anaconda3/2022.05 modules, you can safely ignore this message. However, be aware that loading additional modules could cause issues.

Install the environment

To run the Deep Learning assignments and other code like the notebooks on Snellius, you need to install the provided environment for Snellius (dl2023_gpu.yml). You can either download it locally and copy it to your Snellius account via rsync or scp as described before, or simply clone the practicals github on Snellius:

git clone https://github.com/uvadlc/uvadlc_practicals_2023.git

Snellius provides an Anaconda module, which you can load via module load Anaconda3/2022.05 as mentioned before (remember to load the 2022 module beforehand). We recommend installing the package via a job file since the installation can take 20-30 minutes, and any command on the login node will be killed without warning after 15 minutes.

To do that, save the following content into a file called install_environment.job in your home directory:

#!/bin/bash

#SBATCH --partition=gpu
#SBATCH --gpus=1
#SBATCH --job-name=InstallEnvironment
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=18
#SBATCH --time=04:00:00
#SBATCH --output=slurm_output_%A.out

module purge
module load 2022
module load Anaconda3/2022.05

cd $HOME/uvadlc_practicals_2023/
conda env create -f dl2023_gpu.yml

You can use e.g. nano to do that. If the environment file is not in the cloned repository or you haven’t cloned the repo, change the cd statement to the directory where it is stored. Once the file is saved, start the job with the command sbatch install_environment.job. The installation process is started on a compute node with a time limit of 4 hours, which should be sufficiently long. Let’s look at the next section to understand what we have actually done here with respect to ‘job files’.

Trouble shooting

If the installation via job file does not work, try to install the environment with the following command from the login node after navigating to the directory the environment file is in:

conda env create -f dl2023_gpu.yml

Note that the jobs on the login node on Snellius are limited to 15 minutes. This is often not enough to install the full environment. If the installation command is killed, you can simply restart it. If you get the error that a package is corrupted, go to /home/scur___/.conda/pkgs/ under your home directory and remove the directory of the corrupted package. If you get the error that the environment dl2023 already exists, go to /home/scur___/.conda/envs/, and remove the folder ‘dl2023’.

If you experience issues with the Anaconda module, you can also install Anaconda yourself (download link) or ask your TA for help.

Verifying the installation

When the installation process is completed, you can check if the process was successful by activating your environment on the login node via source activate dl2023 (remember to have loaded the anaconda module beforehand), and starting a python console with executing python. It should say Python 3.11.5 | packaged by conda-forge. If you see a different python version, you might not have activated the environment correctly.

In the python console, try to import pytorch via import torch and check the version: torch.__version__. It should say 2.1.0. Finally, check whether PyTorch can access the GPU: torch.cuda.is_available(). Note that in most cases, this will return False because the login nodes on Snellius do not have GPUs. However, when you run the same command on a compute node (i.e., when submitting a job with sbatch), it should return True. To do this, save the following content into a file called check_environment.job, and start the job using sbatch check_environment.job.

#!/bin/bash

#SBATCH --partition=gpu
#SBATCH --gpus=1
#SBATCH --job-name=CheckEnvironment
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=18
#SBATCH --time=00:05:00
#SBATCH --output=slurm_output_%A.out

module purge
module load 2022
module load Anaconda3/2022.05

# Activate your environment
source activate dl2023
# Check whether the GPU is available
srun python -uc "import torch; print('GPU available?', torch.cuda.is_available())"

If the resulting slurm output contains the line GPU available? True, you should be all set.

The SLURM scheduler

Snellius relies on a SLURM scheduler to organize the jobs on the cluster. When logging into Snellius, you cannot just start a python script with your training, but instead submit a job to the scheduler. The scheduler will decide when and on which node to run your job, based on the number of nodes available and other jobs submitted.

Job files

We provide a template for a job file that you can use on Snellius. Create a file with any name you like, for example template.job, and start the job by executing the command sbatch template.job.

#!/bin/bash

#SBATCH --partition=gpu
#SBATCH --gpus=1
#SBATCH --job-name=ExampleJob
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=18
#SBATCH --time=04:00:00
#SBATCH --output=slurm_output_%A.out

module purge
module load 2022
module load Anaconda3/2022.05

# Your job starts in the directory where you call sbatch
cd $HOME/...
# Activate your environment
source activate dl2023
# Run your code
srun python -u ...

Job arguments

You might have to change the #SBATCH arguments depending on your needs. We describe the arguments below:

  • partition: The partition of Snellius on which you want to run your job. As a student, you only have access to the partition gpu, which provides you nodes with NVIDIA A100 GPUs (40GB).

  • gpus: Number of GPUs you request from the nodes. You can select up to four GPUs with your account, but if you haven’t designed your code to explicitly run on multiple GPUs, please use only one GPU (so no need to change what we have above).

  • job-name: Name of the job to pop up when you list your jobs with squeue (see below).

  • ntasks: Number of tasks to run with the job. In our case, we will always use 1 task.

  • cpus-per-task: Number of CPUs you request from the nodes. The gpu partition restricts you to max. 18 CPUs per job/GPU.

  • time: Estimated time your job needs to finish. It is no problem if your job finishes earlier than the specified time. However, if your job takes longer, it will be instantaneously killed after the specified time. Still, don’t specify unnecessarily long times as this causes your job to be scheduled later (you need to wait longer in the queue if other people also want to use the cluster). A good rule of thumb is to specify ~20% more than what you would expect.

  • output: Output file to which the slurm output should be written. The tag “%A” is automatically replaced by the job ID. Note that if you specify the output file to be in a directory that does not exist, no output file will be created.

SLURM allows you to specify many more arguments, but the ones above are the important ones for us. If you are interested in a full list, see here.

Scratch

If you work with a lot of data, or a larger dataset, it is advised to copy your data to the /scratch-local/<username> directory of the node. Otherwise, the read/write operation might become a bottleneck of your job. To do this, simply use your copy operation of choice (cp, rsync, …), and copy the data to the directory $TMPDIR (which is set /scratch-local/<username> by default). You should add this command to your job file before calling srun .... Remember to point to this data when you are running your code. If you have a dataset that can be downloaded, you can also directly download it to the scratch (can sometimes be faster than copying). In case you also write something on the scratch, you need to copy it back to your home directory before finishing the job.

Edit Dec. 6, 2021: Due to internal changes to the SURF filesystems, it is required to use the scratch for any dataset such as CIFAR10. In PyTorch, the CIFAR10 dataset is structured into multiple large batches (usually 5), and only that batch is loaded which is currently needed. This is why during training, it requires a lot of reading operations on the disk which can slow down your training and constitutes a challenge when hundreds of students share the same filesystem. Hence, you have to use the scratch for such datasets. For most parts in the assignment, you can do this by specifying --data_dir $TMPDIR on the python command in your job file (check if the argument parser has this argument, otherwise you can add it yourself). This will download the dataset to the scratch and only load it from there. We recommend using this approach also for future course editions.

Starting and organizing jobs

To start a job, you simply have to run sbatch jobfile where you replace jobfile by the filename of the job. Note that no specific file postfix like .job is necessary for the job (you can use .txt or any other you prefer). After your job has been submitted, it will be first placed into a waiting queue. The SLURM scheduler decides when to start your job based on the time of your job, all other jobs currently running or waiting, and available nodes.

Besides sbatch, you can interact with the SLURM scheduler via the following commands:

  • squeue: Lists all jobs you have currently submitted to Snellius. See the slurm documentation for details.

  • scancel JOBID: Cancels and stops a job, independent of whether it is running or pending. The job ID can be found using squeue, and is printed when submitting the job via sbatch.

  • scontrol show job JOBID: Shows additional information of a specific job, like the estimated start time.

Interactive sessions

As alternative to running jobs via sbatch, you can request an interactive job session. In this format, you gain access to the node via the terminal/command line, and can interact with it as you would have connected via ssh directly to this compute node. Hence, you can debug your script much easier and have faster respond time. However, please keep in mind the following two disadvantages: 1. If you happen to disconnect from Snellius because of an instable connection or your computer going in stand-by mode, the job will be canceled. 2. Your job is not automatically terminated when your script has finished running, but you have to manually kill the job once you are done. If you forget it, you block the compute node for other students and waste credits that UvA payed for Hence, only use interactive sessions for short jobs/scripts, for example, if you want to debug whether your script starts running and train the model. Do not use interactive session to train a model for a long time.

In order to start an interactive session on Snellius, you can use srun with the same input parameters as for the job file. For example:

srun --partition=gpu --gpus=1 --ntasks=1 --cpus-per-task=18 --time=00:10:00 --pty bash -i

This will start an interactive session with a GPU for 10 minutes. Once resources have been allocated, you will see in your terminal that you are now on one of the compute nodes, for example, r32n5. As a first step, you need to load the modules (module load 2022; module load Anaconda3/2022.05) and activate your conda environment. Then, you are ready to run your script.

Troubleshooting

It can happen that you encounter some issues when interacting with Snellius. A short FAQ is provided on the SURFSara website, and here we provide a list of common questions/situations we have experienced from past students.

Snellius is refusing connection

It can occasionally happen that Snellius refuses the connection when you try to ssh into it. If this happens, you can try to use the Pulse Secure UvA VPN before connecting to Snellius. If this still does not work, then the connection issue is likely not on your side. The problem often resolves after 2-3 hours, and Snellius let’s you login after it again. If the problem doesn’t resolve after couple of hours, please contact your TA, and eventually the SURFSara helpdesk.

Slurm output file missing

If a job of yours is running, but no slurm output file is created, check whether the path to the output file specified in your job file actually exists. If the specified file points to a non-existing directory, no output file will be created. Note that this is not an issue by default, but you are running your job “blind” without seeing the stdout or stderr channels.

Slurm output file is empty for a long time

The slurm output file can lag behind in showing the outputs of your running job. If your job is running for couple of minutes and you would have expected a few print statements to have happened, try to flush your stdout stream (how to flush the output in python).

All my jobs are pending

With your student account, the SLURM scheduler restricts you to run only two jobs in parallel at a time. However, you can still queue more jobs that will run in sequence. This is done because with more than 200 students, Snellius could get crowded very fast if we don’t guarantee a fair share of resources. If all of your jobs are pending, you can check the reason for pending in the last column of squeue. All reasons are listed in the squeue documentation under JOB REASON CODES. The following ones are common:

  • Priority: There are other jobs on Snellius with a higher priority that are also waiting to be run. This means you just have to be patient.

  • QOSResourceLimit: The job is requesting more resources than allowed. Check your job file as you are only allowed to have at max. 4 GPUs, 72 CPU cores and 480GB RAM.

  • Resources: All nodes on Snellius are currently busy, yours will be scheduled soon.

You can also see the estimated start time of a job by running scontrol show job JOBID. However, note that this is the “worst case” scenario for the current number of submitted jobs, as in if all currently running jobs would need their maximum runtime. At the same time, if more people would submit their job with higher priority, yours can fall back in the queue and get a later start time.

PyTorch or other packages cannot be imported

If you run a job and see the python error message in the slurm output file that a package is missing although you have installed it in the environment, there are two things to check. Firstly, make sure to not have the environment activated on the login node when submitting the job. This can lead to an error in the anaconda module such that packages are not found on the compute node. Secondly, check that you activate the environment correctly. To verify that the correct python version is used, you can add the command which python before your training file. This prints out the path of the python that will be used, in which you should see the anaconda version in the dl2023 environment.

My job runs very slow

If your job executes your script much slower than you expect, check for two things: (1) have you requested a GPU and are you using it, and (2) are you using the scratch for your dataset? Not using the scratch for your dataset can create a significant communication bottleneck, especially if multiple students do it at the same time. Make sure to download or copy your dataset to the scratch and load it from there.

I am not able to use Snellius at all

If during the course, there will be major issues with Snellius (e.g. cluster goes into maintenance for a long time, issues with the filesystem, etc.), you can make use of GoogleColab, as we do already for all notebook tutorials here. All assignments do not necessarily require large amount of compute, and can often be comfortably trained on a GPU provided by GoogleColab. For an introduction to GoogleColab, see this tutorial.

Advanced topics

Password-less login

Typing your password every time you connect to Snellius can become annoying. To enable a safe, password-less connection, you can add your public ssh key to the SURFsara user portal. Next time you login from your machine to Snellius, it will only check the ssh key and not ask you for the password anymore.

Remote development with VSCode or PyCharm

The common workflow with clusters is that you first code locally and test your implementation on short runs on the CPU or evt. a local GPU, then sync your code to the cluster (e.g. via git), and finally run the full training/process on a compute node of the cluster. If you prefer to directly code on Snellius, you can do so via remote connections in tools like VSCode or PyCharm. Essentially, these IDEs can connect via SSH to Snellius, so that it looks to you like you code locally, but all code you are editing is directly saved on Snellius. Note though that to run your code, you still need to create a job script and submit it via SLURM. The login nodes, on which these IDEs connect you to, are not meant for debugging or running code. Any process that takes longer than 15 minutes will be killed (this can sometimes also include the SSH connection of VSCode/PyCharm). For more details on remote development and how to set it up, see the SURFsara documentation on VSCode and PyCharm.

Tracking GPU stats

If you are curious whether you use the GPU to its full capacity, you can monitor its utilization as follows. First, you submit your job and check its job ID via squeue or the ID that has been printed out after submitting the job via sbatch. Next, you can log into the node via slurm_jobmonitor [jobid] where you need your job ID. This gives you an interactive view on the node. Finally, you can run nvtop to track the GPU utilization. More details can be found here.

Job Arrays

You might come into the situation where you need to run a hyperparameter search over multiple values, and don’t want to write an endless number of job scripts. A much more elegant solution is a job array. Job arrays are created with two files: a job file, and a hyperparameter file. The job file will start multiple sub-jobs that each use a different set of hyperparameters, as specified in the hyperparameter file. In the job file, you need to add the argument #SBATCH --array=.... The argument specifies how many sub-jobs you want to start, how many to run in parallel (at maximum), and which lines to use from the hyperparameter file. For example, if we specify #SBATCH --array=1-16%8, this means that we start 16 jobs using the lines 1 to 16 in the hyperparameter file, and running at maximum 8 jobs in parallel at the same time. Note that the number of parallel jobs is there to limit yourself from blocking the whole cluster. However, with your student accounts, you will not be able to run more than 1 job in parallel anyways. The template job file array_job.job looks slightly different than the one we had before. The slurm output file is specified using %A and %a. %A is being automatically replaced with the job ID, while %a is the index of the job within the array (so 1 to 16 in our example above). Below, we also added a block for creating a checkpoint folder for the job array, and copying the job file including hyperparameters to that folder. This is good practice for ensuring reproducibility. Finally, in the training call, we specify the path checkpoint path (make sure to have implemented this argument in your argparse) with the addition of experiment_${SLURM_ARRAY_TASK_ID} which is a sub-folder in the checkpoint directory with the sub-job ID (1 to 16 in the example). The next line, $(head -$SLURM_ARRAY_TASK_ID $HPARAMS_FILE | tail -1), copies the N-th line of the hyperparameter file to this job file, and hence submits the hyperparameter arguments to the training file.

File array_job.job:

#!/bin/bash

#SBATCH --partition=gpu
#SBATCH --gpus=1
#SBATCH --job-name=ExampleArrayJob
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=18
#SBATCH --time=04:00:00
#SBATCH --array=1-16%8
#SBATCH --output=slurm_array_testing_%A_%a.out

module purge
module load 2022
module load Anaconda3/2022.05

# Your job starts in the directory where you call sbatch
cd $HOME/...
# Activate your environment
source activate ...

# Good practice: define your directory where to save the models, and copy the job file to it
JOB_FILE=$HOME/.../array_job.job
HPARAMS_FILE=$HOME/.../array_job_hyperparameters.txt
CHECKPOINTDIR=$HOME/.../checkpoints/array_job_${SLURM_ARRAY_JOB_ID}

mkdir $CHECKPOINTDIR
rsync $HPARAMS_FILE $CHECKPOINTDIR/
rsync $JOB_FILE $CHECKPOINTDIR/

# Run your code
srun python -u train.py \
               --checkpoint_path $CHECKPOINTDIR/experiment_${SLURM_ARRAY_TASK_ID} \
               $(head -$SLURM_ARRAY_TASK_ID $HPARAMS_FILE | tail -1)

The hyperparameter file is nothing else than a text file in which each line denotes one set of hyperparameters for which you want to run an experiment. There is no specific order in which you need to put the lines, and you can extend the lines with as many hyperparameter arguments as you want.

File array_job_hyperparameters.txt:

--seed 42 --learning_rate 1e-3
--seed 43 --learning_rate 1e-3
--seed 44 --learning_rate 1e-3
--seed 45 --learning_rate 1e-3
--seed 42 --learning_rate 2e-3
--seed 43 --learning_rate 2e-3
--seed 44 --learning_rate 2e-3
--seed 45 --learning_rate 2e-3
--seed 42 --learning_rate 4e-3
--seed 43 --learning_rate 4e-3
--seed 44 --learning_rate 4e-3
--seed 45 --learning_rate 4e-3
--seed 42 --learning_rate 1e-2
--seed 43 --learning_rate 1e-2
--seed 44 --learning_rate 1e-2
--seed 45 --learning_rate 1e-2