Guide 1: Working with the Lisa cluster

This tutorial explains how to work with the Lisa cluster for the Deep Learning course at the University of Amsterdam. Every student will receive an account to have resources for training deep neural networks and get familiar with working on a cluster. It is recommended to have listened to the presentation by the SURFsara team or the TA team before going through this tutorial.

The Lisa cluster

The following section gives a quick introduction to the Lisa cluster, and how it is build up. A detailed description of the system can be found in the SURFsara user guide.

What is a cluster computer?

(Disclaimer: the following paragraph is an adapted version of the Lisa user guide. Credits: SURFsara team)

You can imagine a cluster computer as a collection of regular computers (known as nodes), tied together with network cables that are similar to the network cables in your home or office (see the figure below - credit: SURFsara team). Each node has its own CPU, memory and disk space, in addition to which they generally have access to a shared file system. On a cluster computer, you can run hundreds of computational tasks simultaneously.

91eb117f28974a039224096dba27fb2a

Interacting with a cluster computer is different from a normal computer. Normal computers are mostly used interactively, i.e. you type a command or click with your mouse, and your computer instantly responds by e.g. running a program. Cluster computers are mostly used non-interactively.

A cluster computer such as Lisa mainly has two types of nodes: login nodes and batch nodes. You connect to Lisa through the login node (see next section). This is an interactive node: similar to your own PC, it immediately responds to the commands you type. There are only a few login nodes on a cluster computer, and you only use them for light tasks: adjusting your code, preparing your input data, writing job scripts, etc. Since the login nodes are only meant for light tasks, many users can be on the same login node at the same time. To prevent users from over-using the login node, any command that takes longer than 15 minutes will be killed.

Your ‘big’ calculations such a neural network training will be done on the batch nodes. These perform what is known as batch jobs. A batch job is essentially a recipe of commands (put together in a job script) that you want the computer to execute. Calculations on the batch nodes are not performed right away. Instead, you submit your job script to the job queue. As soon as sufficient resources (i.e. batch nodes) are available for your job, the system will take your job from the queue, and send it to the batch nodes for execution.

Architecture of Lisa

A visual description of the Lisa architecture can be found below (figure credit - SURFsara team). You can connect to any login node of Lisa to interact with the system. Over the login nodes, you have access to the shared file system across nodes. The one you will mainly interact is /home where you can store your code, data, etc. You can access your files from any login node, as well as any compute node. You have a maximum disk space of 200GB which should be sufficient for the DL course.

You do not directly interact with any compute node. Instead, you can request computational resources with a job script, and Lisa will assign a compute node to this job using a SLURM job scheduler. If all computational resources are occupied, your job will be placed in a queue, and scheduled when resources are available.

Lisa has multiple sets of compute nodes with different computational resources, also called partitions. The one we will use for the Deep Learning course is called gpu_shared_course, and provides us compute nodes with the following resources:

Processor

CPU Cores

RAM Memory

GPUs

Bronze 3104 (1.7GHz)

12

256GB

4x GeForce 1080Ti, 11 GB GDDR5x

These computational resources are more than sufficient for the assignments in this course. For a job, we usually only use a single GPU from a compute node, meaning that we would also use 1/4th of the other resources (3 CPU cores, 64GB RAM). The scheduler of Lisa will assign multiple jobs to the same node if its computational resources are not exhausted yet, thus not wasting any if we only use a single GPU.

d1ed249a10734655acb94df7f173773a

First steps

After discussing the general architecture of Lisa, we are ready to discuss the practical aspects of how to use the Lisa cluster.

REMINDER: When you first receive your login data for Lisa, make sure to go to the user portal and change the password.

How to connect to Lisa

You can login to Lisa’s login nodes using a secure shell (SSH):

ssh -X lcur____@lisa.surfsara.nl

Replace lcur___ by your username. You will be connected to one of its login nodes, and have the view of a standard Linux system in your home directory. Note that you should only use the login node as an interface, and not as compute unit. Do not run any trainings on this node, as it will be killed after 15 minutes, and slows down the communication with Lisa for everyone. Instead, Lisa uses a SLURM scheduler to handle computational expensive jobs (see below).

If you want to transfer files between Lisa and your local computer, you can use standard Unix commands such as scp or rsync, or graphical interfaces such as FileZilla (use port 22 in FileZilla) or WinSCP (for Windows PC). A copy operation from Lisa to your local computer with rsync, started from your local computer, could look as follows:

rsync -av lcur___@lisa.surfsara.nl:~/source destination

Replace lcur___ by your username, source by the directory/file on Lisa you want to copy on your local machine, and destination by the directory/file it should be copied to. Note that source is referenced from your home directory on Lisa. If you want to copy a file from your local computer to Lisa, use:

rsync -av source lcur___@lisa.surfsara.nl:~/destination

Again, replace source with the directory/file on your local computer you want to copy to Lisa, and destination by the directory/file it should be copied to.

Modules

Lisa uses modules to provide you various pre-installed software. This includes simple Python, but also the NVIDIA libraries CUDA and cuDNN that can be necessary to access GPUs. However, for our course, we only need the Anaconda module:

module load 2021
module load Anaconda3/2021.05

The CUDA and cuDNN libraries are already taken care of by installing the cudatoolkits in conda.

Install the environment

To run the Deep Learning assignments and other code like the notebooks on Lisa, you need to install the provided environment for Lisa (dl2021_gpu.yml). You can either download it locally and copy it to your Lisa account via rsync or scp as described before, or simply clone the practicals github on Lisa:

git clone https://github.com/uvadlc/uvadlc_practicals_2021.git

Lisa provides an Anaconda module, which you can load via module load Anaconda3/2021.05 as mentioned before (remember to load the 2021 module beforehand). We recommend installing the package via a job file since the installation can take 20-30 minutes, and any command on the login node will be killed without warning after 15 minutes.

To do that, save the following content into a file called install_environment.job in your home directory:

#!/bin/bash

#SBATCH --partition=gpu_shared_course
#SBATCH --gres=gpu:0
#SBATCH --job-name=InstallEnvironment
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=3
#SBATCH --time=04:00:00
#SBATCH --mem=32000M
#SBATCH --output=slurm_output_%A.out

module purge
module load 2021
module load Anaconda3/2021.05

cd $HOME/uvadlc_practicals_2021/
conda env create -f dl2021_gpu.yml

You can use e.g. nano to do that. If the environment file is not in the cloned repository or you haven’t cloned the repo, change the cd statement to the directory where it is stored. Once the file is saved, start the job with the command sbatch install_environment.job. The installation process is started on a compute node with a time limit of 4 hours, which should be sufficiently long. Let’s look at the next section to understand what we have actually done here with respect to ‘job files’.

Trouble shooting

If the installation via job file does not work, try to install the environment with the following command from the login node after navigating to the directory the environment file is in:

conda env create -f dl2021_gpu.yml

Note that the jobs on the login node on Lisa are limited to 15 minutes. This is often not enough to install the full environment. If the installation command is killed, you can simply restart it. If you get the error that a package is corrupted, go to /home/lcur___/.conda/pkgs/ under your home directory and remove the directory of the corrupted package. If you get the error that the environment dl2021 already exists, go to /home/lcur___/.conda/envs/, and remove the folder ‘dl2021’.

If you experience issues with the Anaconda module, you can also install Anaconda yourself (download link) or ask your TA for help.

Verifying the installation

When the installation process is completed, you can check if the process was successful by activating your environment on the login node via source activate dl2021 (remember to have loaded the anaconda module beforehand), and starting a python console with executing python. It should say Python 3.9.7 | packaged by conda-forge. If you see a different python version, you might not have activated the environment correctly.

In the python console, try to import pytorch via import torch and check the version: torch.__version__. It should say 1.10.0. Finally, check whether PyTorch can access the GPU: torch.cuda.is_available(). Note that in most cases, this will return False because most login-nodes on Lisa do not have GPUs. You can login to a GPU node via ssh lcur___@login-gpu.lisa.surfsara.nl, and on this node, you should see that the command returns True. If that is the case, you should be all set.

The SLURM scheduler

Lisa relies on a SLURM scheduler to organize the jobs on the cluster. When logging into Lisa, you cannot just start a python script with your training, but instead submit a job to the scheduler. The scheduler will decide when and on which node to run your job, based on the number of nodes available and other jobs submitted.

Job files

We provide a template for a job file that you can use on Lisa. Create a file with any name you like, for example template.job, and start the job by executing the command sbatch template.job.

#!/bin/bash

#SBATCH --partition=gpu_shared_course
#SBATCH --gres=gpu:1
#SBATCH --job-name=ExampleJob
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=3
#SBATCH --time=04:00:00
#SBATCH --mem=32000M
#SBATCH --output=slurm_output_%A.out

module purge
module load 2021
module load Anaconda3/2021.05

# Your job starts in the directory where you call sbatch
cd $HOME/...
# Activate your environment
source activate dl2021
# Run your code
srun python -u ...

Job arguments

You might have to change the #SBATCH arguments depending on your needs. We describe the arguments below:

  • partition: The partition of Lisa on which you want to run your job. As a student, you only have access to the partition gpu_shared_course, which provides you nodes with NVIDIA GTX1080Ti GPUs (11GB).

  • gres: Generic resources include the GPU which is crucial for deep learning jobs. You can select up to two GPUs with your account, but if you haven’t designed your code to explicitly run on multiple GPUs, please use only one GPU (so no need to change what we have above).

  • job-name: Name of the job to pop up when you list your jobs with squeue (see below).

  • ntasks: Number of tasks to run with the job. In our case, we will always use 1 task.

  • cpus-per-task: Number of CPUs you request from the nodes. The gpu_shared_course partition restricts you to max. 3 CPUs per job/GPU.

  • time: Estimated time your job needs to finish. It is no problem if your job finishes earlier than the specified time. However, if your job takes longer, it will be instantaneously killed after the specified time. Still, don’t specify unnecessarily long times as this causes your job to be scheduled later (you need to wait longer in the queue if other people also want to use the cluster). A good rule of thumb is to specify ~20% more than what you would expect.

  • mem: RAM of the node you need. Note that this is not the GPU memory, but the random access memory of the node. On gpu_shared_course, you are restricted to 64GB per job/GPU which is more than you need for the assignments.

  • output: Output file to which the slurm output should be written. The tag “%A” is automatically replaced by the job ID. Note that if you specify the output file to be in a directory that does not exist, no output file will be created.

SLURM allows you to specify many more arguments, but the ones above are the important ones for us. If you are interested in a full list, see here.

Scratch

If you work with a lot of data, or a larger dataset, it is advised to copy your data to the /scratch directory of the node. Otherwise, the read/write operation might become a bottleneck of your job. To do this, simply use your copy operation of choice (cp, rsync, …), and copy the data to the directory $TMPDIR. You should add this command to your job file before calling srun .... Remember to point to this data when you are running your code. If you have a dataset that can be downloaded, you can also directly download it to the scratch (can sometimes be faster than copying). In case you also write something on the scratch, you need to copy it back to your home directory before finishing the job.

Edit Dec. 6, 2021: Due to internal changes to the filesystem of Lisa, it is required to use the scratch for any dataset such as CIFAR10. In PyTorch, the CIFAR10 dataset is structured into multiple large batches (usually 5), and only that batch is loaded which is currently needed. This is why during training, it requires a lot of reading operations on the disk which can slow down your training and constitutes a challenge to the Lisa system when hundreds of students share the same filesystem. Hence, you have to use the scratch for such datasets. For most parts in the assignment, you can do this by specifying --data_dir $TMPDIR on the python command in your job file (check if the argument parser has this argument, otherwise you can add it yourself). This will download the dataset to the scratch and only load it from there.

Starting and organizing jobs

To start a job, you simply have to run sbatch jobfile where you replace jobfile by the filename of the job. Note that no specific file postfix like .job is necessary for the job (you can use .txt or any other you prefer). After your job has been submitted, it will be first placed into a waiting queue. The SLURM scheduler decides when to start your job based on the time of your job, all other jobs currently running or waiting, and available nodes.

Besides sbatch, you can interact with the SLURM scheduler via the following commands:

  • squeue: Lists all jobs that are currently submitted to Lisa. This can be a lot of jobs as it includes all partitions. You can make it partition-specific using squeue -p gpu_shared_course, or only list the jobs of your account: squeue -u lcur___ (again, replace lcur___ by your username). See the slurm documentation for details.

  • scancel JOBID: Cancels and stops a job, independent of whether it is running or pending. The job ID can be found using squeue, and is printed when submitting the job via sbatch.

  • scontrol show job JOBID: Shows additional information of a specific job, like the estimated start time.

Troubleshooting

It can happen that you encounter some issues when interacting with Lisa. A short FAQ is provided on the SURFSara website, and here we provide a list of common questions/situations we have experienced from past students.

Lisa is refusing connection

It can occasionally happen that Lisa refuses the connection when you try to ssh into it. If this happens, you can first try to login to different login nodes. Specifically, try the following three login nodes:

ssh -X lcur____@login3.lisa.surfsara.nl
ssh -X lcur____@login4.lisa.surfsara.nl
ssh -X lcur____@login-gpu.lisa.surfsara.nl

If none of those work, you can try to use the Pulse Secure UvA VPN before connecting to Lisa. If this still does not work, then the connection issue is likely not on your side. The problem often resolves after 2-3 hours, and Lisa let’s you login after it again. If the problem doesn’t resolve after couple of hours, please contact your TA, and eventually the SURFSara helpdesk.

Slurm output file missing

If a job of yours is running, but no slurm output file is created, check whether the path to the output file specified in your job file actually exists. If the specified file points to a non-existing directory, no output file will be created. Note that this is not an issue by default, but you are running your job “blind” without seeing the stdout or stderr channels.

Slurm output file is empty for a long time

The slurm output file can lag behind in showing the outputs of your running job. If your job is running for couple of minutes and you would have expected a few print statements to have happened, try to flush your stdout stream (how to flush the output in python).

All my jobs are pending

With your student account, the SLURM scheduler restricts you to run only two jobs in parallel at a time. However, you can still queue more jobs that will run in sequence. This is done because with more than 200 students, Lisa could get crowded very fast if we don’t guarantee a fair share of resources. If all of your jobs are pending, you can check the reason for pending in the last column of squeue. All reasons are listed in the squeue documentation under JOB REASON CODES. The following ones are common:

  • Priority: There are other jobs on Lisa with a higher priority that are also waiting to be run. This means you just have to be patient.

  • QOSResourceLimit: The job is requesting more resources than allowed. Check your job file as you are only allowed to have at max. 2 GPUs, 6 CPU cores and 125GB RAM.

  • Resources: All nodes on Lisa are currently busy, yours will be scheduled soon.

You can also see the estimated start time of a job by running scontrol show job JOBID. However, note that this is the “worst case” scenario for the current number of submitted jobs, as in if all currently running jobs would need their maximum runtime. At the same time, if more people would submit their job with higher priority, yours can fall back in the queue and get a later start time.

PyTorch or other packages cannot be imported

If you run a job and see the python error message in the slurm output file that a package is missing although you have installed it in the environment, there are two things to check. Firstly, make sure to not have the environment activated on the login node when submitting the job. This can lead to an error in the anaconda module such that packages are not found on the compute node. Secondly, check that you activate the environment correctly. To verify that the correct python version is used, you can add the command which python before your training file. This prints out the path of the python that will be used, in which you should see the anaconda version in the dl2021 environment.

Advanced topics

Password-less login

Typing your password every time you connect to Lisa can become annoying. To enable a safe, password-less connection, you can add your public ssh key to the SURFsara user portal. Next time you login from your machine to Lisa, it will only check the ssh key and not ask you for the password anymore.

Interactive jobs

An alternative to submitting job scripts with sbatch is to use interactive sessions via srun. This can be helpful if you have to debug your code on a GPU (for very small usage, you can also use the login-gpu nodes). However, it is not recommended to use srun solely for long trainings because the training is canceled if your connection to Lisa is interupted (which happens occasionally), and you need to track yourself when your training has finished. Otherwise, you block resources for other users without using them (and the university pays for those resources). Make sure to use job scripts where possible.

Job Arrays

You might come into the situation where you need to run a hyperparameter search over multiple values, and don’t want to write an endless number of job scripts. A much more elegant solution is a job array. Job arrays are created with two files: a job file, and a hyperparameter file. The job file will start multiple sub-jobs that each use a different set of hyperparameters, as specified in the hyperparameter file. In the job file, you need to add the argument #SBATCH --array=.... The argument specifies how many sub-jobs you want to start, how many to run in parallel (at maximum), and which lines to use from the hyperparameter file. For example, if we specify #SBATCH --array=1-16%8, this means that we start 16 jobs using the lines 1 to 16 in the hyperparameter file, and running at maximum 8 jobs in parallel at the same time. Note that the number of parallel jobs is there to limit yourself from blocking the whole cluster. However, with your student accounts, you will not be able to run more than 1 job in parallel anyways. The template job file array_job.job looks slightly different than the one we had before. The slurm output file is specified using %A and %a. %A is being automatically replaced with the job ID, while %a is the index of the job within the array (so 1 to 16 in our example above). Below, we also added a block for creating a checkpoint folder for the job array, and copying the job file including hyperparameters to that folder. This is good practice for ensuring reproducibility. Finally, in the training call, we specify the path checkpoint path (make sure to have implemented this argument in your argparse) with the addition of experiment_${SLURM_ARRAY_TASK_ID} which is a sub-folder in the checkpoint directory with the sub-job ID (1 to 16 in the example). The next line, $(head -$SLURM_ARRAY_TASK_ID $HPARAMS_FILE | tail -1), copies the N-th line of the hyperparameter file to this job file, and hence submits the hyperparameter arguments to the training file.

File array_job.job:

#!/bin/bash

#SBATCH --partition=gpu_shared_course
#SBATCH --gres=gpu:1
#SBATCH --job-name=ExampleArrayJob
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=3
#SBATCH --time=04:00:00
#SBATCH --mem=32000M
#SBATCH --array=1-16%8
#SBATCH --output=slurm_array_testing_%A_%a.out

module purge
module load 2021
module load Anaconda3/2021.05

# Your job starts in the directory where you call sbatch
cd $HOME/...
# Activate your environment
source activate ...

# Good practice: define your directory where to save the models, and copy the job file to it
JOB_FILE=$HOME/.../array_job.job
HPARAMS_FILE=$HOME/.../array_job_hyperparameters.txt
CHECKPOINTDIR=$HOME/.../checkpoints/array_job_${SLURM_ARRAY_JOB_ID}

mkdir $CHECKPOINTDIR
rsync $HPARAMS_FILE $CHECKPOINTDIR/
rsync $JOB_FILE $CHECKPOINTDIR/

# Run your code
srun python -u train.py \
               --checkpoint_path $CHECKPOINTDIR/experiment_${SLURM_ARRAY_TASK_ID} \
               $(head -$SLURM_ARRAY_TASK_ID $HPARAMS_FILE | tail -1)

The hyperparameter file is nothing else than a text file in which each line denotes one set of hyperparameters for which you want to run an experiment. There is no specific order in which you need to put the lines, and you can extend the lines with as many hyperparameter arguments as you want.

File array_job_hyperparameters.txt:

--seed 42 --learning_rate 1e-3
--seed 43 --learning_rate 1e-3
--seed 44 --learning_rate 1e-3
--seed 45 --learning_rate 1e-3
--seed 42 --learning_rate 2e-3
--seed 43 --learning_rate 2e-3
--seed 44 --learning_rate 2e-3
--seed 45 --learning_rate 2e-3
--seed 42 --learning_rate 4e-3
--seed 43 --learning_rate 4e-3
--seed 44 --learning_rate 4e-3
--seed 45 --learning_rate 4e-3
--seed 42 --learning_rate 1e-2
--seed 43 --learning_rate 1e-2
--seed 44 --learning_rate 1e-2
--seed 45 --learning_rate 1e-2