Guide 1: Working with the Lisa cluster¶
Author: Phillip Lippe
Note: The Lisa system is being deprecated in favor of the new Supercomputer Snellius. Most aspects discussed in this guide will remain the same, but we will update it once it’s clear on what system the DL1 course will work from late 2023 on.
This tutorial explains how to work with the Lisa cluster for the Deep Learning course at the University of Amsterdam. Every student will receive an account to have resources for training deep neural networks and get familiar with working on a cluster. It is recommended to have listened to the presentation by the SURFsara team or the TA team before going through this tutorial. Further, this tutorial assumes that you are familiar with using the terminal in Linux. If not, a crash course can be found here.
The Lisa cluster¶
The following section gives a quick introduction to the Lisa cluster, and how it is build up. A detailed description of the system can be found in the SURFsara user guide.
What is a cluster computer?¶
(Disclaimer: the following paragraph is an adapted version of the Lisa user guide. Credits: SURFsara team)
You can imagine a cluster computer as a collection of regular computers (known as nodes), tied together with network cables that are similar to the network cables in your home or office (see the figure below - credit: SURFsara team). Each node has its own CPU, memory and disk space, in addition to which they generally have access to a shared file system. On a cluster computer, you can run hundreds of computational tasks simultaneously.
Interacting with a cluster computer is different from a normal computer. Normal computers are mostly used interactively, i.e. you type a command or click with your mouse, and your computer instantly responds by e.g. running a program. Cluster computers are mostly used non-interactively.
A cluster computer such as Lisa mainly has two types of nodes: login nodes and batch nodes. You connect to Lisa through the login node (see next section). This is an interactive node: similar to your own PC, it immediately responds to the commands you type. There are only a few login nodes on a cluster computer, and you only use them for light tasks: adjusting your code, preparing your input data, writing job scripts, etc. Since the login nodes are only meant for light tasks, many users can be on the same login node at the same time. To prevent users from over-using the login node, any command that takes longer than 15 minutes will be killed.
Your ‘big’ calculations such a neural network training will be done on the batch nodes. These perform what is known as batch jobs. A batch job is essentially a recipe of commands (put together in a job script) that you want the computer to execute. Calculations on the batch nodes are not performed right away. Instead, you submit your job script to the job queue. As soon as sufficient resources (i.e. batch nodes) are available for your job, the system will take your job from the queue, and send it to the batch nodes for execution.
Architecture of Lisa¶
A visual description of the Lisa architecture can be found below (figure credit - SURFsara team). You can connect to any login node of Lisa to interact with the system. Over the login nodes, you have access to the shared file system across nodes. The one you will mainly interact is /home
where you can store your code, data, etc. You can access your files from any login node, as well as any compute node. You have a maximum disk space of 200GB which should be sufficient for the DL course.
You do not directly interact with any compute node. Instead, you can request computational resources with a job script, and Lisa will assign a compute node to this job using a SLURM job scheduler. If all computational resources are occupied, your job will be placed in a queue, and scheduled when resources are available.
Lisa has multiple sets of compute nodes with different computational resources, also called partitions. The one we will use for the Deep Learning course is called gpu_shared_course
, and provides us compute nodes with the following resources:
Processor |
CPU Cores |
RAM Memory |
GPUs |
---|---|---|---|
Bronze 3104 (1.7GHz) |
12 |
256GB |
4x GeForce 1080Ti, 11 GB GDDR5x |
These computational resources are more than sufficient for the assignments in this course. For a job, we usually only use a single GPU from a compute node, meaning that we would also use 1/4th of the other resources (3 CPU cores, 64GB RAM). The scheduler of Lisa will assign multiple jobs to the same node if its computational resources are not exhausted yet, thus not wasting any if we only use a single GPU.
First steps¶
After discussing the general architecture of Lisa, we are ready to discuss the practical aspects of how to use the Lisa cluster.
REMINDER: When you first receive your login data for Lisa, make sure to go to the user portal and change the password.
How to connect to Lisa¶
You can login to Lisa’s login nodes using a secure shell (SSH):
ssh -X lcur____@lisa.surfsara.nl
Replace lcur___
by your username. You will be connected to one of its login nodes, and have the view of a standard Linux system in your home directory. Note that you should only use the login node as an interface, and not as compute unit. Do not run any trainings on this node, as it will be killed after 15 minutes, and slows down the communication with Lisa for everyone. Instead, Lisa uses a SLURM scheduler to handle computational expensive jobs (see below).
If you want to transfer files between Lisa and your local computer, you can use standard Unix commands such as scp
or rsync
, or graphical interfaces such as FileZilla (use port 22 in FileZilla) or WinSCP (for Windows PC). A copy operation from Lisa to your local computer with rsync
, started from your local computer, could look as follows:
rsync -av lcur___@lisa.surfsara.nl:~/source destination
Replace lcur___
by your username, source
by the directory/file on Lisa you want to copy on your local machine, and destination
by the directory/file it should be copied to. Note that source
is referenced from your home directory on Lisa. If you want to copy a file from your local computer to Lisa, use:
rsync -av source lcur___@lisa.surfsara.nl:~/destination
Again, replace source
with the directory/file on your local computer you want to copy to Lisa, and destination
by the directory/file it should be copied to.
Modules¶
Lisa uses modules to provide you various pre-installed software. This includes simple Python, but also the NVIDIA libraries CUDA and cuDNN that can be necessary to access GPUs. However, for our course, we only need the Anaconda module:
module load 2021
module load Anaconda3/2021.05
Note that there also exist a 2022
module with slightly newer package versions, but it is at the moment not functional. Hence, we stick with the 2021
module. The CUDA and cuDNN libraries are already taken care of by installing the cudatoolkits in conda.
Install the environment¶
To run the Deep Learning assignments and other code like the notebooks on Lisa, you need to install the provided environment for Lisa (dl2022_gpu.yml
). You can either download it locally and copy it to your Lisa account via rsync or scp as described before, or simply clone the practicals github on Lisa:
git clone https://github.com/uvadlc/uvadlc_practicals_2022.git
Lisa provides an Anaconda module, which you can load via module load Anaconda3/2021.05
as mentioned before (remember to load the 2021
module beforehand). We recommend installing the package via a job file since the installation can take 20-30 minutes, and any command on the login node will be killed without warning after 15 minutes.
To do that, save the following content into a file called install_environment.job
in your home directory:
#!/bin/bash
#SBATCH --partition=gpu_shared_course
#SBATCH --gres=gpu:0
#SBATCH --job-name=InstallEnvironment
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=3
#SBATCH --time=04:00:00
#SBATCH --mem=32000M
#SBATCH --output=slurm_output_%A.out
module purge
module load 2021
module load Anaconda3/2021.05
cd $HOME/uvadlc_practicals_2022/
conda env create -f dl2022_gpu.yml
You can use e.g. nano
to do that. If the environment file is not in the cloned repository or you haven’t cloned the repo, change the cd statement to the directory where it is stored. Once the file is saved, start the job with the command sbatch install_environment.job
. The installation process is started on a compute node with a time limit of 4 hours, which should be sufficiently long. Let’s look at the next section to understand what we have actually done here with respect to ‘job
files’.
Trouble shooting¶
If the installation via job file does not work, try to install the environment with the following command from the login node after navigating to the directory the environment file is in:
conda env create -f dl2022_gpu.yml
Note that the jobs on the login node on Lisa are limited to 15 minutes. This is often not enough to install the full environment. If the installation command is killed, you can simply restart it. If you get the error that a package is corrupted, go to /home/lcur___/.conda/pkgs/
under your home directory and remove the directory of the corrupted package. If you get the error that the environment dl2022 already exists, go to /home/lcur___/.conda/envs/
, and remove the folder ‘dl2022’.
If you experience issues with the Anaconda module, you can also install Anaconda yourself (download link) or ask your TA for help.
Verifying the installation¶
When the installation process is completed, you can check if the process was successful by activating your environment on the login node via source activate dl2022
(remember to have loaded the anaconda module beforehand), and starting a python console with executing python
. It should say Python 3.10.6 | packaged by conda-forge
. If you see a different python version, you might not have activated the environment correctly.
In the python console, try to import pytorch via import torch
and check the version: torch.__version__
. It should say 1.13.0
. Finally, check whether PyTorch can access the GPU: torch.cuda.is_available()
. Note that in most cases, this will return False
because most login-nodes on Lisa do not have GPUs. You can login to a GPU node via ssh lcur___@login-gpu.lisa.surfsara.nl
, and on this node, you should see that the command returns True
. If that is the case, you should
be all set.
The SLURM scheduler¶
Lisa relies on a SLURM scheduler to organize the jobs on the cluster. When logging into Lisa, you cannot just start a python script with your training, but instead submit a job to the scheduler. The scheduler will decide when and on which node to run your job, based on the number of nodes available and other jobs submitted.
Job files¶
We provide a template for a job file that you can use on Lisa. Create a file with any name you like, for example template.job
, and start the job by executing the command sbatch template.job
.
#!/bin/bash
#SBATCH --partition=gpu_shared_course
#SBATCH --gres=gpu:1
#SBATCH --job-name=ExampleJob
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=3
#SBATCH --time=04:00:00
#SBATCH --mem=32000M
#SBATCH --output=slurm_output_%A.out
module purge
module load 2021
module load Anaconda3/2021.05
# Your job starts in the directory where you call sbatch
cd $HOME/...
# Activate your environment
source activate dl2022
# Run your code
srun python -u ...
Note: On Snellius, replace --partition=gpu_shared_course
with --partition=gpu
and --gres=gpu:1
with --gpus=1
.
Job arguments¶
You might have to change the #SBATCH
arguments depending on your needs. We describe the arguments below:
partition
: The partition of Lisa on which you want to run your job. As a student, you only have access to the partition gpu_shared_course, which provides you nodes with NVIDIA GTX1080Ti GPUs (11GB).gres
: Generic resources include the GPU which is crucial for deep learning jobs. You can select up to two GPUs with your account, but if you haven’t designed your code to explicitly run on multiple GPUs, please use only one GPU (so no need to change what we have above).job-name
: Name of the job to pop up when you list your jobs with squeue (see below).ntasks
: Number of tasks to run with the job. In our case, we will always use 1 task.cpus-per-task
: Number of CPUs you request from the nodes. The gpu_shared_course partition restricts you to max. 3 CPUs per job/GPU.time
: Estimated time your job needs to finish. It is no problem if your job finishes earlier than the specified time. However, if your job takes longer, it will be instantaneously killed after the specified time. Still, don’t specify unnecessarily long times as this causes your job to be scheduled later (you need to wait longer in the queue if other people also want to use the cluster). A good rule of thumb is to specify ~20% more than what you would expect.mem
: RAM of the node you need. Note that this is not the GPU memory, but the random access memory of the node. On gpu_shared_course, you are restricted to 64GB per job/GPU which is more than you need for the assignments.output
: Output file to which the slurm output should be written. The tag “%A” is automatically replaced by the job ID. Note that if you specify the output file to be in a directory that does not exist, no output file will be created.
SLURM allows you to specify many more arguments, but the ones above are the important ones for us. If you are interested in a full list, see here.
Scratch¶
If you work with a lot of data, or a larger dataset, it is advised to copy your data to the /scratch
directory of the node. Otherwise, the read/write operation might become a bottleneck of your job. To do this, simply use your copy operation of choice (cp
, rsync
, …), and copy the data to the directory $TMPDIR
. You should add this command to your job file before calling srun ...
. Remember to point to this data when you are running your code. If you have a dataset that can be
downloaded, you can also directly download it to the scratch (can sometimes be faster than copying). In case you also write something on the scratch, you need to copy it back to your home directory before finishing the job.
Edit Dec. 6, 2021: Due to internal changes to the filesystem of Lisa, it is required to use the scratch for any dataset such as CIFAR10. In PyTorch, the CIFAR10 dataset is structured into multiple large batches (usually 5), and only that batch is loaded which is currently needed. This is why during training, it requires a lot of reading operations on the disk which can slow down your training and constitutes a challenge to the Lisa system when hundreds of students share the same
filesystem. Hence, you have to use the scratch for such datasets. For most parts in the assignment, you can do this by specifying --data_dir $TMPDIR
on the python command in your job file (check if the argument parser has this argument, otherwise you can add it yourself). This will download the dataset to the scratch and only load it from there. We recommend using this approach also for future course editions.
Starting and organizing jobs¶
To start a job, you simply have to run sbatch jobfile
where you replace jobfile
by the filename of the job. Note that no specific file postfix like .job
is necessary for the job (you can use .txt
or any other you prefer). After your job has been submitted, it will be first placed into a waiting queue. The SLURM scheduler decides when to start your job based on the time of your job, all other jobs currently running or waiting, and available nodes.
Besides sbatch
, you can interact with the SLURM scheduler via the following commands:
squeue
: Lists all jobs that are currently submitted to Lisa. This can be a lot of jobs as it includes all partitions. You can make it partition-specific usingsqueue -p gpu_shared_course
, or only list the jobs of your account:squeue -u lcur___
(again, replacelcur___
by your username). See the slurm documentation for details.scancel JOBID
: Cancels and stops a job, independent of whether it is running or pending. The job ID can be found usingsqueue
, and is printed when submitting the job viasbatch
.scontrol show job JOBID
: Shows additional information of a specific job, like the estimated start time.
Interactive sessions¶
As alternative to running jobs via sbatch
, you can request an interactive job session. In this format, you gain access to the node via the terminal/command line, and can interact with it as you would have connected via ssh directly to this compute node. Hence, you can debug your script much easier and have faster respond time. However, please keep in mind the following two disadvantages: 1. If you happen to disconnect from Lisa because of an instable connection or your computer going in
stand-by mode, the job will be canceled. 2. Your job is not automatically terminated when your script has finished running, but you have to manually kill the job once you are done. If you forget it, you block the compute node for other students and waste credits that UvA payed for Hence, only use interactive sessions for short jobs/scripts, for example, if you want to debug whether your script starts running and train the model. Do not use interactive session to train a model for a long time.
In order to start an interactive session on Lisa, you can use srun
with the same input parameters as for the job file. For example:
srun --partition=gpu_shared_course --gres=gpu:1 --mem=32000M --ntasks=1 --cpus-per-task=3 --time=00:10:00 --pty bash -i
This will start an interactive session with a GPU for 10 minutes. Once resources have been allocated, you will see in your terminal that you are now on one of the compute nodes, for example, r32n5
. As a first step, you need to load the modules (module load 2021; module load Anaconda3/2021.05
) and activate your conda environment. Then, you are ready to run your script.
Troubleshooting¶
It can happen that you encounter some issues when interacting with Lisa. A short FAQ is provided on the SURFSara website, and here we provide a list of common questions/situations we have experienced from past students.
Lisa is refusing connection¶
It can occasionally happen that Lisa refuses the connection when you try to ssh into it. If this happens, you can first try to login to different login nodes. Specifically, try the following three login nodes:
ssh -X lcur____@login3.lisa.surfsara.nl
ssh -X lcur____@login4.lisa.surfsara.nl
ssh -X lcur____@login-gpu.lisa.surfsara.nl
If none of those work, you can try to use the Pulse Secure UvA VPN before connecting to Lisa. If this still does not work, then the connection issue is likely not on your side. The problem often resolves after 2-3 hours, and Lisa let’s you login after it again. If the problem doesn’t resolve after couple of hours, please contact your TA, and eventually the SURFSara helpdesk.
Slurm output file missing¶
If a job of yours is running, but no slurm output file is created, check whether the path to the output file specified in your job file actually exists. If the specified file points to a non-existing directory, no output file will be created. Note that this is not an issue by default, but you are running your job “blind” without seeing the stdout or stderr channels.
Slurm output file is empty for a long time¶
The slurm output file can lag behind in showing the outputs of your running job. If your job is running for couple of minutes and you would have expected a few print statements to have happened, try to flush your stdout stream (how to flush the output in python).
All my jobs are pending¶
With your student account, the SLURM scheduler restricts you to run only two jobs in parallel at a time. However, you can still queue more jobs that will run in sequence. This is done because with more than 200 students, Lisa could get crowded very fast if we don’t guarantee a fair share of resources. If all of your jobs are pending, you can check the reason for pending in the last column of squeue
. All reasons are listed in the squeue
documentation under JOB REASON CODES. The following ones are common:
Priority
: There are other jobs on Lisa with a higher priority that are also waiting to be run. This means you just have to be patient.QOSResourceLimit
: The job is requesting more resources than allowed. Check your job file as you are only allowed to have at max. 2 GPUs, 6 CPU cores and 125GB RAM.Resources
: All nodes on Lisa are currently busy, yours will be scheduled soon.
You can also see the estimated start time of a job by running scontrol show job JOBID
. However, note that this is the “worst case” scenario for the current number of submitted jobs, as in if all currently running jobs would need their maximum runtime. At the same time, if more people would submit their job with higher priority, yours can fall back in the queue and get a later start time.
PyTorch or other packages cannot be imported¶
If you run a job and see the python error message in the slurm output file that a package is missing although you have installed it in the environment, there are two things to check. Firstly, make sure to not have the environment activated on the login node when submitting the job. This can lead to an error in the anaconda module such that packages are not found on the compute node. Secondly, check that you activate the environment correctly. To verify that the correct python version is used,
you can add the command which python
before your training file. This prints out the path of the python that will be used, in which you should see the anaconda version in the dl2022 environment.
My job runs very slow¶
If your job executes your script much slower than you expect, check for two things: (1) have you requested a GPU and are you using it, and (2) are you using the scratch for your dataset? Not using the scratch for your dataset can create a significant communication bottleneck, especially if multiple students do it at the same time. Make sure to download or copy your dataset to the scratch and load it from there.
I am not able to use Lisa at all¶
If during the course, there will be major issues with Lisa (e.g. cluster goes into maintenance for a long time, issues with the filesystem, etc.), you can make use of GoogleColab, as we do already for all notebook tutorials here. All assignments do not necessarily require large amount of compute, and can often be comfortably trained on a GPU provided by GoogleColab. For an introduction to GoogleColab, see this tutorial.
Advanced topics¶
Password-less login¶
Typing your password every time you connect to Lisa can become annoying. To enable a safe, password-less connection, you can add your public ssh key to the SURFsara user portal. Next time you login from your machine to Lisa, it will only check the ssh key and not ask you for the password anymore.
Remote development with VSCode or PyCharm¶
The common workflow with clusters is that you first code locally and test your implementation on short runs on the CPU or evt. a local GPU, then sync your code to the cluster (e.g. via git), and finally run the full training/process on a compute node of the cluster. If you prefer to directly code on Lisa, you can do so via remote connections in tools like VSCode or PyCharm. Essentially, these IDEs can connect via SSH to Lisa, so that it looks to you like you code locally, but all code you are editing is directly saved on Lisa. Note though that to run your code, you still need to create a job script and submit it via SLURM. The login nodes, on which these IDEs connect you to, are not meant for debugging or running code. Any process that takes longer than 15 minutes will be killed (this can sometimes also include the SSH connection of VSCode/PyCharm). For more details on remote development and how to set it up, see the SURFsara documentation on VSCode and PyCharm.
Tracking GPU stats¶
If you are curious whether you use the GPU to its full capacity, you can monitor its utilization as follows. First, you submit your job and check its job ID via squeue -u [userid]
(with your user-ID/name) or the ID that has been printed out after submitting the job via sbatch
. Next, you can log into the node via slurm_jobmonitor [jobid]
where you need your job ID. This gives you an interactive view on the node. Finally, you can run nvtop
to track the GPU utilization. More details
can be found here.
Job Arrays¶
You might come into the situation where you need to run a hyperparameter search over multiple values, and don’t want to write an endless number of job scripts. A much more elegant solution is a job array. Job arrays are created with two files: a job file, and a hyperparameter file. The job file will start multiple sub-jobs that each use a different set of hyperparameters, as specified in the hyperparameter file. In the job file, you need to add the argument #SBATCH --array=...
. The argument
specifies how many sub-jobs you want to start, how many to run in parallel (at maximum), and which lines to use from the hyperparameter file. For example, if we specify #SBATCH --array=1-16%8
, this means that we start 16 jobs using the lines 1 to 16 in the hyperparameter file, and running at maximum 8 jobs in parallel at the same time. Note that the number of parallel jobs is there to limit yourself from blocking the whole cluster. However, with your student accounts, you will not be able to
run more than 1 job in parallel anyways. The template job file array_job.job
looks slightly different than the one we had before. The slurm output file is specified using %A
and %a
. %A
is being automatically replaced with the job ID, while %a
is the index of the job within the array (so 1 to 16 in our example above). Below, we also added a block for creating a checkpoint folder for the job array, and copying the job file including hyperparameters to that folder. This is good
practice for ensuring reproducibility. Finally, in the training call, we specify the path checkpoint path (make sure to have implemented this argument in your argparse) with the addition of experiment_${SLURM_ARRAY_TASK_ID}
which is a sub-folder in the checkpoint directory with the sub-job ID (1 to 16 in the example). The next line, $(head -$SLURM_ARRAY_TASK_ID $HPARAMS_FILE | tail -1)
, copies the N-th line of the hyperparameter file to this job file, and hence submits the hyperparameter
arguments to the training file.
File array_job.job
:
#!/bin/bash
#SBATCH --partition=gpu_shared_course
#SBATCH --gres=gpu:1
#SBATCH --job-name=ExampleArrayJob
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=3
#SBATCH --time=04:00:00
#SBATCH --mem=32000M
#SBATCH --array=1-16%8
#SBATCH --output=slurm_array_testing_%A_%a.out
module purge
module load 2021
module load Anaconda3/2021.05
# Your job starts in the directory where you call sbatch
cd $HOME/...
# Activate your environment
source activate ...
# Good practice: define your directory where to save the models, and copy the job file to it
JOB_FILE=$HOME/.../array_job.job
HPARAMS_FILE=$HOME/.../array_job_hyperparameters.txt
CHECKPOINTDIR=$HOME/.../checkpoints/array_job_${SLURM_ARRAY_JOB_ID}
mkdir $CHECKPOINTDIR
rsync $HPARAMS_FILE $CHECKPOINTDIR/
rsync $JOB_FILE $CHECKPOINTDIR/
# Run your code
srun python -u train.py \
--checkpoint_path $CHECKPOINTDIR/experiment_${SLURM_ARRAY_TASK_ID} \
$(head -$SLURM_ARRAY_TASK_ID $HPARAMS_FILE | tail -1)
The hyperparameter file is nothing else than a text file in which each line denotes one set of hyperparameters for which you want to run an experiment. There is no specific order in which you need to put the lines, and you can extend the lines with as many hyperparameter arguments as you want.
File array_job_hyperparameters.txt
:
--seed 42 --learning_rate 1e-3
--seed 43 --learning_rate 1e-3
--seed 44 --learning_rate 1e-3
--seed 45 --learning_rate 1e-3
--seed 42 --learning_rate 2e-3
--seed 43 --learning_rate 2e-3
--seed 44 --learning_rate 2e-3
--seed 45 --learning_rate 2e-3
--seed 42 --learning_rate 4e-3
--seed 43 --learning_rate 4e-3
--seed 44 --learning_rate 4e-3
--seed 45 --learning_rate 4e-3
--seed 42 --learning_rate 1e-2
--seed 43 --learning_rate 1e-2
--seed 44 --learning_rate 1e-2
--seed 45 --learning_rate 1e-2
Additional partitions¶
During the Master’s program at the University of Amsterdam, you will likely gain accounts for Lisa during other courses as well, as well as during your Master thesis (if performed at university). In this case, you can get access to partitions besides gpu_shared_course
allowing you to run more computationally heavy experiments. A full list of partitions can be found here. Besides the nodes with the GeForce 1080Ti
GPUs, the most commonly used partition for DL experiments is called gpu_titanrtx
which has nodes with the following config:
Processor |
CPU Cores |
RAM Memory |
GPUs |
---|---|---|---|
Gold 5118 (2.3GHz) |
24 |
192GB |
4x Titan RTX, 24 GB GDDR6 |
The Titan RTX are faster and provide more GPU memory than the 1080Ti’s. However, these nodes cost more than twice as many credits (91.2 vs 42.1 for full node per hour, see accounting), so use them only if needed. Furthermore, in many cases, you do not need a full node and might only need a single GPU. Similar to the course partition, there exist gpu_titanrtx_shared
and gpu_shared
which allow you access to
partial node uses, like 1 or 2 GPUs.
If you are in the need of pure CPU-based jobs, you can use the shared
and normal
partitions.
Additional links¶
Many more details on Lisa, SLURM, etc. can be found on the SURFSara wiki, as well as a different perspective on the aspects we have discussed in this tutorial. A (non-exclusive) list of useful links: