Guide 2: Research projects with PyTorch¶
Based on some feedback I got, we will try to summarize tips and tricks on how to setup and structure large research projects in PyTorch, such as your Master Thesis
Feel free to contribute yourself if you have good ideas
Choosing the right framework can be essential. If you have standard optimization loops of a single forward pass and return a loss, consider going with PyTorch Lightning. It reduces the code overhead a lot and allows to easily scale your model to multiple GPUs and/or nodes if needed. Nonetheless, if you expect that you need to change the default training procedure quite a bit, consider going with plain PyTorch and write your own framework. It might take more time initially, but makes edits in the optimization procedure easier.
For an own framework, the following can be used as an example setup:
general/ │ train.py │ task.py │ mutils.py layers/ experiments/ │ task1/ │ train.py │ task.py │ eval.py │ dataset.py │ task2/ │ train.py │ task.py │ eval.py │ dataset.py
general/train.pyfile summarizes the default operations every model needs (training loop, loading/saving model, setting up model, etc.). If you use PyTorch Lightning, this reduces to a train file per task, and only needs the specification of the trainer object.
general/task.pyfile summarizes a template for the specific parts you have to do for a task (training step, validation step, etc.). If you use PyTorch Lightning, this would be the definition of the Lightning Module.
layers/modelsfolder contains the code for specifying the
nn.Modulesyou use for setting up the model.
experimentsfolder contains the task-specific code. Each task has its own
train.pyfor specifying the argument parser, setting up the model, etc., while the
task.pyoverwrites the template in
eval.pyfile should has as input a checkpoint directory of a trained model, and should evaluate this model on the test dataset. Finally, the file
dataset.pycontains all parts you need for setting up the dataset.
Note that this template assumes that you might have multiple different tasks and multiple different models. If you have a simpler setup, you can inherently shrink the template together.
It is a good practice to use argument parsers for specifying hyperparameters. Argument parsers allow you to call a training like
python train.py --learning ... --seed ... --hidden_size ...etc.
If you have multiple models to choose from, you will have multiple set of hyperparameters. A good summary on that can be found in the PyTorch Lightning documentation without the need of using Lightning. In essence, you can define a static method for each model that returns a parser for its specific hyperparameters. This makes your code cleaner and easier to define new tasks without copying the whole argument parser.
To ensure reproducibility (more details below), it is recommended to save the arguments as a json file or similar in your checkpoint folder.
In general, hyperparameter search is all about experience. Once you have trained a lot of models, it will become easier for you to pick reasonable first-guess hyperparameters.
The first approach to take is to look at related work to your model, and see what others have used as hyperparameters for similar models. This will help you to get started with a reasonable choice.
Hyperparameter search can be expensive. Thus, try to do the search on shallow models first before scaling them up.
Although a large grid search is the best way to get the optimum out of your model, it is often not reasonable to run. Try to group hyperparameters, and optimize each group one by one.
PyTorch Lightning provides a lot of useful tricks and toolkits on hyperparameter searching, such as:
Learning rate finder that plots the learning rate vs loss for a few initial batches, and helps you to choose a reasonable learning rate.
Autoscaling batch sizes which finds the largest possible batch size given your GPU (helpful if you have very deep, large models, and it is obvious you need the largest batch size possible).
For comparing multiple hyperparameter configurations, you can add them to TensorBoard. This is a clean way of comparing multiple runs. If interested, a blog on this can be found here.
There are multiple libraries that support you in automatic hyperparameter search. A good overview for those in PyTorch can be found here.
Everything is about reproducibility. Make sure you can reproduce any training you do with the same random values, batches, etc. You will come to a point where you have tried a lot of different approaches, but none were able to improve upon one of your previous runs. When you try to run the model again with the best hyperparameters, you don’t want to have a bad surprise (believe me, enough people have this issue, and it might also happen to you). Hence, before starting any grid search, make sure you are able to reproduce runs. Run two jobs in parallel on Lisa with the same hyperparams, seeds, etc., and if you don’t get the exact same results, stop and try to fix it before anything else.
Another fact about reproducibility is that saving and loading a model works without any problems. Make sure before a long training that you are able to load a saved model from the disk, and achieve the exact same test score as you had during training.
Print your hyperparameters into the SLURM output file (simple print statement in python). This will help you identifying the runs, and you can easily check whether Lisa executes the job you intended to. Further, hyperparameters should be stored in a separate file in your checkpoint directory, whether saved by PyTorch Lightning or yourself.
When running a job, copy the job file automatically to your checkpoint folder. This improves reproducibility by ensuring you have the exact running comment ready.
Besides the slurm output file, create a output file in which you store the best training, validation and test score. This helps you when you want to quickly compare multiple models or create statistics of your results.
If you want to be on the safe side and use git, you can even print/save the hash of the git commit you are currently on, and any changes you had made to the files. An example of how to do this can be found here.
DL models are inherently noisy, and no two runs are the same if you don’t ensure a deterministic execution. Before running a grid search, try to get a feeling of how noisy your experiments might be. The more noise you expect compared to your result scale, the more versions of your model you need to run to get a statistically significant difference between settings.
After finishing the grid search, run another model of the best configuration with a new seed. If the score is still the best, take the model. If not, consider running a few more seeds for the top \(k\) models in your grid search. Otherwise, you risk taking a suboptimal model, which was just lucky to be the best for a specific seed.
The learning rate is an important parameter, which depends on the optimizer, the model, and many more other hyperparameters.
A usual good starting point is 0.1 for SGD, and 1e-3 for Adam.
The deeper the model is, the lower the learning rate usually should be. For instance, Transformer models usually apply learning rates of 1e-5 to 1e-4 for Adam.
Consider using the PyTorch Lightning learning rate finder toolkit for an initial good guess.
Similarly to the learning rate, the scheduler to apply again depends on the classifier and model.
For image classifiers and SGD as optimizer, the multi-step LR scheduler has shown to be good choice.
Models trained with Adam commonly use a smooth exponential decay in the learning rate or a cosine-like scheduler.
For Transformers: remember to use a learning rate warmup. The cosine scheduler is often used for decaying the learning rate afterwards, but can also be replaced by an exponential decay.
Regularization is important in networks if you see a significantly higher training performance than test performance.
The regularization parameters all interact with each other, and hence must be tuned together. The most commonly used regularization techniques are:
Dropout is usually a good idea as it is applicable to most architectures and has shown to effectively reduce overfitting.
If you want to use weight decay in Adam, remember to use
Domain specific regularization¶
There are couple of regularization techniques that depend on your input data/domain. The most common include:
Computer Vision: image augmentation like horizontal flip, rotation, scale-and-crop, color distortion, gaussian noise, etc.
NLP: input dropout of whole words.
Graphs: dropping edges, nodes, or part of the features of all nodes.
Grid search with SLURM¶
Job arrays allow you to start N jobs in parallel, each running with slightly different settings.
It is effectively the same as creating N job files and calling N times
sbatch ..., but this can become annoying and is messy at some point.
Writing the job arrays can be sometimes annoying, and hence it is advised to write a script that can automatically generate the hyperparameter files if you have to do this often enough (for instance, by adding the seed parameter 4 times to each other hyperparam config). However, if you are using PyTorch Lightning, you can directly create a job array file. The documentation for this can be found here.