Guide 3: Debugging in PyTorch

When you start learning PyTorch, it is expected that you hit bugs and errors. To help you debug your code, we will summarize the most common mistakes in this guide, explain why they happen, and how you can solve them.

My model is underperforming

Situation: Your model is not reaching the performance that it should, but PyTorch is not telling you why that happens. These errors are the most annoying bugs since those can be hard to debug. Nonetheless, there are couple of things you can check. If none of these solve the problem for you, one of us TAs will help you debug your code more in detail.

Softmax, CrossEntropyLoss and NLLLoss

The most common mistake is the mismatch between loss function and output activation function. The loss module nn.CrossEntropyLoss in PyTorch performs two operations: nn.LogSoftmax and nn.NLLLoss. Hence, the input to this loss module should be the output of your last linear layer. Do not apply a softmax before the Cross-Entropy loss. Otherwise, PyTorch will apply a log-softmax on your softmax outputs, which will significantly worsens the performance, and gives you headaches.

If you use the loss module nn.NLLLoss, you need to apply the log-softmax yourself. NLLLoss requires log-probabilities, not plain probabilities. Hence, make sure to apply nn.LogSoftmax or nn.functional.log_softmax, and not nn.Softmax.

Softmax over the right dimension

Pay attention to the dimension you apply your softmax over. Usually, this is the last dimension of your output tensor, which you can identify with e.g. nn.Softmax(dim=-1). If you mix up the dimension, your model ends up with random predictions.

Categorical data and Embedding

Categorical data, as for example language characters or the datasets you are given in assignment 2, require special care. Data like language characters ‘a’, ‘b’, ‘c’ etc. are usually represented as integers 0, 1, 2, etc. Do not use integers as input for categorical data. If you would enter those integers as inputs to the model, two problems arise.

  1. You bias the model to see relations where there are none. In the language example above, the model would think that ‘a’ is closer to ‘b’ than to ‘o’, although ‘a’ and ‘o’ are both vocals, and the closeness of ‘a’ and ‘b’ does not necessarily say anything about there usage.

  2. If you have many categories, you will have input values between 0 and >50. The model will have a hard time separating all those >50 categories without blending over some. Hence, the model loses a lot of information although this is not necessary.

The much better option in the case of categorical data is to use one-hot vectors, or embeddings. A one-hot vector represents each category by a vector of 0s, with one index being 1. This makes the model’s life much easier as it can distinguish between the categories in a very simple manner (if feature !=0, it is a specific category). Alternatively, you can learn an embedding with the help of nn.Embedding. The inputs to this module are:

  • num_embeddings which is the number of different categories you have in your input data (in case of language characters, something like 26 as you have ‘a’ to ‘z’)

  • embedding_dim which is the number of features you want to represent each category with. If you use the embedding directly as input to an LSTM or RNN, a good rule of thumb is to use 1/4 - 1/2 of your hidden size inside the LSTM.

  • padding_idx which would allow you to assign a specific index for the padding symbol. Can be skipped if you do not use “-1” as padding index.

The embedding feature vectors are randomly initialized from \(\mathcal{N}(0,1)\). Do not overwrite this init by Kaiming, Xavier or similar. The used standard deviation is 1 because the initialization, activation functions etc. have been designed to have a input standard deviation of 1. Example usage of the embedding module:

[1]:
import torch
import torch.nn as nn
# Create 5 embedding vectors each with 32 features
embedding = nn.Embedding(num_embeddings=5,
                         embedding_dim=32)

# Example integer input
input_tensor = torch.LongTensor([[0, 4], [2, 3], [0, 1]])

# Get embeddings
embed_vectors = embedding(input_tensor)

print("Input shape:", input_tensor.shape)
print("Output shape:", embed_vectors.shape)
print("Example features:\n", embed_vectors[:,:,:2])
Input shape: torch.Size([3, 2])
Output shape: torch.Size([3, 2, 32])
Example features:
 tensor([[[ 0.0504, -0.2422],
         [ 0.0342,  0.2217]],

        [[ 2.7813, -0.3641],
         [-0.0981,  0.4069]],

        [[ 0.0504, -0.2422],
         [ 1.2092,  0.1760]]], grad_fn=<SliceBackward>)

The nn.Embedding object is a module like a linear layer or convolution. Thus, it needs to be defined in the __init__ function of your higher-level module. Do not create the Embedding module in the forward pass. Otherwise, you will have different embeddings every time you run the model, and hence, your model is not able to learn.

Time dimension in nn.LSTM

By default, PyTorch’s nn.LSTM module assumes the input to be sorted as [seq_len, batch_size, input_size]. Make sure that you do not confuse the sequence length and batch dimension. The LSTM would still run without an error, but will give you wrong results. If you want to change this behavior to accepting an input shape of [batch_size, seq_len, input_size], you can specify the argument batch_first=True when creating the LSTM object. Have a closer look at the documentation for details.

Hidden shape mismatch

If you perform matrix multiplications and have a shape mismatch between two matrices, PyTorch will complain and throw an error. However, there are also situations where PyTorch does not throw an error because the misaligned dimensions have (unluckily) the same size. For instance, imagine you have a weight matrix of size \(d_{in}\times d_{out}\). If you take the input \(x\) of size \(B \times d_{in}\) (\(B\) being the batch dimension), and in your hyperparameter setting, \(B=d_{in}\), you can end up performing the matrix multiplication over the wrong dimension while PyTorch is not detecting it. Test your code with multiple, different batch sizes to prevent shape misalignments with the batch dimension.

Training and Evaluation switch

In PyTorch, a module and/or neural network has two modes: training and evaluation. You switch between them using model.eval() and model.train(). The modes decide for instance whether to apply dropout or not, and how to handle the forward of Batch Normalization. However, a common mistake is to forget to set your model back into training mode after evaluation. Make sure to set your model back to train mode after validation. In case your model does not contain dropout, BatchNorm or similar modules, this might not effect your performance.

Parameter handling

As you might know from the PyTorch Tutorial, PyTorch supports hierarchical usage of nn.Modules. One module can contain another module, which can again contain a module, and so on. When you call .parameters() on a module, PyTorch looks for all modules inside the module to also add their parameters to the highest-level module’s parameter. However, PyTorch does not detect parameters of modules in lists, dicts or similar structures. If you have a list of modules, make sure to put them into a nn.ModuleList or nn.Sequential object. Parameters of modules inside those containers are detected. Similarly, for dictionaries, you can use nn.ModuleDict.

Parameters and “.to(device)”

To push your model and/or data to GPU, you can use .to(device) where device is an device object or string (“cpu” for CPU-only machines, and “cuda”/”cuda:0” for GPUs). However, do not call ``.to(device)`` during parameter init. If you define a parameter like self.W = nn.Parameter(torch.Tensor(64, 128)).to(device), your model will not register the parameter on GPU because the “.to” operator creates a new Tensor. Parameters, nonetheless, have to be leaf Tensors, hence your parameters will not be recognized (corresponding GitHub issue). It is much better practice to only call .to(device) once after finishing the init of the model, and not inside the model.

My model runs fine on CPU, but gets NaN loss on GPU

If this is the case, you likely have the bug of parameters and .to(device) as explained above.

Initialization

Initializing the parameters of your model correctly is very important (see Tutorial 4 for details on this). Initializing parameters with a standard normal distribution is not a good practice and often fails. It can occassionally work for very shallow networks, but don’t risk it! Think about your initialization, and use proper methods like Kaiming or Xavier.

Zero-grad in optimizers

Remember to call optimizer.zero_grad() before doing loss.backward(). If you do not reset the gradients for all parameters before performing backpropagation, your gradients will be added to those from the previous batch. Hence, your gradients end up to be not the ones you intended them to be.

Weight decay and Adam

Adam is known to have a different implementation of weight decay in many frameworks than you would expect. Specifically, the weight decay is usually added as gradients before determining the adaptive learning rate, and hence scaling up the weight decay for parameters with low gradient norms. Details on this problem, which is actually shared across most common DL frameworks, can be found here. In PyTorch, you can use the desired version of weight decay in Adam using torch.optim.AdamW (identical to torch.optim.Adam besides the weight decay implementation).

Check your metric calculation

This might sound a bit stupid but check your metric calculation twice or more often before doubting yourself or your model. Metrics like accuracy are easy to calculate, but it is as easy to add a bug into the code. For instance, check that you are averaging over the batch dimension and not accidentally over the class dimension or any other.

My bits per dimension score is very low

If you obtain a very low bits per dimension score for likelihood-based generative models after already the first iteration, the calculation might not be fully correct. Specifically, the negative log likelihood input to the bpd-metric function is expected to be the sum of the individual pixel’s log likelihood of an image, not the mean. The mean is taken inside the bpd function.


PyTorch throws an error

These errors are the easier bugs to correct because PyTorch actually talks to you about what is wrong. Until these are not solved, you probably cannot train your model.

Trying to backward through the graph a second time, specify retain_graph=True

This error occurs if you re-use a tensor from the computation graph of the previous batch. This should usually not happen. Make sure to not keep tensors across batches if not strictly necessary. Example where this issue can occur: when implementing your own LSTM, make sure that the initial hidden state is a constant zero tensor, and not the last hidden state of the previuos batch.

Size mismatch

This usually occurs if your dimension of the input to a module does not match the specified input dimension of the weight tensor, like in a Linear layer. Make sure to have specified the correct dimensions. Usually, a good way to debug this is to print the shape of the input tensor before every layer you call.

If this happens for a matrix multiplication you have implemented, print the shapes of both matrices, and try to figure out over which dimension the matrix multiplication should actually have been performed, and over which PyTorch currently does it.

Device mismatch

You might sometimes see a mistake such as: Runtime Error: Input type (torch.FloatTensor) dand weigh type (torch.cuda.FloatTensor) should be on the same device. This error indicates that the input data is on CPU, while your weights are on the GPU. Make sure that all data is on the same device. This is usually the GPU as it support acceleration for both training and testing.


Good practices

There are many good practices in PyTorch. We try to add a few below that might make your life easier. Another list of good practices can be found here.

Use nn.Sequential and nn.ModuleList

If you have a model with a lot of layers, you might want to summarize them into a nn.Sequential or nn.ModuleList object. In the forward pass, you only need to call the sequential, or iterate through the module list. A MLP can be implemented as follows:

[2]:
class MLP(nn.Module):

    def __init__(self, input_dims=64, hidden_dims=[128,256], output_dims=10):
        super().__init__()
        hidden_dims = [input_dims] + hidden_dims
        layers = []
        for idx in range(len(hidden_dims)-1):
            layers += [
                nn.Linear(hidden_dims[i], hidden_dims[i+1]),
                nn.ReLU(inplace=True)
            ]
        self.layers = nn.Sequential(*layers)

    def forward(self, x):
        return self.layers(x)

In-place activation functions

Some activation functions like nn.ReLU or nn.LeakyReLU have the argument inplace. By default it is False, but it is recommended to set it to True in neural networks. What it does is that the forward pass overwrites the original values of the input with the new output. This option is only available for activation functions where we do not need to know the original input for backpropagation. For instance, in ReLU, the values that are set to zero, have a gradient of zero independent of its specific input value. In-place operation can save a bit of memory, especially if you have large feature maps.

Create modules for repeating blocks

In deep neural networks, you usually have blocks that are repeatidely added to the model. If those blocks require a more complex forward function than just x = layer(x), it is recommended to implement them in a separate module. For example, a ResNet consists of multiple ResNet blocks with a residual connection. The ResNet blocks apply a small neural network, and add the output back to the input. It is better to implement this dynamic in a separate nn.Module class to keep the main model class small and clear.

Stack layers/weights with same input

If you have multiple linear layers or convolutions that have the same input, you can stack them together to increase efficiency. Suppose we have two layers on \(x\): \(y_1 = W_1x+b_1\), \(y_2=W_2x+b_2\). While you could implement it by two linear layers, you can get the exact same neural network by stacking the two layers into one. The single layer is more efficient as this represents a single matrix operation instead of two for the GPU, and hence we can parallelize the computation. An example is shown below:

[3]:
x = torch.randn(2, 10)

# Implementation of separate layers:
y1_layer = nn.Linear(10, 20)
y2_layer = nn.Linear(10, 30)
y1 = y1_layer(x)
y2 = y2_layer(x)

# Implementation of a stacked layer:
y_layer = nn.Linear(10, 50)
y = y_layer(x)
y1, y2 = y[:,:20], y[:,20:50]

If you implement the linear layer manually, you need to stack the weight and bias tensor accordingly. Note that you should change your initialization for the stacked case if necessary. If your initialization depends on the output size of a layer (as for example in Xavier), you would get a different standard deviation for initialization in the two implementations. Still, if your initialization solely depends on the input dimension (e.g. Kaiming), no change is necessary. An example case where this stacking can be beneficial is LSTMs as all four gates use the exact same input.

Use loss functions on logits

Classification loss functions such as Binary Cross Entropy have two versions in PyTorch: with and without logits. It is recommended and good practice to use the loss functions on logits. This is because it is numerically more stable and prevents any instabilities when your model is very wrong in its prediction. If you do not use the logit loss functions, you might run into problems when the model predicts very high or low values that are not correct. In BCE, you will then encounter a log over a value very close to 0. If you are lucky, you just get a very high number (and your model might still diverge because of this), or actually end up with NaN values.

Make use of torch.nn.functional

You do not always need modules. Many methods that do not parameters are implemented as both modules and functions (e.g. log-softmax/softmax, binary cross entropy, etc.). If you need a softmax but do not have a nn.Sequential where you could add it to, the function option F.softmax(...,dim=...) is cleaner than defining a separate module first.

Clip gradient norms

Another good training practice is to clip gradient norms. Even if you set a high threshold, it can stop your model from diverging, even when it gets very high losses. While in MLPs not strictly necessary, RNNs, Transformers, and likelihood models can often benefit from gradient norm clipping. In PyTorch, you can use it via torch.nn.utils.clip_grad_norm_(...) (remember to call it after loss.backward() but before optimizer.step()). In PyTorch Lightning, you can set the clipping norm via gradient_clip_val=... in the Trainer.