Training Models with PyTorch

Download in pdf format

We consider a learning problem with input observations $\bbx\in\reals^n$ and output information $\bby\in\reals^m$. We use a linear learning parametrization that we want to train to predict outputs as $\hby=\bbH\bbx$ that are close to the real $\bby$. The comparison metric between $\bby$ and $\hby$ is the squared Euclidean error $\ell(\bby, \hby) = \|\bby – \hby\|^2$. We use a dataset with $Q$ samples of the form $(\bbx_q, \bby_q)$, which leads us to the empirical risk minimization (ERM) problem

\bbH^* \ =\ \text{argmin}_{\bbH\in\reals^{m\times n}}
\frac{1}{Q} \sum_{q=1}^Q
\ell(\bby, \hby)
\ =\ \text{argmin}_{\bbH\in\reals^{m\times n}}
\frac{1}{Q} \sum_{q=1}^Q
\frac{1}{2} \|\bby_q-\bbH\bbx_q\|^2_2 .

This is not a particularly good learning parametrization, but it is the simplest possible and therefore good to illustrate ideas. We introduced this problem in Lab 1.

A challenge in solving \eqref{eqn_ERM_linear} is that algorithms for ERM are finicky. The workhorse method is stochastic gradient descent (SGD) which can be quite sensitive to the choice of parameters such as step and batch sizes; see Videos 2.5 and 2.6 in Lecture 2.

A perhaps more daunting challenge is the computation of derivatives. The model in \eqref{eqn_ERM_linear} is simple and you can easily compute the derivative of the loss with respect to $\bbH$. But when we start looking at neural networks, with several layers and filters per layers, we don’t want to go through the trouble of computing derivatives by hand. We want them to be computed automatically.

A third challenge is taking full advantage of computational resources in your computer, which may entail offloading operations into your graphics processing unit (GPU).

For these reasons it is convenient to use packages that implement the minimization of \eqref{eqn_ERM_linear}. In this post we illustrate the use of Pytorch.

Classes and Objects

Using Pytorch is easy but it can look complicated because it requires that you either learn or remember that Python is an object oriented language. To implement an algorithm that solves \eqref{eqn_ERM_linear} it is not as easy as calling a function that performs the minimization. You have to create objects that instantiate classes where you specify the operations that are to be performed. This results in code that can look weird and complicated but that is easier to modify. And while it may look complicated, it is not, in reality complicated.

Classes: Attributes and Methods

The first concept to understand is the difference between a class and an object. A class in an abstract entity that contains attributes and methods and an object is a specific instance of a class. This is easiest explained with an example. Suppose that we are interested in linear functions of the form $\bby=\bbA\bbx$ with $\bbA$ being a matrix with $m$ rows and $n$ columns. We therefore create a class $\p{LinearFunction}$ defined as follows

class LinearFunction:

    def __init__(self, m, n, A)
        self.m = m
        self.n = n
        self.A = A
    def evaluate(self, x)
        y = np.matmul(x,A)
        return y

The class definition contains two methods. The method $\p{\_\_init\_\_}$ plays a special role in the creation of objects which we will explain soon. At this point, observe how it specifies the attributes that are part of the class. In this specific example, the class contains three attributes, the dimensions $\p{m}$ and $\p{n}$ and the matrix $\p{A.}$ When we define a class, the $\p{\_\_init\_\_}$ function has to be specified always and $\p{self}$ has to always be the first parameter of the $\p{\_\_init\_\_}$ method.

The other function $\p{evaluate}$ is a function proper, which in object oriented programming we call a method. This method takes a variable $\p{x}$ as an input and returns the matrix product $\p{y = np.matmul(x,A)}$ as an output. Observe that the matrix $\p{A}$ is not an input to this function. The matrix $\p{A}$ is an attribute that belongs to the class. Further notice that $\p{self}$ is the first parameter of the $\p{evaluate}$ method. Any method that is defined in a class has to take $\p{self}$ as the first parameter.

Objects: Concrete Instances of Abstract Classes

The class is an abstract object with methods that specifies how to manipulate its attributes. If we want to actually process data, we create a specific instance. This is an object. For example, if we want a linear transformation specified by a matrix $\p{A}$ with 42 rows, 71 columns, and random binary entries that are equally likely to be 0 or 1, we create the object $\p{BernoulliMap}$ as an instance of the class $\p{LinearFunction}$,

m = 42
n = 71
A = np.random.binomial(n=1, p=0.5, size=(m, n))
BernoulliMap = LinearFunction(m, n, A)

When we create the object $\p{BernoulliMap}$ we implicitly call the method $\p{LinearFunction.\_\_init\_\_.}$ In doing so we instantiate the attributes that belong to the object. Whenever the object $\p{BernoulliMap}$ is referenced in the code, we are referring to the linear transformation associated with the specific matrix $\p{A}$ that we passed during the creation of the object. If we wanted to have a different random Bernoulli map, we could do so by instantiating another object of the $\p{LinearFunction}$ class,

AnotherA = np.random.binomial(n=1, p=0.5, size=(m, n))
AnotherBernoulliMap = LinearFunction(m, n, AnotherA)

If we now want to implement the products of matrices $\p{A}$ and $\p{AnotherA}$ with a vector $\p{x}$ we use the method $\p{evaluate}$ that we defined in the class $\p{LinearFunction}$. This method is separately instantiated in each of the two objects $\p{BernoulliMap}$ and $\p{AnotherBernoulliMap}$. Assuming that we generate $\p{x}$ with a Bernoulli distribution, we have

x = np.random.binomial(n=1, p=0.5, size=(n, 1))
y = BernoulliMap.evaluate(x)
Anothery = AnotherBernoulliMap.evaluate(x)

If this looks like a lot of trouble for a matrix product it is because it is a lot of trouble; indeed. However, suppose that you now find a more efficient algorithm for implementing matrix computations. Perhaps because you have decided to take advantage of a GPU. You go into the definition of the $\p{LinearFunction}$ class and update the $\p{evaluate}$ method. The change is now implemented in the hundreds of places in your code where you had used matrix multiplication.


A third concept of object oriented programming we have to introduce is inheritance. This is the possibility of defining a “child” subclass that inherits methods from a “parent” class. As an example, suppose that you intend to create several random Bernoulli maps. Instead of generating the matrix and passing it as an argument in the creation of the object, it is more convenient to encapsulate the generation of the Bernoulli matrix inside of an object. To do that, create a class $\p{LinearBernoulliFunction}$ which we define as a child of the $\p{LinearFunction}$ class,

class LinearBernoulliFunction(LinearFunction):

    def __init__(self, m, n)
        self.m = m
        self.n = n
        self.A = np.random.binomial(n=1, p=0.5, size=(m, n)) 

Observe that the specification of $\p{LinearBernoulliFunction}$ as a child of $\p{LinearFunction}$ is done by making the latter an argument in the $\p{class}$ statement. The use of inheritance allows us to reuse our hard work in the creation of the $\p{LinearFunction}$ class. We do not need to specify the $\p{evaluate}$ function for the $\p{LinearBernoulliFunction}$ because we are reusing from the parent class $\p{LinearFunction}$. We are inheriting, to use the more technical term. If at some point in the future we update the $\p{evaluate}$ method in the $\p{LinearFunction}$ class, that updated method is automatically inherited by the child class.

With this new class, the creation of random Bernoulli maps simplifies to the code

m = 42
n = 71
BernoulliMap = LinearBinomialFunction(m, n)
AnotherBernoulliMap = LinearBernoulliFunction(m, n)

The code for the evaluation of the linear functions is still the same because it has been inherited. The most important advantage of defining a new class is that modifications to the class will now propagate to all the places where a Bernoulli map is defined. If, say, we decide that a probability $p=0.3$ for drawing ones is more appropriate, it’s just a matter of changing the definition of the $\p{LinearBernoulliFunction.\_\_init\_\_}$ method. The change will propagate to all the places where we instantiate an object belonging to the $\p{LinearBernoulliFunction}$ class.

A Simple Training Loop

The reason why training with Pytorch may look complicated is that part of the operations are encapsulated in an object that inherits methods from a parent class. Having developed an understanding of the encapsulation of operations inside of objects, it is now easy to understand how to write a training loop in Pytorch.

In this section we focus on the problem in \eqref{eqn_ERM_linear}. In which the loss associated to individual observations is the mean squared cost $\ell(\bby, \hby) = \|\bby – \hby\|^2$ and the learning parametrization is the linear function $\hby=\bbH\bbx$.

The Parametrization Class

Our first task is to specify the learning parametrization that we will use. We do that by creating a class — which we will instantiate later — that we will call $\p{Parametrization}$. This class must have an $\p{\_\_init\_\_}$ method, as all classes do, and a method called $\p{forward}$. Most importantly, the class must inherit from the $\p{Module}$ class that is part of the $\p{torch.nn}$ library. This is what will allow its use in a training loop. To describe this in more detail, here is a minimal code that define a $\p{Parametrization}$ class for estimates $\hby=\bbH\bbx$,

import torch
import torch.nn as nn

class Parametrization(nn.Module):

    def __init__(self, n, m):
        self.H = nn.parameter.Parameter(torch.zeros(n, m))
    def forward(self, x):
        yHat = torch.matmul(x,H)
        return yHat

The definition of the class $\p{Parametrization}$ specifies that $\p{nn.Module}$ is a parent of the class. This allows $\p{Parametrization}$ to inherit methods from $\p{nn.Module}$. Most notable among these inherited methods are those related to the computation of gradients, which we will use in the training loop below.

Asides from that, we specify the $\p{\_\_init\_\_}$ method and the $\p{forward}$ method. The $\p{\_\_init\_\_}$ method is mostly formulaic. The first line of the method initializes the parent and the second line of the method specifies that the variable $\p{self.H}$ is a parameter. This means exactly what you think it means. Is is indicating that $\p{self.H}$ is a variable that we will train. A fact that has to be specified for gradients to be computed correctly. The $\p{Parametrization}$ class could include other parameters that are not trained. The specification of $\p{self.H}$ further states that this variable is a $\p{torch.Tensor}$ with $\p{m}$ rows and $\p{n}$ columns. This is just a specification of a matrix.

The $\p{forward}$ method is where the parametrization is specified. It says that when given an input $\p{x}$, the $\p{Parametrization}$ class is to produce estimates according to $\p{yHat = torch.matmul(x,H)}$. This is the line of the code that we have to change if we want to use a different parametrization. To illustrate ideas, suppose we want to change the parametrization to the perceptron $\bby=\sigma(\bbH\bbx)$, with $\sigma(\bbH\bbx)=\max({\bbH\bbx,\bbzero})$ representing a rectified linear unit (relu). We can do this by simply redefining the forward function as follows,

    def forward(self, x):
        z = torch.matmul(x,H)
        sigma = nn.ReLU
        yHat = sigma(z)
        return yHat

This is a good moment to appreciate the advantage of using objects. The parametrization is encapsulated inside of the $\p{Parametrization}$ class. Once we write a training loop, this training loop can be used for any parametrization. We can experiment with different versions of the $\p{forward}$ function without having to meddle with the training loop.

The Training Loop

The training loop is going to contain the instructions you expect it to contain. We read the dataset, we compute gradients and we update the parameters. The computation of the gradients is going to have a form that may look a little strange and that is the part we will explain here.

The following code trains the linear model that we encapsulated in the $\p{Parametrization}$ class,

import torch
import torch.optim as optim

estimator = Parametrization(n, m)
optimizer = optim.SGD(estimator.parameters(), lr=eps, momentum=0)

iter = 0
while iter < nIters
    x, y = getBatch(batchSize, xTrain, yTrain)
    y = estimator.forward(x)
    loss = torch.mean((yHat-y)**2)
    iter += 1

The first line after the import commands in the code above instantiates $\p{estimator}$ as an object belonging to class $\p{Parametrization}$. This object is essentially a matrix. It is not really a matrix. It is an object of class $\p{Parametrization}$, which inherits from class $\p{nn.Module}$. This endows it with methods which allow the computation of gradients. But this is transparent to us. All that matters is that $\p{estimator}$ is a matrix that we are learning. If we want to access the actual matrix we have to call $\p{estimator.H}$.

The second line specifies the optimization method, which we choose to be SGD. In the specification of the optimization method, the important argument is the passing of $\p{estimator.parameters().}$ This is telling the optimizer which are the variables that have to be updated, or trained. If we recall the definition of the $\p{Parametrization}$ class, we see that it contains a command in the $\p{\_\_init\_\_}$ method that specifies the trainable parameters. That command says that the trainable parameter is the attribute $\p{H}$. When we pass $\p{estimator.parameters()}$ to the optimizer, we are therefore telling the optimizer that it has to update the variable $\p{H.}$

The loop iterates for $\p{nIters}$ iterations. In each iteration there are three separate actions: (i) We access a batch using the $\p{getBatch(batchSize)}$ function (ii) we compute stochastic gradients. (iii) We perform a SGD step by calling $\p{optimizer.step()}$. The computation of gradients is undertaken by the 4 commands that begin and end in a row that is highlighted in red.

The command $\p{\_grad()}$ acts in tandem with the command $\p{loss.backward()}$. Their combined effect is to compute the gradient of all the operations that are contained within. In this particular case, the instruction $\p{yHat = estimator.forward(x)}$ calls the forward method of the $\p{estimator}$ object — which we defined in the $\p{Parametrization}$ class — and applies it to the input $\p{x}$ we read from the batch. The instruction $\p{loss = torch.mean((yHat-y)**2)}$ compute the mean squared loss. Although the operations are somewhat hidden, the resulting effect of these two operations is to compute the loss averaged over the training set. Making an indefensible, yet didactic use of mathematical notation, the combined effect of these two commands is to perform the operations

= \frac{1}{\text{$\p{batchSize}$}}
\big \| \text{$\p{y[i,:]}$}
– \text{$\p{H*y[i,:]}$} \|^2_2

When we call the function $\p{loss.backward()}$ we evaluate the gradient of $\p{loss}$ with respect to the parameters of the $\p{estimator}$ object. This a quantity that is stored within the $\p{estimator}$ object and that optimizer can therefore access when we invoke $\p{optimizer.step()}$ to perform an SGD step.

The mechanics of how gradients are computed are fascinating and worth learning. But you don’t need to know them to run training loops. Beginners don’t even need to modify the training loop. They just modify the $\p{Parametrization}$ class. That suffices to train a different system. The explanations here are enough to make us Intermediate users. These are the facts we have learned:

(L1) We have to instantiate a $\p{Parametrization}$ class that inherits from $\p{nn.Module}$. These creates an $\p{estimator}$ object.

(L2) We instantiate an optimizer from the $\p{optim.SGD}$ class. This optimizer needs access to the trainable parameters of $\p{estimator}$, which we pass by invoking the $\p{parameters()}$ method of the $\p{Parametrization}$ class.

(L3) The $\p{zero.grad()}$ method of the $\p{estimator}$ object in conjunction with the $\p{backward}$ method of the $\p{loss}$ object implement the computation of gradients. When we call $\p{loss.backward()}$ we initiate a backward chain of gradient computations that stops at the most recent call of $\p{\_grad()}$.

(L4) The $\p{loss.backward()}$ call computes gradients with respect to all the objects that are involved. These gradients are stored in the proper object. Among these gradients, we calculate the gradient with respect to $\p{estimator.parameters()}$.

(L5) This gradient is accessed by the optimizer object to update the values of $\p{estimator.parameters()}$.

We will revisit these learned facts by discussing the training of a Neural Network in the next section.

Training a Neural Network

Suppose that we are interested in another type of parameterization. To be concrete, suppose you want to use a fully connected neural networks (NN) with two layers. In this case the learning parametrization is replaced by the following composition of linear transformations and pointwise nonlinearities

\hby \ =\ \bbH_2 \bbz
\ =\ \bbH_2 \Big (\, \sigma \big(\,\bbH_1\bbx\,\big) \,\Big) ,

where the matrix $\bbH_1$ is $h \times n$ and the matrix $\bbH_2$ is $m \times h$. The scalar constant $h$ is the number of hidden units.

If we keep using the squared Euclidean error loss $\ell(\bby, \hby) = \|\bby – \hby\|^2$, the ERM problem we want to solve is obtained by substituting the linear parametrization $\hby=\bbH\bbx$ used in
\eqref{eqn_ERM_linear} by the NN parametrization in \eqref{eqn_nn}. This yields the ERM problem

\bbH^* \ =\ \arg\min_{\bbH\in\reals^{m\times n}}
\frac{1}{Q} \sum_{q=1}^Q \,
\frac{1}{2} \,
\left \| \,
\,\right \|^2_2 .

To implement SGD we need to compute gradients of the summands of \eqref{eqn_ERM_nn} with respect to $\bbH_1$ and $\bbH_2$. This is painful. Lucky for us, we have access to automatic differentiation in Pytorch.

To use pytorch to train \eqref{eqn_ERM_nn} the training loop doesn’t have to change. All we have to do is replace the $\p{Parametrization}$ class by the class $\p{TwoLayerNN}$ that implements the parametrization in $\eqref{eqn_nn}$. This class has an $\p{\_\_init\_\_}$ method and a $\p{forward}$ method and is defined as follows,

import torch
import torch.nn as nn

class TwoLayerNN(nn.Module):

    def __init__(self, n, m, h):
        self.H1 = nn.parameter.Parameter(torch.rand(n, h))
        self.H2 = nn.parameter.Parameter(torch.rand(h, m))

    def forward(self, x):
        sigma = nn.ReLU()
        z = sigma(torch.matmul(x,self.H1))
        yHat = torch.matmul(z,self.H2)
        return yHat

The $\p{\_\_init\_\_}$ method initializes two different parameters $\p{H1}$ and $\p{H2}$. This is because we have two parameters defining the NN. We also initialize them at random. Just to illustrate a different way of initializing parameters. The $\p{forward}$ method is a straightforward implementation of the NN parametrization in \eqref{eqn_nn}. We compute the intermediate output as $\p{z = sigma(torch.matmul(x,self.H1))}$ and we follow with the NN output computation $\p{ yHat = torch.matmul(z,self.H2).}$

We have said that the training loop does not change. This is true except that when we instantiate the $\p{estimator}$ object we need to instantiate it as member of the $\p{TwoLayerNN}$ class. For completeness, we rewrite the training loop here with that modification highlighted,

import torch
import torch.optim as optim

estimator = TwoLayerNN(n, m, h)
optimizer = optim.SGD(estimator.parameters(), lr=eps, momentum=0)

iter = 0
while iter < nIters
    x, y = getBatch(batchSize, xTrain, yTrain)
    y = estimator.forward(x)
    loss = torch.mean((yHat-y)**2)
    iter += 1

The only difference between this training loop and the training loop for linear parametrizations is the use of a different class for the estimator object. In this loop, the combined calls to $\p{\_grad()}$ and $\p{loss.backward()}$ result on computations of gradients with respect to the NN parameters $\p{H1}$ and $\p{H2}$. The call of the step $\p{optimizer.step()}$ results in a stochastic gradient update of these parameters. These changes are implemented by the expedient action of replacing the definition of the $\p{estimator}$ object. If we want to train a graph neural network, we just need to define a proper class and instantiate a proper object. The training loop remains unchanged.

Code links

The implementation of the basic training loop with the linear parametrization can be found in the folder This folder contains the following files:

$\p{main\}$: This is the main script, which implements the training loop for a simple linear parametrization.

$\p{}$: This script contains the function $\p{dataLab1}$, which generates the data samples, and the function $\p{getBatch}$, which splits the data in batches.

$\p{}$: This script contains the $\p{Parametrization}$ class, which specifies a simple linear parametrization.

The implementation of the basic training loop with a two-layer fully connected neural network can be found in the folder This folder contains the following files:

$\p{main\}$: This is the main script, which implements the training loop for a simple linear parametrization.

$\p{}$: The same file used in the linear parametrization case.

$\p{}$: This script contains the $\p{TwoLayerNN}$ class, which specifies a two-layer fully connected neural network.

A More Comprehensive Learning Loop

So far, we have focused on a basic training loop. In this section, we introduce and discuss techniques that can be used to improve it in order to learn better representations.


Before we train a model, there are a number of hyperparameters — i.e., fixed value parameters — that we need to set. For instance, in a fully connected neural network, it is necessary to define the number of layers and the number of hidden units at each layer. Since hyperparameters are not learned, they have to be chosen carefully because a poor choice of hyperparameters causes models to be over or underparametrized. This, in turn, can lead to overfitting or underfitting the training data. These issues can be addressed with validation.

Validation consists of setting aside a portion of the training samples (usually 10-20%) to evaluate the model every few training steps. Therefore, validation samples are not used for training. The main benefit of validation is that it allows comparing the training and validation losses to assess the suitability of the hyperparameters. E.g., a training loss that is much smaller than the validation loss indicates that the model is overfitting the training data and, thus, that the number of learnable parameters should be decreased.

Another advantage of validation is allowing to keep track of the best model to date at any point of the training process, which is done by monitoring the smallest loss realized in the validation set and storing the parameters of the model that achieved that loss. This is helpful because it reduces the uncertainty around the number of training steps necessary for convergence, i.e., it allows setting the number of training steps to a very large number as we are certain to always keep the best model. In these situations, validation can also be used for early stopping, which consists of halting the training process when the validation loss has not decreased after a certain number of validation steps.


The objective of learning is to obtain a model that generalizes well to unseen input data. The ability of a model to generalize is measured by the generalization error, which is the expectation of the error realized by the model on an unseen input. In order to approximate it, we need to observe the error realized on data drawn from the same distribution as the training data, but which is not used for training. This is the test set.

Unlike the validation set used to tune the hyperparameters and keep track of the best model, the samples in the test set are only accessed once the training loop is over. The learned model is run on these samples to compute the test error, which provides a measure of how well the model generalizes to unseen data. In particular, for a good model the gap between the training and the test error should be small. A large gap usually indicates that the model has overfitted the training data.

Splitting the Dataset

In Lab 1, the input data is generated synthetically by sampling a normal distribution to obtain i.i.d. samples. Therefore, the training and test data can be generated separately and are automatically randomized. In contrast, in real-world scenarios we are usually given a chunk of data consisting of all the available samples, which then have to be split between the training and test sets.

In most train-test splits, the largest portion of the data (80-90%) is used for training and the rest for testing. The validation set is obtained by setting aside a small fraction of the training data. Before splitting the data between the training and test sets, the samples are randomized. This is an important step because in real-world scenarios we don’t usually know whether the available samples are random or ordered in some way. In practice, randomizing the samples is also necessary to assess the quality of the model parametrization independently of the quality of a particular train-test split. This is done by running Monte-Carlo experiments, where estimators are trained on multiple train-test splits to compute the average test error realized by models with a given parametrization.

Epochs and Batches

In the basic training loop, the samples of a batch are selected at random from the training set in each training step. If the number of training steps is large enough, this is not an issue as it is highly likely that all training samples have been included in a batch — and therefore used to train the model — at least once. However, the randomness of this approach might make it so that some samples are selected multiple times before the dataset is considered in full. This affects the gradient descent path and, if the number of training steps is not chosen judiciously, it can have a negative effect on the resulting model.

To address this, we can train the model in epochs, which are full passes over the dataset. In each epoch, the samples are permuted and partitioned in fixed-size batches covering the entire dataset in order to use every training sample an equal number of times. Training in epochs is helpful because epochs are more interpretable than training steps — it makes more sense to specify the number of full passes over the data than the total number of steps. Given a certain number of epochs and the size of a batch, the number of training steps is calculated as the number of epochs multiplied by the number of batches necessary to cover the training set.

Code links

Implementations of the more comprehensive training loops described in this section with a simple linear parametrization can be found in the folder This folder contains the following files:

$\p{main\}$: This is the main script modified to include validation.

$\p{main\}$: This is the main script modified to include validation and testing.

$\p{main\_random\}$: This is the main script modified to include validation, testing and a random split of the data.

$\p{main\}$: This is the main script modified to perform training in epochs and include validation, testing and a random split of the data.

$\p{}$: This script contains the function $\p{dataLab1}$, which generates the data samples, and the function $\p{getBatch}$, which splits the data in batches.

$\p{}$: This script contains the $\p{Parametrization}$ class, which specifies a simple linear parametrization.

To use these more comprehensive loops to train different architectures, it suffices to instantiate the estimator object from a different class.