Lab 3: Recommendation Systems (10/2-10/16)

In a recommendation system, we want to predict the ratings that customers would give to a certain product using the product’s rating history and the ratings that these customers have given to similar products. Collaborative filtering solutions build a graph of product similarities using past ratings and consider the ratings of individual customers as graph signals supported on the nodes of the product graph. The underlying assumption is that there exist an underlying set of true ratings or scores, but that we only observe a subset of those scores. The set of scores that are not observed can be estimated from the set of scores that have been observed. This problem can thus be seen as an ERM problem, and our goal will be compare the ability of several learning parametrizations to solve it.

• Download assignment.

1.1-1.2 Loading and cleaning-up the data

To illustrate the problem of recommendation systems with a specific numerical example, we use the MovieLens-100k dataset. This dataset’s zip folder can be downloaded from the class website and a description of the files is provided here. The MovieLens-100k dataset consists of 100,000 ratings given by $U = 943$ users to $M=1682$ movies. The existing movie ratings are integer values between 1 and 5.

To load the data and remove any movies with less than 150 ratings, we implement the Python function $\p{load\_data}$:

import numpy as np
import os
import zipfile # To handle zip files
import torch as torch

def load_data(movie, min_ratings):
    
    # Extract all from zip file
    dataDir=os.getcwd()
    zipObject = zipfile.ZipFile(os.path.join(dataDir,'ml-100k.zip'))
    zipObject.extractall(dataDir)
    zipObject.close()
    
    rawDataFilename = os.path.join(dataDir,'ml-100k','u.data')
    
    # Initialize rating matrix
    rawMatrix = np.empty([0, 0]) 
    
    # From each row of u.data, extract userID, movieID and rating
    with open(rawDataFilename, 'r') as rawData:
        for dataLine in rawData:
            dataLineSplit = dataLine.rstrip('\n').split('\t')
            userID = int(dataLineSplit[0])
            movieID = int(dataLineSplit[1])
            rating = int(dataLineSplit[2])
            if userID > rawMatrix.shape[0]:
                rowDiff = userID - rawMatrix.shape[0]
                zeroPadRows = np.zeros([rowDiff, rawMatrix.shape[1]])
                rawMatrix = np.concatenate((rawMatrix, zeroPadRows),
                                           axis = 0)
            if movieID > rawMatrix.shape[1]:
                colDiff = movieID - rawMatrix.shape[1]
                zeroPadCols = np.zeros([rawMatrix.shape[0], colDiff])
                rawMatrix = np.concatenate((rawMatrix, zeroPadCols),
                                           axis = 1)
                
            # Assign rating to rating matrix
            rawMatrix[userID - 1, movieID - 1] = rating
          
    # Define X
    X = rawMatrix
    
    # Count number of ratings per column, i.e., per movie
    nbRatingsCols = np.sum(X>0,axis=0)
    
    # Mask to identify movies with at least min_ratings
    mask = nbRatingsCols >= min_ratings
    
    # Save new index of the input argument "movie"
    idxMovie = np.sum(mask[0:movie])
    
    # Remove matrix columns
    idx = np.argwhere(mask>0).squeeze()
    X = X[:,idx.squeeze()]
    
    # Make sure there are no rows of all zeros
    nbRatingsRows = np.sum(X>0,axis=1)
    idx = np.argwhere(nbRatingsRows>0).squeeze()
    X=X[idx,:]
    
    # Return cleaned-up X and new index of input argument "movie"
    return X, idxMovie

This function takes in the index of the movie $\p{movie}$ whose ratings we want to predict and the minimum number of ratings per column, and returns the cleaned-up rating matrix and the new index of $\p{movie}$.

To generate the matrix $\p{X}$ with at least 150 ratings per movie and record the new index of the movie “Contact” in this matrix, invoke:

X, idxContact = data.load_data(movie=257, min_ratings=150)

1.3-1.4 Creating and sparsifying the graph

The first step to building our recommendation system is to use the rating history of all movies to compute a graph of movie similarities, in which edges represent a similarity score between different movies. Recall that $x_{um}$ is the rating that user $u$ gives to movie $m$. Typically, movie $m$ has been rated by a subset of users which we denote $\ccalU_m$. We consider the sets of users $\ccalU_{\ell m}=\ccalU_{\ell} \cap \ccalU_{m}$ that have rated movies $\ell$ and $m$ and compute correlations

\begin{align}\label{eqn_reco_systems_weights}
\sigma_{\ell m} = \frac{1}{|\ccalU_{\ell m}|}
\sum_{ u \in \ccalU_{\ell m}}
(x_{u\ell}-\mu_{\ell m})(x_{u m}-\mu_{m \ell}),
\end{align}

where we use the average ratings $\mu_{\ell m}=(1/|\ccalU_{\ell m}|)\sum_{u\in \ccalU_{\ell m}}x_{u \ell}$ and $\mu_{m \ell}=(1/|\ccalU_{\ell m}|)\sum_{u\in \ccalU_{\ell m}}x_{um}$. The movie graph used in collaborative filtering is the one with normalized weights

\begin{align}\label{eqn_reco_systems_graph_weights}
w_{\ell m} \ = \ \sigma_{\ell m} \,\Big/\, \sqrt{\sigma_{\ell \ell}\sigma_{mm}}\ .
\end{align}

An important point to note is that the movie similarity graph should only be built from training data. In order to do this, we split the users between a training and a test set and only use the ratings of the users in the training set to calculate the correlations $\sigma_{\ell m}$ following \eqref{eqn_reco_systems_weights}.

To generate the graph adjacency $\p{W}$ from the rating matrix $\p{X}$, we implement the following Python function:

def create_graph(X, idxTrain, knn):
    
    # Everything below 1e-9 is considered zero
    zeroTolerance = 1e-9
    
    # Number of nodes is equal to the number of columns (movies)
    N = X.shape[1]
    
    # Isolating users used for training
    XTrain = np.transpose(X[idxTrain,:])
    
    # Calculating correlation matrix
    binaryTemplate = (XTrain > 0).astype(XTrain.dtype)
    sumMatrix = XTrain.dot(binaryTemplate.T)
    countMatrix = binaryTemplate.dot(binaryTemplate.T)
    countMatrix[countMatrix == 0] = 1
    avgMatrix = sumMatrix / countMatrix
    sqSumMatrix = XTrain.dot(XTrain.T)
    correlationMatrix = sqSumMatrix / countMatrix - avgMatrix * avgMatrix.T 
    
    # Normalizing by diagonal weights
    sqrtDiagonal = np.sqrt(np.diag(correlationMatrix))
    nonzeroSqrtDiagonalIndex = (sqrtDiagonal > zeroTolerance)\
                                                 .astype(sqrtDiagonal.dtype)
    sqrtDiagonal[sqrtDiagonal < zeroTolerance] = 1.
    invSqrtDiagonal = 1/sqrtDiagonal
    invSqrtDiagonal = invSqrtDiagonal * nonzeroSqrtDiagonalIndex
    normalizationMatrix = np.diag(invSqrtDiagonal)
    
    # Zero-ing the diagonal
    normalizedMatrix = normalizationMatrix.dot(
                            correlationMatrix.dot(normalizationMatrix)) \
                            - np.eye(correlationMatrix.shape[0])

    # Keeping only edges with weights above the zero tolerance
    normalizedMatrix[np.abs(normalizedMatrix) < zeroTolerance] = 0.
    W = normalizedMatrix
    
    # Sparsifying the graph
    WSorted = np.sort(W,axis=1)
    threshold = WSorted[:,-knn].squeeze()
    thresholdMatrix = (np.tile(threshold,(N,1))).transpose()
    W[W<thresholdMatrix] = 0
    
    # Normalizing by eigenvalue with largest magnitude
    E, V = np.linalg.eig(W)
    W = W/np.max(np.abs(E))
    
    return W

The input arguments are $\p{X}$, the rating matrix; $\p{idxTrain}$, the indices of the users in the training set; and $\p{knn}$, the number of neighbors to keep when sparsifying the graph. To generate the graph using a training set with $90%$ of the users selected at random and $40$ nearest neighbors per nodes, invoke:

# Creating and sparsifying the graph

nTotal = X.shape[0] # total number of users (samples)
permutation = np.random.permutation(nTotal)
nTrain = int(np.ceil(0.9*nTotal)) # number of training samples
idxTrain = permutation[0:nTrain] # indices of training samples
nTest = nTotal-nTrain # number of test samples
idxTest=permutation[nTrain:nTotal] # indices of test samples

W = data.create_graph(X=X, idxTrain=idxTrain, knn=40)

1.5-1.6 Generating training and test sets

The next step is to build the input-output samples of the training and test sets. Define the vector $\bbx_u=[x_{u1};\ldots x_{uN}]$ where $x_{um}$ is the rating that user $u$ gave to movie $m$, if available, or $x_{um}=0$ otherwise. Further denote as $\ccalM_u$ the set of movies rated by user $u$. Let $m\in\ccalM_u$ be a movie rated by user $u$ and define the sparse vector $\bby_{um}$ whose unique nonzero entry is $[\bby_{um}]_m = x_{um}$. With these definitions we construct the training set

\begin{align}\label{eqn_ch8_reco_system_training_set}
\ccalT = \bigcup_{u, m \in\ccalM_u}
\big \{ (\bbx_{um}, \bby_{um}) \,:\,
\bbx_{um} = \bbx_u – \bby_{um} \big\}.
\end{align}

In this lab, we focus on predicting the ratings given to the movie Contact. Therefore, we only consider graph signals $\bbx_u$ corresponding to users $u$ who have rated Contact.

To generate the input and output samples for a given movie and split them between the training and test set, we implement the function $\p{split\_data}$:

def split_data(X, idxTrain, idxTest, idxMovie):  
    
    N = X.shape[1]
    
    xTrain = X[idxTrain,:]
    idx = np.argwhere(xTrain[:,idxMovie]>0).squeeze()
    xTrain = xTrain[idx,:]
    yTrain = np.zeros(xTrain.shape)
    yTrain[:,idxMovie] = xTrain[:,idxMovie]
    xTrain[:,idxMovie] = 0
    
    xTrain = torch.tensor(xTrain)
    xTrain = xTrain.reshape([-1,1,N])
    yTrain = torch.tensor(yTrain)
    yTrain = yTrain.reshape([-1,1,N])
    
    xTest = X[idxTest,:]
    idx = np.argwhere(xTest[:,idxMovie]>0).squeeze()
    xTest = xTest[idx,:]
    yTest = np.zeros(xTest.shape)
    yTest[:,idxMovie] = xTest[:,idxMovie]
    xTest[:,idxMovie] = 0
    
    xTest = torch.tensor(xTest)
    xTest = xTest.reshape([-1,1,N])
    yTest = torch.tensor(yTest)
    yTest = yTest.reshape([-1,1,N])
    
    return xTrain, yTrain, xTest, yTest

We then call it with input arguments $\p{X}$, $\p{idxTrain}$, $\p{idxTest}$ and $\p{idxContact}$:

xTrain, yTrain, xTest, yTest = data.split_data(X, idxTrain, idxTest, idxContact)
nTrain = xTrain.shape[0]
nTest = xTest.shape[0]

2.1 Loss function

Our goal is to learn a map that will produce outputs $\bby_{um}$ when presented with inputs $\bbx_{um}$. To do that we define the loss function

\begin{align}\label{eqn_ch8_reco_system_training_set_arbitrary_linear}
\ell \big(\Phi(\bbx_{um}; \ccalH), \bby_{um} \big)
\ = \ \frac{1}{2}
\Big(\bbe_m^T \Phi(\bbx_{um}; \ccalH)
– \bbe_m^T\bby_{um} \Big)^2,
\end{align}

where the vector $\bbe_m$ is the $m$th entry of the canonical basis of $\reals^n$.
Since multiplying with $\bbe_m^T$ extracts the $m$th component of a vector, the loss in \eqref{eqn_ch8_reco_system_training_set_arbitrary_linear} compares the predicted rating $\bbe_m^T \Phi(\bbx_{um}; \ccalH) = [\Phi(\bbx_{um}; \ccalH)]_m$ with the observed rating $\bbe_m^T\bby_{um}=[\bby_{um}]_m = x_{um}$. At execution time, this map can be used to predict ratings of unrated movies from the ratings of rated movies. As we have already seen, successful rating predictions depend on the choice of parametrization. Thus, similarly to what we did in Lab 2, we will implement several parametrizations of $\Phi(\bbx_{um}; \ccalH)$ and compare their ability to accurately predict ratings for the movie Contact.

To compute the loss defined in equation \eqref{eqn_ch8_reco_system_training_set_arbitrary_linear}, we implement the following Python function:

def movieMSELoss(yHat,y,idxMovie):
    mse = nn.MSELoss()
    return mse(yHat[:,:,idxMovie].reshape([-1,1]),y[:,:,idxMovie].reshape([-1,1]))

Besides having the rating predictions $\p{yHat}$ and the true ratings $\p{y}$ as input arguments, this function also needs to take in $\p{idxMovie}$, the index of the movie (or movies) for which we compute the loss. The $\p{reshape}$ function merges the batch and feature dimensions to fit the dimensions required for the input arguments of the PyTorch loss $\p{nn.MSELoss}$.

2.2 Linear parametrization vs. graph filter

The simplest way to generate a generic linear parametrization is to instantiate the class $\p{torch.nn.Linear}$ like we did in Lab 2. Explicitly,

linearParam = torch.nn.Linear(N,N)

For the graph filter, we will use the class $\p{LocalGNN}$ from the Alelab GNN library. First, let us have a look at the syntax of this class, which can be found under $\p{Modules/architectures.py}$:

LocalGNN(dimNodeSignals, nFilterTaps, bias, # Graph Filtering
                 nonlinearity, # Nonlinearity
                 nSelectedNodes, poolingFunction, poolingSize, # Pooling
                 dimReadout, # Local readout layer
                 GSO, order = None # Structure)

The arguments $\p{dimNodeSignals}$ and $\p{nFilterTaps}$ are lists corresponding to the number of features and filter taps in each layer respectively. In particular, the first element of the list $\p{dimNodeSignals}$ should be the number of input features $F_0$. The flag $\p{bias}$ indicates whether we want to consider a bias term in the graph filtering layers, and the argument $\p{nonlinearity}$ specifies which nonlinearity to use in the GNN. The nonlinearity can be any PyTorch function, or any of the activation functions in the file $\p{Utils/graphML.py}$. Because graph filters don’t have nonlinearities, we will use the function $\p{graphML.NoActivation}$.

The arguments $\p{nSelectedNodes}$, $\p{poolingFunction}$, and $\p{poolingSize}$ are specific to pooling. The list $\p{nSelectedNodes}$ contains the number of nodes of the graph after each pooling layer. Since we don’t use pooling in this lab, the number of nodes is always $\p{N}$. The $\p{poolingFunction}$ can be any of the functions in $\p{Utils/graphML.py}$. We use $\p{graphML.NoPool}$. Finally, $\p{poolingSize}$ is a list specifying the size of the neighborhoods around each pooled node from which we summarize information. When using $\p{graphML.NoPool}$, the value of this parameter doesn’t matter, but we will specify the pooling sizes at each layer as 1.

To conclude, $\p{dimReadout}$ is a list specifying the number of features at the output of each readout layer, $\p{GSO}$ is the graph shift operator, and $\p{order}$ (which does not have to be specified, as it has default value $\p{None}$) is the ordering of the graph nodes. We implement the graph filter from 2.2 by instantiating $\p{LocalGNN}$ as:

graphFilter = archit.LocalGNN([1, 64], [5], True, gml.NoActivation, [N], 
                              gml.NoPool, [1], [1], W)

2.3 FCNN vs. GNN

To implement the fully connected neural network, we use $\p{torch.nn.Sequential}$ to stack linear layers and ReLUs, i.e.:

fcNet = nn.Sequential(torch.nn.Linear(N,64), nn.ReLU(), torch.nn.Linear(64,N), nn.ReLU())

To implement the 1-layer GNN, we instantiate an object from the $\p{LocalGNN}$ class with the same input arguments as $\p{graphFilter}$, except for $\p{nonlinearity}$, which is now $\p{nn.ReLU}$:

GNN1Ly = archit.LocalGNN([1, 64], [5], True, nn.ReLU, [N], 
                              gml.NoPool, [1], [1], W)

2.4 Graph filter vs. GNNs

To implement the 2-layer GNN, invoke:

GNN2Ly = archit.LocalGNN([1, 64, 32], [5, 5], True, nn.ReLU, [N, N], 
                              gml.NoPool, [1, 1], [1], W)

A full training loop implementing the comparisons from items 2.1-2.3 can be found here. The average test RMSEs obtained over 10 realizations of the train-test split for each model are presented below:

Linear	FCNN	Graph filter	GNN 1lyr	GNN 2lyr
1.6314	1.2914	1.0029	0.9972	0.9953

3.1-3.5 Transferability

In most practical scenarios in which recommendation systems are used, the product portfolio is constantly changing. An example is the Netflix movie catalog, which gets new movie and TV shows added to it every month. As discussed in Lab 2, when graphs are dynamic it is sometimes impractical to re-train the GNN every time the graph changes. If the product portfolio is too large, it may also be too costly to train the GNN on the full product graph. In general, we want to be able to train GNNs on moderately sized graphs and apply them to similar graphs that are large and/or dynamic without much performance degradation. It turns out that this is possible in graphs such as the movie graphs of this lab where, even if nodes are added or removed, a certain overall structure is preserved. This transference property of GNNs, also known as transferability, has already been illustrated in the synthetic source localization example of Lab 2. Here, we analyze how it holds up in the more realistic scenario of movie recommendation.

Toa analyze transferability in the large graph, the first step is to generate the data for movies with at least 50 ratings:

# Loading and cleaning up the data

X2, idxContact2 = data.load_data(movie=257, min_ratings=50)
N2 = X2.shape[1]

# Creating and sparsifying the graph

W2 = data.create_graph(X=X2, idxTrain=idxTrain, knn=40)

# Creating the training and test sets

_, _, xTest2, yTest2 = data.split_data(X2, idxTrain, idxTest, idxContact2)

Then, we use the function $\p{changeGSO}$ of $\p{LocalGNN}$ to change the GSO of the 1-layer and the 2-layer GNN. This function has the following syntax:

changeGSO(S, nSelectedNodes = [], poolingSize = [])

where $\p{S}$ is the new GSO, $\p{nSelectedNodes}$ is a list containining the number of selected nodes after pooling in each layer, and $\p{poolingSize}$ is a list specifying the corresponding pooling sizes, i.e., the sizes of the neighborhoods over which we compute the summarizing operation. To change the GSO of $\p{GNN1Ly}$ and $\p{GNN2Ly}$, invoke:

GNN1Ly.changeGSO(W2, [N2], [1])
GNN2Ly.changeGSO(W2, [N2, N2], [1, 1])

Note that we need to update $\p{nSelectedNodes}$ to change the number of selected nodes at each layer from $\p{N}$ to $\p{N2}$. Finally, to test these models using the new data, write

lossTest2 = dict()
costTest2 = dict()

with torch.no_grad():
    yHatTest2 = GNN1Ly(xTest2)
lossTest2['GNN 1 layer'] = movieMSELoss(yHatTest2, yTest2, idxContact2)
costTest2['GNN 1 layer'] = np.sqrt(lossTest2['GNN 1 layer'].item())

with torch.no_grad():
    yHatTest2 = GNN2Ly(xTest2)
lossTest2['GNN 2 layer'] = movieMSELoss(yHatTest2, yTest2, idxContact2)
costTest2['GNN 2 layer'] = np.sqrt(lossTest2['GNN 2 layer'].item())

This solves items 3.1-3.3. To analyze transferability to an even larger graph (items 3.4 and 3.5), simply repeat the steps above for $\p{min\_ratings}=10$. A full training loop for this lab, including the transferability experiments, can be found here. The transferability results over 5 random data splits for both the large and the larger graph are shown below:

Architecture	Original graph	Large graph	Relative RMSE difference
GNN 1lyr	1.033	1.040	1.72
GNN 2lyr	0.998	1.484	30.48

Architecture	Original graph	Larger graph	Relative RMSE difference
GNN 1lyr	0.950	0.958	0.84
GNN 2lyr	0.935	1.157	15.73

Note that transferability decreases with the number of layers, and increases with the size of the graph to which the GNN is transferred.