Lab 1: Empirical Risk Minimization (9/3 – 9/13)

We formulate Artificial Intelligence (AI) as the extraction of information from observations. In nature, observations and information are related by a probability distribution. An AI is a function that when given an input makes a prediction about the value that is likely to be the one that was generated by nature according to the distribution. Mathematically, this is formulated as the minimization of a risk that measures the difference between natural outputs and the output predicted by the AI. These risks can be statistical, when models are available, or empirical, when data is available but models are unknown.

Empirical risk minimization seems to do away with models. This is true to some extent, but not as true as we would like it to be.

• Download assignment.

1.1 Learning with Models: Statistical Estimation

In this part of the lab we are exploring learning when models are available. Most people would call this statistical estimation, not learning. But we are trying to draw a distinction between learning with models and learning with data. Thus, please bear with us and agree to call this learning. Whatever we call it, the lab assignment describes “nature” as an engine that produces an input-output relationship of the form

\begin{align}\label{eqn_lab_10_linear_model}
\bby=\bbA \bbx + \bbw.
\end{align}

In \eqref{eqn_lab_10_linear_model}, the vector $\bbx$ is the input and the vector $\bby$ is the output. The input-output relationship os determined by the matrix $\bbA$ and the noise vector $\bbw$ which we assume has zero mean.

We are tasked with identifying an Artificial Intelligence (AI) function $\Phi(\bbx)$ that tries to predict an output $\bby$ that is consistent with the given model. The specific meaning of consistency in this sentence is that we want the function $\Phi(\bbx)$ to minimize the expected loss

\begin{align}\label{eqn_lab1_SRM_linear_model}
\Phi^*_{\text{S}} =
\arg \min_{\Phi}
\mbE_{p(\bbx,\bby)}
\left[ \,
\frac{1}{2} \big\| \bby – \Phi(\bbx) \big\|_2^2
\, \right].
\end{align}

In general, this is an optimization problem and optimization problems can be solved numerically. In this particular case, however, it is quite elementary to find an analytic expression. To do so, observe that in \eqref{eqn_lab1_SRM_linear_model} the function $\Phi$ plays the role of a variable. This is weird but inconsequential. Taking gradients with respect to this variable and setting it to zero, yields the expression

\begin{align}\label{eqn_lab1_SRM_linear_model_derivative}
\frac{\partial}{\partial \Phi (x)}
\mbE_{p(\bbx,\bby)}
\left[ \,
\frac{1}{2} \big\| \bby – \Phi(\bbx) \big\|_2^2
\, \right]
\ = \
\mbE_{p(\bbx,\bby)}
\Big[ \,
\bby – \Phi(\bbx)
\, \Big]
\ = \
\bbzero,
\end{align}

which the optimal function $\Phi^*_{\text{S}}$ must satisfy. In the second equality in \eqref{eqn_lab1_SRM_linear_model_derivative} we exchanged the order of derivation and expectation and took gradients of the squared norm. We should point out that sending a derivative inside of an expectation requires care or luck for it to work. In our case we’re counting on our luck.

To finalize the derivation we use the model in \eqref{eqn_lab_10_linear_model}. Substituting this model into \eqref{eqn_lab1_SRM_linear_model_derivative} yields the expression

\begin{align}\label{eqn_lab1_SRM_linear_model_derivative_processed}
\mbE_{p(\bbx,\bby)}
\Big[ \,
\bbA \bbx + \bbw – \Phi(\bbx)
\, \Big]
\ = \
\bbzero .
\end{align}

In this expression the only random term is the noise $\bbw$. We have been told that it has zero mean, which makes it vanish from \eqref{eqn_lab1_SRM_linear_model_derivative_processed}. We therefore end up with the optimal classifier

\begin{align}\label{eqn_lab1_SRM_linear_model_optimal_solution}
\Phi^*_{\text{S}}(\bbx) = \bbA \bbx .
\end{align}

This is a rather obvious solution to \eqref{eqn_lab_10_linear_model}. We are just providing a clarifying example of what it means to learn when a model is available. The AI imitates nature.

1.2 – 1.4 Sample Generator Functions

In real life, data is acquired. In this lab we want to illustrate the importance of choosing a learning parametrizations that matches the structure of the data. For that purpose we will write a function that generates fake data that follows the model in \eqref{eqn_lab_10_linear_model} as well a function that generates fake data according to the model

\begin{align}\label{eqn_lab_10_sign_model}
\bby = \text{sign}\Big(\bbA \bbx + \bbw\Big).
\end{align}

This is the same as \eqref{eqn_lab_10_linear_model} except that we postprocess the output with a sign function. Generating these functions is busy work. There is nothing interesting about them.

The first task is to generate a matrix $\bbA$ with independent and identically distributed random binary entries $A_{ij} \in \{ 0, 1 \}$. This means that each entry of $\bbA$ must be Bernoulli. We will further use $\alpha$ to denote the probability of drawing a one so that entries of follow the distribution,

\begin{align}
A_{ij} \sim \text{Ber}(\alpha)
\quad \equiv \quad
\text{Pr}(A_{ij}=1) = \alpha.
\end{align}

We are further told that all the vectors involved are zero mean and that the energy of the input vector $\bbx$ is $\mbE(\|\bbx\|^2)=1/2$ and the energy of the noise vector is also $\mbE(\|\bbw\|^2)=1/2$. We are tasked with finding the probability $\alpha$ that will make the output vector $\bby$ have unit energy, $\mbE(\|\bby\|^2]=1$. This probability is just

\begin{align}
\alpha = 1 \, / \, m.
\end{align}

The Python code to obtain the matrix $\bbA$ follows by using the random.binomial function in Numpy library. The arguments of the function are the dimensions of the input and output vectors $\bbx$ and $\bby$. The output is the matrix $\bbA$ for which the output energy is $\mbE(\|\bby\|^2]=1$ when the input and noise energies are $\mbE(\|\bbx\|^2)=\mbE(\|\bbw\|^2)=1/2$. This is the code:

import numpy as np
def SampleGenerator(m,n):
    return np.random.binomial(1,1/(m),(m,n))

The second task is to use this $\bbA$ matrix in \eqref{eqn_lab_10_linear_model} to generate a given number $Q$ of samples. The third task is to use it in \eqref{eqn_lab_10_sign_model} to generate samples according to this alternative model.

For the case of the linear model sample generator function, we will create two functions, one that generates samples according to (M1) and (M2) and the second one that computes the linear product of them. There are several ways to solve this problem, but in order to use some of the libraries that Numpy has, we can resort to a normal vector,

def InputsGenerator(n):
    return np.random.normal(np.zeros(n), ((np.sqrt(0.5 / n))) * np.ones(n))

Now, we just need to obtain $Q$ samples following the linear model (2),

def TrainingSetGenerator(A,q):
    [m,n]=A.shape
    X = InputsGenerator(n)
    Y = (H @ X + InputsGenerator(m))

    for i in range(q - 1):
        X = np.column_stack((X, InputsGenerator(n)))
        Y = np.column_stack((Y, A @ X[:, -1] + InputsGenerator(m)))

    X = X.T
    Y = Y.T
    return X,Y

For the case of the sign model sample generator function, we will be pragmatic and just re-use the function we obtained previouly but adding the sign function,

def TrainingSetGenerator(A,q):
    [m,n]=A.shape
    X = InputsGenerator(n)
    Y = (H @ X + InputsGenerator(m))

    for i in range(q - 1):
        X = np.column_stack((X, InputsGenerator(n)))
        Y = np.column_stack((Y, A @ X[:, -1] + InputsGenerator(m)))

    X = X.T
    Y = np.sign(Y.T)
    return X,Y

2.1 – 2.2 Empirical Risk Minimization

In part 1 we assumed that the model was known, in this part we will take a more realistic approach. The model will not be known albeit we will have samples of the model. Hence, we need to minimize an empirical risk over a dataset consisting of $Q$ samples. We do not have any restriction on our function $\Phi$, so we are free to construct the function to our own will.

\begin{align}\label{eqn_ERM_no_parametrization} \Phi_{\text{E}}^* = \arg \min_{\Phi} \frac{1}{Q} \sum_{q=1}^Q \ell(\bby_q,\Phi(\bbx_q)). \end{align}

As long as $Q$ is large, problems $\eqref{eqn_ERM_no_parametrization}$ and $\eqref{eqn_lab1_SRM_linear_model}$ become similar, however, the solutions to them are fundamentaly different. Notice that the non-negative loss $\ell(\mathbf{y},\mathbf{\hat{y}})$ in equation $\eqref{eqn_ERM_no_parametrization}$ has a minimum value of zero that is attained when $\bby=\hat{\bby}$. Hence, the only condition that $\Phi$ needs to satisfy in order to achieve the minimum of the Empirical Risk $\eqref{eqn_ERM_no_parametrization}$ is,

\begin{align} \Phi(\mathbf x_q)= \mathcal y_q , \forall q =1,\dots,Q \end{align}

In order to construct a function over the whole domain, we need to decide the value of our function $\Phi$ outside the training set. One possible solution could be to pick $0$ as the value of out function outside the training set. Hence, one function $\Phi^*_{\text{E}}$ that minimizes the ERM $\eqref{eqn_ERM_no_parametrization}$ can be expressed as,

\begin{align} \Phi^*_{\text{E}}(\bbx) =\Big \{ \begin{array}{lr} \bby_q, & \text{ if } x=\bbx_q, \bbx_q \in \ccalT \\ 0, & \text{otherwise } \end{array} \end{align}

Notice that this function will output $0$ outside the training set thus being useless when used to predict a new sample. Solving the empirical risk minimization without restricting $\Phi$ to a particular type of function, yields no information whatsoever about observations that are outside the training set.

In short, we have learned nothing about what goes on outside of the training set.

3.1 – 3.4 Parametric Empirical Risk Minimization

In this part we are asked to obtain a close form solution of problem ERM $\eqref{eqn_ERM_no_parametrization}$ for the particular case of a linear parametrization $\eqref{eqn_lab_10_linear_model}$ and using the squared loss function. In the first case, the model to obtain samples will be linear, we will use the previously obtain function TrainingSetGenerator. To obtain the close form solution, we need to take the derivate of the ERM with respect to matrix \bbH and solve equal to $0$. Hence,the solution is,

\begin{align} \bbH^*=\bbY^T \bbX (\bbX^T\bbX)^{-1}, \label{sol:LeastSquares} \end{align}

Recall that in this example, we have used the function TrainingSetGenerator which gives us matrixes $\bbY \in \reals^{Q\times m}$ and $\bbx \in \reals ^{Q \times n}$. The error measured on the training set $\ccalT$ and an unseen set $\ccalT^{‘}$ is $0.45$ and $0.55$ respectively. This makes sense, as $0.5$ is the energy of the noise $\bbw$ which is the smallest error we can aim to achieve. The fact that the error on both the training set and an unseen set is similar and around the same as the energy of the noise is due to the fact that the amount of samples is large enough for us to learn the model precisely.

In order to obtain the solution, we can resort to the $numpy$ function $inv$ which calculates the inverse of the matrix. Hence the code for this section is:

m=10**2
n=10**2
Q=10**3
A=SampleGenerator(m,n)
T_x,T_y=TrainingSetGenerator(A,Q)
T_x_prime,T_y_prime=TrainingSetGenerator(A,Q)

H=T_y.T@T_x@np.linalg.inv(T_x.T@T_x)

print("The MSE on the training set is 
print("The MSE on the unseen set is 
print("The total energy of the samples is

In the second part of the problem (3.2), we are asked to generate data according to the sign data and to learn a linear model. The solution to this problem is the same as in the previous example, namely $\eqref{sol:LeastSquares}$. In this case, the parameterization fails to learn the model. The energy of the samples is $100$, and the MSE over the training set is $67$ and the MSE over the unseen data is $83$. This result makes sense as we are trying to approximate a non-linear model with a linear parameterization, in short, our parameterization is inaccurate. Note that the energy of each vector is $100$ as the energy of each individual component of the vector is $1$ and the vector is of lenght $100$.

In the case of (3.3), we need to repeat the steps we went through in problem (3.1), but now the dimension of the observation and the information are way larger, $m=n=10^4$. If you happen to experience numerical issues on your computer, you can decrease $Q=10^2$ while keeping $m=n=10^2$ and see the same phenomena. In this case, we will fail categorically when evaluating in the unseen set $\ccalT^{‘}$, as we do not have the amount of data necessary to learn the model. However, the error on the training set $\ccalT$ will be ridiculously small. The takeaway in this exercise is that even though we have the correct parameterization, we will not be able to learn the model unless the correct amount of data is provided.

4.2 – 4.2 Stochastic Gradient Descent

In this section we are asked to derive the gradient step for the case of a linear parameterization and a norm squared loss. In order to do so, we need to compute the derivative of the loss function with respect to the parameters, namely:

\begin{align}\label{eqn_erm_gradients_norm2} \nabla L(\bbH) = \frac{1}{Q} \sum_{q=1}^Q \nabla_{\bbH} \frac{1}{2}||\bby_q – \bbH \bbx_q||^2, \end{align}

where $\bbH \in \reals^{m\times n}$. Taking derivatives with respect to the matrix $\bbH$, we can express the gradient step for one sample $q$ as,

\begin{align}\label{eqn_erm_gradients} \nabla_{\bbH} \frac{1}{2}||\bby_q – \bbH \bbx_q||^2= -(\bby_q – \bbH \bbx_q)\bbx_q^T. \end{align}

We can now plug this result on the iterative process, to obtain the SGD,

\begin{align}\label{eqn_erm_gradient_descent} \bbH_{t+1} \ = \ \bbH_t – \eps \nabla L(\bbH) \ = \ \bbH_t – \frac{\eps}{Q} \sum_{q=1}^Q -(\bby_q – \bbH \bbx_q)\bbx_q^T. \end{align}

We can create function $grad$ in Python by implementing the equation $\eqref{eqn_erm_gradients}$,

m=10**2
def Grad(X,Y,H):
    matricial=(1/X.shape[0] ) * (-Y.T+H@X.T)@(X)
    return matricial

In the second section we are asked to implement SGD on python, we need to generate a loop in which first the gradient according to $\eqref{eqn_erm_gradients}$, then the learning step \eqref{eqn_erm_gradient_descent} is implemented and finally, the MSE error is computed, an example of it can be:

n=10**2
m=10**2
Q=10**3

b_size=10 #Batch Size
stepsize=1 #Stepsize, e.g. epsilon

A=SampleGenerator(m,n)
T_x,T_y=TrainingSetGenerator(A,Q)
Test_x,Test_y=TrainingSetGenerator(A,Q)

H=np.zeros((m,n))
error_train=[]
error_test=[]

for i in range(Q):
    permutation = np.random.permutation(Q) #Get an with the first Q integers in random order
    mb_x= np.array([T_x[index] for index in permutation[0:b_size] ])  # Sub Sample Training Set
    mb_y =np.array( [T_y[index] for index in permutation[0:b_size] ])  # Sub Sample Training Set

    grad=Grad(mb_x,mb_y,H) #Compute Gradient
    H=H-stepsize*grad #Compute the learning step


    error_train=error_train+[np.linalg.norm(T_y.T-H@T_x.T)**2/Q]
    error_test=error_test+[np.linalg.norm(Test_y.T-H@Test_x.T)**2/Q]


plt.plot(range(len(error_train)),error_train,'*r') #Plot
plt.plot(range(len(error_train)),error_test,'*b') #Plot

plt.title('MSE using Gradient Descent')
plt.xlabel('Iterations')
plt.ylabel('MSE')
plt.savefig('MSE_computing_grad.pdf', bbox_inches='tight', dpi=150)
plt.show()