Lecture 2: Empirical Risk Minimization (9/6 – 9/11)

In Lecture 1 we saw that out interest in graph neural networks (GNNs) stems from their use in artificial intelligence and machine learning problems that involve graph signals. Before we move on to talk more about GNNs we need to be more specific about what we mean by machine learning (ML) and artificial intelligence (AI). Our specific meaning is that ML and AI are synonyms of statistical and empirical risk minimization. These are the subjects we will study in this lecture. We have to disclaim that this is a somewhat narrow definition of ML and AI. But one that will be sufficient for our goals. In addition to studying statistical and empirical risk minimization we will also have a brief discussion about stochastic gradient descent. Throughout we will emphasize the importance of choosing appropriate learning parametrizations for successful learning.

• Handout.

• Script.

• Access full lecture playlist.

Video 2.1 – Artificial Intelligence as Statistical Learning

Conceptually, an Artificial Intelligence (AI) is a system that extracts information from observations. Nature, associates observations to information according to a distribution and the role of an AI is to mimic nature. The mathematical formulation of this process of mimicry is a statistical risk minimization problem.

• Covers Slides 1-6 in the handout.

Video 2.2 – A Word on Models

We have rewritten AI as the solution of a statistical risk minimization (SRM) problem. But solving SRM requires that we have access to the model relating observations to information. If solving SRM needs a model, the pertinent question is: Where is this model coming from? There are three possible answers to this question: System’s modeling. System’s identification. And Machine learning proper

&amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;br /&amp;amp;amp;amp;amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;amp;amp;amp;amp;lt;br /&amp;amp;amp;amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;amp;amp;amp;lt;br /&amp;amp;amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;amp;amp;lt;br /&amp;amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;amp;lt;br /&amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;lt;br /&amp;amp;amp;amp;gt;&amp;amp;amp;lt;br /&amp;amp;amp;gt;&amp;amp;lt;br /&amp;amp;gt;&amp;lt;br /&amp;gt;&lt;br /&gt;<br /><br />

• Covers Slides 7-11 in the handout.

Video 2.3 – Empirical Risk Minimization (ERM)

We began with a definition of learning in terms of statistical risk minimization. But we have evolved into a definition in terms of what we will see now is empirical risk minimization. This is a form of learning that bypasses models by trying to imitate observations, as opposed to imitating models. ERM differs from SRM in that losses are not averaged over a distribution but over a dataset. We will see that if care is not exercised, we are bound to end up with a nonsensical formulation of learning.

• Covers Slides 12-15 in the handout.

Video 2.4 – ERM with Learning Parametrizations

Learning with data produced a mathematical formulation in terms of empirical risk minimization (ERM) . Alas, it produced a problem that does not make sense because it can’t generalize outside of the training set. The search for a problem that makes sense brings us to the concept of learning parametrizations. To obtain a sensical ERM problem we require the introduction of a function class. Instead of searching for an optimal AI over the space of all possible functions, we search over the functions that belong to this class.

• Covers Slides 16-20 in the handout.

Video 2.5 – Stochastic Gradient Descent (SGD)

Empirical risk minimization (ERM) entails solution of an optimization problem. Stochastic gradient descent (SGD) is the customary method used for the minimization of the empirical risk. SGD is designed to avoid the elevated computational cost of the gradients of the empirical risk. It does so by using batches of samples at each iteration and averaging pointwise gradients over batches. As opposed to averaging gradients over the full training set.

• Covers Slides 21-26 in the handout.

Video 2.6 – SGD Memorabilia

Our coverage of stochastic gradient descent (SGD) has been brief and incomplete. But some points are worth remembering: (i) Gradient descent converges because gradients points towards the minimum. (ii) SGD converges because stochastic gradients point in the right direction on average. (iii) The cost of computing stochastic gradients of the empirical risk is much smaller than the cost of computing gradients of the empirical risk. We also discuss convergence properties of SGD and its use in the optimization of functions that are not convex.

• Covers Slides 27-31 in the handout.

Video 2.7 – The Importance of Learning Parametrizations

We close this lecture with a discussion on the importance of selecting the right learning parametrization. We have seen that artificial intelligence reduces to empirical risk minimization and that in ERM all we have to do is choose a learning parametrization. We will illustrate with some examples that this is not an easy choice. The parametrization controls generalization outside of the training set and it can make or break an AI system. When all is said and done, the parametrization is a model of how outputs are related to inputs. And, as is always the case of models, they have to be an accurate representation of nature.

• Covers Slides 32-39 in the handout.