Learning Systems

A process of learning in a learning system can be summarized in several steps:

The learner has a sample of observations. This is an arbitrary (random) set of objects or instances each of which has a set of features (\(\mathbf{X}\) - features vector) and labels/outcomes (\(y\)). We call this sequence of pairs as a training set: \(S=\left(\left(\mathbf{X}_{1}, y_{1}\right) \ldots\left(\mathbf{X}_{m}, y_{m}\right)\right)\).
We ask the learner to produce a prediction rule (a predictor or a classifier), so that we can use it to predict the outcome of new observations (instances).
We assume that the training dataset \(S\) is generated by a data-generating model (DGM) or a labeling function, \(f(x)\). The learner does not know about \(f(x)\). In fact, we ask the learner to discover it.
The learner will come up with a prediction rule, \(\hat{f}(x)\), by using \(S\), which will be different than \(f(x)\). Hence, we can measure the learning system’s performance by a loss function: \(L_{(S, f)}(\hat{f})\), which is a sort of rule (or a function) that defines the difference between \(\hat{f}(x)\) and \(f(x)\). This is also called as the generalization error or the risk.
The goal of the algorithm is to find \(\hat{f}(x)\) that minimizes the difference from the unknown \(f(x)\). The key point here is that, since the learner does not know \(f(x)\), it cannot quantify the gap. However, it calculates the prediction error also called as the empirical error or the empirical risk, which is a function that defines the difference between \(\hat{f}(x)\) and \(y_i\).
Hence, the learning process can be defined as coming up with a predictor \(\hat{f}(x)\) that minimizes the empirical error. This process is called Empirical Risk Minimization (ERM).

The question now becomes what sort of conditions would lead to bad or good ERM?

If we use the training data (in-sample data points) to minimize the empirical risk, the process can lead to \(L_{(S, f)}(\hat{f}) = 0\). This problem is called overfitting and the only way to rectify it is to restrict the sample that the learning model can access. The common way to do this is to “train” the model over a subsection of the data (training or in-sample data points) and apply ERM by using the test data (out-sample data points). Since this process restrict the learning model by limiting the number of observations in it, this procedure is also called inductive bias in the process of learning.

There are always two “universes” in a statistical analysis: the population and the sample. The population is usually unknown or inaccessible to us. We consider the sample as a random subset of the population. Whatever the statistical analysis we apply almost always uses that sample dataset, which could be very large or very small. Although the sample we have is randomly drawn from the population, it may not always be representative of the population. There is always some risk that the sampled data happen to be very unrepresentative of the population. Intuitively, the sample is a window through which we have partial information about the population. We use the sample to estimate an unknown parameter of the population, which is the main task of inferential statistics. In predictive systems, we also use the sample, but we develop a prediction rule to predict unknown population outcomes.

Can we use an estimator as a predictor? Could a “good” estimator also be a “good” predictor. We had some simulations in the previous chapter showing that the best estimator could be the worst predictor. Why? In this section we will try to delve deeper into these questions to find answers.

The starting point will be to define these two different but similar processes.