Learning Systems

In learning problems, the task for a learning system can be summarized in several steps:

  1. The learner has a sample of observations. This is an arbitrary (random) set of objects or instances each of which has a set of features (\(\mathbf{X}\) - features vector) and labels/outcomes (\(y\)). We call this sequence of pairs as a training set: \(S=\left(\left(\mathbf{X}_{1}, y_{1}\right) \ldots\left(\mathbf{X}_{m}, y_{m}\right)\right)\).
  2. We ask the learner to produce a prediction rule (a predictor or a classifier model), so that we can use it to predict the outcome of new domain points (observations/instances).
  3. We assume that the training dataset \(S\) is generated by a data-generating model (DGM) or some “correct” labeling function, \(f(x)\). The learner does not know about \(f(x)\). In fact, we ask the learner to discover it.
  4. The learner will come up with a prediction rule, \(\hat{f}(x)\), by using \(S\), which will be different than \(f(x)\). Hence, we can measure the learning system’s performance by a loss function: \(L_{(S, f)}(\hat{f})\), which is a kind of function that defines the difference between \(\hat{f}(x)\) and \(f(x)\). This is also called as the generalization error or the risk.
  5. The goal of the algorithm is to find \(\hat{f}(x)\) that minimizes the error with respect to the unknown \(f(x)\). The key point here is that, since the learner does not know \(f(x)\), it cannot calculate the loss function. However, it calculates the training error also called as the empirical error or the empirical risk, which is a function that defines the difference between \(\hat{f}(x)\) and \(y_i\).
  6. Hence, the learning process can be defined as coming up with a predictor \(\hat{f}(x)\) that minimizes the empirical error. This process is called Empirical Risk Minimization (ERM).
  7. Now the question becomes what sort of conditions would lead to bad or good ERM?

If we use the training data (in-sample data points) to minimize the empirical risk, the process can lead to \(L_{(S, f)}(\hat{f}) = 0\). This problem is called overfitting and the only way to rectify it is to restrict the number of features in the learning model. The common way to do this is to “train” the model over a subsection of the data (“seen” or in-sample data points) and apply ERM by using the test data (“unseen” or out-sample data points). Since this process restrict the learning model by limiting the number of features in it, this procedure is also called inductive bias in the process of learning.

There are always two “universes” in a statistical analysis: the population and the sample. The population is usually unknown or inaccessible to us. We consider the sample as a random subset of the population. Whatever the statistical analysis we apply almost always uses that sample dataset, which could be very large or very small. Although the sample we have is randomly drawn from the population, it may not always be representative of the population. There is always some risk that the sampled data happens to be very unrepresentative of the population. Intuitively, the sample is a window through which we have partial information about the population. We use the sample to estimate an unknown parameter of the population, which is the main task of inferential statistics. Or, we use the sample to develop a prediction rule to predict unknown population outcomes.

When we have a numeric outcome (non-binary), the lost function, which can be expressed as the mean squared error (MSE), assesses the quality of a predictor or an estimator. Note that we call \(\hat{f}(x)\) as a predictor or estimator. Can we use an estimator as a predictor? Could a “good” estimator also be a “good” predictor. We had some simulations in the previous chapter showing that the best estimator could be the worst predictor. Why? In this section we will try to delve deeper into these questions to find answers.

The starting point will be to define these two different but similar processes.