What
is generalization?
If
training loss in fact does decrease as expected, it doesn’t automatically mean
that whatever the model has learned is also useful. This is where the
validation loss comes into play. Things look good if the validation loss
decreases alongside the training loss. In that case, the learned patterns seem to
generalize to the unseen validation data. The validation loss will typically be
higher than the training loss, however, since not all patterns generalize, as
you can see in the following graphic.
If
validation loss decreases as well, the learned patterns seem to generalize.
Bias
Bias
is defined as the average squared difference between predictions and true
values. It’s a measure of how good your model fits the data. Zero bias would
mean that the model captures the true data generating process perfectly. Both
your training and validation loss would go to zero. That is unrealistic,
however, as data is almost always noisy in reality, so some bias is inevitable
— called the irreducible error.
Anyway,
if losses do not decrease as expected, it probably signals that the model is
not a good fit for the data. It would happen, for example, if you tried to fit
an exponential relationship with a linear model — it can simply not adequately
capture that relationship. Just try a different, more flexible model in that
case.
You
may also call this underfitting, with a slightly different connotation,
though. Unlike bias, underfitting would imply that the model has still capacity
to learn, so you would simply train for more iterations or collect more data.
Importantly,
biases may also be hidden in the training data — which is easily overlooked.
Your training loss may decrease as usual in that case. Only testing on real
data can reveal any such bias.
Variance
A
model is said to have high variance if its predictions are sensitive to small
changes in the input. In other words, you can think of it as the surface
between the data points not being smooth but very wiggly. That is usually not
what you want. High variance often means overfitting because the
model seems to have captured random noise or outliers.
Like
high bias and underfitting, high variance and overfitting are related as well
but are still not totally equivalent in meaning.
Overfitting
At
some point during the training of a model, the validation loss usually levels
out (and sometimes even starts to increase again) while the training loss
continues to decrease. That signals overfitting. In other words, the model is
still learning patterns but they do not generalize beyond the training set (see
graphic below). Overfitting is particularly typical for models that have a
large number of parameters, like deep neural networks.
Overfitting
can happen after a certain number of training iterations.
A
large gap between training and validation loss is a hint that the model does
not generalize well and you may want to try to narrow that gap (graphic below).
The simplest solution to overfitting is early-stopping, that is to stop the
training loop as soon as validation loss is beginning to level off.
Alternatively, regularization may help (see below). Underfitting, on the other
hand, may happen if you stop too early.
Regularization
Regularization
is a method to avoid high variance and overfitting as well as to increase
generalization. Without getting into details, regularization aims to keep
coefficients close to zero. Intuitively, it follows that the function the model
represents is simpler, less unsteady. So predictions are smoother and
overfitting is less likely (graphic below). Regularization can be as simple as
shrinking or penalizing large coefficients — often called weight decay. L1
and L2 regularization are two widely used methods. But you may also encounter
different forms, such as dropout regularization in neural networks.
To
sum it all up, learning is well and good, but generalization is what we really
want. For that matter, a good model is supposed to have both low bias and low
variance. Overfitting and underfitting should both be avoided as well. And
regularization may be part of the solution to all of that.
0 Comments