A common example of a low-bias, high-variance model is that of a polynomial spline fit
applied to a non-linear data set.
The parameter of the model (the degree of the polynomial)
183
could be adjusted to fit such a model very precisely (i.e.
low-bias on the training data), but
additions of new points would almost certainly lead to the model having to modify its degree of
polynomial to fit the new data. This would make it a very high-variance model on the in sample
data. Such a model would likely have very poor predictability or inferential capability on out of
sample data.
Overfitting can also manifest itself on the trading strategy and not just the statistical model.
For instance, we could optimise the Sharpe ratio by varying entry and exit threshold parameters.
While this may improve profitability in the backtest (or minimise risk substantially), it would
likely not be behaviour that is replicated when the strategy was deployed live, as we might have
been fitting such optimisations to noise in the historical data.
We will discuss techniques below to minimise overfitting, as much as possible. However one
has to be aware that it is an ever-present danger in both algorithmic trading and statistical
analysis in general.
16.2
Model Selection
In this section we are going to consider how to optimise the statistical model that will underly a
trading strategy. In the field of statistics and machine learning this is known as
Model Selection
.
While I won’t present an exhaustive discussion on the various model selection techniques, I will
describe some of the basic mechanisms such as
Cross Validation
and
Grid Search
that work well
for trading strategies.
16.2.1
Cross Validation
Cross Validation is a technique used to assess how a statistical model will generalise to new data
that it has not been exposed to before. Such a technique is usually used on predictive models,
such as the aforementioned supervised classifiers used to predict the sign of the following daily
returns of an asset price series. Fundamentally, the goal of cross validation is to minimise error
on out of sample data without leading to an overfit model.
In this section we will describe the
training/test split
and
k-fold cross validation
, as well as use
techniques within Scikit-Learn to automatically carry out these procedures on statistical models
we have already developed.
Train/Test Split
The simplest example of cross validation is known as a
training/test split
, or a
2-fold cross
validation
.
Once a prior historical data set is assembled (such as a daily time series of asset
prices), it is split into two components. The ratio of the split is usually varied between 0.5 and
0.8. In the latter case this means 80% of the data is used for training and 20% is used for testing.