Some of data scientists divide their data into two portions: training data and testing data.
✓
Train, validation and test data split ratio depends on total number of observations in the
data and the actual model you are building.
✓
If you happen to have a model with no hyper parameters or parameters difficult to tune,
you may not need validation dataset too.
✓
70:10:20 is frequently used as the Train, validation and test data split ratio.
Caveats:
1) Size of test data must be large enough to yield statistically significant results.
2) Test data is the representative of the data set as a whole.
3) Never train on the test data.
2.
Training & Test Data - continued
Model Comparison
19-Jan-20
84

✓
Validation techniques in machine learning are used to get the error rate of
the ML model, which can be considered as close to the true rate of the
population.
✓
Since we work with samples of data that may not be a true representative of
the population requiring the use of validation techniques.
✓
Validation techniques:
1. Re-substitution
2. Hold-out
3. K fold cross validation
4. LOOCV
5. Random sub-sampling
6. Bootstrapping
Model Comparison
3. Principle of Model validation
19-Jan-20
85

➢
Validation techniques:
1.
Re-substitution:
In this technique, all the
data is used for training the model and the
error rate is evaluated based on outcome
vs. actual value from the same training data
set. This error is called the re-substitution
error.
2.
Hold-out: is a technique where the data is
split into two different data sets (train and
test)
datasets
with
equal
distribution
of
different classes of data. This avoids re-
substitution error.
3.
K
fold
cross
validation:
is
a
technique
where k - 1 folds are used for training and
the remaining one is used for testing. The
error
rate
is
the
average
rate
of
each
iteration.
K fold cross validation
Model Comparison
3. Principle of Model validation - continued
19-Jan-20
86

➢
Validation techniques:
4.
LOOCV: is the acronym for Leave One Out Cross Validation. In this
technique, all of the data except one record is used for training and
one record is used for testing, This process is repeated for n times,
where n is the number of observations. The error rate of the model
is average of the error rate of each iteration.
LOOCV technique
Model Comparison
3. Principle of Model validation - continued
19-Jan-20
87

➢
Validation techniques:
5.
Random sub-sampling: Here multiple sets of data are randomly chosen
from the dataset and combined to form a test dataset. The remaining
data forms the training dataset.
Model Comparison
3. Principle of Model validation - continued
19-Jan-20
88

➢
Validation techniques:
6.
Bootstrapping: Here, the training dataset is randomly selected with
replacement. For testing, we use the records not selected are used.
Size of the training data set changes from fold to fold. The error rate of
the model is the average of the error rate of each item.

#### You've reached the end of your free preview.

Want to read all 107 pages?

- Fall '19
- Regression Analysis