Train validation and test data split ratio depends on total number of

Train validation and test data split ratio depends on

This preview shows page 84 - 90 out of 107 pages.

Some of data scientists divide their data into two portions: training data and testing data. Train, validation and test data split ratio depends on total number of observations in the data and the actual model you are building. If you happen to have a model with no hyper parameters or parameters difficult to tune, you may not need validation dataset too. 70:10:20 is frequently used as the Train, validation and test data split ratio. Caveats: 1) Size of test data must be large enough to yield statistically significant results. 2) Test data is the representative of the data set as a whole. 3) Never train on the test data. 2. Training & Test Data - continued Model Comparison 19-Jan-20 84
Image of page 84
Validation techniques in machine learning are used to get the error rate of the ML model, which can be considered as close to the true rate of the population. Since we work with samples of data that may not be a true representative of the population requiring the use of validation techniques. Validation techniques: 1. Re-substitution 2. Hold-out 3. K fold cross validation 4. LOOCV 5. Random sub-sampling 6. Bootstrapping Model Comparison 3. Principle of Model validation 19-Jan-20 85
Image of page 85
Validation techniques: 1. Re-substitution: In this technique, all the data is used for training the model and the error rate is evaluated based on outcome vs. actual value from the same training data set. This error is called the re-substitution error. 2. Hold-out: is a technique where the data is split into two different data sets (train and test) datasets with equal distribution of different classes of data. This avoids re- substitution error. 3. K fold cross validation: is a technique where k - 1 folds are used for training and the remaining one is used for testing. The error rate is the average rate of each iteration. K fold cross validation Model Comparison 3. Principle of Model validation - continued 19-Jan-20 86
Image of page 86
Validation techniques: 4. LOOCV: is the acronym for Leave One Out Cross Validation. In this technique, all of the data except one record is used for training and one record is used for testing, This process is repeated for n times, where n is the number of observations. The error rate of the model is average of the error rate of each iteration. LOOCV technique Model Comparison 3. Principle of Model validation - continued 19-Jan-20 87
Image of page 87
Validation techniques: 5. Random sub-sampling: Here multiple sets of data are randomly chosen from the dataset and combined to form a test dataset. The remaining data forms the training dataset. Model Comparison 3. Principle of Model validation - continued 19-Jan-20 88
Image of page 88
Validation techniques: 6. Bootstrapping: Here, the training dataset is randomly selected with replacement. For testing, we use the records not selected are used. Size of the training data set changes from fold to fold. The error rate of the model is the average of the error rate of each item.
Image of page 89
Image of page 90

You've reached the end of your free preview.

Want to read all 107 pages?

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern

Stuck? We have tutors online 24/7 who can help you get unstuck.
A+ icon
Ask Expert Tutors You can ask You can ask You can ask (will expire )
Answers in as fast as 15 minutes