European Policy Analysis Volume 2, Number 1, Spring 2016 | Page 110

Decision Trees and Random Forests : Machine Learning Techniques to Classify Rare Events
underrepresented in the data and the majority vote might be too strict . In our budget example , only 57 of 610 budget shifts count as punctuation . Therefore , the random chance for any observation to be a punctuation is about 10 percent . If our model predicts 40 percent probability that an observation has the value “ Punctuated ,” this is four times as higher than the random chance . Still the majority vote would classify the observation as “ Incremental ,” because this decreases the classification error ( Chen , Liaw , and Breiman 2004 ). If we instead change the ensemble rule ( let ’ s say every observation is labeled as “ Punctuated ” in case 30 percent or more of the single decision trees classify the observation as a major budget shift ), the model will predict more punctuations which will lead to more correctly classified “ Punctuated .” But , of course , this will weaken the overall classification rate because more incremental changes will be wrongly labeled as “ Punctuated .” Whether this is desirable depends on the objectives of the model . If the model is seen as the most accurate classifier , to change majority vote often means to decrease the performance . But if the model is seen as a “ detector ” for rare events , it can be useful to increase the number of rightly detected punctuations even at the cost of accuracy .
Cross-validation
As discussed earlier , overfitting is a serious issue with machine learning . The algorithms are sometimes very accurate on the dataset the model is fitted to but perform poorly on new data . Random forest increases its robustness by means of the ensemble approach . Nevertheless , overfitting remains an issue . The stateof-the-art procedure to deal with this situation is cross-validation . The idea is to build the model on one dataset and test it on a different one :
Ideally , there would be two random samples from the same population . One would be a training data set , and one would be a testing data set . […] Often , there is only a single data set . An alternative strategy is to split the data up into several randomly chosen , nonoverlapping parts . ( Berk 2006 , 277 )
For cross-validation , the dataset is split randomly in a training set containing , for example , two thirds of the data and a test set with the remaining one third . The final model is fitted on the training data only and the predictions for the test data are evaluated . This validation set approach , in principle , should prevent overfitting . An advantage of this method is that it is easy to apply , but there are two potential drawbacks that should be kept in mind :
1 . The validation-set approach can lead to quite different results , depending on the actual division of training and test set . In practice , splitting the data should always be made with a “ frozen ” random number generator 13 so that others are able to reproduce the results .
2 . “ Since statistical methods tend to perform worse when trained on fewer observations , this suggests that the validation set error rate may tend to overestimate the test error rate for the model fit on the entire data set ” ( James et al . 2013 , 178 ). The splitting of the data in a training set and a test set , therefore , leads to a lower level of accuracy .
110