European Policy Analysis Volume 2, Number 1, Spring 2016 | Page 109

European Policy Analysis Random forest, an algorithm developed by Breiman (2001), adds one other very clever feature to the bagging procedure. At each split of the tree, the number of available predictors is reduced to a random sample of all predictors. Typically, the size of this random sample is the square root of the total number of predictors (e.g., three out of nine). “In other words, in building a random forest, at each split in the tree, the algorithm is not even allowed to consider a majority of the available predictors” (James et al. 2013, 320). This seems counter intuitive, because the procedure will negligently reduce the explanatory power of each single tree. For example, there is one predictor (like the Year in the above example) that can explain a lot of variances in the data. If this predictor is neglected in many splits, the model will get worse. On the other hand, the limitation of predictors at each split has a very advantageous effect on the ensemble of trees. Especially, when there is one predictor that is stronger than the others, all trees will automatically chose this predictor as a starting point with the result that all trees look more or less the same and are highly correlated. But “averaging highly correlated quantities does not lead to as large of a reduction in variance as averaging many uncorrelated quantities” (James et al. 2013, 320). As long as the number of trees is sufficiently high to guarantee that all predictors get their chance and all data points are evaluated by enough decision trees, random forest is a very robust algorithm with high accuracy. The bootstrapping has an additional advantage. Every tree is built only on a sample of the data. The other data points are out of bag (OOB), that is, they are not used to fit this special tree. Because several (100) trees are grown, every observation will be OOB in some of the models. For every data point, we can take only those trees where the point has been OOB and use these models to predict the class of the data point by averaging the predictions. This way we receive a more or less unbiased performance indicator (OOB error), because now only the prediction of trees is evaluated where the data point in question has not been considered for building the model. The OOB error can be used to find a good value for the number of trees that are grown in a random forest. In general, random forests are very robust against overfitting so the number of trees should be sufficiently high (see Hastie, Tibshirani, and Friedman 2009, 596–597 for critical analysis). A standard value is 500, but especially in huge datasets, this value might be too low. A plot of the decrease of the OOB error against the number of trees is a good way to see if more trees might lead to better results. Random forests are very sensitive to the number of predictors tried at each split. Testing different values here might improve the model strongly.12 The evaluation criterion for the splits depends on the type of problem (classification or regression). Besides the described Gini index, cross-entropy is often used for classification. Regression trees normally rely on the residual sum of square (RSS). Especially, for the study of rare events, changing the ensemble rule is very interesting (Hastie et al. 2009, 622). As discussed above, random forest normally takes the majority vote of the classification of all trees to predict the class of an observation. The problem here is that extreme events are very rare by nature. Therefore, they are always 109