European Policy Analysis Volume 2, Number 1, Spring 2016 | Page 109
European Policy Analysis
Random forest, an algorithm
developed by Breiman (2001), adds one
other very clever feature to the bagging
procedure. At each split of the tree, the
number of available predictors is reduced
to a random sample of all predictors.
Typically, the size of this random sample
is the square root of the total number of
predictors (e.g., three out of nine). “In
other words, in building a random forest,
at each split in the tree, the algorithm is not
even allowed to consider a majority of the
available predictors” (James et al. 2013,
320). This seems counter intuitive, because
the procedure will negligently reduce the
explanatory power of each single tree.
For example, there is one predictor (like
the Year in the above example) that can
explain a lot of variances in the data. If
this predictor is neglected in many splits,
the model will get worse. On the other
hand, the limitation of predictors at each
split has a very advantageous effect on the
ensemble of trees. Especially, when there
is one predictor that is stronger than the
others, all trees will automatically chose
this predictor as a starting point with
the result that all trees look more or less
the same and are highly correlated. But
“averaging highly correlated quantities
does not lead to as large of a reduction in
variance as averaging many uncorrelated
quantities” (James et al. 2013, 320). As
long as the number of trees is sufficiently
high to guarantee that all predictors
get their chance and all data points are
evaluated by enough decision trees,
random forest is a very robust algorithm
with high accuracy.
The bootstrapping has an
additional advantage. Every tree is built
only on a sample of the data. The other
data points are out of bag (OOB), that is,
they are not used to fit this special tree.
Because several (100) trees are grown,
every observation will be OOB in some of
the models. For every data point, we can
take only those trees where the point has
been OOB and use these models to predict
the class of the data point by averaging
the predictions. This way we receive
a more or less unbiased performance
indicator (OOB error), because now
only the prediction of trees is evaluated
where the data point in question has
not been considered for building the
model. The OOB error can be used to
find a good value for the number of trees
that are grown in a random forest. In
general, random forests are very robust
against overfitting so the number of trees
should be sufficiently high (see Hastie,
Tibshirani, and Friedman 2009, 596–597
for critical analysis). A standard value is
500, but especially in huge datasets, this
value might be too low. A plot of the
decrease of the OOB error against the
number of trees is a good way to see if
more trees might lead to better results.
Random forests are very sensitive
to the number of predictors tried at
each split. Testing different values here
might improve the model strongly.12 The
evaluation criterion for the splits depends
on the type of problem (classification or
regression). Besides the described Gini
index, cross-entropy is often used for
classification. Regression trees normally
rely on the residual sum of square (RSS).
Especially, for the study of rare events,
changing the ensemble rule is very
interesting (Hastie et al. 2009, 622).
As discussed above, random
forest normally takes the majority vote
of the classification of all trees to predict
the class of an observation. The problem
here is that extreme events are very rare
by nature. Therefore, they are always
109