European Policy Analysis Volume 2, Number 1, Spring 2016 | Page 108
Decision Trees and Random Forests: Machine Learning Techniques to Classify Rare Events
3. Do the same with all other
predictors.
4. Choose the predictor and the
corresponding
cutpoint
that
reduces the criterion most.
5. Split the data in two parts according
to the selected cutpoint.
6. Repeat the procedure for both
parts of the data until a stopping
criterion is reached; for instance,
until no region contains more than
five observations.
Decision trees are very prone
to overfitting. In an extreme case, we
could divide the predictor space in as
many regions as data points. The result
would be a perfect prediction of the data
(unless two cases with different classes
share exactly the same position). But
such an overcomplex tree would perform
poorly on new data (i.e., it would not be
robust). Contrary, to keep the regions as
big as possible (view splits) increases the
robustness of the decision tree because
these big regions will probably be suitable
for new data points. But the tradeoff then
is a higher classification error.
Random forest is an upgrade of
the decision tree method that overcomes
this problem.
Random Forest
The problem with decision trees
is that they suffer from high variance.
This means that slightly different data
might lead to very different decision
trees. Calculating the mean is a common
way to reduce the variance. In a set “of n
independent observations Z1,…,Zn, each
with the variance σ2, the variance of the
mean Ž of the observations is given by
σ2/n. In other words, averaging a set of
observations reduces variance” (James et
al. 2013, 306). So, if we ran the decision
tree algorithm on multiple training sets,
we could average the models and come up
with one low-variance machine learning
algorithm. The problem is, of course,
that we (normally) do not have multiple
training sets. Splitting our data in different
sets does not help because every model
builtd on a subset would be strongly
biased. The solution is bootstrapping.9: The
procedure is quite simple. We can create
multiple datasets from the original data by
a sample with replacement (Mooney 1996;,
Shikano 2006). The dataset is treated like
a bag from which every observation can
be drawn and added to the bootstrapped
dataset. Then this observation is re