European Policy Analysis Volume 2, Number 1, Spring 2016

European Policy Analysis - Volume 2, Number 1 - Spring 2016 Decision Trees and Random Forests: Machine Learning Techniques to Classify Rare Events Simon HegelichA The article introduces machine learning algorithms for political scientists. These approaches should not be seen as a new method for old problems. Rather, it is important to understand the different logic of the machine learning approach. Here, data is analyzed without theoretical assumptions about possible causalities. Models are optimized according to their accuracy and robustness. While the computer can do this work more or less alone, it is the researcher’s duty to make sense of these models afterward. Visualization of machine learning results, therefore, becomes very important and is in the focus of this paper. The methods that are presented and compared are decision trees, bagging, and random forests. The latter are more advanced versions of the former, relying on bootstrapping procedures. To demonstrate these methods, extreme shifts in the US budget and their connection to the attention of political actors are analyzed. The paper presents a comparison of the accuracy of different models based on ROC curves and shows how to interpret random forest models with the help of visualizations. The aim of the paper is to provide an example, how these methods can be used in political science and to highlight possible pitfalls as well as advantages of machine learning. Keywords: Machine learning, methods, punctuated equilibrium, statistics for the 21st century Introduction classical statistics is the way problems are formulated. Traditional approaches in political science start with the formulation of hypothesis, creation of formal models that represent the underlying causalities, and then by the test of these models on the available data. Machine learning starts with data, tries to find hidden patterns, and then comes up with formal models that can “explain” additional cases. So, both approaches follow a quite different M achine learning—the usage of computer algorithms that are changing their performance with new data—is a new tool for political scientists that can be very useful, especially in analyzing “unusual” settings such as extreme events, big data problems, or classification of rare events. The main difference between machine learning and A Technical University of Munich / Bavarian School of Public Policy, Munich, Germany doi: 10.18278/epa.2.1.7 98

European Policy Analysis Volume 2, Number 1, Spring 2016 | Page 98