Journal on Policy & Complex Systems Volume 3, Issue 2 | Page 129

Policy and Complex Systems
al ., 2013 ). By performing feature selection using univariate statistical analysis , features that have epistatic interactions will most likely not be selected for use in subsequent multivariate models . This should be of particular concern with respect to the triatomine vectors of Chagas disease because infestation is inherently associated with epistatic interactions when the study views the system as ecological niche modeling . For triatomine vectors to survive in and infest a house , the vector requires , at a minimum , a source of shelter and a readily available food source . Other system features may be of importance ( e . g ., house attractiveness by the vector such as initial entry or passive modes of transportation into the house ). Therefore , household infestation of triatomine vectors is a complex nonlinear system with epistatic relationships between features that are often included as potential risk factors in modeling triatomine infestation . Given the large number of potential risk factors associated with triatomine infestation , it is natural for studies to reduce the number of model features a priori because inclusion of large numbers of features makes exhaustive search of all possible models prohibitively expensive and / or impossible . In addition , a priori feature reduction may help remove noisy features that may lead to overfitting in the multivariate models . As a result , Bustamante et al . ( 2015 ) held a workshop in order to pre-select 25 features for multivariate modeling of T . dimidiata . Features were selected because previous studies showed they increase the odds of infestation ; thus , the features selected are inherently biased toward previous univariate model selection ( i . e ., not able to search the datasets for epistatic interactions ). In addition , missing data challenge many statistical methods . Thus , the removal of an entire observation ( e . g ., house ) because it contains one or two missing features often results in the loss of many other features that may contain important information and change the individual feature distributions ( Bustamante et al ., 2015 ). Finally , another challenge associated with the statistical modeling of with triatomine vector infestation is the use of mixed data types . The inclusion of real , ordinal , and nominal input data is not possible for all statistical methods ; and as a result , real-valued features are often converted to bins that represent a range . For instance , the number of chickens might get binned into categories ( e . g ., 0 , 1 – 3 , 4 – 10 , etc .), when in reality , the number of features needed to represent all possible ranges for the number of chickens could be very large . As a result , Bustamante et al . ( 2015 ) used expert knowledge to create four binned classes to represent the ranges of chickens in the house . While there is nothing wrong with employing expert knowledge , especially when there is no other reasonable alternative , the posterior tinkering of features can result in the reinforcement of preconceived researcher hypotheses .
Finally , there is evidence of heterogeneity in modeling infestation with triatomine vectors of Chagas disease since both Bustamante , De Urioste-Stone , Juárez , and Pennington ( 2014 ) and Bustamante et al . ( 2015 ) found no statistical support for a single best model of infestation . Thus , any successful statistical modeling tool would need to consider model heterogeneity . The goals of this manuscript are two-fold . The first is to call attention to this neglected disease and the large number of risk factors that contribute to its transmission . The second is to show preliminary proof-of-concept of a recently developed evolutionary algorithm ( Hanley , Eppstein , Buzas , & Rizzo , 2016 ; Hanley , Rizzo , Buzas , & Eppstein , 2017 ) designed to mine “ Big ” survey data for the most important risk factors associated with T . dimidiata infestation using georeferenced
125