DYNAMISM(E) - Biannual Student Magazine June-2017 | Page 8

How to Handle Missing Data? There are many ways to handle missing data. The process of estimating missing values in the data set called as “imputation”.  Data imputation is well known and well-studied in the literature and probably this blog is not enough to discuss all of them. To get more information on handling missing value imputation, please look at some of the following article/presentations: http://www.stat.columbia.edu/~gelman/arm/ missing.pdf Use the attribute(features) mean/ mode/ median for all samples of the same class to fill in the missing value KNN (k-Nearest Neighbors) Imputation  Expectation–maximization (EM) algorithm Inference-based such as regression Bayesian method https://www.amstat.org/sections/srms/ webinarfiles/ModernMethodWebinarMay2012. pdf Decision tree based methods like Random Forest, CART etc. http://liberalarts.utexas.edu/prc/_files/cs/ Missing-Data.pdf Many methods exists to overcome missing value problems but it is very critical to choose the particular method. Each of the methods have pros and cons. But we need to understand the type of data, source of the data and questions (Kindly look at my previous blog on “Data, big data and data driven decision making strategy: Part 1”) we want to solve with the data. Data imputation is a good process to get a clean data but we should not introduce “bias” in the process of doing it. http://www.bu.edu/sph/files/2014/05/Marina- tech-report.pdf Methods for data imputation Data imputation may include following methods (not extensive) Deletion: Delete observations where any of the variable is missing 8 Use a global constant to fill in the missing value In summary: DYNAMISM(E)