How to Handle Missing Data?
There are many ways to handle missing data.
The process of estimating missing values in the
data set called as “imputation”. Data imputation
is well known and well-studied in the literature
and probably this blog is not enough to discuss
all of them. To get more information on handling
missing value imputation, please look at some
of the following article/presentations:
http://www.stat.columbia.edu/~gelman/arm/
missing.pdf
Use a global constant to fill in the missing value
Use the attribute(features) mean/ mode/
median for all samples of the same class to fill
in the missing value
KNN (k-Nearest Neighbors) Imputation
Expectation–maximization (EM) algorithm
Inference-based such as regression
Bayesian method
https://www.amstat.org/sections/srms/
webinarfiles/ModernMethodWebinarMay2012.
pdf Decision tree based methods like Random Forest,
CART etc.
http://liberalarts.utexas.edu/prc/_files/cs/
Missing-Data.pdf Many methods exists to overcome missing value
problems but it is very critical to choose the
particular method. Each of the methods have
pros and cons. But we need to understand the
type of data, source of the data and questions
(Kindly look at my previous blog on “Data, big
data and data driven decision making strategy:
Part 1”) we want to solve with the data. Data
imputation is a good process to get a clean data
but we should not introduce “bias” in the process
of doing it.
http://www.bu.edu/sph/files/2014/05/Marina-
tech-report.pdf
Methods for data imputation
Data imputation may include following methods
(not extensive)
Deletion: Delete observations where any of the
variable is missing
In summary: