MIS S IN G VA LU E S
Figure 1:Distribution of variable “age”
in a customer database.
Thus, the field “date of birth” is missing in
the customer database of her phone provider, and we can assume she is not the
only customer with a missing value.
If an analyst now looks at the distribution of variable “age” in this customer
database, he might get a histogram as
shown in Figure 1. Additionally, he will
see that he has 9.1 percent missing values. The question is how to treat these
missing values.
• Shall the mean be used as imputation
value?
• Shall different imputation values be
sampled from the actual distribution?
In our case, we can assume that the
true age value for Aunt Susanne and her
friends is not distributed over the whole
60
|
A N A LY T I C S - M A G A Z I N E . O R G
range of values. After a certain year it was
mandatory to provide the date of birth with
new contracts. So the missing values will
mostly occur for a certain age segment
(the older customers) and probably also for
a certain behavior segment (those who did
not change their contract type).
In the Figure 2 histogram, the true distribution of the unknown age values is shown
in red. We realize that we would make a
wrong assumption when we treat the missing values as random, as we found out that
there is a systematic pattern behind them.
In order to qualify such a situation correctly,
business and process knowledge is needed. This know-how is also important to
formulate an adequate imputation rule as
the imputation values should be from the
W W W. I N F O R M S . O R G