Figure 2: True distribution of unknown
age values shown in red.
interval 55 to 95 years rather than covering
the whole age range.
Calculating just the proportion of
missing values per variable does not really help to uncover such situations. In
this case we would just have seen 9.1
percent missing in the age variable. Such
an analysis only tells us which characteristics are most commonly infected by the
“missing value disease” in the data. For
our purpose, we need to find a way that
uncovers the relationship of the missing
values to other variables or features of
the customer.
One method is to create an indicator variable “age missing YES/NO” and
compare the distribution of other variables between these two groups. So we
A NA L Y T I C S
might see that the missing age values
occur with “old contract types” or have a
specific phone behavior (Aunt Susanne
is not making international phone calls
or having data traffic, she is just phoning
her friends from time to time).
Another method that can efficiently
be used to uncover the structure of the
missing values is to analyze the “missing
value patterns” and show these patterns
in tile charts. For each record in the data
a string of 0s and 1s is created, “1” indicates a missing value for the respective
variable, “0” otherwise. The first character of this string could stand for variable
AGE, the next for variable GENDER, the
third for variable DATA TRAFFIC. Thus a
string of “101” would indicate that for this
J A N U A R Y / F E B R U A R Y 2 014
|
61