Analytics Magazine Analytics Magazine, January/February 2014 | Page 61

Figure 2: True distribution of unknown age values shown in red. interval 55 to 95 years rather than covering the whole age range. Calculating just the proportion of missing values per variable does not really help to uncover such situations. In this case we would just have seen 9.1 percent missing in the age variable. Such an analysis only tells us which characteristics are most commonly infected by the “missing value disease” in the data. For our purpose, we need to find a way that uncovers the relationship of the missing values to other variables or features of the customer. One method is to create an indicator variable “age missing YES/NO” and compare the distribution of other variables between these two groups. So we A NA L Y T I C S might see that the missing age values occur with “old contract types” or have a specific phone behavior (Aunt Susanne is not making international phone calls or having data traffic, she is just phoning her friends from time to time). Another method that can efficiently be used to uncover the structure of the missing values is to analyze the “missing value patterns” and show these patterns in tile charts. For each record in the data a string of 0s and 1s is created, “1” indicates a missing value for the respective variable, “0” otherwise. The first character of this string could stand for variable AGE, the next for variable GENDER, the third for variable DATA TRAFFIC. Thus a string of “101” would indicate that for this J A N U A R Y / F E B R U A R Y 2 014 | 61