Analytics Magazine Analytics Magazine, January/February 2014 | Page 60

MIS S IN G VA LU E S Figure 1:Distribution of variable “age” in a customer database. Thus, the field “date of birth” is missing in the customer database of her phone provider, and we can assume she is not the only customer with a missing value. If an analyst now looks at the distribution of variable “age” in this customer database, he might get a histogram as shown in Figure 1. Additionally, he will see that he has 9.1 percent missing values. The question is how to treat these missing values. • Shall the mean be used as imputation value? • Shall different imputation values be sampled from the actual distribution? In our case, we can assume that the true age value for Aunt Susanne and her friends is not distributed over the whole 60 | A N A LY T I C S - M A G A Z I N E . O R G range of values. After a certain year it was mandatory to provide the date of birth with new contracts. So the missing values will mostly occur for a certain age segment (the older customers) and probably also for a certain behavior segment (those who did not change their contract type). In the Figure 2 histogram, the true distribution of the unknown age values is shown in red. We realize that we would make a wrong assumption when we treat the missing values as random, as we found out that there is a systematic pattern behind them. In order to qualify such a situation correctly, business and process knowledge is needed. This know-how is also important to formulate an adequate imputation rule as the imputation values should be from the W W W. I N F O R M S . O R G