WO RK IN G W I T H DATA
Missing values
The origin, detection, treatment and consequences of
missing values in analytics.
BY GERHARD SVOLBA
Missing values – and how to
deal with them – is an inevitable problem for statisticians,
data miners or anyone working with analytical data. Missing values in
the data create uncertainty for the analyst
and the information consumer because decisions need to be made without having the
full picture. Missing values should trigger a
discussion about randomness and systematic patterns, as they might introduce more
fuzziness and/or bias into the picture.
Missing values can also reduce the
number of usable records for the analysis,
or force analysts to eliminate variables from
the analysis. This happens for a technical
reason, since many analytical methods
M
58
|
A N A LY T I C S - M A G A Z I N E . O R G
such as regression techniques, neural networks or cluster algorithms cannot deal
with missing values, as they require numeric values to be used in a mathematical
equation. Consequently, if an observation
has a missing value in any of the required
variables, the whole observation (data record) needs to be omitted from the analysis.
Other options would be to exclude it from
the analysis variable as a whole or to insert imputation values for the missing data
points. (However, in descriptive statistics or
with decision trees, missing values can simply be represented as a separate category.)
Along with the technical consideration
of what to do with missing values, it is also
important, from a business point of view, to
W W W. I N F O R M S . O R G