Data Analysis - Precautions
Gautam Banerjee
[email protected]
The combination of large data sets and observational data means that data analysis activities are often at risk of
drawing misleading inferences. In this blog we describe five of these dangers. Within the operations research and
analytics community these problems are well understood. Unfortunately, the solution is not easy and often
tempers down the enthusiasm, and to be cognizant that rather more complex models are necessary.
Hidden nuances of data collection - Many a times the analyst is not aware of the process of data
collection and the hurdles or bottlenecks of doing so. However a detailed understanding of the same
renders the data analysis more holistic and improves handling observations that otherwise may lead to
diametrically opposite inferences. Say for example we are studying the galactic observations from
telescope to understand the distance of space objects from earth. One of the primary assumptions here
will be that further the object is from earth, dimmer will it look under the telescopic lens. However there
may be interstellar and intergalactic gas and dust clouds, which attenuate radiation. So it violates the
primary assumption. Of course these phenomenon is known to scientists but things get more complicated
or messy in situations where the data collection process are not so well understood, or, even worse, when
the possibility of such events is not considered.
Process change or Nonstationarity -Any inferences from data analysis is not valid if the there is any
change in the underlying process or pattern of events. For example, a model of herbivorous animals
browsing propensity on grass lands will be quite useless when the animal population declines rapidly.
Any historical data on browsing wil