Phase 1: Questions
Are we asking the right questions? Questions
are very critical before we do any type of data
analysis / analytics. Mistaking the type of question
being considered is the most common error in
data analysis. Please look here: http://science.
sciencemag.org/content/early/2015/02/25/
science.aaa6146.full
In this article, Jeff Leek and Roger D. Peng argued
the importance of the asking right questions.
Here, I summarized some of the key questions
we might ask while doing our (big) data analytics
process.
11 key questions before you start analysing your
data
Data Source: What was the source of your data
or how data was collected? Please read more
on this topic https://hbr.org/2015/10/the-two-
questions-you-need-to-ask-your-data-analysts
Error structure: What type of error you can be
expect in this stage? Did you consider error
associated with the data which can be human
error, sampling error, technical error etc.?
Right Data: Is it right data for you to do analysis?
There are already lot of discussions regarding this
aspect. Please look herehttp://www.mckinsey.
com/business-functions/business-technology/
our-insights/three-keys-to-building-a-data-
driven-strategy
Sample representation: How well do the sample
data represent the population? Sometimes it is
hard to get a grip on this aspect but back of mind
it helps us to do downstream analysis.
Number of samples/objects/individuals (n): How
many samples (individuals/objects) are there?
The number of samples give impact on further
statistical analysis. We will discuss this part later
on and called as “Power analyses”
Number of features/variables (p): How many
numbers of features are in the data set?
Number of Samples (n) vs. number of features
(p): What is the size of your data set? Is it p>n or
n>p situation? If p>n (assuming features are in
columns and rows represents samples), then we
call them wide data otherwise tall data.
Features/ Variables: Are you interested to find
out which features or group of similar features
are important for prediction?Do you have
dependent and/or independent variables(s)?
Samples: How samples are related? Do you want
to find out structure/pattern in the samples?
Software: Which software you are going to use?
R, python or other scripting language or object
oriented programming like JAVA, C++?
Data types: What is the type of your data? Is it
qualitative or quantitative?
All above questions help you to decide right
method(s) for further downstream data analytics
process.
Please share your views on this aspect. In the
next post, I will discuss about phase 2 of the
data analytics process.
In this part, I will discuss about initial
pre-processing steps mainly on “missing
value”treatment on the data set.
Why Data Pre-processing?
Data in the real world is dirty, noisy, lacking
attribute values, lacking certain attributes of
interest and incomplete. It has been said that
80% of data analysis (data analytics) is spent on
the process of cleaning and preparing the data.
So, it is important to make data error free as
much as possible.
We will discuss some of the pre-processing /
data cleaning steps.
Missing value:
It is very common problem in the data analysis
(or data analytics). Data can be missing in the
process of a) data extraction b) data collection
or c) data documentation