and are widely used for prediction and
forecasting.
Most common methods of regression
such as “ordinary least squares” and
“maximum likelihood estimation” require
that the number of variables be less than
the number of observations. In a big data
environment, where increasing newer
data sets are being incorporated, the
number of independent variables available often greatly exceeds the number
of observations. A case in point is the
study of genes, where the different types
of genes are the independent variables
and the number of patients in a study is
the observations. Another good example
is texture classification of images where
the variables are the pixels and observations are the number of images available
for observation.
In addition to this, the analyst also
has to address some very important issues. For example, do the new variables
really help improve the accuracy of the
prediction? In general, not all variables
contribute to an improved accuracy of the
model. Typically, only a few of the large
number of potentially influential factors
account for most of the variation.
To handle this complexity of variable selection brought about by increasing number of data sets available for
analysis through big data techniques, a
few methods have gained attention and
adoption, such as subset selection for
regression, penalized regression, Biglm,
Revolution R and Distributed-R Vertica,
and the split-and-conquer approach. For
a more technical discussion of each of
these methods, click here.
ANALYTICS TECHNIQUE:
CLUSTERING. Segmentation, using
clustering techniques, is a common method used to reveal natural structure of data.
Cluster analysis involves dividing the data
into useful as well as meaningful groups
where objects in one group (called a cluster) are more similar to each other than to
those in other groups.
In general, a clustering technique
should have the following characteristics
to be suitable for use in a big data environment: It should be able to capture
clusters of various shapes and sizes, effective treatment for outliers and be able
to efficiently execute the algorithms for
large data sets.
Most partitional and hierarchical
methods that rely on centroid-based approaches to clustering do not work very
well in large data sets where the underlying data supports clusters of different
sizes and geometry.
Techniques such as DBSCAN (Density Based Spatial Clustering of Application
with Noise) [1] can help find clusters with
arbitrary shapes. It works by determining
M A Y / J U N E 2 014
|
51