Analytics Magazine Analytics Magazine, May/June 2014

and are widely used for prediction and forecasting. Most common methods of regression such as “ordinary least squares” and “maximum likelihood estimation” require that the number of variables be less than the number of observations. In a big data environment, where increasing newer data sets are being incorporated, the number of independent variables available often greatly exceeds the number of observations. A case in point is the study of genes, where the different types of genes are the independent variables and the number of patients in a study is the observations. Another good example is texture classification of images where the variables are the pixels and observations are the number of images available for observation. In addition to this, the analyst also has to address some very important issues. For example, do the new variables really help improve the accuracy of the prediction? In general, not all variables contribute to an improved accuracy of the model. Typically, only a few of the large number of potentially influential factors account for most of the variation. To handle this complexity of variable selection brought about by increasing number of data sets available for analysis through big data techniques, a few methods have gained attention and adoption, such as subset selection for regression, penalized regression, Biglm, Revolution R and Distributed-R Vertica, and the split-and-conquer approach. For a more technical discussion of each of these methods, click here. ANALYTICS TECHNIQUE: CLUSTERING. Segmentation, using clustering techniques, is a common method used to reveal natural structure of data. Cluster analysis involves dividing the data into useful as well as meaningful groups where objects in one group (called a cluster) are more similar to each other than to those in other groups. In general, a clustering technique should have the following characteristics to be suitable for use in a big data environment: It should be able to capture clusters of various shapes and sizes, effective treatment for outliers and be able to efficiently execute the algorithms for large data sets. Most partitional and hierarchical methods that rely on centroid-based approaches to clustering do not work very well in large data sets where the underlying data supports clusters of different sizes and geometry. Techniques such as DBSCAN (Density Based Spatial Clustering of Application with Noise) [1] can help find clusters with arbitrary shapes. It works by determining M A Y / J U N E 2 014 | 51

Analytics Magazine Analytics Magazine, May/June 2014 | Page 51