Analytics Magazine Analytics Magazine, September/October 2014

Hig h - P erform a nc e A na ly tic s Orga n izat ion is useful to think of them in four dimensions. 1. Structured data management: Tools for managing high-volume structured data (for instance, clickstream data or machine or sensor data) are an important part of any big data technology stack. 2. Unstructured data management: The explosion in data volumes have been to a large extent a result of the rise in human information, which is typically comprised of social media data, videos, pictures and even text data from customer support logs. Tools and technologies to manage, analyze and make sense of this data stream are critical to build understanding and to correlate with other forms of structured data. 3. Analytics environment: Combining both structured and unstructured data, at scale, requires specialized tools and technologies to be able to merge these data sets and to be able to run analytical algorithms. Concepts such as in-database and in-memory analytics have greatly enhanced the ability to use large data sets for analysis at near real-time speeds and to combine the analytics environment within, for example, structured data management tools. 4. Visualization: Intuitive representation of data and results of analysis is a critical final component of the big data technology stack. This furthers the speed at which results are understood and insights derived. Tools and technologies 40 | a n a ly t i c s - m a g a z i n e . o r g that allow for quick drill down, investigative analysis are now pervasive and easily integrated into the analytics stack [2]. Most tools designed for data mining or conventional statistical analysis are not optimal for large data sets. A common hurdle to cross for most analytics organizations trying to leverage big data analytics is availability of big data technologies and platforms. Organizations usually start off by using open source technologies to gain experience and expertise. The big data analytics space, thankfully, provides many open source options for organizations. For example, Hadoop is a good starting place for being able to manage large data at scale. Combining this with NoSQL databases such as Hbase or MySQL can provide a good first step to get a feel for handling large data sets. Hadoop ecosystem tools like Hive, Pig, Sqoop, etc. allow data scientists to also get a feel for being able to query and analyze large data sets. R is an open source programming language and software environment designed for statistical computing and visualization [3]. For visualization, tools like d3.js allow for creative and varied visualization sets to help data scientists present results in an intuitive way. The challenge with using open source technologies though is two-fold. One, integrating these with a legacy enterprise w w w. i n f o r m s . o r g

Analytics Magazine Analytics Magazine, September/October 2014 | Page 40