Hig h - P erform a nc e A na ly tic s Orga n izat ion
is useful to think of them in four dimensions.
1. Structured data management:
Tools for managing high-volume structured data (for instance, clickstream data
or machine or sensor data) are an important part of any big data technology stack.
2. Unstructured data management:
The explosion in data volumes have
been to a large extent a result of the rise
in human information, which is typically
comprised of social media data, videos,
pictures and even text data from customer support logs. Tools and technologies
to manage, analyze and make sense of
this data stream are critical to build understanding and to correlate with other forms
of structured data.
3. Analytics environment: Combining
both structured and unstructured data, at
scale, requires specialized tools and technologies to be able to merge these data
sets and to be able to run analytical algorithms. Concepts such as in-database and
in-memory analytics have greatly enhanced
the ability to use large data sets for analysis
at near real-time speeds and to combine the
analytics environment within, for example,
structured data management tools.
4. Visualization: Intuitive representation of data and results of analysis is a
critical final component of the big data
technology stack. This furthers the speed
at which results are understood and insights derived. Tools and technologies
40
|
a n a ly t i c s - m a g a z i n e . o r g
that allow for quick drill down, investigative analysis are now pervasive and easily integrated into the analytics stack [2].
Most tools designed for data mining
or conventional statistical analysis are not
optimal for large data sets. A common hurdle to cross for most analytics organizations trying to leverage big data analytics
is availability of big data technologies and
platforms. Organizations usually start off
by using open source technologies to gain
experience and expertise. The big data
analytics space, thankfully, provides many
open source options for organizations.
For example, Hadoop is a good starting place for being able to manage large
data at scale. Combining this with NoSQL
databases such as Hbase or MySQL can
provide a good first step to get a feel for
handling large data sets. Hadoop ecosystem tools like Hive, Pig, Sqoop, etc. allow
data scientists to also get a feel for being able to query and analyze large data
sets. R is an open source programming
language and software environment designed for statistical computing and visualization [3]. For visualization, tools like
d3.js allow for creative and varied visualization sets to help data scientists present
results in an intuitive way.
The challenge with using open source
technologies though is two-fold. One, integrating these with a legacy enterprise
w w w. i n f o r m s . o r g