g oal - d rive n a n a ly t i c s
organizing, extracting and even visualizing rapid streams of data is essentially
a cost center activity. Only when content
of value is operationalized into active decisioning and measured for impact will
big data’s liability be converted into an
intelligence asset. Big data’s recovery up
the Hype Cycle [1] “Slope of Enlightenment” will come in the form of actionable
analytics for automated decision-making
at the operational level and proactive
recommendations at the strategic level.
Size and Success Don’t
Correlate
Big data enthusiasts are finding that
the more data they collect, the harder
it becomes to understand just what the
data is telling them. And most practitioners are surprised to learn how little
data is required to build a highly effective goal-driven model. It’s not a matter of having a lot of data, but a valid
sampling of data to support the target
objective.
For advanced analytics, it is far more
important for a database to be wide with
attributes or variables than long in transactions. Thanks to big data innovations,
more variables are being collected than
ever before. In fact, data dictionaries are
starting to be turned on their side to allow
vertical scrolling through a growing number of attributes.
30
|
a n a ly t i c s - m a g a z i n e . o r g
Only variables that have no relationship to the target objective should be
excluded. A development model will automatically rank the limited set of variables
that have predictive value toward the objective. The remainder can be eliminated
from the final model and potentially from
the analytic sandbox.
Only enough transactional data to
adequately represent the solution space
for the application at hand is required to
develop the model. There are standard
rules of thumb based on the final number
of attributes or dimensionality of the final
model that suggest the number of records
or transactions needed to derive the train,
test and validation data sets for model development. Most times, this range is from
5,000 to 250,000 records – a mere quark
in the vast universe of big data.
But without a use plan for data, companies feel at risk to not harvest all possible
data. This digital hoarding overwhelms
analysis and motivates strategies for deriving streamlined analytic sandboxes.
The sandboxes draw targeted data for
goal-driven model development from the
vast stores of useless “dark data.”
One other consideration toward limiting data for more streamlined analytics is
to start with available structured data. In
most organizations, structured data holds
far more predictive value and requires far
less preparation labor than open text.
w w w. i n f o r m s . o r g