REAL - T IME T E X T A NA LY T I C S
are mentioned in such media need to
remain current on relevant discussions
and be able to track the sentiment of every employee, customer and investor. To
address this challenge, a cloud-based
real-time ecosystem was created for analyzing comments, reviews and opinions
mined from Twitter. In addition, tracking
trending themes in the customer space
and the evolution of these trends over
time was incorporated.
be assumed as generated from multiple
topics in different proportions. Now every
word generated in a tweet can be randomly chosen in a two-step process:
• First, a topic is randomly selected
from the distribution of topics.
• Second, the chosen word is randomly
selected from the distribution of
words over that topic.
So, the joint probability distribution of word
W and topic T = Probability (W, T) =
TEXT MINING ALGORITHMS
Probability (T) * Probability (W | T).
Topic modeling. Topic models are
Now when the individual probability of
statistical techniques that analyze words/ occurrence of a word is known (because it
phrases in textual data to understand has already occurred in the tweet), the posthe main themes running through them. terior distribution is calculated as follows:
This model algorithm is based on LDA
Probability (T | W) = Probability (W, T)
(latent Dirichlet allocation) and uses the / Probability (W)
observed words in tweets (extracted from
Given the probabilities of observed
Twitter) to infer the hidden topic structure. words, latent information like the vocabuLDA is more easily understood by its lary distribution of a topic and the distrigenerative process. This generative pro- bution of topics over the tweet are thus
cess defines a joint probability distribution inferred.
over the observed (the words) and hidden
(the topics) random variables. This joint
Sentiment analysis. A holistic lexidistribution is used to compute the condi- con-based algorithm is used to analyze
tional distribution of the hidden variables individual feature-level sentiments as well
given the observed variables. This con- as cumulative sentiments over tweets.
ditional distribution is called the posterior
Aggregating opinions for a feature:
distribution.
The algorithm parses one tweet at a time
A topic is assumed to be a collec- identifying the features present. A set of
tion of words with different probabilities opinion words for each feature is identiof occurrence. An individual tweet can fied using a lexicon. An orientation score
36
|
A N A LY T I C S - M A G A Z I N E . O R G
W W W. I N F O R M S . O R G