Analytics Magazine Analytics Magazine, July/August 2014 | Page 36

REAL - T IME T E X T A NA LY T I C S are mentioned in such media need to remain current on relevant discussions and be able to track the sentiment of every employee, customer and investor. To address this challenge, a cloud-based real-time ecosystem was created for analyzing comments, reviews and opinions mined from Twitter. In addition, tracking trending themes in the customer space and the evolution of these trends over time was incorporated. be assumed as generated from multiple topics in different proportions. Now every word generated in a tweet can be randomly chosen in a two-step process: • First, a topic is randomly selected from the distribution of topics. • Second, the chosen word is randomly selected from the distribution of words over that topic. So, the joint probability distribution of word W and topic T = Probability (W, T) = TEXT MINING ALGORITHMS Probability (T) * Probability (W | T). Topic modeling. Topic models are Now when the individual probability of statistical techniques that analyze words/ occurrence of a word is known (because it phrases in textual data to understand has already occurred in the tweet), the posthe main themes running through them. terior distribution is calculated as follows: This model algorithm is based on LDA Probability (T | W) = Probability (W, T) (latent Dirichlet allocation) and uses the / Probability (W) observed words in tweets (extracted from Given the probabilities of observed Twitter) to infer the hidden topic structure. words, latent information like the vocabuLDA is more easily understood by its lary distribution of a topic and the distrigenerative process. This generative pro- bution of topics over the tweet are thus cess defines a joint probability distribution inferred. over the observed (the words) and hidden (the topics) random variables. This joint Sentiment analysis. A holistic lexidistribution is used to compute the condi- con-based algorithm is used to analyze tional distribution of the hidden variables individual feature-level sentiments as well given the observed variables. This con- as cumulative sentiments over tweets. ditional distribution is called the posterior Aggregating opinions for a feature: distribution. The algorithm parses one tweet at a time A topic is assumed to be a collec- identifying the features present. A set of tion of words with different probabilities opinion words for each feature is identiof occurrence. An individual tweet can fied using a lexicon. An orientation score 36 | A N A LY T I C S - M A G A Z I N E . O R G W W W. I N F O R M S . O R G