Topic Modeling using NMF and LDA using sklearn
Author: Shravan Kuchkula
Introduction
Topic modeling involves extracting features from document terms and using mathematical structures and frameworks like matrix factorization and SVD to generate clusters or groups of terms that are distinguishable from each other, and these cluster of words form topics or concepts. These concepts can be used to interpret the main themes of a corpus and also make semantic connections among words that co-occur together frequently in various documents. There are various frameworks and algorithms to build topic models. Here, I will explore two:
- Non-negative matrix factorization
- Latent Dirichlet Allocation
The pros and cons of using NMF and LDA are discussed in the context of analyzing 1500 movie reviews extracted from IMDB. Shown below is the high-level Topic Modeling workflow:
- 1500 movie reviews are sent through the NLP pipeline with the goal to normalize the text.
- The normalized corpus is then fed into a Term Frequency Vectorizer or Tf-idf vectorizer depending on the algorithm.
- Topic modeling is performed using NMF and LDA
- The topic modeling results are evaluated and the results are visualized using pyLDAvis.
End Result
Shown below are the results of topic modeling with both NMF and LDA. These results show that there is some positive sentiment associated with James Bond movies. I will discuss this further down in the post.
NMF results
LDA results
Interactive plot showing results of K-means clustering, LDA topic modeling and Sentiment Analysis
By combining the results of Clustering, Topic Modeling and Sentiment Analysis, we can subjectively gauge how well our Topic Modeling has worked.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from normalization2 import *
pd.options.display.max_colwidth=500
import warnings
warnings.filterwarnings('ignore')
#some ipython magic to show the matplotlib plots inline
%matplotlib inline
Data Gathering and Normalization
A positive, negative and neurtral movie review is extracted for each of the top 500 Thriller movies from IMDB website. Thus, our corpus contains 1500. Each movie review in the corpus (a document) is sent through the following NLP pipeline to normalize the text:
- remove_hypens
- tokenize_text
- remove_special_characters
- convert to lower case
- remove stopwords
- lemmatize the token
- remove short tokens
- keep only words in wordnet
After sending each review through this pipeline, we will now have a list of normalized reviews which can now be used for further analysis.
# get the collection of reviews
# first 250
df = pd.read_pickle('userReviews4.pkl')
# next 250
df1 = pd.read_pickle('userReviews3.pkl')
# dataframe containing all 1000 reviews
df_reviews = pd.concat([df, df1], ignore_index=True)
# display shape of df_reviews
display(df_reviews.head())
# user reviews
reviews = list(df_reviews.user_review)
movie | user_review_permalink | user_review | sentiment | |
---|---|---|---|---|
0 | The Dark Knight | https://www.imdb.com/review/rw2081858/ | I'm so relieved that I'm not the only one who doesn't think this movie is great. I saw this first in the theater, and to be honest, probably would've given it a bit higher rating...if I would've reviewed it then. But........I just watched it again on cable and it doesn't hold up as well outside of the movie theater with it's huge screen, dark room and loud speakers. First off...what I liked. Heath Leadger was fantastic as the Joker and could've done so much more with it if the writing and pl... | neutral |
1 | The Dark Knight | https://www.imdb.com/review/rw1908115/ | We've been subjected to enormous amounts of hype and marketing for the Dark Knight. We've seen Joker scavenger hunts and one of the largest viral campaigns in advertising history and it culminates with the actual release of the movie.Everything that's been said is pretty much spot on. This is the first time I can remember where a summer blockbuster film far surpasses the hype.For as much action as there is in this movie, it's the acting that makes it a great piece of work. Between all the pu... | positive |
2 | The Dark Knight | https://www.imdb.com/review/rw1922862/ | Jim Kunsler is not a regular film critic. He is an accomplished published author and highly educated in arts and literature. I subscribe to his blog, usually for his political writing.There are hardly any film critics around. The newspaper, magazine, internetz,IMDb & TV folks are just reviewers as opposed to critics. They mumble about good or bad, thumbs up, how much money is involved and other nonsense in what is an ignorant parody of Consumer Reports, as if the art of film were a microwave... | negative |
3 | Inception | https://www.imdb.com/review/rw2286063/ | I have to say to make such an impressive trailer and such an uninteresting film, takes some doing.Here you have most of the elements that would make a very good film. You have great special effects, a sci-fi conundrum, beautiful visuals and good sound. Yet the most important part of the film is missing. There is no plot, character or soul to this film. It's like having a beautiful building on the outside with no paint or decoration on the inside.It's an empty shell of a film. There is no ten... | neutral |
4 | Inception | https://www.imdb.com/review/rw2276780/ | What is the most resilient parasite? An Idea! Yes, Nolan has created something with his unbelievably, incredibly and god- gifted mind which will blow the minds of the audience away. The world premiere of the movie, directed by Hollywood's most inventive dreamers, was shown in London and has already got top notch reviews worldwide and has scored maximum points! Now the question arises what the movie has that it deserve all this?Dom Cobb(Di Caprio) is an extractor who is paid to invade the dre... | positive |
# apply the NLP pipeline
clean_reviews = cleanTextBooks(reviews)
# rejoin the tokens to form strings which will be used to vectorize
clean_reviews_text = [' '.join(item) for item in clean_reviews]
Vectorize the reviews
Since our goal is to explore LDA and NMF and see how each performs on the corpus of text, we will use both term frequencies
(used by LDA) and term frequencies - inverse document frequencies
(used by NMF). Thus, we will use CountVectorizer and TfidfVectorizer.
# for LDA
from sklearn.feature_extraction.text import CountVectorizer
# for NMF
from sklearn.feature_extraction.text import TfidfVectorizer
# vectorize the corpus
count_vectorizer = CountVectorizer(min_df=10, max_df=0.95, ngram_range=(1,1), stop_words='english')
tfidf_vectorizer = TfidfVectorizer(min_df=10, max_df=0.95, ngram_range=(1,1), stop_words='english')
# calculate the feature matrix
feature_matrix = count_vectorizer.fit_transform(clean_reviews_text)
tfidf_feature_matrix = tfidf_vectorizer.fit_transform(clean_reviews_text)
display(feature_matrix.shape)
display(tfidf_feature_matrix.shape)
(1500, 2516)
(1500, 2516)
Displaying the shape of the feature matrices indicates that there are a total of 2516 unique features in the corpus of 1500 documents.
Topic Modeling
Build NMF model using sklearn
Non-Negative Matrix Factorization (NMF): The goal of NMF is to find two non-negative matrices (W, H) whose product approximates the non- negative matrix X. This factorization can be used for example for dimensionality reduction, source separation or topic extraction. We will be using sklearn’s implementation of NMF.
from sklearn.decomposition import NMF
nmf = NMF(n_components=2, random_state=43, alpha=0.1, l1_ratio=0.5)
nmf_output = nmf.fit_transform(tfidf_feature_matrix)
nmf_feature_names = tfidf_vectorizer.get_feature_names()
nmf_weights = nmf.components_
#####################################
## Utility functions to help with NMF
# Code adapted from Sarkar text book
#####################################
# get topics with their terms and weights
def get_topics_terms_weights(weights, feature_names):
feature_names = np.array(feature_names)
sorted_indices = np.array([list(row[::-1]) for row in np.argsort(np.abs(weights))])
sorted_weights = np.array([list(wt[index]) for wt, index in zip(weights, sorted_indices)])
sorted_terms = np.array([list(feature_names[row]) for row in sorted_indices])
topics = [np.vstack((terms.T, term_weights.T)).T for terms, term_weights in zip(sorted_terms, sorted_weights)]
return topics
# prints components of all the topics
# obtained from topic modeling
def print_topics_udf(topics, total_topics=1,
weight_threshold=0.0001,
display_weights=False,
num_terms=None):
for index in range(total_topics):
topic = topics[index]
topic = [(term, float(wt))
for term, wt in topic]
#print(topic)
topic = [(word, round(wt,2))
for word, wt in topic
if abs(wt) >= weight_threshold]
if display_weights:
print('Topic #'+str(index+1)+' with weights')
print(topic[:num_terms]) if num_terms else topic
else:
print('Topic #'+str(index+1)+' without weights')
tw = [term for term, wt in topic]
print(tw[:num_terms]) if num_terms else tw
# prints components of all the topics
# obtained from topic modeling
def get_topics_udf(topics, total_topics=1,
weight_threshold=0.0001,
num_terms=None):
topic_terms = []
for index in range(total_topics):
topic = topics[index]
topic = [(term, float(wt))
for term, wt in topic]
#print(topic)
topic = [(word, round(wt,2))
for word, wt in topic
if abs(wt) >= weight_threshold]
topic_terms.append(topic[:num_terms] if num_terms else topic)
return topic_terms
def getTermsAndSizes(topic_display_list_item):
terms = []
sizes = []
for term, size in topic_display_list_item:
terms.append(term)
sizes.append(size)
return terms, sizes
Important terms in each Topic
topics = get_topics_terms_weights(nmf_weights, nmf_feature_names)
print_topics_udf(topics, total_topics=2, num_terms=30, display_weights=True)
Topic #1 with weights
[('like', 0.6), ('make', 0.57), ('character', 0.55), ('time', 0.52), ('story', 0.51), ('good', 0.5), ('really', 0.42), ('scene', 0.4), ('action', 0.4), ('great', 0.38), ('people', 0.38), ('film', 0.38), ('watch', 0.38), ('movie', 0.37), ('know', 0.37), ('plot', 0.34), ('best', 0.33), ('think', 0.31), ('thing', 0.3), ('performance', 0.29), ('want', 0.29), ('life', 0.29), ('actor', 0.28), ('play', 0.28), ('love', 0.27), ('work', 0.27), ('come', 0.27), ('acting', 0.26), ('year', 0.25), ('better', 0.24)]
Topic #2 with weights
[('bond', 1.93), ('james', 0.61), ('action', 0.21), ('love', 0.18), ('girl', 0.11), ('moore', 0.11), ('agent', 0.11), ('favorite', 0.11), ('villain', 0.1), ('best', 0.09), ('woman', 0.09), ('fight', 0.08), ('russia', 0.08), ('series', 0.07), ('kill', 0.07), ('soviet', 0.07), ('grant', 0.07), ('beautiful', 0.07), ('number', 0.06), ('pierce', 0.06), ('daniel', 0.06), ('russian', 0.05), ('bomb', 0.05), ('secret', 0.05), ('excellent', 0.04), ('train', 0.04), ('course', 0.04), ('sequence', 0.04), ('death', 0.04), ('film', 0.04)]
topics_display_list = get_topics_udf(topics, total_topics=2, num_terms=30)
Visualize NMF topics
Instead of using a wordcloud
I have made use of matplotlib and displayed the topics based on their relevance.
terms, sizes = getTermsAndSizes(topics_display_list[0])
num_top_words = 30
fontsize_base = 30 / np.max(sizes) # font size for word with largest share in corpus
num_topics = 1
for t in range(num_topics):
fig, ax = plt.subplots(1, num_topics, figsize=(6, 12))
plt.ylim(0, num_top_words + 1.0)
plt.xticks([])
plt.yticks([])
plt.title('Topic #{}'.format(t))
for i, (word, share) in enumerate(zip(terms, sizes)):
word = word + " (" + str(share) + ")"
plt.text(0.3, num_top_words-i-1.0, word, fontsize=fontsize_base*share)
plt.tight_layout()
terms, sizes = getTermsAndSizes(topics_display_list[1])
num_top_words = 30
fontsize_base = 160 / (np.max(sizes))*0.8 # font size for word with largest share in corpus
num_topics = 1
for t in range(num_topics):
fig, ax = plt.subplots(1, num_topics, figsize=(16, 30))
plt.ylim(0, num_top_words + 1.0)
plt.xticks([])
plt.yticks([])
plt.title('Topic #{}'.format(t+1))
for i, (word, share) in enumerate(zip(terms, sizes)):
word = word + " (" + str(share) + ")"
plt.text(0.3, num_top_words-i-.5, word, fontsize=fontsize_base*share)
plt.tight_layout()
NMF results summary
There are three fundamental goals while subjectively evaluating the NMF results:
- What is the meaning of each topic? This question seeks to tackle topic coherence.
- How prevalent(common) is each topic in the overall corpus? This question seeks to understand the topic distribution across a corpus.
- How do the topics relate to each other? This deals with inter-topic distance.
The NMF results show, that the model identified the distinctive features of Topic 1 with a lot of confidence. However, it failed to provide coherence when it comes to Topic 0. As NMF is a deterministic model, we don’t have a way to modify the probabilities to see how the key terms vary within each topic. For better Topic coherence, we can try a probabilistic model like LDA.
Why LDA ?
-
Latent Dirichlet Allocation learns the relationships between words, topics, and documents by assuming documents are generated by a particular probabilistic model.
-
A topic in LDA is a multinomial distribution over the (typically thousands of) terms in the vocabulary of the corpus.
-
To interpret a topic, one typically examines a ranked list of the most probable terms in that topic, using anywhere from three to thirty terms in the list. The problem with interpreting topics this way is that common terms in the corpus often appear near the top of such lists for multiple topics, making it hard to differentiate the meanings of these topics.
-
LDA allows in ranking terms for a given topic in terms of both the frequency of the term under that topic as well as the term’s exclusivity to the topic, which accounts for the degree to which it appears in that particular topic to the exclusion of others. By varying the lambda value, we have the flexibility to rank terms in the order of usefulness for interpreting topics.
-
Thus, for applications in which a human end-user will interact with learned topics, the flexibility of LDA and the coherence advantages of LDA warrant strong consideration.
Build LDA model using sklearn
from sklearn.decomposition import LatentDirichletAllocation
# Instantiate the LDA model
lda_model = LatentDirichletAllocation(n_components=2, max_iter=100, learning_method='online', random_state=43,
batch_size=128, evaluate_every=-1, n_jobs=-1)
# fit transform the feature matrix
lda_output = lda_model.fit_transform(feature_matrix)
# display the lda_output and its shape
display(lda_output)
display(lda_output.shape)
array([[0.29220839, 0.70779161],
[0.81608747, 0.18391253],
[0.08911543, 0.91088457],
...,
[0.68707845, 0.31292155],
[0.54749114, 0.45250886],
[0.11440662, 0.88559338]])
(1500, 2)
Diagnose model performance using perplexity and log-likelihood
A model with higher log-likelihood and lower perplexity is preferred.
# print log-likelihood
print("Log likelihood: ", lda_model.score(feature_matrix))
Log likelihood: -969519.4325965648
# print perplexity
print("Perplexity: ", lda_model.perplexity(feature_matrix))
Perplexity: 1329.485684613842
GridSearch the best LDA model
from sklearn.model_selection import GridSearchCV
# Define Search Param
search_params = {'n_components': [2, 3, 4, 5, 10, 15, 20, 25], 'learning_decay': [.5, .7, .9]}
# Init the model
lda = LatentDirichletAllocation()
# Init Grid Search class
model = GridSearchCV(lda, search_params)
model.fit(feature_matrix)
best_lda_model = model.best_estimator_
print("Best model's params: ", model.best_params_)
print("Best log likelihood score: ", model.best_score_)
print("Model perplexity: ", best_lda_model.perplexity(feature_matrix))
Best model's params: {'learning_decay': 0.5, 'n_components': 2}
Best log likelihood score: -331679.02058652363
Model perplexity: 1337.0782969491522
Compare the LDA performance score
df_cv_results = pd.DataFrame(model.cv_results_)
df_cv_results.to_csv("LDAGridSearchResults.csv", header=True, index=False, encoding='utf-8')
import seaborn as sns
sns.pointplot(x="param_n_components", y="mean_test_score", hue="param_learning_decay", data=df_cv_results)
<matplotlib.axes._subplots.AxesSubplot at 0x1a2b9831d0>
Dominant Topic in each document
# Take the best model
best_lda_model
LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
evaluate_every=-1, learning_decay=0.5,
learning_method='batch', learning_offset=10.0,
max_doc_update_iter=100, max_iter=10, mean_change_tol=0.001,
n_components=2, n_jobs=None, n_topics=None, perp_tol=0.1,
random_state=None, topic_word_prior=None,
total_samples=1000000.0, verbose=0)
# Create a document to topic matrix
lda_output = best_lda_model.transform(feature_matrix)
# column names
topicnames = ['Topic_' + str(i) for i in range(best_lda_model.n_components)]
# index names
docnames = ['Doc_' + str(i) for i in range(len(clean_reviews_text))]
# create a dataframe
df_document_topic = pd.DataFrame(np.round(lda_output,2), columns=topicnames, index=docnames)
df_document_topic.head()
Topic_0 | Topic_1 | |
---|---|---|
Doc_0 | 0.99 | 0.01 |
Doc_1 | 0.64 | 0.36 |
Doc_2 | 0.62 | 0.38 |
Doc_3 | 0.99 | 0.01 |
Doc_4 | 0.46 | 0.54 |
# dominant topic
df_document_topic['dominant_topic'] = np.argmax(df_document_topic.values, axis=1)
df_document_topic.head()
Topic_0 | Topic_1 | dominant_topic | |
---|---|---|---|
Doc_0 | 0.99 | 0.01 | 0 |
Doc_1 | 0.64 | 0.36 | 0 |
Doc_2 | 0.62 | 0.38 | 0 |
Doc_3 | 0.99 | 0.01 | 0 |
Doc_4 | 0.46 | 0.54 | 1 |
sns.countplot(df_document_topic.dominant_topic)
<matplotlib.axes._subplots.AxesSubplot at 0x1a2c028898>
Topic 1 is more dominant in the entire corpus.
Visualize using pyLDAvis
pyLDAvis visualization provides a global view of the topics (and how they differ from each other), while at the same time allowing for a deep inspection of the terms most highly associated with each individual topic.
import pyLDAvis.sklearn
panel = pyLDAvis.sklearn.prepare(best_lda_model, feature_matrix, count_vectorizer, mds='tsne')
pyLDAvis.display(panel)
panel = pyLDAvis.sklearn.prepare(best_lda_model, feature_matrix, count_vectorizer, mds='PCoA')
pyLDAvis.display(panel)
Get each topic’s keywords
# components_ contains the word to topic matrix
best_lda_model.components_.shape
(2, 2516)
# check the shape
feature_matrix.shape
(1500, 2516)
# Topic - Keyword matrix
df_topic_keywords = pd.DataFrame(best_lda_model.components_)
# assign column and index
df_topic_keywords.columns = count_vectorizer.get_feature_names()
df_topic_keywords.index = topicnames
# check the head
df_topic_keywords.iloc[:,:10]
abandon | ability | able | absence | absolute | absolutely | absorb | absurd | abuse | academy | |
---|---|---|---|---|---|---|---|---|---|---|
Topic_0 | 2.854504 | 20.251906 | 67.314308 | 3.665994 | 21.150957 | 89.982076 | 6.728738 | 6.750981 | 3.943679 | 25.818699 |
Topic_1 | 16.145496 | 26.748094 | 47.685692 | 11.334006 | 18.849043 | 31.017924 | 10.271262 | 13.249019 | 30.056321 | 10.181301 |
Get the top 15 keywords from each topic
# Show top n keywords for each topic
def show_topics(vectorizer=count_vectorizer, lda_model=best_lda_model, n_words=20):
keywords = np.array(vectorizer.get_feature_names())
topic_keywords = []
for topic_weights in lda_model.components_:
top_keyword_locs = (-topic_weights).argsort()[:n_words]
topic_keywords.append(keywords.take(top_keyword_locs))
return topic_keywords
topic_keywords = show_topics(count_vectorizer, best_lda_model, 20)
topic_keywords
[array(['like', 'make', 'time', 'good', 'action', 'character', 'really',
'story', 'movie', 'scene', 'great', 'watch', 'best', 'love',
'plot', 'people', 'know', 'film', 'think', 'thing'], dtype='<U15'),
array(['character', 'make', 'like', 'story', 'time', 'film', 'life',
'know', 'scene', 'people', 'bond', 'play', 'world', 'work', 'year',
'come', 'best', 'performance', 'director', 'good'], dtype='<U15')]
# Topic - Keywords Dataframe
df_topic_keywords = pd.DataFrame(topic_keywords)
df_topic_keywords.columns = ['Word '+str(i) for i in range(df_topic_keywords.shape[1])]
df_topic_keywords.index = ['Topic '+str(i) for i in range(df_topic_keywords.shape[0])]
df_topic_keywords
Word 0 | Word 1 | Word 2 | Word 3 | Word 4 | Word 5 | Word 6 | Word 7 | Word 8 | Word 9 | Word 10 | Word 11 | Word 12 | Word 13 | Word 14 | Word 15 | Word 16 | Word 17 | Word 18 | Word 19 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Topic 0 | like | make | time | good | action | character | really | story | movie | scene | great | watch | best | love | plot | people | know | film | think | thing |
Topic 1 | character | make | like | story | time | film | life | know | scene | people | bond | play | world | work | year | come | best | performance | director | good |
Predict topics for new text
test_corpus = ['''As a lifelong James Bond enthusiast who has been extremely disappointed with the franchise's
latest efforts (with the exception of Casino Royale), I was extremely pleased with this film. It strayed away
from the storyline of the previous two films and I couldn't have been happier after the mediocrity of Quantum
of Solace. This film has all the constituents from the Bond films that have preceded it. Big explosions, ridiculous
stunts that not a single person in the history of humanity can survive, and let's not forget to mention the beautiful
women that would make both genders stop and stare. So what does Skyfall have that the other Bond films don't? For the
first time, we get a glimpse into our mysterious hero's dark past. Where he came from and what made him the person
he is today. ''',
'''When I watched this for the first time in over 30 years, I was surprised how little action there was
since I had remembered this as some intense horror movie. Of course, I was young and more impressionable so I guess I
just remembered those few dramatic, sensational scenes such as Janet Leigh murdered in the shower and the quick other
murder at the top of the stairs. Basically, that was about it, action-wise, BUT I have no complaints because the more
I watch this film, the more I like it. It has become my favorite Alfred Hitchcock movie, along with Rear Window.''']
# normalize the corpus
clean_test_corpus = cleanTextBooks(test_corpus)
clean_test_corpus = [' '.join(text) for text in clean_test_corpus]
# vectorize the corpus
test_feature_matrix = count_vectorizer.transform(clean_test_corpus)
# check the shape, it should have same features
test_feature_matrix.shape
(2, 2516)
test_lda_output = best_lda_model.transform(test_feature_matrix)
# column names
test_topicnames = ['Topic_' + str(i) for i in range(best_lda_model.n_components)]
# index names
test_docnames = ['Doc_' + str(i) for i in range(len(clean_test_corpus))]
# create a dataframe
test_df_document_topic = pd.DataFrame(np.round(test_lda_output,2), columns=test_topicnames, index=test_docnames)
# dominant topic
test_df_document_topic['dominant_topic'] = np.argmax(test_df_document_topic.values, axis=1)
test_df_document_topic.head()
Topic_0 | Topic_1 | dominant_topic | |
---|---|---|---|
Doc_0 | 0.34 | 0.66 | 1 |
Doc_1 | 0.63 | 0.37 | 0 |
Interactive Data Visualization showing relation between Clustering, Sentiment and Topics
df_clusters = pd.read_csv('sentiment_clustering.csv')
df_clusters.head()
movie | user_review | sentiment | predicted_sentiment | k2 | cluster | sentiment_score | pc1 | pc2 | |
---|---|---|---|---|---|---|---|---|---|
0 | The Dark Knight | We've been subjected to enormous amounts of hype and marketing for the Dark Knight. We've seen Joker scavenger hunts and one of the largest viral campaigns in advertising history and it culminates with the actual release of the movie.Everything that's been said is pretty much spot on. This is the first time I can remember where a summer blockbuster film far surpasses the hype.For as much action as there is in this movie, it's the acting that makes it a great piece of work. Between all the pu... | positive | positive | 1 | cluster1 | 0.017 | 0.224582 | -0.059699 |
1 | The Dark Knight | Jim Kunsler is not a regular film critic. He is an accomplished published author and highly educated in arts and literature. I subscribe to his blog, usually for his political writing.There are hardly any film critics around. The newspaper, magazine, internetz,IMDb & TV folks are just reviewers as opposed to critics. They mumble about good or bad, thumbs up, how much money is involved and other nonsense in what is an ignorant parody of Consumer Reports, as if the art of film were a microwave... | negative | negative | 0 | cluster0 | 0.009 | -0.085530 | 0.022846 |
2 | Inception | What is the most resilient parasite? An Idea! Yes, Nolan has created something with his unbelievably, incredibly and god- gifted mind which will blow the minds of the audience away. The world premiere of the movie, directed by Hollywood's most inventive dreamers, was shown in London and has already got top notch reviews worldwide and has scored maximum points! Now the question arises what the movie has that it deserve all this?Dom Cobb(Di Caprio) is an extractor who is paid to invade the dre... | positive | positive | 1 | cluster1 | 0.021 | 0.114715 | -0.112303 |
3 | Inception | This is the worst film I've seen in a long time. I think I can imagine what other people who are raving about the film like; but I can guarantee that the rating of this film will plummet in a year when the slight novelty of the special effects wears off. The "story" here is the story of a role-playing video game, where you get trapped into deeper and deeper levels without knowing what to expect. As you go, the rules change. This is convenient for the writers, who simply make it all up as the... | negative | negative | 0 | cluster0 | -0.005 | -0.077007 | -0.028335 |
4 | The Usual Suspects | Ah, the Usual Suspects. My personal favorite movie of all time. Don't let my bias be a fool. Perhaps it's not THE best movie ever, but it's one that I never get tired of.If you like flash and bikinis and breath-taking camera angles, you won't find them here. Usual Suspects is not an "epic," and it doesn't pretend to be. It's a modestly-budgeted piece by a fresh director (who later went on to do the X-Men movies, a FAR departure).A great, gritty script, beautifully-acted characters, and what ... | positive | positive | 1 | cluster1 | 0.069 | 0.028520 | -0.033267 |
# get movie, user_review, sentiment from df_reviews
df_document_topic['movie'] = df_reviews.movie.tolist()
df_document_topic['user_review'] = df_reviews.user_review.tolist()
df_document_topic['sentiment'] = df_reviews.sentiment.tolist()
# filter out neutral
df_document_topic = df_document_topic[df_document_topic.sentiment != 'neutral']
# now append clusters
df_document_topic['cluster'] = df_clusters['cluster'].tolist()
# now append predicted sentiment
df_document_topic['predicted_sentiment'] = df_clusters['predicted_sentiment'].tolist()
# overall sentiment score
df_document_topic['sentiment_score'] = df_clusters['sentiment_score'].tolist()
# pc1 and pc2
df_document_topic['pc1'] = df_clusters['pc1'].tolist()
df_document_topic['pc2'] = df_clusters['pc2'].tolist()
# create a new column by mapping k2 values 0, 1 to cluster0, cluster1. (Bokeh needs it)
df_document_topic['dominant_topic'] = np.where(df_document_topic['dominant_topic'] == 0, 'topic0', 'topic1')
df_document_topic.head()
Topic_0 | Topic_1 | dominant_topic | movie | user_review | sentiment | cluster | predicted_sentiment | sentiment_score | pc1 | pc2 | |
---|---|---|---|---|---|---|---|---|---|---|---|
Doc_1 | 0.64 | 0.36 | topic0 | The Dark Knight | We've been subjected to enormous amounts of hype and marketing for the Dark Knight. We've seen Joker scavenger hunts and one of the largest viral campaigns in advertising history and it culminates with the actual release of the movie.Everything that's been said is pretty much spot on. This is the first time I can remember where a summer blockbuster film far surpasses the hype.For as much action as there is in this movie, it's the acting that makes it a great piece of work. Between all the pu... | positive | cluster1 | positive | 0.017 | 0.224582 | -0.059699 |
Doc_2 | 0.62 | 0.38 | topic0 | The Dark Knight | Jim Kunsler is not a regular film critic. He is an accomplished published author and highly educated in arts and literature. I subscribe to his blog, usually for his political writing.There are hardly any film critics around. The newspaper, magazine, internetz,IMDb & TV folks are just reviewers as opposed to critics. They mumble about good or bad, thumbs up, how much money is involved and other nonsense in what is an ignorant parody of Consumer Reports, as if the art of film were a microwave... | negative | cluster0 | negative | 0.009 | -0.085530 | 0.022846 |
Doc_4 | 0.46 | 0.54 | topic1 | Inception | What is the most resilient parasite? An Idea! Yes, Nolan has created something with his unbelievably, incredibly and god- gifted mind which will blow the minds of the audience away. The world premiere of the movie, directed by Hollywood's most inventive dreamers, was shown in London and has already got top notch reviews worldwide and has scored maximum points! Now the question arises what the movie has that it deserve all this?Dom Cobb(Di Caprio) is an extractor who is paid to invade the dre... | positive | cluster1 | positive | 0.021 | 0.114715 | -0.112303 |
Doc_5 | 0.98 | 0.02 | topic0 | Inception | This is the worst film I've seen in a long time. I think I can imagine what other people who are raving about the film like; but I can guarantee that the rating of this film will plummet in a year when the slight novelty of the special effects wears off. The "story" here is the story of a role-playing video game, where you get trapped into deeper and deeper levels without knowing what to expect. As you go, the rules change. This is convenient for the writers, who simply make it all up as the... | negative | cluster0 | negative | -0.005 | -0.077007 | -0.028335 |
Doc_7 | 0.79 | 0.21 | topic0 | The Usual Suspects | Ah, the Usual Suspects. My personal favorite movie of all time. Don't let my bias be a fool. Perhaps it's not THE best movie ever, but it's one that I never get tired of.If you like flash and bikinis and breath-taking camera angles, you won't find them here. Usual Suspects is not an "epic," and it doesn't pretend to be. It's a modestly-budgeted piece by a fresh director (who later went on to do the X-Men movies, a FAR departure).A great, gritty script, beautifully-acted characters, and what ... | positive | cluster1 | positive | 0.069 | 0.028520 | -0.033267 |
from bokeh.io import show, output_notebook, push_notebook, output_file
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource, HoverTool
from bokeh.models import CategoricalColorMapper
from bokeh.layouts import row
from bokeh.layouts import gridplot
output_notebook()
# Make a source and a scatter plot
source = ColumnDataSource(df_document_topic)
# define plot 1
plot1 = figure(x_axis_label = 'PC 1',
y_axis_label = 'PC 2', title="Clustering Results",
width = 500, height = 400)
# add color
palette = ['#FF7373', '#61F2F5']
color_map1 = CategoricalColorMapper(factors=df_document_topic['cluster'].unique(), palette=palette)
plot1.circle(x = 'pc1',
y = 'pc2',
source = source,
color = {'field': 'cluster', 'transform': color_map1},
legend='cluster', alpha = .8)
# Create a HoverTool object
hover1 = HoverTool(tooltips = [('Movie', '@movie'),
('Dominant_topic', '@dominant_topic'),
('Cluster', '@cluster'),
('Predicted_sentiment', '@predicted_sentiment'),
('Actual_Sentiment', '@sentiment'),
('Sentiment_score', '@sentiment_score')
])
plot1.add_tools(hover1)
### plot2
plot2 = figure(x_axis_label = 'PC 1',
y_axis_label = 'PC 2', title="Predicted Sentiment",
width = 500, height = 400)
color_map2 = CategoricalColorMapper(factors=df_document_topic['predicted_sentiment'].unique(), palette=palette)
plot2.circle(x = 'pc1',
y = 'pc2',
source = source, color = {'field': 'predicted_sentiment', 'transform': color_map2},
legend='predicted_sentiment', alpha = .8)
# Create a HoverTool object
hover2 = HoverTool(tooltips = [('Movie', '@movie'),
('Dominant_topic', '@dominant_topic'),
('Cluster', '@cluster'),
('Predicted_sentiment', '@predicted_sentiment'),
('Actual_Sentiment', '@sentiment'),
('Sentiment_score', '@sentiment_score')
])
plot2.add_tools(hover2)
### plot3
plot3 = figure(x_axis_label = 'PC 1',
y_axis_label = 'PC 2', title="Dominant Topic",
width = 500, height = 400)
color_map3 = CategoricalColorMapper(factors=df_document_topic['dominant_topic'].unique(), palette=palette)
plot3.circle(x = 'pc1',
y = 'pc2',
source = source, color = {'field': 'dominant_topic', 'transform': color_map3},
legend='dominant_topic', alpha = .8)
# Create a HoverTool object
hover3 = HoverTool(tooltips = [('Movie', '@movie'),
('Dominant_topic', '@dominant_topic'),
('Cluster', '@cluster'),
('Predicted_sentiment', '@predicted_sentiment'),
('Actual_Sentiment', '@sentiment'),
('Sentiment_score', '@sentiment_score')
])
plot3.add_tools(hover3)
### plot4
plot4 = figure(x_axis_label = 'PC 1',
y_axis_label = 'PC 2', title="Actual Sentiment",
width = 500, height = 400)
color_map4 = CategoricalColorMapper(factors=df_document_topic['sentiment'].unique(), palette=palette)
plot4.circle(x = 'pc1',
y = 'pc2',
source = source, color = {'field': 'sentiment', 'transform': color_map4},
legend='sentiment', alpha = .8)
# Create a HoverTool object
hover4 = HoverTool(tooltips = [('Movie', '@movie'),
('Dominant_topic', '@dominant_topic'),
('Cluster', '@cluster'),
('Predicted_sentiment', '@predicted_sentiment'),
('Actual_Sentiment', '@sentiment'),
('Sentiment_score', '@sentiment_score')
])
plot4.add_tools(hover4)
# link the ranges
plot2.x_range = plot1.x_range
plot2.y_range = plot1.y_range
plot3.x_range = plot1.x_range
plot3.y_range = plot1.y_range
plot4.x_range = plot1.x_range
plot4.y_range = plot1.y_range
#layout
row2 = [plot4, plot2]
row1 = [plot1, plot3]
layout = gridplot([row1, row2])
<div class="bk-root">
<a href="https://bokeh.pydata.org" target="_blank" class="bk-logo bk-logo-small bk-logo-notebook"></a>
<span id="1243">Loading BokehJS ...</span>
</div>
# append the document topics
output_file('topic-modeling.html')
show(layout)
Separate topics based on sentiment
From the interactive data visualization above, if you hover over the individual movie review, you will be able to see the sentiment score, predicted sentiment and actual sentiment for each of the user reivews (along with the topic that it is assigned to).
Calculate the average sentiment score for each dominant topic
Looking at the average sentiment score for each of the topics does not reveal much about the sentiment of each topic. Since Topic 1 has lower sentiment score, and since more negative words are part of topic 1, we can deduce that topic 1 has slightly negative sentiment.
df_document_topic.groupby('dominant_topic')['sentiment_score'].mean()
dominant_topic
topic0 0.029157
topic1 0.017293
Name: sentiment_score, dtype: float64
Appendix
import platform; print(platform.platform())
import sys; print("Python", sys.version)
import numpy; print("Numpy", numpy.__version__)
import pandas; print("Pandas", pandas.__version__)
import seaborn; print("Seaborn", seaborn.__version__)
import matplotlib; print("Matplotlib", matplotlib.__version__)
import nltk; print("NLTK", nltk.__version__)
import requests; print("requests", requests.__version__)
import bs4; print("BeautifulSoup", bs4.__version__)
import re; print("re", re.__version__)
import spacy; print("spacy", spacy.__version__)
import gensim; print("gensim", gensim.__version__)
import bokeh; print("bokeh", bokeh.__version__)
Darwin-17.7.0-x86_64-i386-64bit
Python 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 11:07:29)
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
Numpy 1.15.4
Pandas 0.23.3
Seaborn 0.9.0
Matplotlib 2.2.2
NLTK 3.2.5
requests 2.19.1
BeautifulSoup 4.7.1
re 2.2.1
spacy 2.1.4
gensim 3.4.0
bokeh 1.3.4
#####################################
# Module: normalization2.py
# Author: Shravan Kuchkula
# Date: 08/08/2019
#####################################
import re
import pandas as pd
import numpy as np
import nltk
import string
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet as wn
custom_stopwords = ['movie', 'film', 'review']
def remove_hypens(book_text):
return re.sub(r'(\w+)-(\w+)-?(\w)?', r'\1 \2 \3', book_text)
# tokenize text
def tokenize_text(book_text):
TOKEN_PATTERN = r'\s+'
regex_wt = nltk.RegexpTokenizer(pattern=TOKEN_PATTERN, gaps=True)
word_tokens = regex_wt.tokenize(book_text)
return word_tokens
def remove_characters_after_tokenization(tokens):
pattern = re.compile('[{}]'.format(re.escape(string.punctuation)))
filtered_tokens = filter(None, [pattern.sub('', token) for token in tokens])
return filtered_tokens
def convert_to_lowercase(tokens):
return [token.lower() for token in tokens if token.isalpha()]
def remove_stopwords(tokens, custom_stopwords):
stopword_list = nltk.corpus.stopwords.words('english')
stopword_list += custom_stopwords
filtered_tokens = [token for token in tokens if token not in stopword_list]
return filtered_tokens
def get_lemma(tokens):
lemmas = []
for word in tokens:
lemma = wn.morphy(word)
if lemma is None:
lemmas.append(word)
else:
lemmas.append(lemma)
return lemmas
def remove_short_tokens(tokens):
return [token for token in tokens if len(token) > 3]
def keep_only_words_in_wordnet(tokens):
return [token for token in tokens if wn.synsets(token)]
def apply_lemmatize(tokens, wnl=WordNetLemmatizer()):
return [wnl.lemmatize(token) for token in tokens]
def cleanTextBooks(book_texts):
clean_books = []
for book in book_texts:
book = remove_hypens(book)
book_i = tokenize_text(book)
book_i = remove_characters_after_tokenization(book_i)
book_i = convert_to_lowercase(book_i)
book_i = remove_stopwords(book_i, custom_stopwords)
book_i = get_lemma(book_i)
book_i = remove_short_tokens(book_i)
book_i = keep_only_words_in_wordnet(book_i)
book_i = apply_lemmatize(book_i)
clean_books.append(book_i)
return clean_books