Topic Modeling using NMF and LDA using sklearn

24 minute read

Author: Shravan Kuchkula

Introduction

Topic modeling involves extracting features from document terms and using mathematical structures and frameworks like matrix factorization and SVD to generate clusters or groups of terms that are distinguishable from each other, and these cluster of words form topics or concepts. These concepts can be used to interpret the main themes of a corpus and also make semantic connections among words that co-occur together frequently in various documents. There are various frameworks and algorithms to build topic models. Here, I will explore two:

  • Non-negative matrix factorization
  • Latent Dirichlet Allocation

The pros and cons of using NMF and LDA are discussed in the context of analyzing 1500 movie reviews extracted from IMDB. Shown below is the high-level Topic Modeling workflow:

  • 1500 movie reviews are sent through the NLP pipeline with the goal to normalize the text.
  • The normalized corpus is then fed into a Term Frequency Vectorizer or Tf-idf vectorizer depending on the algorithm.
  • Topic modeling is performed using NMF and LDA
  • The topic modeling results are evaluated and the results are visualized using pyLDAvis. topic-modeling-workflow

End Result

Shown below are the results of topic modeling with both NMF and LDA. These results show that there is some positive sentiment associated with James Bond movies. I will discuss this further down in the post.

NMF results

nmf

LDA results

lda lda1

Interactive plot showing results of K-means clustering, LDA topic modeling and Sentiment Analysis

By combining the results of Clustering, Topic Modeling and Sentiment Analysis, we can subjectively gauge how well our Topic Modeling has worked.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from normalization2 import *

pd.options.display.max_colwidth=500

import warnings
warnings.filterwarnings('ignore')

#some ipython magic to show the matplotlib plots inline
%matplotlib inline

Data Gathering and Normalization

A positive, negative and neurtral movie review is extracted for each of the top 500 Thriller movies from IMDB website. Thus, our corpus contains 1500. Each movie review in the corpus (a document) is sent through the following NLP pipeline to normalize the text:

  • remove_hypens
  • tokenize_text
  • remove_special_characters
  • convert to lower case
  • remove stopwords
  • lemmatize the token
  • remove short tokens
  • keep only words in wordnet

After sending each review through this pipeline, we will now have a list of normalized reviews which can now be used for further analysis.

# get the collection of reviews
# first 250
df = pd.read_pickle('userReviews4.pkl')

# next 250
df1 = pd.read_pickle('userReviews3.pkl')

# dataframe containing all 1000 reviews
df_reviews = pd.concat([df, df1], ignore_index=True)

# display shape of df_reviews
display(df_reviews.head())

# user reviews
reviews = list(df_reviews.user_review)
movie user_review_permalink user_review sentiment
0 The Dark Knight https://www.imdb.com/review/rw2081858/ I'm so relieved that I'm not the only one who doesn't think this movie is great. I saw this first in the theater, and to be honest, probably would've given it a bit higher rating...if I would've reviewed it then. But........I just watched it again on cable and it doesn't hold up as well outside of the movie theater with it's huge screen, dark room and loud speakers. First off...what I liked. Heath Leadger was fantastic as the Joker and could've done so much more with it if the writing and pl... neutral
1 The Dark Knight https://www.imdb.com/review/rw1908115/ We've been subjected to enormous amounts of hype and marketing for the Dark Knight. We've seen Joker scavenger hunts and one of the largest viral campaigns in advertising history and it culminates with the actual release of the movie.Everything that's been said is pretty much spot on. This is the first time I can remember where a summer blockbuster film far surpasses the hype.For as much action as there is in this movie, it's the acting that makes it a great piece of work. Between all the pu... positive
2 The Dark Knight https://www.imdb.com/review/rw1922862/ Jim Kunsler is not a regular film critic. He is an accomplished published author and highly educated in arts and literature. I subscribe to his blog, usually for his political writing.There are hardly any film critics around. The newspaper, magazine, internetz,IMDb & TV folks are just reviewers as opposed to critics. They mumble about good or bad, thumbs up, how much money is involved and other nonsense in what is an ignorant parody of Consumer Reports, as if the art of film were a microwave... negative
3 Inception https://www.imdb.com/review/rw2286063/ I have to say to make such an impressive trailer and such an uninteresting film, takes some doing.Here you have most of the elements that would make a very good film. You have great special effects, a sci-fi conundrum, beautiful visuals and good sound. Yet the most important part of the film is missing. There is no plot, character or soul to this film. It's like having a beautiful building on the outside with no paint or decoration on the inside.It's an empty shell of a film. There is no ten... neutral
4 Inception https://www.imdb.com/review/rw2276780/ What is the most resilient parasite? An Idea! Yes, Nolan has created something with his unbelievably, incredibly and god- gifted mind which will blow the minds of the audience away. The world premiere of the movie, directed by Hollywood's most inventive dreamers, was shown in London and has already got top notch reviews worldwide and has scored maximum points! Now the question arises what the movie has that it deserve all this?Dom Cobb(Di Caprio) is an extractor who is paid to invade the dre... positive
# apply the NLP pipeline
clean_reviews = cleanTextBooks(reviews)

# rejoin the tokens to form strings which will be used to vectorize
clean_reviews_text = [' '.join(item) for item in clean_reviews]

Vectorize the reviews

Since our goal is to explore LDA and NMF and see how each performs on the corpus of text, we will use both term frequencies (used by LDA) and term frequencies - inverse document frequencies (used by NMF). Thus, we will use CountVectorizer and TfidfVectorizer.

# for LDA
from sklearn.feature_extraction.text import CountVectorizer

# for NMF
from sklearn.feature_extraction.text import TfidfVectorizer
# vectorize the corpus
count_vectorizer = CountVectorizer(min_df=10, max_df=0.95, ngram_range=(1,1), stop_words='english')
tfidf_vectorizer = TfidfVectorizer(min_df=10, max_df=0.95, ngram_range=(1,1), stop_words='english')

# calculate the feature matrix
feature_matrix = count_vectorizer.fit_transform(clean_reviews_text)
tfidf_feature_matrix = tfidf_vectorizer.fit_transform(clean_reviews_text)

display(feature_matrix.shape)
display(tfidf_feature_matrix.shape)
(1500, 2516)



(1500, 2516)

Displaying the shape of the feature matrices indicates that there are a total of 2516 unique features in the corpus of 1500 documents.

Topic Modeling

Build NMF model using sklearn

Non-Negative Matrix Factorization (NMF): The goal of NMF is to find two non-negative matrices (W, H) whose product approximates the non- negative matrix X. This factorization can be used for example for dimensionality reduction, source separation or topic extraction. We will be using sklearn’s implementation of NMF.

from sklearn.decomposition import NMF

nmf = NMF(n_components=2, random_state=43,  alpha=0.1, l1_ratio=0.5)
nmf_output = nmf.fit_transform(tfidf_feature_matrix)

nmf_feature_names = tfidf_vectorizer.get_feature_names()
nmf_weights = nmf.components_
#####################################
## Utility functions to help with NMF
# Code adapted from Sarkar text book
#####################################

# get topics with their terms and weights
def get_topics_terms_weights(weights, feature_names):
    feature_names = np.array(feature_names)
    sorted_indices = np.array([list(row[::-1]) for row in np.argsort(np.abs(weights))])
    sorted_weights = np.array([list(wt[index]) for wt, index in zip(weights, sorted_indices)])
    sorted_terms = np.array([list(feature_names[row]) for row in sorted_indices])

    topics = [np.vstack((terms.T, term_weights.T)).T for terms, term_weights in zip(sorted_terms, sorted_weights)]

    return topics


# prints components of all the topics
# obtained from topic modeling
def print_topics_udf(topics, total_topics=1,
                     weight_threshold=0.0001,
                     display_weights=False,
                     num_terms=None):

    for index in range(total_topics):
        topic = topics[index]
        topic = [(term, float(wt))
                 for term, wt in topic]
        #print(topic)
        topic = [(word, round(wt,2))
                 for word, wt in topic
                 if abs(wt) >= weight_threshold]

        if display_weights:
            print('Topic #'+str(index+1)+' with weights')
            print(topic[:num_terms]) if num_terms else topic
        else:
            print('Topic #'+str(index+1)+' without weights')
            tw = [term for term, wt in topic]
            print(tw[:num_terms]) if num_terms else tw

# prints components of all the topics
# obtained from topic modeling
def get_topics_udf(topics, total_topics=1,
                     weight_threshold=0.0001,
                     num_terms=None):

    topic_terms = []

    for index in range(total_topics):
        topic = topics[index]
        topic = [(term, float(wt))
                 for term, wt in topic]
        #print(topic)
        topic = [(word, round(wt,2))
                 for word, wt in topic
                 if abs(wt) >= weight_threshold]

        topic_terms.append(topic[:num_terms] if num_terms else topic)

    return topic_terms

def getTermsAndSizes(topic_display_list_item):
    terms = []
    sizes = []
    for term, size in topic_display_list_item:
        terms.append(term)
        sizes.append(size)
    return terms, sizes

Important terms in each Topic

topics = get_topics_terms_weights(nmf_weights, nmf_feature_names)
print_topics_udf(topics, total_topics=2, num_terms=30, display_weights=True)
Topic #1 with weights
[('like', 0.6), ('make', 0.57), ('character', 0.55), ('time', 0.52), ('story', 0.51), ('good', 0.5), ('really', 0.42), ('scene', 0.4), ('action', 0.4), ('great', 0.38), ('people', 0.38), ('film', 0.38), ('watch', 0.38), ('movie', 0.37), ('know', 0.37), ('plot', 0.34), ('best', 0.33), ('think', 0.31), ('thing', 0.3), ('performance', 0.29), ('want', 0.29), ('life', 0.29), ('actor', 0.28), ('play', 0.28), ('love', 0.27), ('work', 0.27), ('come', 0.27), ('acting', 0.26), ('year', 0.25), ('better', 0.24)]
Topic #2 with weights
[('bond', 1.93), ('james', 0.61), ('action', 0.21), ('love', 0.18), ('girl', 0.11), ('moore', 0.11), ('agent', 0.11), ('favorite', 0.11), ('villain', 0.1), ('best', 0.09), ('woman', 0.09), ('fight', 0.08), ('russia', 0.08), ('series', 0.07), ('kill', 0.07), ('soviet', 0.07), ('grant', 0.07), ('beautiful', 0.07), ('number', 0.06), ('pierce', 0.06), ('daniel', 0.06), ('russian', 0.05), ('bomb', 0.05), ('secret', 0.05), ('excellent', 0.04), ('train', 0.04), ('course', 0.04), ('sequence', 0.04), ('death', 0.04), ('film', 0.04)]
topics_display_list = get_topics_udf(topics, total_topics=2, num_terms=30)

Visualize NMF topics

Instead of using a wordcloud I have made use of matplotlib and displayed the topics based on their relevance.

terms, sizes = getTermsAndSizes(topics_display_list[0])

num_top_words = 30
fontsize_base = 30 / np.max(sizes) # font size for word with largest share in corpus

num_topics = 1

for t in range(num_topics):
    fig, ax = plt.subplots(1, num_topics, figsize=(6, 12))
    plt.ylim(0, num_top_words + 1.0)
    plt.xticks([])
    plt.yticks([])
    plt.title('Topic #{}'.format(t))

    for i, (word, share) in enumerate(zip(terms, sizes)):
        word = word + " (" + str(share) + ")"
        plt.text(0.3, num_top_words-i-1.0, word, fontsize=fontsize_base*share)

plt.tight_layout()

png

terms, sizes = getTermsAndSizes(topics_display_list[1])

num_top_words = 30
fontsize_base = 160 / (np.max(sizes))*0.8 # font size for word with largest share in corpus

num_topics = 1

for t in range(num_topics):
    fig, ax = plt.subplots(1, num_topics, figsize=(16, 30))
    plt.ylim(0, num_top_words + 1.0)
    plt.xticks([])
    plt.yticks([])
    plt.title('Topic #{}'.format(t+1))

    for i, (word, share) in enumerate(zip(terms, sizes)):
        word = word + " (" + str(share) + ")"
        plt.text(0.3, num_top_words-i-.5, word, fontsize=fontsize_base*share)

plt.tight_layout()

png

NMF results summary

There are three fundamental goals while subjectively evaluating the NMF results:

  • What is the meaning of each topic? This question seeks to tackle topic coherence.
  • How prevalent(common) is each topic in the overall corpus? This question seeks to understand the topic distribution across a corpus.
  • How do the topics relate to each other? This deals with inter-topic distance.

The NMF results show, that the model identified the distinctive features of Topic 1 with a lot of confidence. However, it failed to provide coherence when it comes to Topic 0. As NMF is a deterministic model, we don’t have a way to modify the probabilities to see how the key terms vary within each topic. For better Topic coherence, we can try a probabilistic model like LDA.

Why LDA ?

  • Latent Dirichlet Allocation learns the relationships between words, topics, and documents by assuming documents are generated by a particular probabilistic model.

  • A topic in LDA is a multinomial distribution over the (typically thousands of) terms in the vocabulary of the corpus.

  • To interpret a topic, one typically examines a ranked list of the most probable terms in that topic, using anywhere from three to thirty terms in the list. The problem with interpreting topics this way is that common terms in the corpus often appear near the top of such lists for multiple topics, making it hard to differentiate the meanings of these topics.

  • LDA allows in ranking terms for a given topic in terms of both the frequency of the term under that topic as well as the term’s exclusivity to the topic, which accounts for the degree to which it appears in that particular topic to the exclusion of others. By varying the lambda value, we have the flexibility to rank terms in the order of usefulness for interpreting topics.

  • Thus, for applications in which a human end-user will interact with learned topics, the flexibility of LDA and the coherence advantages of LDA warrant strong consideration.

Build LDA model using sklearn

from sklearn.decomposition import LatentDirichletAllocation

# Instantiate the LDA model
lda_model = LatentDirichletAllocation(n_components=2, max_iter=100, learning_method='online', random_state=43,
                                     batch_size=128, evaluate_every=-1, n_jobs=-1)

# fit transform the feature matrix
lda_output = lda_model.fit_transform(feature_matrix)

# display the lda_output and its shape
display(lda_output)
display(lda_output.shape)
array([[0.29220839, 0.70779161],
       [0.81608747, 0.18391253],
       [0.08911543, 0.91088457],
       ...,
       [0.68707845, 0.31292155],
       [0.54749114, 0.45250886],
       [0.11440662, 0.88559338]])



(1500, 2)

Diagnose model performance using perplexity and log-likelihood

A model with higher log-likelihood and lower perplexity is preferred.

# print log-likelihood
print("Log likelihood: ", lda_model.score(feature_matrix))
Log likelihood:  -969519.4325965648
# print perplexity
print("Perplexity: ", lda_model.perplexity(feature_matrix))
Perplexity:  1329.485684613842

GridSearch the best LDA model

from sklearn.model_selection import GridSearchCV

# Define Search Param
search_params = {'n_components': [2, 3, 4, 5, 10, 15, 20, 25], 'learning_decay': [.5, .7, .9]}

# Init the model
lda = LatentDirichletAllocation()

# Init Grid Search class
model = GridSearchCV(lda, search_params)

model.fit(feature_matrix)
best_lda_model = model.best_estimator_
print("Best model's params: ", model.best_params_)
print("Best log likelihood score: ", model.best_score_)
print("Model perplexity: ", best_lda_model.perplexity(feature_matrix))
Best model's params:  {'learning_decay': 0.5, 'n_components': 2}
Best log likelihood score:  -331679.02058652363
Model perplexity:  1337.0782969491522

Compare the LDA performance score

df_cv_results = pd.DataFrame(model.cv_results_)
df_cv_results.to_csv("LDAGridSearchResults.csv", header=True, index=False, encoding='utf-8')
import seaborn as sns
sns.pointplot(x="param_n_components", y="mean_test_score", hue="param_learning_decay", data=df_cv_results)
<matplotlib.axes._subplots.AxesSubplot at 0x1a2b9831d0>

png

Dominant Topic in each document

# Take the best model
best_lda_model
LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.5,
             learning_method='batch', learning_offset=10.0,
             max_doc_update_iter=100, max_iter=10, mean_change_tol=0.001,
             n_components=2, n_jobs=None, n_topics=None, perp_tol=0.1,
             random_state=None, topic_word_prior=None,
             total_samples=1000000.0, verbose=0)
# Create a document to topic matrix
lda_output = best_lda_model.transform(feature_matrix)
# column names
topicnames = ['Topic_' + str(i) for i in range(best_lda_model.n_components)]

# index names
docnames = ['Doc_' + str(i) for i in range(len(clean_reviews_text))]

# create a dataframe
df_document_topic = pd.DataFrame(np.round(lda_output,2), columns=topicnames, index=docnames)

df_document_topic.head()
Topic_0 Topic_1
Doc_0 0.99 0.01
Doc_1 0.64 0.36
Doc_2 0.62 0.38
Doc_3 0.99 0.01
Doc_4 0.46 0.54
# dominant topic
df_document_topic['dominant_topic'] = np.argmax(df_document_topic.values, axis=1)
df_document_topic.head()
Topic_0 Topic_1 dominant_topic
Doc_0 0.99 0.01 0
Doc_1 0.64 0.36 0
Doc_2 0.62 0.38 0
Doc_3 0.99 0.01 0
Doc_4 0.46 0.54 1
sns.countplot(df_document_topic.dominant_topic)
<matplotlib.axes._subplots.AxesSubplot at 0x1a2c028898>

png

Topic 1 is more dominant in the entire corpus.

Visualize using pyLDAvis

pyLDAvis visualization provides a global view of the topics (and how they differ from each other), while at the same time allowing for a deep inspection of the terms most highly associated with each individual topic.

import pyLDAvis.sklearn
panel = pyLDAvis.sklearn.prepare(best_lda_model, feature_matrix, count_vectorizer, mds='tsne')
pyLDAvis.display(panel)
panel = pyLDAvis.sklearn.prepare(best_lda_model, feature_matrix, count_vectorizer, mds='PCoA')
pyLDAvis.display(panel)

Get each topic’s keywords

# components_ contains the word to topic matrix
best_lda_model.components_.shape
(2, 2516)
# check the shape
feature_matrix.shape
(1500, 2516)
# Topic - Keyword matrix
df_topic_keywords = pd.DataFrame(best_lda_model.components_)

# assign column and index
df_topic_keywords.columns = count_vectorizer.get_feature_names()
df_topic_keywords.index = topicnames


# check the head
df_topic_keywords.iloc[:,:10]
abandon ability able absence absolute absolutely absorb absurd abuse academy
Topic_0 2.854504 20.251906 67.314308 3.665994 21.150957 89.982076 6.728738 6.750981 3.943679 25.818699
Topic_1 16.145496 26.748094 47.685692 11.334006 18.849043 31.017924 10.271262 13.249019 30.056321 10.181301

Get the top 15 keywords from each topic

# Show top n keywords for each topic
def show_topics(vectorizer=count_vectorizer, lda_model=best_lda_model, n_words=20):
    keywords = np.array(vectorizer.get_feature_names())
    topic_keywords = []
    for topic_weights in lda_model.components_:
        top_keyword_locs = (-topic_weights).argsort()[:n_words]
        topic_keywords.append(keywords.take(top_keyword_locs))
    return topic_keywords
topic_keywords = show_topics(count_vectorizer, best_lda_model, 20)
topic_keywords
[array(['like', 'make', 'time', 'good', 'action', 'character', 'really',
        'story', 'movie', 'scene', 'great', 'watch', 'best', 'love',
        'plot', 'people', 'know', 'film', 'think', 'thing'], dtype='<U15'),
 array(['character', 'make', 'like', 'story', 'time', 'film', 'life',
        'know', 'scene', 'people', 'bond', 'play', 'world', 'work', 'year',
        'come', 'best', 'performance', 'director', 'good'], dtype='<U15')]
# Topic - Keywords Dataframe
df_topic_keywords = pd.DataFrame(topic_keywords)
df_topic_keywords.columns = ['Word '+str(i) for i in range(df_topic_keywords.shape[1])]
df_topic_keywords.index = ['Topic '+str(i) for i in range(df_topic_keywords.shape[0])]
df_topic_keywords
Word 0 Word 1 Word 2 Word 3 Word 4 Word 5 Word 6 Word 7 Word 8 Word 9 Word 10 Word 11 Word 12 Word 13 Word 14 Word 15 Word 16 Word 17 Word 18 Word 19
Topic 0 like make time good action character really story movie scene great watch best love plot people know film think thing
Topic 1 character make like story time film life know scene people bond play world work year come best performance director good

Predict topics for new text

test_corpus = ['''As a lifelong James Bond enthusiast who has been extremely disappointed with the franchise's
latest efforts (with the exception of Casino Royale), I was extremely pleased with this film. It strayed away
from the storyline of the previous two films and I couldn't have been happier after the mediocrity of Quantum
of Solace. This film has all the constituents from the Bond films that have preceded it. Big explosions, ridiculous
stunts that not a single person in the history of humanity can survive, and let's not forget to mention the beautiful
women that would make both genders stop and stare. So what does Skyfall have that the other Bond films don't? For the
first time, we get a glimpse into our mysterious hero's dark past. Where he came from and what made him the person
he is today. ''',
              '''When I watched this for the first time in over 30 years, I was surprised how little action there was
since I had remembered this as some intense horror movie. Of course, I was young and more impressionable so I guess I
just remembered those few dramatic, sensational scenes such as Janet Leigh murdered in the shower and the quick other
murder at the top of the stairs. Basically, that was about it, action-wise, BUT I have no complaints because the more
I watch this film, the more I like it. It has become my favorite Alfred Hitchcock movie, along with Rear Window.''']
# normalize the corpus
clean_test_corpus = cleanTextBooks(test_corpus)
clean_test_corpus = [' '.join(text) for text in clean_test_corpus]
# vectorize the corpus
test_feature_matrix = count_vectorizer.transform(clean_test_corpus)
# check the shape, it should have same features
test_feature_matrix.shape
(2, 2516)
test_lda_output = best_lda_model.transform(test_feature_matrix)
# column names
test_topicnames = ['Topic_' + str(i) for i in range(best_lda_model.n_components)]

# index names
test_docnames = ['Doc_' + str(i) for i in range(len(clean_test_corpus))]

# create a dataframe
test_df_document_topic = pd.DataFrame(np.round(test_lda_output,2), columns=test_topicnames, index=test_docnames)

# dominant topic
test_df_document_topic['dominant_topic'] = np.argmax(test_df_document_topic.values, axis=1)
test_df_document_topic.head()
Topic_0 Topic_1 dominant_topic
Doc_0 0.34 0.66 1
Doc_1 0.63 0.37 0

Interactive Data Visualization showing relation between Clustering, Sentiment and Topics

df_clusters = pd.read_csv('sentiment_clustering.csv')
df_clusters.head()
movie user_review sentiment predicted_sentiment k2 cluster sentiment_score pc1 pc2
0 The Dark Knight We've been subjected to enormous amounts of hype and marketing for the Dark Knight. We've seen Joker scavenger hunts and one of the largest viral campaigns in advertising history and it culminates with the actual release of the movie.Everything that's been said is pretty much spot on. This is the first time I can remember where a summer blockbuster film far surpasses the hype.For as much action as there is in this movie, it's the acting that makes it a great piece of work. Between all the pu... positive positive 1 cluster1 0.017 0.224582 -0.059699
1 The Dark Knight Jim Kunsler is not a regular film critic. He is an accomplished published author and highly educated in arts and literature. I subscribe to his blog, usually for his political writing.There are hardly any film critics around. The newspaper, magazine, internetz,IMDb & TV folks are just reviewers as opposed to critics. They mumble about good or bad, thumbs up, how much money is involved and other nonsense in what is an ignorant parody of Consumer Reports, as if the art of film were a microwave... negative negative 0 cluster0 0.009 -0.085530 0.022846
2 Inception What is the most resilient parasite? An Idea! Yes, Nolan has created something with his unbelievably, incredibly and god- gifted mind which will blow the minds of the audience away. The world premiere of the movie, directed by Hollywood's most inventive dreamers, was shown in London and has already got top notch reviews worldwide and has scored maximum points! Now the question arises what the movie has that it deserve all this?Dom Cobb(Di Caprio) is an extractor who is paid to invade the dre... positive positive 1 cluster1 0.021 0.114715 -0.112303
3 Inception This is the worst film I've seen in a long time. I think I can imagine what other people who are raving about the film like; but I can guarantee that the rating of this film will plummet in a year when the slight novelty of the special effects wears off. The "story" here is the story of a role-playing video game, where you get trapped into deeper and deeper levels without knowing what to expect. As you go, the rules change. This is convenient for the writers, who simply make it all up as the... negative negative 0 cluster0 -0.005 -0.077007 -0.028335
4 The Usual Suspects Ah, the Usual Suspects. My personal favorite movie of all time. Don't let my bias be a fool. Perhaps it's not THE best movie ever, but it's one that I never get tired of.If you like flash and bikinis and breath-taking camera angles, you won't find them here. Usual Suspects is not an "epic," and it doesn't pretend to be. It's a modestly-budgeted piece by a fresh director (who later went on to do the X-Men movies, a FAR departure).A great, gritty script, beautifully-acted characters, and what ... positive positive 1 cluster1 0.069 0.028520 -0.033267
# get movie, user_review, sentiment from df_reviews
df_document_topic['movie'] = df_reviews.movie.tolist()
df_document_topic['user_review'] = df_reviews.user_review.tolist()
df_document_topic['sentiment'] = df_reviews.sentiment.tolist()

# filter out neutral
df_document_topic = df_document_topic[df_document_topic.sentiment != 'neutral']

# now append clusters
df_document_topic['cluster'] = df_clusters['cluster'].tolist()

# now append predicted sentiment
df_document_topic['predicted_sentiment'] = df_clusters['predicted_sentiment'].tolist()

# overall sentiment score
df_document_topic['sentiment_score'] = df_clusters['sentiment_score'].tolist()

# pc1 and pc2
df_document_topic['pc1'] = df_clusters['pc1'].tolist()
df_document_topic['pc2'] = df_clusters['pc2'].tolist()

# create a new column by mapping k2 values 0, 1 to cluster0, cluster1. (Bokeh needs it)
df_document_topic['dominant_topic'] = np.where(df_document_topic['dominant_topic'] == 0, 'topic0', 'topic1')
df_document_topic.head()
Topic_0 Topic_1 dominant_topic movie user_review sentiment cluster predicted_sentiment sentiment_score pc1 pc2
Doc_1 0.64 0.36 topic0 The Dark Knight We've been subjected to enormous amounts of hype and marketing for the Dark Knight. We've seen Joker scavenger hunts and one of the largest viral campaigns in advertising history and it culminates with the actual release of the movie.Everything that's been said is pretty much spot on. This is the first time I can remember where a summer blockbuster film far surpasses the hype.For as much action as there is in this movie, it's the acting that makes it a great piece of work. Between all the pu... positive cluster1 positive 0.017 0.224582 -0.059699
Doc_2 0.62 0.38 topic0 The Dark Knight Jim Kunsler is not a regular film critic. He is an accomplished published author and highly educated in arts and literature. I subscribe to his blog, usually for his political writing.There are hardly any film critics around. The newspaper, magazine, internetz,IMDb & TV folks are just reviewers as opposed to critics. They mumble about good or bad, thumbs up, how much money is involved and other nonsense in what is an ignorant parody of Consumer Reports, as if the art of film were a microwave... negative cluster0 negative 0.009 -0.085530 0.022846
Doc_4 0.46 0.54 topic1 Inception What is the most resilient parasite? An Idea! Yes, Nolan has created something with his unbelievably, incredibly and god- gifted mind which will blow the minds of the audience away. The world premiere of the movie, directed by Hollywood's most inventive dreamers, was shown in London and has already got top notch reviews worldwide and has scored maximum points! Now the question arises what the movie has that it deserve all this?Dom Cobb(Di Caprio) is an extractor who is paid to invade the dre... positive cluster1 positive 0.021 0.114715 -0.112303
Doc_5 0.98 0.02 topic0 Inception This is the worst film I've seen in a long time. I think I can imagine what other people who are raving about the film like; but I can guarantee that the rating of this film will plummet in a year when the slight novelty of the special effects wears off. The "story" here is the story of a role-playing video game, where you get trapped into deeper and deeper levels without knowing what to expect. As you go, the rules change. This is convenient for the writers, who simply make it all up as the... negative cluster0 negative -0.005 -0.077007 -0.028335
Doc_7 0.79 0.21 topic0 The Usual Suspects Ah, the Usual Suspects. My personal favorite movie of all time. Don't let my bias be a fool. Perhaps it's not THE best movie ever, but it's one that I never get tired of.If you like flash and bikinis and breath-taking camera angles, you won't find them here. Usual Suspects is not an "epic," and it doesn't pretend to be. It's a modestly-budgeted piece by a fresh director (who later went on to do the X-Men movies, a FAR departure).A great, gritty script, beautifully-acted characters, and what ... positive cluster1 positive 0.069 0.028520 -0.033267
from bokeh.io import show, output_notebook, push_notebook, output_file
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource, HoverTool
from bokeh.models import CategoricalColorMapper
from bokeh.layouts import row
from bokeh.layouts import gridplot
output_notebook()

# Make a source and a scatter plot  
source = ColumnDataSource(df_document_topic)

# define plot 1
plot1 = figure(x_axis_label = 'PC 1',
              y_axis_label = 'PC 2', title="Clustering Results",
              width = 500, height = 400)

# add color
palette = ['#FF7373', '#61F2F5']
color_map1 = CategoricalColorMapper(factors=df_document_topic['cluster'].unique(), palette=palette)

plot1.circle(x = 'pc1',
    y = 'pc2',
    source = source,
             color = {'field': 'cluster', 'transform': color_map1},
             legend='cluster', alpha = .8)


# Create a HoverTool object
hover1 = HoverTool(tooltips = [('Movie', '@movie'),
                               ('Dominant_topic', '@dominant_topic'),
                               ('Cluster', '@cluster'),
                               ('Predicted_sentiment', '@predicted_sentiment'),
                               ('Actual_Sentiment', '@sentiment'),
                               ('Sentiment_score', '@sentiment_score')
                              ])
plot1.add_tools(hover1)



### plot2
plot2 = figure(x_axis_label = 'PC 1',
              y_axis_label = 'PC 2', title="Predicted Sentiment",
              width = 500, height = 400)

color_map2 = CategoricalColorMapper(factors=df_document_topic['predicted_sentiment'].unique(), palette=palette)

plot2.circle(x = 'pc1',
    y = 'pc2',
    source = source, color = {'field': 'predicted_sentiment', 'transform': color_map2},
            legend='predicted_sentiment', alpha = .8)



# Create a HoverTool object
hover2 = HoverTool(tooltips = [('Movie', '@movie'),
                               ('Dominant_topic', '@dominant_topic'),
                               ('Cluster', '@cluster'),
                               ('Predicted_sentiment', '@predicted_sentiment'),
                               ('Actual_Sentiment', '@sentiment'),
                               ('Sentiment_score', '@sentiment_score')
                              ])

plot2.add_tools(hover2)



### plot3
plot3 = figure(x_axis_label = 'PC 1',
              y_axis_label = 'PC 2', title="Dominant Topic",
              width = 500, height = 400)

color_map3 = CategoricalColorMapper(factors=df_document_topic['dominant_topic'].unique(), palette=palette)

plot3.circle(x = 'pc1',
    y = 'pc2',
    source = source, color = {'field': 'dominant_topic', 'transform': color_map3},
            legend='dominant_topic', alpha = .8)



# Create a HoverTool object
hover3 = HoverTool(tooltips = [('Movie', '@movie'),
                               ('Dominant_topic', '@dominant_topic'),
                               ('Cluster', '@cluster'),
                               ('Predicted_sentiment', '@predicted_sentiment'),
                               ('Actual_Sentiment', '@sentiment'),
                               ('Sentiment_score', '@sentiment_score')
                              ])

plot3.add_tools(hover3)



### plot4
plot4 = figure(x_axis_label = 'PC 1',
              y_axis_label = 'PC 2', title="Actual Sentiment",
              width = 500, height = 400)

color_map4 = CategoricalColorMapper(factors=df_document_topic['sentiment'].unique(), palette=palette)

plot4.circle(x = 'pc1',
    y = 'pc2',
    source = source, color = {'field': 'sentiment', 'transform': color_map4},
            legend='sentiment', alpha = .8)



# Create a HoverTool object
hover4 = HoverTool(tooltips = [('Movie', '@movie'),
                               ('Dominant_topic', '@dominant_topic'),
                               ('Cluster', '@cluster'),
                               ('Predicted_sentiment', '@predicted_sentiment'),
                               ('Actual_Sentiment', '@sentiment'),
                               ('Sentiment_score', '@sentiment_score')
                              ])

plot4.add_tools(hover4)

# link the ranges
plot2.x_range = plot1.x_range
plot2.y_range = plot1.y_range
plot3.x_range = plot1.x_range
plot3.y_range = plot1.y_range
plot4.x_range = plot1.x_range
plot4.y_range = plot1.y_range

#layout
row2 = [plot4, plot2]
row1 = [plot1, plot3]
layout = gridplot([row1, row2])
<div class="bk-root">
    <a href="https://bokeh.pydata.org" target="_blank" class="bk-logo bk-logo-small bk-logo-notebook"></a>
    <span id="1243">Loading BokehJS ...</span>
</div>
# append the document topics
output_file('topic-modeling.html')
show(layout)

Separate topics based on sentiment

From the interactive data visualization above, if you hover over the individual movie review, you will be able to see the sentiment score, predicted sentiment and actual sentiment for each of the user reivews (along with the topic that it is assigned to).

Calculate the average sentiment score for each dominant topic

Looking at the average sentiment score for each of the topics does not reveal much about the sentiment of each topic. Since Topic 1 has lower sentiment score, and since more negative words are part of topic 1, we can deduce that topic 1 has slightly negative sentiment.

df_document_topic.groupby('dominant_topic')['sentiment_score'].mean()
dominant_topic
topic0    0.029157
topic1    0.017293
Name: sentiment_score, dtype: float64

Appendix

import platform; print(platform.platform())
import sys; print("Python", sys.version)
import numpy; print("Numpy", numpy.__version__)
import pandas; print("Pandas", pandas.__version__)
import seaborn; print("Seaborn", seaborn.__version__)
import matplotlib; print("Matplotlib", matplotlib.__version__)
import nltk; print("NLTK", nltk.__version__)
import requests; print("requests", requests.__version__)
import bs4; print("BeautifulSoup", bs4.__version__)
import re; print("re", re.__version__)
import spacy; print("spacy", spacy.__version__)
import gensim; print("gensim", gensim.__version__)
import bokeh; print("bokeh", bokeh.__version__)
Darwin-17.7.0-x86_64-i386-64bit
Python 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 11:07:29)
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
Numpy 1.15.4
Pandas 0.23.3
Seaborn 0.9.0
Matplotlib 2.2.2
NLTK 3.2.5
requests 2.19.1
BeautifulSoup 4.7.1
re 2.2.1
spacy 2.1.4
gensim 3.4.0
bokeh 1.3.4
#####################################
#  Module: normalization2.py
#  Author: Shravan Kuchkula
#  Date: 08/08/2019
#####################################

import re
import pandas as pd
import numpy as np
import nltk
import string
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet as wn


custom_stopwords = ['movie', 'film', 'review']

def remove_hypens(book_text):
    return re.sub(r'(\w+)-(\w+)-?(\w)?', r'\1 \2 \3', book_text)

# tokenize text
def tokenize_text(book_text):
    TOKEN_PATTERN = r'\s+'
    regex_wt = nltk.RegexpTokenizer(pattern=TOKEN_PATTERN, gaps=True)
    word_tokens = regex_wt.tokenize(book_text)
    return word_tokens

def remove_characters_after_tokenization(tokens):
    pattern = re.compile('[{}]'.format(re.escape(string.punctuation)))
    filtered_tokens = filter(None, [pattern.sub('', token) for token in tokens])
    return filtered_tokens

def convert_to_lowercase(tokens):
    return [token.lower() for token in tokens if token.isalpha()]

def remove_stopwords(tokens, custom_stopwords):
    stopword_list = nltk.corpus.stopwords.words('english')
    stopword_list += custom_stopwords
    filtered_tokens = [token for token in tokens if token not in stopword_list]
    return filtered_tokens

def get_lemma(tokens):
    lemmas = []
    for word in tokens:
        lemma = wn.morphy(word)
        if lemma is None:
            lemmas.append(word)
        else:
            lemmas.append(lemma)
    return lemmas

def remove_short_tokens(tokens):
    return [token for token in tokens if len(token) > 3]

def keep_only_words_in_wordnet(tokens):
    return [token for token in tokens if wn.synsets(token)]

def apply_lemmatize(tokens, wnl=WordNetLemmatizer()):
    return [wnl.lemmatize(token) for token in tokens]

def cleanTextBooks(book_texts):
    clean_books = []
    for book in book_texts:
        book = remove_hypens(book)
        book_i = tokenize_text(book)
        book_i = remove_characters_after_tokenization(book_i)
        book_i = convert_to_lowercase(book_i)
        book_i = remove_stopwords(book_i, custom_stopwords)
        book_i = get_lemma(book_i)
        book_i = remove_short_tokens(book_i)
        book_i = keep_only_words_in_wordnet(book_i)
        book_i = apply_lemmatize(book_i)
        clean_books.append(book_i)
    return clean_books

Tags:

Updated: