Document Clustering

Author: Shravan Kuchkula

Document Clustering

Document clustering or cluster analysis is an interesting area in NLP and text analytics that applies unsupervised ML concepts and techniques. The main premise of document clustering is similar to that of document categorization, where you start with a whole corpus of documents and are tasked with segregating them into various groups based on some distinctive properties, attributes, and features of the documents. Document classification needs pre-labeled training data to build a model and then categorize documents. Document clustering uses unsupervised ML algorithms to group the documents into various clusters. The properties of these clusters are such that documents inside one cluster are more similar and related to each other compared to documents belonging to other clusters.

In order to do document clustering, I will illustrate how to use:

  • K-means by varying the k-value. png

  • Hierarchical clustering by varying where we cut the dendrogram. png

Here are the high-level steps:

  • Get 1000 user movie reviews that were pickled as part of Scrape IMDB movie reviews post.
  • Normalize the corpus of text.
  • Vectorize the corpus using TfidfVectorizer.
  • Run K-means clustering by varying the number of clusters.
  • Visualize K-means using PCA.
  • Conduct Silhouette analysis to quantitatively assess the clusters.
  • Run Hierarchical clustering using complete and ward linkage.
  • Plot the dendrogram and cut the tree to create clusters.

Get the data

A total of 1000 movie user reviews were collected in advance and stored as pickle files.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from normalization import *

pd.options.display.max_colwidth=500

import warnings
warnings.filterwarnings('ignore')

#some ipython magic to show the matplotlib plots inline
%matplotlib inline
# get the collection of reviews
# first 250
df = pd.read_pickle('userReviews.pkl')

# next 250
df1 = pd.read_pickle('userReviews1.pkl')

# dataframe containing all 1000 reviews
df_reviews = pd.concat([df, df1], ignore_index=True)

# display shape of df_reviews
display(df_reviews.shape)
display(df_reviews.head())

# user reviews
reviews = list(df_reviews.user_review)
(1000, 4)
movie user_review_permalink user_review sentiment
0 The Dark Knight https://www.imdb.com/review/rw1921967/ Many commenters said they were "blown away," so it probably has succeeded in blowing away the box office. I waited until the second week, and had high expectations from the 9 and 10 ratings it was receiving. But, fellow movie/film viewers (and especially great film lovers) ... really!? There's no doubt about the action and action and action in this one, and thus, the special effects. That's the main reason I enjoy such fantasy flicks -- the comic book genre. So, this one does more of it and ... negative
1 The Dark Knight https://www.imdb.com/review/rw1908115/ We've been subjected to enormous amounts of hype and marketing for the Dark Knight. We've seen Joker scavenger hunts and one of the largest viral campaigns in advertising history and it culminates with the actual release of the movie.Everything that's been said is pretty much spot on. This is the first time I can remember where a summer blockbuster film far surpasses the hype.For as much action as there is in this movie, it's the acting that makes it a great piece of work. Between all the pu... positive
2 Inception https://www.imdb.com/review/rw2286063/ I have to say to make such an impressive trailer and such an uninteresting film, takes some doing.Here you have most of the elements that would make a very good film. You have great special effects, a sci-fi conundrum, beautiful visuals and good sound. Yet the most important part of the film is missing. There is no plot, character or soul to this film. It's like having a beautiful building on the outside with no paint or decoration on the inside.It's an empty shell of a film. There is no ten... negative
3 Inception https://www.imdb.com/review/rw2276780/ What is the most resilient parasite? An Idea! Yes, Nolan has created something with his unbelievably, incredibly and god- gifted mind which will blow the minds of the audience away. The world premiere of the movie, directed by Hollywood's most inventive dreamers, was shown in London and has already got top notch reviews worldwide and has scored maximum points! Now the question arises what the movie has that it deserve all this?Dom Cobb(Di Caprio) is an extractor who is paid to invade the dre... positive
4 The Usual Suspects https://www.imdb.com/review/rw0374462/ After a gun fight on the docks leaves only one survivor with the majority dead, NYC agent Dave Kujan flies in to ensure that ex-cop Dean Keaton is really dead. During the questioning the survivor, Verbal Kint, tells of how events came to happen. Five criminals are brought together in a line up and decide to use the events to plan a job. However another survivor tells an extra story – one involving master criminal Kyser Soze. Kint reveals how the gang were forced into the fateful job by S... negative

Normalize user reviews

I have already pre-built a NLP pre-processing pipeline which normalizes the corpus of text. This is captured in the function cleanTextBooks(). Shown below is the output before and after normalization of a snippet of text.

reviews_corpus = cleanTextBooks(reviews)
reviews_corpus = [' '.join(item) for item in reviews_corpus]

display(reviews[0][:200])
display(reviews_corpus[0][:200])
'Many commenters said they were "blown away," so it probably has succeeded in blowing away the box office. I waited until the second week, and had high expectations from the 9 and 10 ratings it was rec'



'many commenters said blown away probably succeeded blowing away box office waited second week high expectation rating receiving fellow moviefilm viewer especially great film lover really there doubt a'

Vectorize the corpus

We will be making use of TfidfVectorizer for getting our feature matrix.

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(min_df=20, max_df=0.4, max_features=20000, ngram_range=(1,1), stop_words='english')
vectorizer
TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.4, max_features=20000, min_df=20,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)
# calculate the feature matrix
feature_matrix = vectorizer.fit_transform(reviews_corpus)

# display the feature matrix shape
display(feature_matrix.shape)
(1000, 1235)

KMeans clustering

The k-means clustering algorithm is a centroid-based clustering model that tries to cluster data into groups or clusters of equal variance. The criteria or measure that this algorithm tries to minimize is distortion, which is the sum of square errors (SSE) within each cluster.

The kmeans_cluster_terms() function, run k-means algorithm on the feature matrix with number of clusters pre-specified. It then returns the cluster centers and distortion values. The vq() function is responsible for assigning the cluster labels.

In addition, this function also returns the key-terms within each cluster. These key-terms can later be used to get a sense of the characteristics of the members of the cluster.

from scipy.cluster.vq import kmeans, vq
import seaborn as sns

def kmeans_cluster_terms(num_clusters, top_n):
    """Performs K-means clustering and returns top_n features in each cluster.

    Args:
        num_cluster: k in k-means.
        top_n: top n features closest to the centroid of each cluster.

    Returns:
        cluster_centers: centroids of each cluster.
        distortion: sum of squares within each cluster.
        key_terms: list of top_n features closest to each centroid.
        labels: cluster assignments
    """
    # Generate cluster centers through the kmeans function
    cluster_centers, distortion = kmeans(feature_matrix.todense(), num_clusters)

    # Generate terms from the tfidf_vectorizer object
    terms = vectorizer.get_feature_names()

    # Display the top_n terms in that cluster
    key_terms = []
    for i in range(num_clusters):
        # Sort the terms and print top_n terms
        center_terms = dict(zip(terms, list(cluster_centers[i])))
        sorted_terms = sorted(center_terms, key=center_terms.get, reverse=True)
        key_terms.append(sorted_terms[:top_n])

    # label the clusters
    labels, _ = vq(feature_matrix.todense(), cluster_centers, check_finite=True)

    return cluster_centers, distortion, key_terms, labels

The above function calculates the centroids, distortions, top 10 key terms in each cluster and the assigned labels.

Elbow method

One popular method to determine the number of clusters is the elbow method. The idea here is that distortion could decrease rapidly at first and then slowly flatten out (like an elbow). We visualize this in a Elbow plot - which is a line plot. The value of k where distortion starts to flatten out is typically regarded as a good starting point for choosing the k-value.

# vary k from 2,10
distortions = []
centroids = []
top_10 = []
cluster_labels = []

num_clusters = range(2, 10)

for i in num_clusters:
    cluster_centers, distortion, key_terms, labels = kmeans_cluster_terms(i, 10)

    centroids.append(cluster_centers)
    distortions.append(distortion)
    top_10.append(key_terms)
    cluster_labels.append(labels)

# plot the elbow plot
elbow_plot_data = pd.DataFrame({'num_clusters': num_clusters,
                               'distortions': distortions})

sns.lineplot(x='num_clusters', y='distortions', data = elbow_plot_data)
plt.show()

png

This roughly gives an indication that k=4 might be a good place to start with k-means.

Visualize K-means clustering

To aid in visualizing the clusters, we can make use of dimensionality reduction. I used PCA and used the first and second principal components to draw the scatterplot.

I have developed two utility functions to help us visualize the clustering:

  • plot_kmeans() : Plots k-means clusters with first two principal components.
  • plot_silhouette() : Plots the silhouette plot.

See Appendix for code

from sklearn.decomposition import PCA
pca = PCA()
components = pca.fit_transform(feature_matrix.todense())

xs, ys = components[:, 0], components[:, 1]

With k=4

#set up colors per clusters using a dict
cluster_colors = {0: '#1b9e77', 1: '#d95f02', 2: '#7570b3', 3: '#e7298a'}  #, 4: '#66a61e'}

#set up cluster names using a dict
cluster_names = {0: ', '.join(top_10[2][0]),
                 1: ', '.join(top_10[2][1]),
                 2: ', '.join(top_10[2][2]),
                 3: ', '.join(top_10[2][3])}

# get the cluster labels
labels_four = list(cluster_labels[2])
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.set_size_inches(18, 7)

ax1 = plot_kmeans(4, xs, ys, labels_four, cluster_names, cluster_colors, fig, ax1, 250)
ax2 = plot_silhouette(4, feature_matrix.todense(), labels_four, cluster_colors, fig, ax2)

plt.suptitle("Clustering user movie reviews using K-means with K=4", fontsize=14, fontweight='bold')
plt.show()
For n_clusters = 4 The average silhouette_score is : 0.007206153319679833

png

With k=2

#set up colors per clusters using a dict
cluster_colors = {0: '#1b9e77', 1: '#d95f02' }#, 2: '#7570b3', 3: '#e7298a'}  #, 4: '#66a61e'}

#set up cluster names using a dict
cluster_names = {0: ', '.join(top_10[0][0]),
                 1: ', '.join(top_10[0][1])}

# get the cluster labels
labels_two = list(cluster_labels[0])
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.set_size_inches(18, 7)

ax1 = plot_kmeans(2, xs, ys, labels_two, cluster_names, cluster_colors, fig, ax1)
ax2 = plot_silhouette(2, feature_matrix.todense(), labels_two, cluster_colors, fig, ax2)

plt.suptitle("Clustering user movie reviews using K-means with K=2", fontsize=14, fontweight='bold')
plt.show()
For n_clusters = 2 The average silhouette_score is : 0.004741655297679337

png

With k=3

#set up colors per clusters using a dict
cluster_colors = {0: '#1b9e77', 1: '#d95f02', 2: '#7570b3' }#, 3: '#e7298a'}  #, 4: '#66a61e'}

#set up cluster names using a dict
cluster_names = {0: ', '.join(top_10[1][0]),
                 1: ', '.join(top_10[1][1]),
                 2: ', '.join(top_10[1][2])}

# get the cluster labels
labels_three = list(cluster_labels[1])
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.set_size_inches(18, 7)

ax1 = plot_kmeans(3, xs, ys, labels_three, cluster_names, cluster_colors, fig, ax1)
ax2 = plot_silhouette(3, feature_matrix.todense(), labels_three, cluster_colors, fig, ax2)

plt.suptitle("Clustering user movie reviews using K-means with K=3", fontsize=14, fontweight='bold')
plt.show()
For n_clusters = 3 The average silhouette_score is : 0.004485589168121894

png

The figures on the left show the scatter plot of first and second principal components - which are color coded with cluster values. The top 10 terms from each cluster is used as the name for that cluster. For instance, when K=4, we have bond, james, action, sean, really, dr, love, best, villian, spy forms one cluster of points shown in green. Rather than visually assessing the performance of clustering, we can rely on a quantitative method called Silhouette analysis.

Silhouette analysis can be used to study the separation distance between the resulting clusters. The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters and thus provides a way to assess parameters like number of clusters visually. This measure has a range of [-1, 1].

Silhouette coefficients (as these values are referred to as) near +1 indicate that the sample is far away from the neighboring clusters. A value of 0 indicates that the sample is on or very close to the decision boundary between two neighboring clusters and negative values indicate that those samples might have been assigned to the wrong cluster.

In our case, the silhouette plots show that none of the k-values produce good results.

Hierarchical Clustering:

The hierarchical clustering family of algorithms tries to build a nested hierarchy of clusters by either merging or splitting them in succession. There are two main strategies:

Agglomerative (AGNES): These algorithms follow a bottom-up approach where initially all data points belong to their own individual cluster, and then from this bottom layer, we start merging clusters together, building a hierarchy of clusters as we go up.

Divisive (DIANA): These algorithms follow a top-down approach where initially all the data points belong to a single huge cluster and then we start recursively dividing them up as we move down gradually, and this produces a hierarchy of clusters going from the top-down.

I will be using agglomerative hierarchical clustering algorithm.

In agglomerative clustering, for deciding which clusters we should combine when starting from the individual data point clusters, we need two things:

  • Distance Metric: A distance metric to measure the similarity or dissimilarity degree between data points. We will be using the Cosine distance/ similarity in our implementation.

  • Linkage Criterion: A linkage criterion that determines the metric to be used for the merging strategy of clusters. We will be using complete and Ward’s linkage methods here.

With linkage method = complete

# Import cosine_similarity to calculate similarity of movie plots
from sklearn.metrics.pairwise import cosine_similarity

# Calculate the similarity distance
similarity_distance = 1 - cosine_similarity(feature_matrix)

# Import modules necessary to plot dendrogram
from scipy.cluster.hierarchy import linkage, dendrogram

# Create mergings matrix
mergings = linkage(similarity_distance, method='complete')
# Plot the dendrogram, using title as label column
dendrogram_ = dendrogram(mergings, orientation='top',
               labels=[x for x in df_reviews["movie"]],
               leaf_font_size=16
)

# Adjust the plot
fig = plt.gcf()
_ = [lbl.set_color('r') for lbl in plt.gca().get_xmajorticklabels()]
fig.set_size_inches(108, 50)

# Show the plotted dendrogram
plt.show()

png

from scipy.cluster.hierarchy import fcluster

k = 9
h_link_cluster_labels = fcluster(mergings, k, criterion='maxclust')

fig, (ax1, ax2) = plt.subplots(1, 2)
fig.set_size_inches(18, 7)

#fig, ax = plt.subplots(figsize=(17, 9))

ax2 = sns.scatterplot(x = xs, y=ys, hue=h_link_cluster_labels, palette="Set2",
                      alpha=0.8, legend="full")
ax2.set_title("Hierarchical clustering using complete linkage with 9 clusters")
ax2.set_xlabel("First principal component")
ax2.set_ylabel("Second principal component")

ax1 = plot_silhouette_hierarchical(9, feature_matrix.todense(), h_link_cluster_labels, fig, ax1)

plt.suptitle("Clustering user movie reviews using AGNES complete linkage with 9 clusters",
             fontsize=14, fontweight='bold')
plt.show()
For n_clusters = 9 The average silhouette_score is : -0.011373618013564598

png

With ward linkage method

from scipy.cluster.hierarchy import ward

linkage_matrix = ward(similarity_distance)

# Plot the dendrogram, using title as label column
dendrogram_ = dendrogram(linkage_matrix,
               labels=[x for x in df_reviews["movie"]],
               leaf_rotation=90,
               leaf_font_size=16,
)

# Adjust the plot
fig = plt.gcf()
_ = [lbl.set_color('r') for lbl in plt.gca().get_xmajorticklabels()]
fig.set_size_inches(108, 21)

plt.show()

png

k = 3
h_ward_cluster_labels = fcluster(linkage_matrix, k, criterion='maxclust')

fig, (ax1, ax2) = plt.subplots(1, 2)
fig.set_size_inches(18, 7)

#fig, ax = plt.subplots(figsize=(17, 9))

ax2 = sns.scatterplot(x = xs, y=ys, hue=h_ward_cluster_labels, palette="Set2",
                      alpha=0.8, legend="full")
ax2.set_title("Hierarchical clustering using ward linkage with 3 clusters")
ax2.set_xlabel("First principal component")
ax2.set_ylabel("Second principal component")

ax1 = plot_silhouette_hierarchical(3, feature_matrix.todense(), h_ward_cluster_labels, fig, ax1)

plt.suptitle("Clustering user movie reviews using AGNES ward linkage with 3 clusters",
             fontsize=14, fontweight='bold')
plt.show()
For n_clusters = 3 The average silhouette_score is : 0.0040936865567986445

png

Key characteristics that make up each cluster

When using k-means, we get back the centroids of each of the clusters. Using the centriods, we can extract the top-10 features from that cluster. Shown below are the top terms in each cluster. Based on these, we can get some sense of the members of the cluster. For instance, when K=2, we have the following two clusters:

  • cluster 1: ['bond', 'james', 'action', 'really', 'sean', 'dr', 'love', 'villain', 'best', 'spy']
  • cluster 2: ['action', 'great', 'scene', 'best', 'really', 'people', 'way', 'performance', 'plot', 'dont']

As we are focused on the Thriller genre, we expect to see James Bond movies. It appears that, movie reviews in cluster 1 have characteristics of a bond movie - action, spy, villain etc.

Shown below are all the 9 clusters and the key terms within each cluster.

cluster_range = range(2,10)
for k, terms in list(zip(cluster_range, top_10)):
    print("-"*100)
    print("For K: " + str(k) + " clusters, the top terms are: ")
    print(terms)
    print("-"*100)
----------------------------------------------------------------------------------------------------
For K: 2 clusters, the top terms are:
[['bond', 'james', 'action', 'really', 'sean', 'dr', 'love', 'villain', 'best', 'spy'], ['action', 'great', 'scene', 'best', 'really', 'people', 'way', 'performance', 'plot', 'dont']]
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
For K: 3 clusters, the top terms are:
[['people', 'really', 'scene', 'dont', 'plot', 'end', 'know', 'think', 'thing', 'way'], ['action', 'bond', 'love', 'best', 'james', 'scene', 'great', 'really', 'favorite', 'hard'], ['great', 'performance', 'best', 'role', 'work', 'actor', 'life', 'world', 'scene', 'way']]
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
For K: 4 clusters, the top terms are:
[['bond', 'james', 'action', 'sean', 'really', 'dr', 'love', 'best', 'villain', 'spy'], ['people', 'really', 'dont', 'thing', 'scene', 'way', 'know', 'life', 'plot', 'think'], ['performance', 'great', 'best', 'role', 'actor', 'scene', 'work', 'director', 'man', 'cast'], ['action', 'scene', 'great', 'best', 'love', 'plot', 'really', 'hard', 'favorite', 'watch']]
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
For K: 5 clusters, the top terms are:
[['people', 'thing', 'know', 'dont', 'really', 'way', 'think', 'scene', 'watch', 'point'], ['performance', 'best', 'work', 'life', 'scene', 'man', 'role', 'director', 'great', 'actor'], ['great', 'really', 'acting', 'seen', 'book', 'scene', 'actor', 'plot', 'say', 'best'], ['bond', 'james', 'action', 'sean', 'really', 'dr', 'love', 'best', 'villain', 'spy'], ['action', 'love', 'best', 'great', 'scene', 'hard', 'favorite', 'fast', 'die', 'really']]
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
For K: 6 clusters, the top terms are:
[['really', 'people', 'thing', 'think', 'dont', 'know', 'scene', 'end', 'way', 'watch'], ['bond', 'james', 'action', 'sean', 'really', 'dr', 'love', 'best', 'villain', 'spy'], ['life', 'man', 'performance', 'work', 'world', 'play', 'way', 'people', 'best', 'role'], ['book', 'im', 'read', 'scene', 'didnt', 'say', 'great', 'plot', 'fan', 'ive'], ['action', 'love', 'best', 'great', 'scene', 'hard', 'favorite', 'really', 'fast', 'plot'], ['best', 'great', 'performance', 'acting', 'seen', 'actor', 'scene', 'plot', 'cinema', 'director']]
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
For K: 7 clusters, the top terms are:
[['scene', 'plot', 'hollywood', 'thriller', 'ending', 'script', 'acting', 'director', 'end', 'action'], ['great', 'really', 'best', 'seen', 'acting', 'performance', 'actor', 'ive', 'think', 'role'], ['life', 'man', 'world', 'performance', 'best', 'work', 'way', 'people', 'real', 'play'], ['action', 'love', 'best', 'scene', 'hard', 'fast', 'great', 'favorite', 'die', 'john'], ['bond', 'james', 'action', 'sean', 'really', 'dr', 'love', 'best', 'villain', 'spy'], ['dont', 'people', 'know', 'really', 'thing', 'bad', 'guy', 'hour', 'think', 'watch'], ['book', 'read', 'novel', 'way', 'plot', 'scene', 'great', 'say', 'know', 'based']]
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
For K: 8 clusters, the top terms are:
[['hitchcock', 'couple', 'suspense', 'love', 'murder', 'best', 'plot', 'script', 'james', 'thriller'], ['great', 'twist', 'seen', 'amazing', 'shot', 'director', 'tarantino', 'quite', 'dont', 'work'], ['action', 'scene', 'best', 'love', 'great', 'hard', 'plot', 'favorite', 'john', 'harry'], ['bond', 'james', 'action', 'sean', 'really', 'dr', 'love', 'best', 'series', 'villain'], ['man', 'life', 'people', 'performance', 'world', 'best', 'role', 'work', 'way', 'scene'], ['really', 'dont', 'scene', 'people', 'think', 'know', 'end', 'plot', 'thing', 'watch'], ['horror', 'review', 'im', 'seen', 'great', 'ive', 'read', 'scare', 'year', 'scary'], ['performance', 'best', 'acting', 'superb', 'great', 'scene', 'actor', 'script', 'year', 'role']]
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
For K: 9 clusters, the top terms are:
[['know', 'dont', 'people', 'really', 'guy', 'think', 'thing', 'end', 'way', 'bad'], ['world', 'work', 'performance', 'way', 'life', 'people', 'great', 'really', 'point', 'reality'], ['great', 'best', 'performance', 'seen', 'actor', 'acting', 'ive', 'really', 'amazing', 'say'], ['horror', 'sense', 'scare', 'girl', 'dead', 'scary', 'classic', 'dont', 'gore', 'people'], ['review', 'im', 'bad', 'son', 'great', 'year', 'performance', 'acting', 'number', 'life'], ['man', 'life', 'wife', 'role', 'play', 'woman', 'john', 'way', 'performance', 'great'], ['scene', 'book', 'plot', 'little', 'action', 'work', 'bit', 'director', 'end', 'really'], ['action', 'love', 'best', 'scene', 'hard', 'great', 'die', 'favorite', 'lot', 'really'], ['bond', 'james', 'action', 'sean', 'really', 'dr', 'love', 'best', 'villain', 'spy']]
----------------------------------------------------------------------------------------------------

Conclusion

I would prefer k-means with 2 clusters over hierarchical clustering for this particular problem. Based on the results observed above, it looks like there are only two distinct clusters. Increasing the number of clusters is decreasing the average silhouette value, which indicates that the clusters formed are not good. Additionally, K-means is a lot faster than hierarchical clustering and with K-means we can also retrieve the key features of each cluster, which is not possible with hierarchical clustering.

Appendix

Environment

import platform; print(platform.platform())
import sys; print("Python", sys.version)
import numpy; print("Numpy", numpy.__version__)
import pandas; print("Pandas", pandas.__version__)
import seaborn; print("Seaborn", seaborn.__version__)
import matplotlib; print("Matplotlib", matplotlib.__version__)
import nltk; print("NLTK", nltk.__version__)
import requests; print("requests", requests.__version__)
import bs4; print("BeautifulSoup", bs4.__version__)
import re; print("re", re.__version__)
import spacy; print("spacy", spacy.__version__)
import gensim; print("gensim", gensim.__version__)
Darwin-17.7.0-x86_64-i386-64bit
Python 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 11:07:29)
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
Numpy 1.15.4
Pandas 0.23.3
Seaborn 0.9.0
Matplotlib 2.2.2
NLTK 3.2.5
requests 2.19.1
BeautifulSoup 4.7.1
re 2.2.1
spacy 2.1.4
gensim 3.4.0

Code

from sklearn.metrics import silhouette_samples, silhouette_score

def plot_kmeans(k, xs, ys, labels, cluster_names, cluster_colors, fig, ax, num_points=len(xs)):
    """Plots k-means clusters with first two principal components
    Args:
        k: number of clusters
        xs: first principal component
        ys: second principal component
        labels: cluster labels assigned by k-means algorithm
        cluster_names: top_10 features around the centroid
        cluster_colors: dictionary of pre-established colors
        num_points: number of observations you want displayed.
    """

    #create data frame that has the result of the PCA plus the cluster numbers
    df = pd.DataFrame(dict(x=xs[:num_points], y=ys[:num_points], label=labels[:num_points],
                           title=labels[:num_points]))

    #group by cluster
    groups = df.groupby('label')

    # set up plot
    #fig, ax = plt.subplots(figsize=(17, 9)) # set size
    ax.margins(0.05) # Optional, just adds 5% padding to the autoscaling


    #iterate through groups to layer the plot
    #note that I use the cluster_name and cluster_color dicts
    #with the 'name' lookup to return the appropriate color/label
    for name, group in groups:
        ax.plot(group.x, group.y, marker='o', linestyle='', ms=12,
                label=cluster_names[name], color=cluster_colors[name],
                mec='none')
        ax.set_aspect('auto')
        ax.tick_params(\
            axis= 'x',          # changes apply to the x-axis
            which='both',      # both major and minor ticks are affected
            bottom='off',      # ticks along the bottom edge are off
            top='off',         # ticks along the top edge are off
            labelbottom='off')
        ax.tick_params(\
            axis= 'y',         # changes apply to the y-axis
            which='both',      # both major and minor ticks are affected
            left='off',      # ticks along the bottom edge are off
            top='off',         # ticks along the top edge are off
            labelleft='off')

    ax.legend(numpoints=1)  #show legend with only 1 point

    #add label in x,y position with the label as the film title
    #for i in range(len(df)):
        #ax.text(df.ix[i]['x'], df.ix[i]['y'], df.ix[i]['title'], size=10)


    ax.set_title("K-means with " + str(k) + " clusters showing " \
                 + str(num_points) + " movie reviews")
    ax.set_xlabel("first principal component")
    ax.set_ylabel("second principal component")

    return ax

def plot_silhouette(n_clusters, X, labels, cluster_colors, fig, ax):
    """Plots the silhouette plot
    Args:
        n_clusters: number of clusters
        X: dense tf-idf feature matrix, the same one used to fit the model
        labels: clusters guessed by the algorithm
        cluster_colors: dict of cluster colors - same colors used in scatterplot.
        fig: figure object to display in a single row
        ax: axes object to display in a single row

    Returns:
        ax: axes object
    """

    # set up plot
    # Create a subplot with 1 row and 2 columns
    #fig, ax = plt.subplots(figsize=(15, 9)) # set size

    # may need to set ax.set_xlim ax.set_ylim

    # get predicted cluster labels for k
    # convert to ndarray as fillcolor needs it that way.
    cluster_labels = np.asarray(labels)

    # The silhouette_score gives the average value for all the samples.
    # This gives a perspective into the density and separation of the formed
    # clusters
    silhouette_avg = silhouette_score(X, cluster_labels)
    print("For n_clusters =", n_clusters,
            "The average silhouette_score is :", silhouette_avg)

    # Compute the silhouette scores for each sample
    sample_silhouette_values = silhouette_samples(X, cluster_labels)

    y_lower = 10
    color = tuple(cluster_colors.values())
    for i in range(n_clusters):
        # Aggregate the silhouette scores for samples belonging to
        # cluster i, and sort them
        ith_cluster_silhouette_values = sample_silhouette_values[cluster_labels == i]

        ith_cluster_silhouette_values.sort()

        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i

        #print(ith_cluster_silhouette_values)

        #color = cm.nipy_spectral(float(i) / n_clusters)
        #color = tuple(cluster_colors.values())
        ax.fill_betweenx(np.arange(y_lower, y_upper),
                            0, ith_cluster_silhouette_values,
                            facecolor=color[i], edgecolor=color[i], alpha=0.7)

        # Label the silhouette plots with their cluster numbers at the middle
        ax.text(-0.02, y_lower + 0.5 * size_cluster_i, str(i))

        # Compute the new y_lower for next plot
        y_lower = y_upper + 10  # 10 for the 0 samples

    ax.set_title("The silhouette plot for the various clusters.")
    ax.set_xlabel("The silhouette coefficient values")
    ax.set_ylabel("Cluster label")

    # The vertical line for average silhouette score of all the values
    ax.axvline(x=silhouette_avg, color="red", linestyle="--")

    ax.set_yticks([])  # Clear the yaxis labels / ticks
    ax.set_xticks([])

    return ax



import matplotlib.cm as cm
def plot_silhouette_hierarchical(n_clusters, X, labels, fig, ax):
    """Plots the silhouette plot
    Args:
        n_clusters: number of clusters
        X: dense tf-idf feature matrix, the same one used to fit the model
        labels: clusters guessed by the algorithm
        fig: figure object to display in a single row
        ax: axes object to display in a single row

    Returns:
        ax: axes object
    """

    # set up plot
    # Create a subplot with 1 row and 2 columns
    #fig, ax = plt.subplots(figsize=(15, 9)) # set size

    # may need to set ax.set_xlim ax.set_ylim

    # get predicted cluster labels for k
    # convert to ndarray as fillcolor needs it that way.
    cluster_labels = np.asarray(labels)

    # The silhouette_score gives the average value for all the samples.
    # This gives a perspective into the density and separation of the formed
    # clusters
    silhouette_avg = silhouette_score(X, cluster_labels)
    print("For n_clusters =", n_clusters,
            "The average silhouette_score is :", silhouette_avg)

    # Compute the silhouette scores for each sample
    sample_silhouette_values = silhouette_samples(X, cluster_labels)

    y_lower = 10
    for i in range(n_clusters):
        # Aggregate the silhouette scores for samples belonging to
        # cluster i, and sort them
        ith_cluster_silhouette_values = sample_silhouette_values[cluster_labels == i]

        ith_cluster_silhouette_values.sort()

        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i

        #print(ith_cluster_silhouette_values)

        color = cm.nipy_spectral(float(i) / n_clusters)
        #color = tuple(cluster_colors.values())
        ax.fill_betweenx(np.arange(y_lower, y_upper),
                            0, ith_cluster_silhouette_values)
                            #facecolor=color[i], edgecolor=color[i], alpha=0.7)

        # Label the silhouette plots with their cluster numbers at the middle
        ax.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

        # Compute the new y_lower for next plot
        y_lower = y_upper + 10  # 10 for the 0 samples

    ax.set_title("The silhouette plot for the various clusters.")
    ax.set_xlabel("The silhouette coefficient values")
    ax.set_ylabel("Cluster label")

    # The vertical line for average silhouette score of all the values
    ax.axvline(x=silhouette_avg, color="red", linestyle="--")

    ax.set_yticks([])  # Clear the yaxis labels / ticks
    ax.set_xticks([])

    return ax

Tags:

Updated: