Document Similarity

Author: Shravan Kuchkula

Document Similarity

“Two documents are similar if their vectors are similar”.


To illustrate the concept of text/term/document similarity, I will use Amazon’s book search to construct a corpus of documents. Suppose that we searched for “Natural Language Processing” and got back several book titles. We can then manually collect these titles and store them in a list. Shown below are the list of titles in the order in which the website returned the results. This will serve as our corpus.

Goal: The goal here is to show how we can leverage NLP to semantically compare documents.

Concepts: The following concepts are discussed here:

  • Normalizing a corpus of text.
  • Vectorizing a corpus of text using TfidfVectorizer.
  • Calculating the cosine similarity between documents/vectors.
  • Plotting cosine similarity using a heatmap.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline
books = [
    "Natural Language Processing in Action: Understanding, analyzing, and generating text with Python",
    "Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit",
    "Neural Network Methods for Natural Language Processing (Synthesis Lectures on Human Language Technologies)",
    "Natural Language Processing with PyTorch: Build Intelligent Language Applications Using Deep Learning",
    "Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning",
    "Natural Language Processing with TensorFlow: Teach language to machines using Python's deep learning library",
    "Speech and Language Processing, 2nd Edition",
    "Foundations of Statistical Natural Language Processing",
    "Natural Language Processing Fundamentals: Build intelligent applications that can interpret the human language to deliver impactful results",
    "Deep Learning for Natural Language Processing",
    "Natural Language Processing and Computational Linguistics: A practical guide to text analysis with Python, Gensim, spaCy, and Keras",
    "Hands-On Unsupervised Learning Using Python: How to Build Applied Machine Learning Solutions from Unlabeled Data",
    "The Handbook of Computational Linguistics and Natural Language Processing",
    "Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems",
    "Natural Language Processing Recipes: Unlocking Text Data with Machine Learning and Deep Learning using Python",
    "Deep Learning in Natural Language Processing",
    "Python Natural Language Processing: Advanced machine learning and deep learning techniques for natural language processing",
    "Natural Language Annotation for Machine Learning: A Guide to Corpus-Building for Applications",
    "Natural Language Processing: A Quick Introduction to NLP with Python and NLTK (Step-by-Step Tutorial for Beginners)",
    "Python Deep learning: Develop your first Neural Network in Python Using TensorFlow, Keras, and PyTorch (Step-by-Step Tutorial for Beginners)",
    "Deep Learning for Natural Language Processing: Creating Neural Networks with Python",
    "Introduction to Natural Language Processing (Adaptive Computation and Machine Learning series)",
    "Deep Learning for Natural Language Processing: Solve your natural language processing problems with smart deep neural networks",
    "Biomedical Natural Language Processing"

num_books = len(books)
print("A total of " + str(num_books) + " books have been collected")
A total of 24 books have been collected

In order to do pairwise comparison of book titles, I have first built a dictionary - with book_id as key and book title as values, then I constructed a list of all possible pairs. The cosine similarity is calculated for each of these pairs.

# label books as book_1, book_2 .. book_n
bookids = ["book_" + str(i) for i in range(num_books)]

# create a dictionary
book_dict = dict(zip(bookids, books))

# get all the book ids in a list
ids = list(book_dict.keys())

# create all possible pairs
pairs = []
# create a list of tuples
for i, v in enumerate(ids):
    for j in ids[i+1:]:
        pairs.append((ids[i], j))

print("There are a total of " + str(len(pairs)) + " pairs")
print("Displaying first 10 pairs: ")
print("Displaying last 10 pairs: ")
There are a total of 276 pairs
Displaying first 10 pairs:

[('book_0', 'book_1'),
 ('book_0', 'book_2'),
 ('book_0', 'book_3'),
 ('book_0', 'book_4'),
 ('book_0', 'book_5'),
 ('book_0', 'book_6'),
 ('book_0', 'book_7'),
 ('book_0', 'book_8'),
 ('book_0', 'book_9'),
 ('book_0', 'book_10')]

Displaying last 10 pairs:

[('book_19', 'book_20'),
 ('book_19', 'book_21'),
 ('book_19', 'book_22'),
 ('book_19', 'book_23'),
 ('book_20', 'book_21'),
 ('book_20', 'book_22'),
 ('book_20', 'book_23'),
 ('book_21', 'book_22'),
 ('book_21', 'book_23'),
 ('book_22', 'book_23')]

For each of these pairs, we will be calculating the cosine similarity.

Calculating cosine similarity

The process for calculating cosine similarity can be summarized as follows:

  • Normalize the corpus of documents.
  • Vectorize the corpus of documents.
  • Take a dot product of the pairs of documents.
  • Plot a heatmap to visualize the similarity.

To normalize the corpus, I make use of the module which contains functions that tokenize and normalize a list of documents.

To vectorize the corpus, I make use of TfidfVectorizer.

To calculate the cosine similarity between pairs in the corpus, I first extract the feature vectors of the pairs and then compute their dot product.

Finally, I have plotted a heatmap of the cosine similarity scores to visually assess which two documents are most similar and most dissimilar to each other.

The below sections of code illustrate this:

Normalize the corpus of documents

from normalization import *

# cleanTextBooks takes a list of strings and returns a list of lists
corpus = cleanTextBooks(books)

# convert list of lists into a list of strings
norm_book_corpus = [' '.join(text) for text in corpus]

# display normalized corpus
['natural language processing action understanding analyzing generating text python',
 'natural language processing python analyzing text natural language toolkit',
 'neural network method natural language processing synthesis lecture human language technology',
 'natural language processing pytorch build intelligent language application using deep learning',
 'applied text analysis python enabling languageaware data product machine learning',
 'natural language processing tensorflow teach language machine using python deep learning library',
 'speech language processing edition',
 'foundation statistical natural language processing',
 'natural language processing fundamental build intelligent application interpret human language deliver impactful result',
 'deep learning natural language processing',
 'natural language processing computational linguistics practical guide text analysis python gensim spacy kera',
 'handson unsupervised learning using python build applied machine learning solution unlabeled data',
 'handbook computational linguistics natural language processing',
 'handson machine learning scikitlearn tensorflow concept tool technique build intelligent system',
 'natural language processing recipe unlocking text data machine learning deep learning using python',
 'deep learning natural language processing',
 'python natural language processing advanced machine learning deep learning technique natural language processing',
 'natural language annotation machine learning guide corpusbuilding application',
 'natural language processing quick introduction nlp python nltk stepbystep tutorial beginner',
 'python deep learning develop first neural network python using tensorflow kera pytorch stepbystep tutorial beginner',
 'deep learning natural language processing creating neural network python',
 'introduction natural language processing adaptive computation machine learning series',
 'deep learning natural language processing solve natural language processing problem smart deep neural network',
 'biomedical natural language processing']

Vectorize the corpus of documents

The TfidfVectorizer converts a collection of raw documents to a matrix of TF-IDF features.

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(min_df=0.0, max_df=1.0, ngram_range=(1,1))
TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=0.0,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
# calculate the feature matrix
feature_matrix = vectorizer.fit_transform(norm_book_corpus).astype(float)

# display the shape of feature matrix

# display the first feature vector

# display the dense version of the feature vector

# display the shape of dense feature vector

# display the first document text
(24, 82)

<1x82 sparse matrix of type '<class 'numpy.float64'>'
	with 9 stored elements in Compressed Sparse Row format>

array([0.4540467 , 0.        , 0.        , 0.        , 0.40183052,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.4540467 , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.15123435, 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.15751759,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.15751759, 0.        , 0.22330221, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.31256636, 0.        ,
       0.        , 0.        , 0.4540467 , 0.        , 0.        ,
       0.        , 0.        ])


'natural language processing action understanding analyzing generating text python'

When we run the fit_transform function on the normalized corpus, we get back a feature matrix. The dimensions of the feature matrix are (24,82) - 24 documents and 82 unique words/tokens. Displaying the first feature vector (i.e for first document) we can see that it is a sparse matrix with 1x82 dimensions. The toarray function converts the sparse matrix to a dense feature vector.

The above feature vector represents TF-IDF vector of the document “natural language processing action understanding analyzing generating text python”. The benefit of converting this document into a vector is that we can now use dot product to calculate the cosine similarity. Moreover, representing a document in vector format opens up the possibility to use many other mathematical models which operate on numeric data.

Take a dot product of the pairs of documents.

def compute_cosine_similarity(pair):

    # extract the indexes from the pair
    book1, book2 = pair

    # split on _ and get index
    book1_index = int(book1.split("_")[1])
    book2_index = int(book2.split("_")[1])

    # get the feature matrix of the document
    book1_fm = feature_matrix.toarray()[book1_index]
    book2_fm = feature_matrix.toarray()[book2_index]

    # compute cosine similarity manually
    manual_cosine_similarity =, book2_fm)

    return manual_cosine_similarity
pairwise_cosine_similarity = [compute_cosine_similarity(pair) for pair in pairs]

# create a dataframe
df = pd.DataFrame({'pair': pairs, 'similarity': pairwise_cosine_similarity})
pair similarity
0 (book_0, book_1) 0.502098
1 (book_0, book_2) 0.081986
2 (book_0, book_3) 0.101772
3 (book_0, book_4) 0.128064
4 (book_0, book_5) 0.145188
pair similarity
271 (book_20, book_22) 0.551686
272 (book_20, book_23) 0.169933
273 (book_21, book_22) 0.156997
274 (book_21, book_23) 0.140307
275 (book_22, book_23) 0.230840

Plot a heatmap of cosine similarity values

from utils import plot_heatmap

# initialize an empty dataframe grid
df_hm = pd.DataFrame({'ind': range(24), 'cols': range(24), 'vals': pd.Series(np.zeros(24))})

# convert to a matrix
df_hm = df_hm.pivot(index='ind', columns='cols').fillna(0)

# make a copy
df_temp = df.copy()

# convert list of tuples into 2 lists
list1 = []
list2 = []
for item1, item2 in df_temp.pair:

# add two columns to df_temp
df_temp['book1'] = list1
df_temp['book2'] = list2

# drop the pair as it not needed
df_temp.drop('pair', axis=1, inplace=True)

# extract index so that you can construct pairs
df_temp['book1'] = df_temp['book1'].apply(lambda x: int(x.split('_')[-1]))
df_temp['book2'] = df_temp['book2'].apply(lambda x: int(x.split('_')[-1]))

# create tuples (0, 1, similarity)
df_temp['pairs'] = list(zip(df_temp.book1, df_temp.book2, round(df_temp.similarity, 2)))

# display(df_temp.head())

# to get lower diagnol, swap the rows and cols.
for row, col, similarity in df_temp.pairs:
    df_hm.iloc[col, row] = similarity

ax = plot_heatmap(df_hm, ids, ids)


From the above heatmap, we can see that the most similar documents are book_9 and book_15. Whereas, the most dissimilar documents are the one’s with similarity score of 0.0. One such example of documents that have no similarity is the pair book_0 and book_13. Shown below are the titles of these books.

# display books which are most similar and least similar
df.loc[[df.similarity.values.argmax(), df.similarity.values.argmin()]]
pair similarity
176 (book_9, book_15) 1.0
12 (book_0, book_13) 0.0
print("Most similar books are: ")
print(" and ")
Most similar books are:
Deep Learning for Natural Language Processing
Deep Learning in Natural Language Processing
print("Most dissimilar books are: ")
print(" and ")
Most dissimilar books are:
Natural Language Processing in Action: Understanding, analyzing, and generating text with Python
Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems



#  Module:
#  Author: Shravan Kuchkula
#  Date: 07/19/2019

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

def plot_heatmap(df_hm, xlabels, ylabels):
    Given a dataframe containing similarity grid, plot the heatmap

    # Set up the matplotlib figure
    # (to enlarge the cells, increase the figure size)
    f, ax = plt.subplots(figsize=(18, 18))

    # Generate a custom diverging colormap
    cmap = sns.diverging_palette(220, 20, as_cmap=True)

    # Generate a mask for the upper triangle
    mask = np.zeros_like(df_hm, dtype=np.bool)
    mask[np.triu_indices_from(mask)] = True

    # Draw the heatmap with the mask and correct aspect ratio
    sns.heatmap(df_hm, mask=mask, cmap=cmap, center=0.5,
            xticklabels=xlabels, yticklabels=ylabels,
            square=True, linewidths=.5, fmt='.2f',
            annot=True, cbar_kws={"shrink": .5}, vmax=1)

    ax.set_title("Heatmap of cosine similarity scores").set_fontsize(15)

    return ax

#  Module:
#  Author: Shravan Kuchkula
#  Date: 07/19/2019

import re
import pandas as pd
import numpy as np
import nltk
import string
from nltk.stem import LancasterStemmer
from nltk.stem import WordNetLemmatizer

# tokenize text
def tokenize_text(book_text):
    TOKEN_PATTERN = r'\s+'
    regex_wt = nltk.RegexpTokenizer(pattern=TOKEN_PATTERN, gaps=True)
    word_tokens = regex_wt.tokenize(book_text)
    return word_tokens

def remove_characters_after_tokenization(tokens):
    pattern = re.compile('[{}]'.format(re.escape(string.punctuation)))
    filtered_tokens = filter(None, [pattern.sub('', token) for token in tokens])
    return filtered_tokens

def convert_to_lowercase(tokens):
    return [token.lower() for token in tokens if token.isalpha()]

def remove_stopwords(tokens):
    stopword_list = nltk.corpus.stopwords.words('english')
    filtered_tokens = [token for token in tokens if token not in stopword_list]
    return filtered_tokens

def apply_lemmatization(tokens, wnl=WordNetLemmatizer()):
    return [wnl.lemmatize(token) for token in tokens]

def cleanTextBooks(book_texts):
    clean_books = []
    for book in book_texts:
        book_i = tokenize_text(book)
        book_i = remove_characters_after_tokenization(book_i)
        book_i = convert_to_lowercase(book_i)
        book_i = remove_stopwords(book_i)
        book_i = apply_lemmatization(book_i)
    return clean_books