Lexical Diversity

7 minute read

Introduction

Lexical diversity is a measure of how many different words that are used in a text. The goal of this notebook is to use NLTK to explore the lexical diversity of Third grade, Sixth grade and High school books that are scrapped from project gutenberg’s Childeren Instructional Books bookshelf.

Installing NLTK

# install nltk
!pip install nltk
Requirement already satisfied: nltk in /Users/Shravan/anaconda3/envs/NLP/lib/python3.6/site-packages (3.2.5)
Requirement already satisfied: six in /Users/Shravan/anaconda3/envs/NLP/lib/python3.6/site-packages (from nltk) (1.11.0)


NLTKs text corpus

The nltk.book module contains a books collection with about 9 books which can be used readily to explore some key NLP concepts. These texts are loaded using the import statement and can be accessed as objects. Each of these text objects contains some methods which can be used to gain more insight into the texts.

from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

Lexical diversity

The lexical diversity/richness of a text can be calculated very easily by taking the ratio of unique types and total types.

# lexical diversity
def lexical_diversity(text):
    return len(set(text)) / len(text)

# percentage
def percentage(count, total):
    return 100 * count / total
import pandas as pd

# print the lexical diversity of all the texts.
texts = [text1, text2, text3, text4, text5,
        text6, text7, text8, text9]

lexicalDiversity = [lexical_diversity(text) for text in texts]
tokens = [len(text) for text in texts]
types = [len(set(text)) for text in texts]
title = [text.name for text in texts]

ld = pd.DataFrame({'title': title, 'tokens': tokens, 'types': types,
                   'lexical_diversity': lexicalDiversity})

ld.sort_values(by='lexical_diversity', ascending=False)
title tokens types lexical_diversity
7 Personals Corpus 4867 1108 0.227656
4 Chat Corpus 45010 6066 0.134770
5 Monty Python and the Holy Grail 16967 2166 0.127660
6 Wall Street Journal 100676 12408 0.123247
8 The Man Who Was Thursday by G . K . Chesterton... 69213 6807 0.098349
0 Moby Dick by Herman Melville 1851 260819 19317 0.074063
3 Inaugural Address Corpus 145735 9754 0.066930
2 The Book of Genesis 44764 2789 0.062305
1 Sense and Sensibility by Jane Austen 1811 141576 6833 0.048264

The book with highest lexical diversity is Personals Corpus.

Gutenberg’s children’s instructional books (bookshelf)

Three books are chosen from the bookshelf to conduct lexical diversity analysis:

We will be using the requests package to send and receive the text. Two functions are defined to aid in getting the tokens, texts and titles from the response:

  • retrieveFromURL() : returns the Text object or Token list of the given URL.
  • getTitle() : extracts the title from the text.
from urllib import request
from nltk import word_tokenize
import nltk
import re

# returns text/tokens
def retrieveFromUrl(url, text=False):
    # make a request
    response = request.urlopen(url)

    # extract the raw response
    raw = response.read().decode('utf8')

    # tokenize the words
    tokens = word_tokenize(raw)

    if text:
        textObj = nltk.Text(tokens)
        return textObj

    return tokens

# find the title name
def getTitle(url):
    # make a request
    response = request.urlopen(url)

    # extract the raw response
    raw = response.read().decode('utf8')

    match = re.search(r'Title: ([\w\']+) (.+)', raw)
    if match:
        title = match.group()
    else:
        return None

    return title.strip('\r')
# grade-level: high school
url1 = "http://www.gutenberg.org/files/22795/22795-0.txt"

# grade-level: third
url2 = "http://www.gutenberg.org/cache/epub/14766/pg14766.txt"

# grade-level: sixth
url3 = "http://www.gutenberg.org/cache/epub/16751/pg16751.txt"

urls = [url1, url2, url3]
# construct a list of texts
texts = [retrieveFromUrl(url, text=True) for url in urls]

# construct a list of token lenghts
tokens = [len(retrieveFromUrl(url, text=False)) for url in urls]

# construct a list of unique types
types = [len(set(retrieveFromUrl(url, text=False))) for url in urls]

# construct a list of titles
titles = [getTitle(url) for url in urls]

# construct a list of lexical diversity scores
lexicalDiversity = [lexical_diversity(text) for text in texts]
# create a dataframe
lexical_summary = pd.DataFrame({'title': titles, 'tokens': tokens, 'types': types,
                               'lexical_diversity': lexicalDiversity})

# sort by highest lexical_diversity score
lexical_summary.sort_values(by='lexical_diversity', ascending=False)
title tokens types lexical_diversity
1 Title: McGuffey's Third Eclectic Reader 37971 4715 0.124174
0 Title: The Ontario High School Reader 100434 11934 0.118824
2 Title: McGuffey's Sixth Eclectic Reader 171050 17258 0.100894

The lexical diversity of Third grade is higher than High school which is a little amusing. However, I would argue that Third grade has fewer tokens and types than the other two. The number of unique types of Sixth grade is much larger than high school, which I think is interesting.

Since types can include non-alphabetic entities, it can be beneficial to check the vocabulary size.

Vocabulary size

The Vocabulary size is calculated by filtering the types that are not alphabetic and converting them to lower case. Some additional techniques like stemming and lemmatization may be useful to get an accurate vocabulary size.

# get the vocab size of the 3 texts
def getVocabSize(text):
    vocab = set(w.lower() for w in text if w.isalpha())
    return len(vocab)
vocabSize = [getVocabSize(text) for text in texts]
lexical_summary['VocabularySize'] = vocabSize
# sort by highest lexical_diversity score
lexical_summary.sort_values(by='lexical_diversity', ascending=False)
title tokens types lexical_diversity VocabularySize
1 Title: McGuffey's Third Eclectic Reader 37971 4715 0.124174 3683
0 Title: The Ontario High School Reader 100434 11934 0.118824 8916
2 Title: McGuffey's Sixth Eclectic Reader 171050 17258 0.100894 13709

Remove stop words

# Getting the English stop words from nltk
from nltk.corpus import stopwords
sw = stopwords.words('english')

# remove stop words
def removeStopWords(text):
    # A new list to hold text with No Stop words
    words_ns = []

    # Appending to words_ns all words that are in words but not in sw
    for word in text:
        if word not in sw:
            words_ns.append(word)

    return words_ns
# vocabulary without stopwords
vocabs = []
for text in texts:
    vocab = set(w.lower() for w in text if w.isalpha())
    vocabs.append(vocab)
vocab_without_stopwords = [len(removeStopWords(vocab)) for vocab in vocabs]
lexical_summary['VocabSize - stopwords'] = vocab_without_stopwords
# sort by highest lexical_diversity score
lexical_summary.sort_values(by='lexical_diversity', ascending=False)
title tokens types lexical_diversity VocabularySize VocabSize - stopwords
1 Title: McGuffey's Third Eclectic Reader 37971 4715 0.124174 3683 3558
0 Title: The Ontario High School Reader 100434 11934 0.118824 8916 8784
2 Title: McGuffey's Sixth Eclectic Reader 171050 17258 0.100894 13709 13574

Normalizing text to understand vocabulary

In the earlier sections, we have tried to normalize the text by converting into lower-case, filtering the non-alphabets and removing stop words. In addition to these techniques, we can go further and strip off affixes, a task known as stemming. An even further step is to make sure that the resulting form is a known word in a dictionary, a task known as lemmatization.

NLTK includes several off-the-shelf stemmers. The Porter stemmer and Lancaster stemmer. The porter stemmer typically handles more cases, and thus we will use this to see if we can guage the vocabulary of the above texts.

# construct a list of token lenghts
tokens = [retrieveFromUrl(url, text=False) for url in urls]

cleanTokens = []
for token in tokens:
    clean_w = set(w.lower() for w in token if w.isalpha())
    cleanTokens.append(list(clean_w))
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
porterStemmer = PorterStemmer()
wnl = WordNetLemmatizer()
ps_tokens_len = []
wnl_tokens_len = []
for token in cleanTokens:
    ps_token = [porterStemmer.stem(t) for t in token]
    wnl_token = [wnl.lemmatize(t) for t in token]
    ps_tokens_len.append(len(set(ps_token)))
    wnl_tokens_len.append(len(set(wnl_token)))
lexical_summary['VocabSizePorterStemmer'] = ps_tokens_len
lexical_summary['VocabSizeLemmatizer'] = wnl_tokens_len

# sort by highest lexical_diversity score
lexical_summary.sort_values(by='lexical_diversity', ascending=False)
title tokens types lexical_diversity VocabularySize VocabSize - stopwords VocabSizePorterStemmer VocabSizeLemmatizer
1 Title: McGuffey's Third Eclectic Reader 37971 4715 0.124174 3683 3558 2822 3355
0 Title: The Ontario High School Reader 100434 11934 0.118824 8916 8784 6354 7988
2 Title: McGuffey's Sixth Eclectic Reader 171050 17258 0.100894 13709 13574 9142 12086

Normalzing the text by applying a Porter Stemmer and WordNet Lemmatizer reveals the same trend, in that, Sixth Grade text had a much larger vocabulary than the other two. However, due to the large number of types in Sixth Grade, we observe that the lexical diversity is the lowest.

Understanding text difficulty

Using lexical diversity alone to measure the text difficultly is not accurate, especially when we are comparing against three texts with different number of types. In other words lexical diversity is dependent on the total number of tokens, if the tokens are a large number like those of Sixth Grade, then even though the Vocabulary size of sixth grade is larger, its lexical diversity is the lowest. Thus, it is safe to say that lexical_diversity alone shouldn’t be a measure of text difficulty but it should be augmented with a Vocabulary analysis.

Tags:

Updated: