Lexical Diversity
Introduction
Lexical diversity is a measure of how many different words that are used in a text. The goal of this notebook is to use NLTK
to explore the lexical diversity of Third grade, Sixth grade and High school books that are scrapped from project gutenberg’s Childeren Instructional Books bookshelf.
Installing NLTK
# install nltk
!pip install nltk
Requirement already satisfied: nltk in /Users/Shravan/anaconda3/envs/NLP/lib/python3.6/site-packages (3.2.5)
Requirement already satisfied: six in /Users/Shravan/anaconda3/envs/NLP/lib/python3.6/site-packages (from nltk) (1.11.0)
NLTKs text corpus
The nltk.book
module contains a books collection with about 9 books which can be used readily to explore some key NLP concepts. These texts are loaded using the import statement and can be accessed as objects. Each of these text objects contains some methods which can be used to gain more insight into the texts.
from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
Lexical diversity
The lexical diversity/richness of a text can be calculated very easily by taking the ratio of unique types and total types.
# lexical diversity
def lexical_diversity(text):
return len(set(text)) / len(text)
# percentage
def percentage(count, total):
return 100 * count / total
import pandas as pd
# print the lexical diversity of all the texts.
texts = [text1, text2, text3, text4, text5,
text6, text7, text8, text9]
lexicalDiversity = [lexical_diversity(text) for text in texts]
tokens = [len(text) for text in texts]
types = [len(set(text)) for text in texts]
title = [text.name for text in texts]
ld = pd.DataFrame({'title': title, 'tokens': tokens, 'types': types,
'lexical_diversity': lexicalDiversity})
ld.sort_values(by='lexical_diversity', ascending=False)
title | tokens | types | lexical_diversity | |
---|---|---|---|---|
7 | Personals Corpus | 4867 | 1108 | 0.227656 |
4 | Chat Corpus | 45010 | 6066 | 0.134770 |
5 | Monty Python and the Holy Grail | 16967 | 2166 | 0.127660 |
6 | Wall Street Journal | 100676 | 12408 | 0.123247 |
8 | The Man Who Was Thursday by G . K . Chesterton... | 69213 | 6807 | 0.098349 |
0 | Moby Dick by Herman Melville 1851 | 260819 | 19317 | 0.074063 |
3 | Inaugural Address Corpus | 145735 | 9754 | 0.066930 |
2 | The Book of Genesis | 44764 | 2789 | 0.062305 |
1 | Sense and Sensibility by Jane Austen 1811 | 141576 | 6833 | 0.048264 |
The book with highest lexical diversity is Personals Corpus
.
Gutenberg’s children’s instructional books (bookshelf)
Three books are chosen from the bookshelf to conduct lexical diversity analysis:
-
High School
: The Ontario High School Reader by Aletta E. Marty Sixth Grade
: McGuffey’s Sixth Eclectic ReaderThird Grade
: McGuffey’s Third Eclectic Reader
We will be using the requests
package to send and receive the text. Two functions are defined to aid in getting the tokens, texts and titles from the response:
- retrieveFromURL() : returns the Text object or Token list of the given URL.
- getTitle() : extracts the title from the text.
from urllib import request
from nltk import word_tokenize
import nltk
import re
# returns text/tokens
def retrieveFromUrl(url, text=False):
# make a request
response = request.urlopen(url)
# extract the raw response
raw = response.read().decode('utf8')
# tokenize the words
tokens = word_tokenize(raw)
if text:
textObj = nltk.Text(tokens)
return textObj
return tokens
# find the title name
def getTitle(url):
# make a request
response = request.urlopen(url)
# extract the raw response
raw = response.read().decode('utf8')
match = re.search(r'Title: ([\w\']+) (.+)', raw)
if match:
title = match.group()
else:
return None
return title.strip('\r')
# grade-level: high school
url1 = "http://www.gutenberg.org/files/22795/22795-0.txt"
# grade-level: third
url2 = "http://www.gutenberg.org/cache/epub/14766/pg14766.txt"
# grade-level: sixth
url3 = "http://www.gutenberg.org/cache/epub/16751/pg16751.txt"
urls = [url1, url2, url3]
# construct a list of texts
texts = [retrieveFromUrl(url, text=True) for url in urls]
# construct a list of token lenghts
tokens = [len(retrieveFromUrl(url, text=False)) for url in urls]
# construct a list of unique types
types = [len(set(retrieveFromUrl(url, text=False))) for url in urls]
# construct a list of titles
titles = [getTitle(url) for url in urls]
# construct a list of lexical diversity scores
lexicalDiversity = [lexical_diversity(text) for text in texts]
# create a dataframe
lexical_summary = pd.DataFrame({'title': titles, 'tokens': tokens, 'types': types,
'lexical_diversity': lexicalDiversity})
# sort by highest lexical_diversity score
lexical_summary.sort_values(by='lexical_diversity', ascending=False)
title | tokens | types | lexical_diversity | |
---|---|---|---|---|
1 | Title: McGuffey's Third Eclectic Reader | 37971 | 4715 | 0.124174 |
0 | Title: The Ontario High School Reader | 100434 | 11934 | 0.118824 |
2 | Title: McGuffey's Sixth Eclectic Reader | 171050 | 17258 | 0.100894 |
The lexical diversity of Third grade is higher than High school which is a little amusing. However, I would argue that Third grade has fewer tokens and types than the other two. The number of unique types of Sixth grade is much larger than high school, which I think is interesting.
Since types can include non-alphabetic entities, it can be beneficial to check the vocabulary size.
Vocabulary size
The Vocabulary size is calculated by filtering the types that are not alphabetic and converting them to lower case. Some additional techniques like stemming and lemmatization may be useful to get an accurate vocabulary size.
# get the vocab size of the 3 texts
def getVocabSize(text):
vocab = set(w.lower() for w in text if w.isalpha())
return len(vocab)
vocabSize = [getVocabSize(text) for text in texts]
lexical_summary['VocabularySize'] = vocabSize
# sort by highest lexical_diversity score
lexical_summary.sort_values(by='lexical_diversity', ascending=False)
title | tokens | types | lexical_diversity | VocabularySize | |
---|---|---|---|---|---|
1 | Title: McGuffey's Third Eclectic Reader | 37971 | 4715 | 0.124174 | 3683 |
0 | Title: The Ontario High School Reader | 100434 | 11934 | 0.118824 | 8916 |
2 | Title: McGuffey's Sixth Eclectic Reader | 171050 | 17258 | 0.100894 | 13709 |
Remove stop words
# Getting the English stop words from nltk
from nltk.corpus import stopwords
sw = stopwords.words('english')
# remove stop words
def removeStopWords(text):
# A new list to hold text with No Stop words
words_ns = []
# Appending to words_ns all words that are in words but not in sw
for word in text:
if word not in sw:
words_ns.append(word)
return words_ns
# vocabulary without stopwords
vocabs = []
for text in texts:
vocab = set(w.lower() for w in text if w.isalpha())
vocabs.append(vocab)
vocab_without_stopwords = [len(removeStopWords(vocab)) for vocab in vocabs]
lexical_summary['VocabSize - stopwords'] = vocab_without_stopwords
# sort by highest lexical_diversity score
lexical_summary.sort_values(by='lexical_diversity', ascending=False)
title | tokens | types | lexical_diversity | VocabularySize | VocabSize - stopwords | |
---|---|---|---|---|---|---|
1 | Title: McGuffey's Third Eclectic Reader | 37971 | 4715 | 0.124174 | 3683 | 3558 |
0 | Title: The Ontario High School Reader | 100434 | 11934 | 0.118824 | 8916 | 8784 |
2 | Title: McGuffey's Sixth Eclectic Reader | 171050 | 17258 | 0.100894 | 13709 | 13574 |
Normalizing text to understand vocabulary
In the earlier sections, we have tried to normalize the text by converting into lower-case, filtering the non-alphabets and removing stop words. In addition to these techniques, we can go further and strip off affixes, a task known as stemming. An even further step is to make sure that the resulting form is a known word in a dictionary, a task known as lemmatization.
NLTK includes several off-the-shelf stemmers. The Porter stemmer and Lancaster stemmer. The porter stemmer typically handles more cases, and thus we will use this to see if we can guage the vocabulary of the above texts.
# construct a list of token lenghts
tokens = [retrieveFromUrl(url, text=False) for url in urls]
cleanTokens = []
for token in tokens:
clean_w = set(w.lower() for w in token if w.isalpha())
cleanTokens.append(list(clean_w))
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
porterStemmer = PorterStemmer()
wnl = WordNetLemmatizer()
ps_tokens_len = []
wnl_tokens_len = []
for token in cleanTokens:
ps_token = [porterStemmer.stem(t) for t in token]
wnl_token = [wnl.lemmatize(t) for t in token]
ps_tokens_len.append(len(set(ps_token)))
wnl_tokens_len.append(len(set(wnl_token)))
lexical_summary['VocabSizePorterStemmer'] = ps_tokens_len
lexical_summary['VocabSizeLemmatizer'] = wnl_tokens_len
# sort by highest lexical_diversity score
lexical_summary.sort_values(by='lexical_diversity', ascending=False)
title | tokens | types | lexical_diversity | VocabularySize | VocabSize - stopwords | VocabSizePorterStemmer | VocabSizeLemmatizer | |
---|---|---|---|---|---|---|---|---|
1 | Title: McGuffey's Third Eclectic Reader | 37971 | 4715 | 0.124174 | 3683 | 3558 | 2822 | 3355 |
0 | Title: The Ontario High School Reader | 100434 | 11934 | 0.118824 | 8916 | 8784 | 6354 | 7988 |
2 | Title: McGuffey's Sixth Eclectic Reader | 171050 | 17258 | 0.100894 | 13709 | 13574 | 9142 | 12086 |
Normalzing the text by applying a Porter Stemmer and WordNet Lemmatizer reveals the same trend, in that, Sixth Grade text had a much larger vocabulary than the other two. However, due to the large number of types in Sixth Grade, we observe that the lexical diversity is the lowest.
Understanding text difficulty
Using lexical diversity alone to measure the text difficultly is not accurate, especially when we are comparing against three texts with different number of types. In other words lexical diversity is dependent on the total number of tokens, if the tokens are a large number like those of Sixth Grade, then even though the Vocabulary size of sixth grade is larger, its lexical diversity is the lowest. Thus, it is safe to say that lexical_diversity alone shouldn’t be a measure of text difficulty but it should be augmented with a Vocabulary analysis.