Text as numerical data
Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.
For instance, consider the below example of three text documents (aka. strings). These cannot be fed into a machine learning algorithm because they are:
- Not numerical feature vectors
- Not fixed size
- raw text documents
- having variable length
# example text for model training (SMS messages) simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']
We will use
convert text into a matrix of token counts:
As the above training data is not in the format
scikit-learn expects, we will use
CountVectorizer() to convert the text into a matrix of token counts. I will show you why we need to do that.
CountVectorizer’s purpose is to convert text into a matrix of token counts
CountVectorizer follows the same pattern as all the scikit-learn estimators follow. That is,
- You import
- You instantiate
- You fit a model
Even though, CountVectorizer is not a model, it has the same API as an estimator. So, CountVectorizer is not a model, but it has a fit method.
# import and instantiate CountVectorizer (with the default parameters) from sklearn.feature_extraction.text import CountVectorizer vect = CountVectorizer()
Next, we run
fit() on the Vectorizer object. Notice, how we are only passing the
simple_train list here. Typically in Supervised Learning problems, you would pass
X and y to the fit method to learn the relationship between X and y. However, in this case, we only pass in the
simple_train list. So, what this Vectorizer’s fit() method is doing is it learns the vocabulary, i.e it literally learns what are the words used. And it does it inplace.
CountVectorizer(analyzer='word', binary=False, decode_error='strict', dtype=<class 'numpy.int64'>, encoding='utf-8', input='content', lowercase=True, max_df=1.0, max_features=None, min_df=1, ngram_range=(1, 1), preprocessor=None, stop_words=None, strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, vocabulary=None)
So, lets see what did it learn ?
Once a Vectorizer is fit, it exposes a method called
get_feature_names(). If we run that, what we get is a fitted vocabulary.
# examine the fitted vocabulary vect.get_feature_names()
['cab', 'call', 'me', 'please', 'tonight', 'you']
This is the vocabulary that it learned, from the raw text.
Next, we run
transform() on the fitted Vectorizer object to get what is known as a
Document Term Matrix.
# transform training data into a 'document-term matrix' simple_train_dtm = vect.transform(simple_train) simple_train_dtm
<3x6 sparse matrix of type '<class 'numpy.int64'>' with 9 stored elements in Compressed Sparse Row format>
So, why did we get a 3x6 sparse matrix ?
3 documents and 6 vocabulary words
What is a document ?
A document is a string (possibly with multiple lines) that you are giving
scikit-learn as a unit. An example would be a an “email” as one document (here the email is passed in as a string, which is then treated as one document). Another example of a document: Suppose you wanted to predict the gender of an author from the book, then your entire book becomes a document. Or, if you are trying to predict the sentiment of a chapter within the book, then that chapter becomes a document. So, basically the word document depends on the task you are performing. A god-damn tweet could be a document.
What is a term ?
We have 3 documents, and 6 terms (also known as vocabulary words) (also known as features) (also known as tokens)
What is a document-term-matrix (DTM) ?
When you apply a
transform() to the fitted Vectorizer, you get the DTM. This is a sparse matrix with dimensions:
(documents x terms). DTMs have a
to_array() method which will return a dense matrix. A term-document-matrix is same as document-term-matrix, it is just the order in which you say it.
# convert sparse matrix to a dense matrix simple_train_dtm.toarray()
array([[0, 1, 0, 0, 1, 1], [1, 1, 1, 0, 0, 0], [0, 1, 1, 2, 0, 0]])
Now, let’s label the columns of this dense matrix to understand how this all works.
import pandas as pd # examine the vocabulary and document-term matrix together pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())
So, now you see that it took “text” - which was non-numeric and variable length, and now it is representing it as a feature matrix with a fixed number of columns!! <- This is what we were looking for, so that we can feed this into the machine learning models.
So, in other words, a DTM is literally, just a count of the number of times a token appears in that document.
From sklearn documentation:
In this scheme, features and samples(observations) are defined as follows:
- Each individual token occurrence frequency (normalized or not) is treated as a feature.
- The vector of all the token frequencies for a given document is considered a multivariate sample.
A corpus of documents can thus be represented by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus.
We call vectorization the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the Bag of Words or “Bag of n-grams” representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.
What is Corpus of documents ? A set of related documents.
What is Bag of Words ? Documents are described by word occurrences while completely ignoring the relative position information of the words in the document. It is like the ordering is lost, it is literally a bag of words. It is like you put the words in a bag and you count them.
Let’s breifly talk about sparse matrices for a second. So, we have our sparse matric
simple_train_dtm which is our document term matrix.
(0, 1) 1 (0, 4) 1 (0, 5) 1 (1, 0) 1 (1, 1) 1 (1, 2) 1 (2, 1) 1 (2, 2) 1 (2, 3) 2
From the scikit-learn documentation:
As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have many feature values that are zeros (typically more than 99% of them).
For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.
In order to be able to store such a matrix in memory but also to speed up operations, implementations will typically use a sparse representation such as the implementations available in the
What is a sparse matrix and why was it used ?
This is a more efficient storage strategy, when you have sparse data. It only stores non-zero values. As most of the values are 0’s, it is a waste of space to store that many zeros. Thus, a sparse matrix only stores non-zero values. For example, if you see (2,3) - second row, third column in the to_array() output, you will notice that the value is 2.
NOTE: We converted the sparse matrix to dense matrix for display purposes only, in reality, the models use a sparse matrix.
A very important thing to consider is how does our model handle new text data ? Since our model’s DTM only contains 6 terms in the corpus of documents, how does it handle new (unseen in the corpus during training) words ?. In the example below, the word don’t does not appear in the training corpus. So, that word gets ignored!
# example text for model testing simple_test = ["please don't call me"]
NOTE: A very important part of scikit-learn (and modelling in general) is that: In order to make a prediction, the new observation must have the same features as the training observations, both in number and meaning.
In the below cell, we transform the test data into DTM using exisiting vocabulary.
# transform testing data into a document-term matrix (using existing vocabulary) simple_test_dtm = vect.transform(simple_test) simple_test_dtm.toarray()
array([[0, 1, 1, 1, 0, 0]])
Now, if you examine the vocabulary and DTM together, you will notice that the word don’t is dropped! This is because the training corpus does not contain that feature. It is similar to when you are using the iris data set where you train your supervised learning model with 4 features: sepal.length, sepal.width, petal.length, petal.width, and if during testing phase, you provide a new feature, for example: sepal.height, then the model will not work, or rather, it will ignore that feature.
# examine the vocabulary and document-term matrix together pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())
vect.fit(train)learns the vocabulary of the training data
vect.transform(train)uses the fitted vocabulary to build a document-term matrix from the training data
vect.transform(test)uses the fitted vocabulary to build a document-term matrix from the testing data (and ignores tokens it hasn’t seen before)