Text as numeric data

7 minute read

Text as numerical data

Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.

For instance, consider the below example of three text documents (aka. strings). These cannot be fed into a machine learning algorithm because they are:

  • Not numerical feature vectors
  • Not fixed size
  • raw text documents
  • having variable length
# example text for model training (SMS messages)
simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']

We will use CountVectorizer() to convert text into a matrix of token counts:

As the above training data is not in the format scikit-learn expects, we will use CountVectorizer() to convert the text into a matrix of token counts. I will show you why we need to do that.

CountVectorizer’s purpose is to convert text into a matrix of token counts

CountVectorizer follows the same pattern as all the scikit-learn estimators follow. That is,

  • You import
  • You instantiate
  • You fit a model

Even though, CountVectorizer is not a model, it has the same API as an estimator. So, CountVectorizer is not a model, but it has a fit method.

# import and instantiate CountVectorizer (with the default parameters)
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

Next, we run fit() on the Vectorizer object. Notice, how we are only passing the simple_train list here. Typically in Supervised Learning problems, you would pass X and y to the fit method to learn the relationship between X and y. However, in this case, we only pass in the simple_train list. So, what this Vectorizer’s fit() method is doing is it learns the vocabulary, i.e it literally learns what are the words used. And it does it inplace.

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

So, lets see what did it learn ?

Once a Vectorizer is fit, it exposes a method called get_feature_names(). If we run that, what we get is a fitted vocabulary.

# examine the fitted vocabulary
['cab', 'call', 'me', 'please', 'tonight', 'you']

This is the vocabulary that it learned, from the raw text.

Next, we run transform() on the fitted Vectorizer object to get what is known as a Document Term Matrix.

# transform training data into a 'document-term matrix'
simple_train_dtm = vect.transform(simple_train)
<3x6 sparse matrix of type '<class 'numpy.int64'>'
	with 9 stored elements in Compressed Sparse Row format>

So, why did we get a 3x6 sparse matrix ?

3 documents and 6 vocabulary words

What is a document ?

A document is a string (possibly with multiple lines) that you are giving scikit-learn as a unit. An example would be a an “email” as one document (here the email is passed in as a string, which is then treated as one document). Another example of a document: Suppose you wanted to predict the gender of an author from the book, then your entire book becomes a document. Or, if you are trying to predict the sentiment of a chapter within the book, then that chapter becomes a document. So, basically the word document depends on the task you are performing. A god-damn tweet could be a document.

What is a term ?

We have 3 documents, and 6 terms (also known as vocabulary words) (also known as features) (also known as tokens)

What is a document-term-matrix (DTM) ?

When you apply a transform() to the fitted Vectorizer, you get the DTM. This is a sparse matrix with dimensions: (documents x terms). DTMs have a to_array() method which will return a dense matrix. A term-document-matrix is same as document-term-matrix, it is just the order in which you say it.

# convert sparse matrix to a dense matrix
array([[0, 1, 0, 0, 1, 1],
       [1, 1, 1, 0, 0, 0],
       [0, 1, 1, 2, 0, 0]])

Now, let’s label the columns of this dense matrix to understand how this all works.

import pandas as pd
# examine the vocabulary and document-term matrix together
pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())
cab call me please tonight you
0 0 1 0 0 1 1
1 1 1 1 0 0 0
2 0 1 1 2 0 0

So, now you see that it took “text” - which was non-numeric and variable length, and now it is representing it as a feature matrix with a fixed number of columns!! <- This is what we were looking for, so that we can feed this into the machine learning models.

So, in other words, a DTM is literally, just a count of the number of times a token appears in that document.

From sklearn documentation:

In this scheme, features and samples(observations) are defined as follows:

  • Each individual token occurrence frequency (normalized or not) is treated as a feature.
  • The vector of all the token frequencies for a given document is considered a multivariate sample.

A corpus of documents can thus be represented by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus.

We call vectorization the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the Bag of Words or “Bag of n-grams” representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

What is Corpus of documents ? A set of related documents.

What is Bag of Words ? Documents are described by word occurrences while completely ignoring the relative position information of the words in the document. It is like the ordering is lost, it is literally a bag of words. It is like you put the words in a bag and you count them.

Let’s breifly talk about sparse matrices for a second. So, we have our sparse matric simple_train_dtm which is our document term matrix.

  (0, 1)	1
  (0, 4)	1
  (0, 5)	1
  (1, 0)	1
  (1, 1)	1
  (1, 2)	1
  (2, 1)	1
  (2, 2)	1
  (2, 3)	2

From the scikit-learn documentation:

As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have many feature values that are zeros (typically more than 99% of them).

For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.

In order to be able to store such a matrix in memory but also to speed up operations, implementations will typically use a sparse representation such as the implementations available in the scipy.sparse package.

What is a sparse matrix and why was it used ?

This is a more efficient storage strategy, when you have sparse data. It only stores non-zero values. As most of the values are 0’s, it is a waste of space to store that many zeros. Thus, a sparse matrix only stores non-zero values. For example, if you see (2,3) - second row, third column in the to_array() output, you will notice that the value is 2.

NOTE: We converted the sparse matrix to dense matrix for display purposes only, in reality, the models use a sparse matrix.

Test data

A very important thing to consider is how does our model handle new text data ? Since our model’s DTM only contains 6 terms in the corpus of documents, how does it handle new (unseen in the corpus during training) words ?. In the example below, the word don’t does not appear in the training corpus. So, that word gets ignored!

# example text for model testing
simple_test = ["please don't call me"]

NOTE: A very important part of scikit-learn (and modelling in general) is that: In order to make a prediction, the new observation must have the same features as the training observations, both in number and meaning.

In the below cell, we transform the test data into DTM using exisiting vocabulary.

# transform testing data into a document-term matrix (using existing vocabulary)
simple_test_dtm = vect.transform(simple_test)
array([[0, 1, 1, 1, 0, 0]])

Now, if you examine the vocabulary and DTM together, you will notice that the word don’t is dropped! This is because the training corpus does not contain that feature. It is similar to when you are using the iris data set where you train your supervised learning model with 4 features: sepal.length, sepal.width, petal.length, petal.width, and if during testing phase, you provide a new feature, for example: sepal.height, then the model will not work, or rather, it will ignore that feature.

# examine the vocabulary and document-term matrix together
pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())
cab call me please tonight you
0 0 1 1 1 0 0


  • vect.fit(train) learns the vocabulary of the training data
  • vect.transform(train) uses the fitted vocabulary to build a document-term matrix from the training data
  • vect.transform(test) uses the fitted vocabulary to build a document-term matrix from the testing data (and ignores tokens it hasn’t seen before)