Create features out of text

15 minute read

Engineer columnar features out of a text corpus

Summary: Here we will work with unstructured text data, understanding ways in which we can engineer columnar features out of a text corpus. We will compare how different approaches may impact how much context is being extracted from a text, and how to balance the need for context, without too many features being created.

Data that is not in a pre-defined form is called unstructured data, and free text data is a good example of this.

Fellow-Citizens of the Senate and of the House of Representatives: AMONG the vicissitudes incident to life no event could have filled me with greater anxieties than that of which the notification was transmitted by your order, and received on the th day of the present month.

Throughout this post, we will be working with USA inaugural speeches dataset.

import pandas as pd
speech_df = pd.read_csv('inaugural_speeches.csv')
print(speech_df.shape)
speech_df.head()
(58, 4)
Name Inaugural Address Date text
0 George Washington First Inaugural Address Thursday, April 30, 1789 Fellow-Citizens of the Senate and of the House...
1 George Washington Second Inaugural Address Monday, March 4, 1793 Fellow Citizens: I AM again called upon by th...
2 John Adams Inaugural Address Saturday, March 4, 1797 WHEN it was first perceived, in early times, t...
3 Thomas Jefferson First Inaugural Address Wednesday, March 4, 1801 Friends and Fellow-Citizens: CALLED upon to u...
4 Thomas Jefferson Second Inaugural Address Monday, March 4, 1805 PROCEEDING, fellow-citizens, to that qualifica...

It is clear that free-text like this is not in tabular form. Before any text analytics can be performed, we must ensure that the text data is in a format that can be used.

Standardize the text

Unstructured text data cannot be directly used in most analyses. Multiple steps need to be taken to go from a long free form string to a set of numeric columns in the right format that can be ingested by a machine learning model. The first step of this process is to standardize the data and eliminate any characters that could cause problems later on in your analytic pipeline.

We will start by first cleaning up the text column by removing special chars and lower-casing the characters.

# Replace all non letter characters with a whitespace
speech_df['text_clean'] = speech_df['text'].str.replace('[^a-zA-Z]', ' ')

# Change to lower case
speech_df['text_clean'] = speech_df['text_clean'].str.lower()

# Print the first 5 rows of the text_clean column
print(speech_df['text_clean'].head())
0    fellow citizens of the senate and of the house...
1    fellow citizens   i am again called upon by th...
2    when it was first perceived  in early times  t...
3    friends and fellow citizens   called upon to u...
4    proceeding  fellow citizens  to that qualifica...
Name: text_clean, dtype: object

High level text features

Once the text has been cleaned and standardized you can begin creating features from the data. The most fundamental information you can calculate about free form text is its size, such as its length and number of words.

# Find the length of each text
speech_df['char_cnt'] = speech_df['text_clean'].str.len()

# Count the number of words in each text
speech_df['word_cnt'] = speech_df['text_clean'].str.split().str.len()

# Find the average length of word
speech_df['avg_word_length'] = speech_df['char_cnt'] / speech_df['word_cnt']

# Print the first 5 rows of these columns
speech_df[['text_clean', 'char_cnt', 'word_cnt', 'avg_word_length']].head()
text_clean char_cnt word_cnt avg_word_length
0 fellow citizens of the senate and of the house... 8616 1432 6.016760
1 fellow citizens i am again called upon by th... 787 135 5.829630
2 when it was first perceived in early times t... 13871 2323 5.971158
3 friends and fellow citizens called upon to u... 10144 1736 5.843318
4 proceeding fellow citizens to that qualifica... 12902 2169 5.948363

Word count representation

Once high level information has been recorded you can begin creating features based on the actual content of each text.

The most common approach to this is to create a column for each word and record the number of times each particular word appears in the text.

This results in a set of columns equal in width to the number of unique words in the dataset, with counts filling each entry.

While you could of course write a script to do this counting yourself, scikit-learn already has this functionality built-in with its CountVectorizer class. As shown here:

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
print(cv)
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

It may have become apparent that creating a column for every word will result in far too many values for analyses. Thankfully, you can specify arguments when initializing your CountVectorizer to limit this.

  • min_df: minimum fraction of documents the word must occur
  • max_df: maximum fraction of documents the word can occur in.

For example, you can specify the minimum number of texts that a word must be contained in using the argument min_df. If a float is given, the word must appear in at least this percent of documents. This threshold eliminates words that occur so rarely that they would not be useful when generalizing to new texts.

Conversely, max_df limits words to only ones that occur below a certain percentage of the data. This can be useful to remove words that occur too frequently to be of any value.

Once the vectorizer has been instantiated you can then fit it on the data you want to create your features around. This is done by calling the fit() method on relevant column.

Once the vectorizer has been fit, you can call the transform() method on the column you want to transform.

This outputs a sparse array, with a row for every text and a column for every word that has been counted.

To transform this to a non-sparse array you can use the toarray() method. You may notice that the output is an array, with no concept of column names.

To get the names of the features that have been generated you can call the get_feature_names() method on the vectorizer which returns a list of the features generated, in the same order that the columns of the converted array are in.

As an aside, while fitting and transforming separately can be useful, particularly when you need to transform a different data set than the one that you fit the vectorizer to, you can accomplish both steps at once using the fit_transfrom() method. This will get you an array containing the count values of each of the words of interest, and a way to get the feature names you can combine these in a DataFrame as shown here. The add_prefix() method allows you to be able to distinguish these columns in the future.

You can finally combine this Dataframe with your original Dataframe so they can be used to generate future analytical models using pandas concat method.

# Fit the vectorizer
cv.fit(speech_df['text_clean'])

# Print feature names
print(cv.get_feature_names()[:100])
['abandon', 'abandoned', 'abandonment', 'abate', 'abdicated', 'abeyance', 'abhorring', 'abide', 'abiding', 'abilities', 'ability', 'abject', 'able', 'ably', 'abnormal', 'abode', 'abolish', 'abolished', 'abolishing', 'aboriginal', 'aborigines', 'abound', 'abounding', 'abounds', 'about', 'above', 'abraham', 'abreast', 'abridging', 'abroad', 'absence', 'absent', 'absolute', 'absolutely', 'absolutism', 'absorb', 'absorbed', 'absorbing', 'absorbs', 'abstain', 'abstaining', 'abstract', 'abstractions', 'absurd', 'abundance', 'abundant', 'abundantly', 'abuse', 'abused', 'abuses', 'academies', 'accept', 'acceptance', 'accepted', 'accepting', 'accepts', 'access', 'accessible', 'accession', 'accident', 'accidental', 'accidents', 'acclaim', 'accommodation', 'accommodations', 'accompanied', 'accompany', 'accomplish', 'accomplished', 'accomplishing', 'accomplishment', 'accomplishments', 'accord', 'accordance', 'accorded', 'according', 'accordingly', 'accords', 'account', 'accountability', 'accountable', 'accounted', 'accrue', 'accrued', 'accruing', 'accumulate', 'accumulated', 'accumulation', 'accurately', 'accustom', 'achieve', 'achieved', 'achievement', 'achievements', 'achieving', 'acknowledge', 'acknowledged', 'acknowledging', 'acknowledgment', 'acquaintance']

Great, this vectorizer can be applied to both the text it was trained on, and new texts.

Once the vectorizer has been fit to the data, it can be used to transform the text to an array representing the word counts. This array will have a row per block of text and a column for each of the features generated by the vectorizer that you observed above.

# Apply the vectorizer
cv_transformed = cv.transform(speech_df['text_clean'])

# Print the full array
cv_array = cv_transformed.toarray()

# Print the shape of cv_array
print(cv_array.shape)
(58, 9043)

The speeches have 9043 unique words, which is a lot! In the next step, you will see how to create a limited set of features.

As you have seen, using the CountVectorizer with its default settings creates a feature for every single word in your corpus. This can create far too many features, often including ones that will provide very little analytical value.

For this purpose CountVectorizer has parameters that you can set to reduce the number of features:

  • min_df : Use only words that occur in more than this percentage of documents. This can be used to remove outlier words that will not generalize across texts.
  • max_df : Use only words that occur in less than this percentage of documents. This is useful to eliminate very common words that occur in every corpus without adding value such as “and” or “the”.
# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Specify arguements to limit the number of features generated
cv = CountVectorizer(min_df=0.2, max_df=0.8)

# Fit, transform, and convert into array
cv_transformed = cv.fit_transform(speech_df['text_clean'])
cv_array = cv_transformed.toarray()

# Print the array shape
print(cv_array.shape)
(58, 818)

Did you notice that the number of features (unique words) greatly reduced from 9043 to 818?

Now that you have generated these count based features in an array you will need to reformat them so that they can be combined with the rest of the dataset. This can be achieved by converting the array into a pandas DataFrame, with the feature names you found earlier as the column names, and then concatenate it with the original DataFrame.

# Create a DataFrame with these features
cv_df = pd.DataFrame(cv_array,
                     columns=cv.get_feature_names()).add_prefix('Counts_')

# Add the new columns to the original DataFrame
speech_df_new = pd.concat([speech_df, cv_df], axis=1, sort=False)
speech_df_new.head()
Name Inaugural Address Date text text_clean char_cnt word_cnt avg_word_length Counts_abiding Counts_ability ... Counts_women Counts_words Counts_work Counts_wrong Counts_year Counts_years Counts_yet Counts_you Counts_young Counts_your
0 George Washington First Inaugural Address Thursday, April 30, 1789 Fellow-Citizens of the Senate and of the House... fellow citizens of the senate and of the house... 8616 1432 6.016760 0 0 ... 0 0 0 0 0 1 0 5 0 9
1 George Washington Second Inaugural Address Monday, March 4, 1793 Fellow Citizens: I AM again called upon by th... fellow citizens i am again called upon by th... 787 135 5.829630 0 0 ... 0 0 0 0 0 0 0 0 0 1
2 John Adams Inaugural Address Saturday, March 4, 1797 WHEN it was first perceived, in early times, t... when it was first perceived in early times t... 13871 2323 5.971158 0 0 ... 0 0 0 0 2 3 0 0 0 1
3 Thomas Jefferson First Inaugural Address Wednesday, March 4, 1801 Friends and Fellow-Citizens: CALLED upon to u... friends and fellow citizens called upon to u... 10144 1736 5.843318 0 0 ... 0 0 1 2 0 0 2 7 0 7
4 Thomas Jefferson Second Inaugural Address Monday, March 4, 1805 PROCEEDING, fellow-citizens, to that qualifica... proceeding fellow citizens to that qualifica... 12902 2169 5.948363 0 0 ... 0 0 0 0 2 2 2 4 0 4

5 rows × 826 columns

With the new features combined with the orginial DataFrame they can be now used for ML models or analysis.

Tf-idf representation

While counts of occurrences of words can be a good first step towards encoding your text to build models, it has some limitations. The main issue is counts will be much higher for very common words that occur across all texts, providing little value as a distinguishing feature.

To limit these common words from overpowering your model some form of normalization can be used. One of the most effective ways to do this is called “Term frequency inverse document frequency” or TF-IDF.

TF-IDF divides number of times a word occurs in the document by a measure of what proportion of the documents a word occurs in all documents. This has the effect of reducing the value of common words, while increasing the weight of words that do not occur in many documents.

# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Instantiate TfidfVectorizer
tv = TfidfVectorizer(max_features=100, stop_words='english')

# Fit the vectroizer and transform the data
tv_transformed = tv.fit_transform(speech_df['text_clean'])

# Create a DataFrame with these features
tv_df = pd.DataFrame(tv_transformed.toarray(),
                     columns=tv.get_feature_names()).add_prefix('TFIDF_')
tv_df.head()
TFIDF_action TFIDF_administration TFIDF_america TFIDF_american TFIDF_americans TFIDF_believe TFIDF_best TFIDF_better TFIDF_change TFIDF_citizens ... TFIDF_things TFIDF_time TFIDF_today TFIDF_union TFIDF_united TFIDF_war TFIDF_way TFIDF_work TFIDF_world TFIDF_years
0 0.000000 0.133415 0.000000 0.105388 0.0 0.000000 0.000000 0.000000 0.000000 0.229644 ... 0.000000 0.045929 0.0 0.136012 0.203593 0.000000 0.060755 0.000000 0.045929 0.052694
1 0.000000 0.261016 0.266097 0.000000 0.0 0.000000 0.000000 0.000000 0.000000 0.179712 ... 0.000000 0.000000 0.0 0.000000 0.199157 0.000000 0.000000 0.000000 0.000000 0.000000
2 0.000000 0.092436 0.157058 0.073018 0.0 0.000000 0.026112 0.060460 0.000000 0.106072 ... 0.032030 0.021214 0.0 0.062823 0.070529 0.024339 0.000000 0.000000 0.063643 0.073018
3 0.000000 0.092693 0.000000 0.000000 0.0 0.090942 0.117831 0.045471 0.053335 0.223369 ... 0.048179 0.000000 0.0 0.094497 0.000000 0.036610 0.000000 0.039277 0.095729 0.000000
4 0.041334 0.039761 0.000000 0.031408 0.0 0.000000 0.067393 0.039011 0.091514 0.273760 ... 0.082667 0.164256 0.0 0.121605 0.030338 0.094225 0.000000 0.000000 0.054752 0.062817

5 rows × 100 columns

Inspecting Tf-idf values: After creating Tf-idf features you will often want to understand what are the most highest scored words for each corpus. This can be achieved by isolating the row you want to examine and then sorting the the scores from high to low.

# Isolate the row to be examined
sample_row = tv_df.iloc[0]

# Print the top 5 words of the sorted output
print(sample_row.sort_values(ascending=False).head())
TFIDF_government    0.367430
TFIDF_public        0.333237
TFIDF_present       0.315182
TFIDF_duty          0.238637
TFIDF_citizens      0.229644
Name: 0, dtype: float64
# Isolate the row to be examined
sample_row = tv_df.iloc[1]

# Print the top 5 words of the sorted output
print(sample_row.sort_values(ascending=False).head())
TFIDF_shall             0.567446
TFIDF_america           0.266097
TFIDF_administration    0.261016
TFIDF_present           0.246652
TFIDF_president         0.242128
Name: 1, dtype: float64

Transforming unseen data

When creating vectors from text, any transformations that you perform before training a machine learning model, you also need to apply on the new unseen (test) data.

Fit the vectorizer only on the training data, and apply it to the test data.

speech_df.shape
(58, 8)
train_speech_df = speech_df.iloc[:45]
test_speech_df = speech_df.iloc[45:]
# Instantiate TfidfVectorizer
tv = TfidfVectorizer(max_features=100, stop_words='english')

# Fit the vectroizer and transform the data
tv_transformed = tv.fit_transform(train_speech_df['text_clean'])

# Transform test data
test_tv_transformed = tv.transform(test_speech_df['text_clean'])

# Create new features for the test set
test_tv_df = pd.DataFrame(test_tv_transformed.toarray(),
                          columns=tv.get_feature_names()).add_prefix('TFIDF_')
test_tv_df.head()
TFIDF_action TFIDF_administration TFIDF_america TFIDF_american TFIDF_authority TFIDF_best TFIDF_business TFIDF_citizens TFIDF_commerce TFIDF_common ... TFIDF_subject TFIDF_support TFIDF_time TFIDF_union TFIDF_united TFIDF_war TFIDF_way TFIDF_work TFIDF_world TFIDF_years
0 0.000000 0.029540 0.233954 0.082703 0.000000 0.000000 0.000000 0.022577 0.0 0.000000 ... 0.0 0.000000 0.115378 0.000000 0.024648 0.079050 0.033313 0.000000 0.299983 0.134749
1 0.000000 0.000000 0.547457 0.036862 0.000000 0.036036 0.000000 0.015094 0.0 0.000000 ... 0.0 0.019296 0.092567 0.000000 0.000000 0.052851 0.066817 0.078999 0.277701 0.126126
2 0.000000 0.000000 0.126987 0.134669 0.000000 0.131652 0.000000 0.000000 0.0 0.046997 ... 0.0 0.000000 0.075151 0.000000 0.080272 0.042907 0.054245 0.096203 0.225452 0.043884
3 0.037094 0.067428 0.267012 0.031463 0.039990 0.061516 0.050085 0.077301 0.0 0.000000 ... 0.0 0.098819 0.210690 0.000000 0.056262 0.030073 0.038020 0.235998 0.237026 0.061516
4 0.000000 0.000000 0.221561 0.156644 0.028442 0.087505 0.000000 0.109959 0.0 0.023428 ... 0.0 0.023428 0.187313 0.131913 0.040016 0.021389 0.081124 0.119894 0.299701 0.153133

5 rows × 100 columns

test_tv_df.shape
(13, 100)

moral of the story: The vectorizer should only be fit on the train set, never on your test set.

N-grams

So far we have looked at individual words on their own without any context or word order, this approach is called a bag-of-words model, as the words are treated as if they are being drawn from a bag with no concept of order or grammar.

This can be quite powerful when used in a machine learning model but you may be concerned that by looking at words individually a lot of the context is being ignored. To deal with this when creating models you can use n-grams which are sequence of n words grouped together. For example:

  • bigrams: Sequences of two consecutive words
  • trigrams: Sequences of two consecutive words

These can be automatically created in your dataset by specifying the ngram_range argument as a tuple (n1, n2) where all n-grams in the n1 to n2 range are included.

# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Instantiate a trigram vectorizer
cv_trigram_vec = CountVectorizer(max_features=100,
                                 stop_words='english',
                                 ngram_range=(3,3))

# Fit and apply trigram vectorizer
cv_trigram = cv_trigram_vec.fit_transform(speech_df['text_clean'])

# Print the trigram features
print(cv_trigram_vec.get_feature_names())
['ability preserve protect', 'agriculture commerce manufactures', 'america ideal freedom', 'amity mutual concession', 'anchor peace home', 'ask bow heads', 'best ability preserve', 'best interests country', 'bless god bless', 'bless united states', 'chief justice mr', 'children children children', 'citizens united states', 'civil religious liberty', 'civil service reform', 'commerce united states', 'confidence fellow citizens', 'congress extraordinary session', 'constitution does expressly', 'constitution united states', 'coordinate branches government', 'day task people', 'defend constitution united', 'distinction powers granted', 'distinguished guests fellow', 'does expressly say', 'equal exact justice', 'era good feeling', 'executive branch government', 'faithfully execute office', 'fellow citizens assembled', 'fellow citizens called', 'fellow citizens large', 'fellow citizens world', 'form perfect union', 'general welfare secure', 'god bless america', 'god bless god', 'good greatest number', 'government peace war', 'government united states', 'granted federal government', 'great body people', 'great political parties', 'greatest good greatest', 'guests fellow citizens', 'invasion wars powers', 'land new promise', 'laws faithfully executed', 'letter spirit constitution', 'liberty pursuit happiness', 'life liberty pursuit', 'local self government', 'make hard choices', 'men women children', 'mr chief justice', 'mr majority leader', 'mr president vice', 'mr speaker mr', 'mr vice president', 'nation like person', 'new breeze blowing', 'new states admitted', 'north south east', 'oath prescribed constitution', 'office president united', 'passed generation generation', 'peace shall strive', 'people united states', 'physical moral political', 'policy united states', 'power general government', 'preservation general government', 'preservation sacred liberty', 'preserve protect defend', 'president united states', 'president vice president', 'promote general welfare', 'proof confidence fellow', 'protect defend constitution', 'protection great interests', 'reform civil service', 'reserved states people', 'respect individual human', 'right self government', 'secure blessings liberty', 'south east west', 'sovereignty general government', 'states admitted union', 'territories united states', 'thank god bless', 'turning away old', 'united states america', 'united states best', 'united states government', 'united states great', 'united states maintain', 'united states territory', 'vice president mr', 'welfare secure blessings']

Here you can see that by taking sequential word pairings, some context is preserved.

Finding the most common words: Its always advisable once you have created your features to inspect them to ensure that they are as you would expect. This will allow you to catch errors early, and perhaps influence what further feature engineering you will need to do.

# Create a DataFrame of the features
cv_tri_df = pd.DataFrame(cv_trigram.toarray(),
                 columns=cv_trigram_vec.get_feature_names()).add_prefix('Counts_')

# Print the top 5 words in the sorted output
print(cv_tri_df.sum().sort_values(ascending=False).head())
Counts_constitution united states    20
Counts_people united states          13
Counts_preserve protect defend       10
Counts_mr chief justice              10
Counts_president united states        8
dtype: int64

Great, that the most common trigram is constitution united states makes a lot of sense for US presidents speeches.