Build a Machine Learning Pipeline to predict Multi-class-multi-label target variable

27 minute read

Predicting a Multi-class Multi-label Target variable using a Pipeline

In this post, I will cover how to tackle a problem where we are trying to predict a multi-class multi-label response variable. Along the way, we will cover the following:

  • How to work with text and numerical features to predict a multi-class multi-label response?
  • How to preprocess text using simple NLP tricks?
  • How to preprocess Numerical columns?
  • How to build a Machine Learning pipeline using sklearn’s Pipeline object?
  • How to use FunctionTransformer to convert a function to be usable inside a pipeline?
  • How to combine the results of two sub-pipelines using FeatureUnion?
  • How to use Feature Engineering techniques like interactions (stats trick) to improve performance?
  • How to improve computational efficiency using Hashing (HashVectorizer)?

Predicting School Budgets

Here’s the problem statement:

Goal is to predict the probability that a certain label is attached to a budget line item. Each row in the budget has mostly free-form text features, except for the two below that are noted as float. Any of the fields may or may not be empty. One budget line contains contains the following fields, which are the explanatory variables/features with which we will be predicting a label (response variable).

Explanatory Variables/Features:

  • FTE float - If an employee, the percentage of full-time that the employee works.
  • Facility_or_Department - If expenditure is tied to a department/facility, that department/facility.
  • Function_Description - A description of the function the expenditure was serving.
  • Fund_Description - A description of the source of the funds.
  • Job_Title_Description - If this is an employee, a description of that employee’s job title.
  • Location_Description - A description of where the funds were spent.
  • Object_Description - A description of what the funds were used for.
  • Position_Extra - Any extra information about the position that we have.
  • Program_Description - A description of the program that the funds were used for.
  • SubFund_Description - More detail on Fund_Description
  • Sub_Object_Description - More detail on Object_Description
  • Text_1 - Any additional text supplied by the district.
  • Text_2 - Any additional text supplied by the district.
  • Text_3 - Any additional text supplied by the district.
  • Text_4 - Any additional text supplied by the district.
  • Total float - The total cost of the expenditure.

Response/Target Variable can be any of these 9 labels:

  • Function:
    • Aides Compensation
    • Career & Academic Counseling
    • Communications
    • Curriculum Development
    • Data Processing & Information Services
    • Development & Fundraising
    • Enrichment
    • Extended Time & Tutoring
    • Facilities & Maintenance
    • Facilities Planning
    • Finance, Budget, Purchasing & Distribution
    • Food Services
    • Governance
    • Human Resources
    • Instructional Materials & Supplies
    • Insurance
    • Legal
    • Library & Media
    • NO_LABEL
    • Other Compensation
    • Other Non-Compensation
    • Parent & Community Relations
    • Physical Health & Services
    • Professional Development
    • Recruitment
    • Research & Accountability
    • School Administration
    • School Supervision
    • Security & Safety
    • Social & Emotional
    • Special Population Program Management & Support
    • Student Assignment
    • Student Transportation
    • Substitute Compensation
    • Teacher Compensation
    • Untracked Budget Set-Aside
    • Utilities
  • Object_Type:
    • Base Salary/Compensation
    • Benefits
    • Contracted Services
    • Equipment & Equipment Lease
    • NO_LABEL
    • Other Compensation/Stipend
    • Other Non-Compensation
    • Rent/Utilities
    • Substitute Compensation
    • Supplies/Materials
    • Travel & Conferences
  • Operating_Status:
    • Non-Operating
    • Operating, Not PreK-12
    • PreK-12 Operating
  • Position_Type:
    • (Exec) Director
    • Area Officers
    • Club Advisor/Coach
    • Coordinator/Manager
    • Custodian
    • Guidance Counselor
    • Instructional Coach
    • Librarian
    • NO_LABEL
    • Non-Position
    • Nurse
    • Nurse Aide
    • Occupational Therapist
    • Other
    • Physical Therapist
    • Principal
    • Psychologist
    • School Monitor/Security
    • Sec/Clerk/Other Admin
    • Social Worker
    • Speech Therapist
    • Substitute
    • TA
    • Teacher
    • Vice Principal
  • Pre_K:
    • NO_LABEL
    • Non PreK
    • PreK
  • Reporting:
    • NO_LABEL
    • Non-School
    • School
  • Sharing:
    • Leadership & Management
    • NO_LABEL
    • School Reported
    • School on Central Budgets
    • Shared Services
  • Student_Type:
    • Alternative
    • At Risk
    • ELL
    • Gifted
    • NO_LABEL
    • Poverty
    • PreK
    • Special Education
    • Unspecified
  • Use:
    • Business Services
    • ISPD
    • Instruction
    • Leadership
    • NO_LABEL
    • O&M
    • Pupil Services & Enrichment
    • Untracked Budget Set-Aside

Introduction

School budgets in United States are incredibly complex and there are no standards for reporting how money is spent. Schools want to be able to measure their performance. For example, are we spending more on text books than our neighboring schools, and is that investment worthwhile? However, in order to do this type of analysis takes hundreds of hours each year in which analysts hand-categorize each line-item. Our goal is to build a machine learning algorithm that can automate that process.

For each line item, we have some text fields that tell us about the expense. For example, a line might say something like “Algebra books for 8th grade students”. We also have the amount of expense in dollars. This line item, then has a set of labels attached to it. For example, “Text books”, “Math”, “Middle School”. These labels are our target variables.

Listed above are some of the categories we need to determine. For example, one of the categories is Pre-K, the question becomes: is this expense pre-kindergarten education?

Overall, there are 9 columns with many different possible categories in each column. If you talk to people who actually do this work, they will tell you that it is impossible for a human to label these lines with 100% accuracy. To take this into account, we don’t want our algorithm to just say “This line is for textbooks”. But we want it to say, “It’s most likely that this line is for textbooks, and I am 60% sure that it is”. Or, “if it is not textbooks then I am 30% sure that it is office supplies”. By making these suggestions, analysts can prioritize their time. This is called a human-in-the-loop machine learning system. We will predict a probability between 0 (the algorithm thinks this label is very unlikely for this line item) and 1 (the algorithm thinks this label is very likely for this line item).

Is it supervised or unsupervised?

This is a supervised learning problem where we want to use correctly labelled data to build an algorithm that can suggest labels for unlabeled lines.

Is it classification or regression?

For this problem, we have over 100 unique target variables that could be attached to a single line item. Because we want to predict a category for each line item, this is a Classification problem.

So, in short: Our goal is to develop a model that predicts the probability for each possible label by relying on some correctly labeled examples. Alternately stated: our goal is to correctly label budget line items by training a supervised model to predict the probability of each possible label, taking most probable label as the correct label.

%matplotlib inline
from __future__ import division
from __future__ import print_function

# ignore deprecation warnings in sklearn
import warnings
warnings.filterwarnings("ignore")

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import os
import sys

# add the 'src' directory as one where we can import modules
src_dir = os.path.join(os.getcwd(), os.pardir, 'src')
sys.path.append(src_dir)

from data.multilabel import multilabel_sample_dataframe, multilabel_train_test_split
from features.SparseInteractions import SparseInteractions
from models.metrics import multi_multi_log_loss

Load Data

First, we’ll load the entire training data set available from DrivenData. In order to make this notebook run, you will need to:

path_to_training_data = os.path.join(os.pardir,
                                     'data',
                                     'TrainingData.csv')

df = pd.read_csv(path_to_training_data, index_col=0)

print(df.shape)
(400277, 25)

Exploratory Data Analysis

Some of the columns correspond to features - description of the budget items - such as, the Job_Title_Description column. The values in this column tell us if a budget item is for a teacher, custodian or other employee.

Whereas, some columns correpond to budget item labels - the ones we are trying to predict with our model. For example, the Object_Type column describes whether budget item is related to classroom supplies, salary, travel expenses etc.

df.columns
Index(['Function', 'Use', 'Sharing', 'Reporting', 'Student_Type',
       'Position_Type', 'Object_Type', 'Pre_K', 'Operating_Status',
       'Object_Description', 'Text_2', 'SubFund_Description',
       'Job_Title_Description', 'Text_3', 'Text_4', 'Sub_Object_Description',
       'Location_Description', 'FTE', 'Function_Description',
       'Facility_or_Department', 'Position_Extra', 'Total',
       'Program_Description', 'Fund_Description', 'Text_1'],
      dtype='object')

In this dataset, there are a total of 25 columns. The breakdown is as follows:

  • 9 of these columns are target variable/labels.
  • 16 of these columns are features. Of these, 14 are text features and 2 are numeric features.

Shown below is head, tail, and datatypes of each of these columns.

df.head()
Function Use Sharing Reporting Student_Type Position_Type Object_Type Pre_K Operating_Status Object_Description ... Sub_Object_Description Location_Description FTE Function_Description Facility_or_Department Position_Extra Total Program_Description Fund_Description Text_1
134338 Teacher Compensation Instruction School Reported School NO_LABEL Teacher NO_LABEL NO_LABEL PreK-12 Operating NaN ... NaN NaN 1.0 NaN NaN KINDERGARTEN 50471.810 KINDERGARTEN General Fund NaN
206341 NO_LABEL NO_LABEL NO_LABEL NO_LABEL NO_LABEL NO_LABEL NO_LABEL NO_LABEL Non-Operating CONTRACTOR SERVICES ... NaN NaN NaN RGN GOB NaN UNDESIGNATED 3477.860 BUILDING IMPROVEMENT SERVICES NaN BUILDING IMPROVEMENT SERVICES
326408 Teacher Compensation Instruction School Reported School Unspecified Teacher Base Salary/Compensation Non PreK PreK-12 Operating Personal Services - Teachers ... NaN NaN 1.0 NaN NaN TEACHER 62237.130 Instruction - Regular General Purpose School NaN
364634 Substitute Compensation Instruction School Reported School Unspecified Substitute Benefits NO_LABEL PreK-12 Operating EMPLOYEE BENEFITS ... NaN NaN NaN UNALLOC BUDGETS/SCHOOLS NaN PROFESSIONAL-INSTRUCTIONAL 22.300 GENERAL MIDDLE/JUNIOR HIGH SCH NaN REGULAR INSTRUCTION
47683 Substitute Compensation Instruction School Reported School Unspecified Teacher Substitute Compensation NO_LABEL PreK-12 Operating TEACHER COVERAGE FOR TEACHER ... NaN NaN NaN NON-PROJECT NaN PROFESSIONAL-INSTRUCTIONAL 54.166 GENERAL HIGH SCHOOL EDUCATION NaN REGULAR INSTRUCTION

5 rows × 25 columns

df.tail()
Function Use Sharing Reporting Student_Type Position_Type Object_Type Pre_K Operating_Status Object_Description ... Sub_Object_Description Location_Description FTE Function_Description Facility_or_Department Position_Extra Total Program_Description Fund_Description Text_1
109283 Professional Development ISPD Shared Services Non-School Unspecified Instructional Coach Other Compensation/Stipend NO_LABEL PreK-12 Operating WORKSHOP PARTICIPANT ... NaN STAFF DEV AND INSTR MEDIA NaN INST STAFF TRAINING SVCS NaN NaN 48.620000 NaN GENERAL FUND STAFF DEV AND INSTR MEDIA
102430 Substitute Compensation Instruction School Reported School Unspecified Substitute Base Salary/Compensation NO_LABEL PreK-12 Operating SALARIES OF PART TIME EMPLOYEE ... NaN NaN 0.00431 TITLE II,D NaN PROFESSIONAL-INSTRUCTIONAL 128.824985 INSTRUCTIONAL STAFF TRAINING NaN INSTRUCTIONAL STAFF
413949 Parent & Community Relations NO_LABEL School Reported School NO_LABEL Other NO_LABEL NO_LABEL PreK-12 Operating NaN ... NaN NaN 1.00000 NaN NaN PARENT/TITLE I 4902.290000 Misc Schoolwide Schools NaN
433672 Library & Media Instruction School on Central Budgets Non-School Unspecified Librarian Benefits NO_LABEL PreK-12 Operating EMPLOYEE BENEFITS ... NaN ED RESOURCE SERVICES NaN NON-PROJECT NaN OFFICE/ADMINISTRATIVE SUPPORT 4020.290000 MEDIA SUPPORT SERVICES NaN INSTRUCTIONAL STAFF
415831 Substitute Compensation Instruction School Reported School Poverty Substitute Substitute Compensation Non PreK PreK-12 Operating Salaries And Wages For Substitute Professionals ... Inservice Substitute Teachers Grant Funded School NaN Instruction Instruction And Curriculum CERTIFIED SUBSTITUTE 46.530000 Accelerated Education "Title Part A Improving Basic Programs" MISCELLANEOUS

5 rows × 25 columns

df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 400277 entries, 134338 to 415831
Data columns (total 25 columns):
Function                  400277 non-null object
Use                       400277 non-null object
Sharing                   400277 non-null object
Reporting                 400277 non-null object
Student_Type              400277 non-null object
Position_Type             400277 non-null object
Object_Type               400277 non-null object
Pre_K                     400277 non-null object
Operating_Status          400277 non-null object
Object_Description        375493 non-null object
Text_2                    88217 non-null object
SubFund_Description       306855 non-null object
Job_Title_Description     292743 non-null object
Text_3                    109152 non-null object
Text_4                    53746 non-null object
Sub_Object_Description    91603 non-null object
Location_Description      162054 non-null object
FTE                       126071 non-null float64
Function_Description      342195 non-null object
Facility_or_Department    53886 non-null object
Position_Extra            264764 non-null object
Total                     395722 non-null float64
Program_Description       304660 non-null object
Fund_Description          202877 non-null object
Text_1                    292285 non-null object
dtypes: float64(2), object(23)
memory usage: 89.4+ MB
# check the number of object types
df.dtypes.value_counts()
object     23
float64     2
dtype: int64

Encode the labels as categories

Remember, your ultimate goal is to predict the probability that a certain label is attached to a budget line item. You just saw that many columns in your data are the inefficient object type.

There are 9 columns of labels in the dataset. Each of these columns is a category that has many possible values it can take. These labels are stored in a list called LABELS. Inspect the type of these LABELS and convert them to categories.

LABELS = ['Function',
          'Use',
          'Sharing',
          'Reporting',
          'Student_Type',
          'Position_Type',
          'Object_Type',
          'Pre_K',
          'Operating_Status']
df[LABELS].dtypes
Function            object
Use                 object
Sharing             object
Reporting           object
Student_Type        object
Position_Type       object
Object_Type         object
Pre_K               object
Operating_Status    object
dtype: object
# Define the lambda function: categorize_label
categorize_label = lambda x: x.astype('category')

# Convert df[LABELS] to a categorical type
df[LABELS] = df[LABELS].apply(categorize_label, axis='rows')

# Print the converted dtypes
print(df[LABELS].dtypes)
Function            category
Use                 category
Sharing             category
Reporting           category
Student_Type        category
Position_Type       category
Object_Type         category
Pre_K               category
Operating_Status    category
dtype: object

Count the number of unique labels

There are over 100 unique labels. We will explore this fact by counting and plotting the number of unique values for each category of label. Pandas provides a pd.Series.nunique method for counting the number of unique values in a Series. We will be using that.

# Calculate number of unique values for each label: num_unique_labels
num_unique_labels = df[LABELS].apply(pd.Series.nunique, axis='rows')

# Plot number of unique values for each label
num_unique_labels.plot(kind='bar')

# Label the axes
plt.xlabel('Labels')
plt.ylabel('Number of unique values')

# Display the plot
plt.show()

png

Choosing a metric to evaluate our algorithm

Choosing how to evaluate our machine learning algorithm is one of the most important decisions an analyst makes. Instead of using accuracy, we will be using log loss. Log loss is what is called a loss function and it is a measure of error. We want our error to be as small as possible.

Splitting the multi-class dataset

As we are dealing with a multi-class multi-label target variable, a simple train-test-split may not be sufficient, as some of the classes/labels might only appear in the training data. In order to have a representative training dataset, we will use a custom function: multilabel_train_test_split(). This function is an extension of StratifiedShuffleSplit - which only works when we have 1 target variable. However, since we have many target variables, we need to use this custom function, which ensures that all the classes are represented in both the test and training sets.

The first step is to split the data into a training set and a test set. Some labels don’t occur very often, but we want to make sure that they appear in both the training and the test sets. We provide a function that will make sure at least min_count examples of each label appear in each split: multilabel_train_test_split.

Building a simple model

I will be using a multi-class logistic regression model.

We’ll start with a simple model that uses just the numeric columns of your DataFrame when calling multilabel_train_test_split. The data has been read into a DataFrame df and a list consisting of just the numeric columns is available as NUMERIC_COLUMNS.

NUMERIC_COLUMNS = ['FTE', "Total"]

# Create the new DataFrame: numeric_data_only.
# Do some pre-processing by replacing na values with -1000
# to differentiate between 0 values.
numeric_data_only = df[NUMERIC_COLUMNS].fillna(-1000)

# Get labels and convert to dummy variables: label_dummies
label_dummies = pd.get_dummies(df[LABELS])

The get_dummies method essentially does one-hot-encoding of each of the categorical variables. The reason we need to do this, is because sklearn models expect the categorical target variables in this format for model building.

label_dummies.head()
Function_Aides Compensation Function_Career & Academic Counseling Function_Communications Function_Curriculum Development Function_Data Processing & Information Services Function_Development & Fundraising Function_Enrichment Function_Extended Time & Tutoring Function_Facilities & Maintenance Function_Facilities Planning ... Object_Type_Rent/Utilities Object_Type_Substitute Compensation Object_Type_Supplies/Materials Object_Type_Travel & Conferences Pre_K_NO_LABEL Pre_K_Non PreK Pre_K_PreK Operating_Status_Non-Operating Operating_Status_Operating, Not PreK-12 Operating_Status_PreK-12 Operating
134338 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 1 0 0 0 0 1
206341 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 1 0 0 1 0 0
326408 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 0 0 1
364634 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 1 0 0 0 0 1
47683 0 0 0 0 0 0 0 0 0 0 ... 0 1 0 0 1 0 0 0 0 1

5 rows × 104 columns

# Create training and test sets
X_train, X_test, y_train, y_test = multilabel_train_test_split(numeric_data_only,label_dummies,size=0.2,seed=123)

# Print the info
print("X_train info:")
print(X_train.info())
print("\nX_test info:")  
print(X_test.info())
print("\ny_train info:")  
print(y_train.info())
print("\ny_test info:")  
print(y_test.info())
X_train info:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 320222 entries, 134338 to 415831
Data columns (total 2 columns):
FTE      320222 non-null float64
Total    320222 non-null float64
dtypes: float64(2)
memory usage: 7.3 MB
None

X_test info:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 80055 entries, 206341 to 72072
Data columns (total 2 columns):
FTE      80055 non-null float64
Total    80055 non-null float64
dtypes: float64(2)
memory usage: 1.8 MB
None

y_train info:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 320222 entries, 134338 to 415831
Columns: 104 entries, Function_Aides Compensation to Operating_Status_PreK-12 Operating
dtypes: uint8(104)
memory usage: 34.2 MB
None

y_test info:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 80055 entries, 206341 to 72072
Columns: 104 entries, Function_Aides Compensation to Operating_Status_PreK-12 Operating
dtypes: uint8(104)
memory usage: 8.6 MB
None

If you observe carefully, we have 2 features and 104 target variables. It is highly unlikely that we will get good results, but our goal is to build a simple model first and then iterate to improve.

We will import the logistic regression and one-versus-rest-classifiers in order to fit a multi-class logistic regression model to the NUMERIC_COLUMNS of your feature data.

# Import classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
# Instantiate the classifier: clf
clf = OneVsRestClassifier(LogisticRegression())

# Fit the classifier to the training data
clf.fit(X_train, y_train)

# Print the accuracy
print("Accuracy: {}".format(clf.score(X_test, y_test)))
Accuracy: 0.0

Remember, we’re ultimately going to be using logloss to score our model, so don’t worry too much about the accuracy here. Keep in mind that you’re throwing away all of the text data in the dataset - that’s by far most of the data! So don’t get your hopes up for a killer performance just yet. We’re just interested in getting things up and running at the moment.

Since we are dealing with multi-class problem, we will be using sklearn.multiclass module’s OneVsRestClassifer. OneVsRest let’s us treat each column of target variable independently. Essentially, it fits a separate classifier for each of the columns. This is one strategy you can use when dealing with multi-class problems.

For the sake of completeness, we will run the last step, which is running the model against the holdout dataset and saving the predicted probabilities into a csv file.

path_to_holdout_data = os.path.join(os.pardir,
                                    'data',
                                    'TestData.csv')

# Load holdout data
holdout = pd.read_csv(path_to_holdout_data, index_col=0)

# Generate predictions: predictions
predictions = clf.predict_proba(holdout[NUMERIC_COLUMNS].fillna(-1000))

Improving the model

Now that we have trained a simple model and observed how poorly it performed, we can try to improve the model’s performance by adding in the text data. In addition to adding the text data, we will be using Pipelines to better organize our model building and evaluation phases.

Pipelines provide a repeatable way to go from raw data to a trained model.

sklearn’s Pipeline object takes a sequential list of steps where the output of one step is the input to the next. Each step is represented as a tuple, with a name for that step and an object that implements fit and transform methods. Pipelines are a very flexible way to represent your workflow. The beauty of a pipeline is that it encapsulates every transformation from raw data to a trained model.

Resample Data

400,277 rows is too many to work with locally while we develop our approach. We’ll sample down to 40,000 rows so that it is easy and quick to run our analysis.

We’ll also create dummy variables for our labels and split our sampled dataset into a training set and a test set.

Notice the use of multilabel_sample_dataframe() which is used to sample the dataframe, so that we have representation from all the classes. The implementation of this function can be found in the src directory.

Before we build the pipeline, lets sample 40,000 observations, so that we can quickly train them locally.

# target variables
LABELS = ['Function',
          'Use',
          'Sharing',
          'Reporting',
          'Student_Type',
          'Position_Type',
          'Object_Type',
          'Pre_K',
          'Operating_Status']

# all features
NON_LABELS = [c for c in df.columns if c not in LABELS]

# numeric features
NUMERIC_COLUMNS = ['FTE', "Total"]

# text features
TEXT_COLUMNS = [c for c in NON_LABELS if c not in NUMERIC_COLUMNS]

# sample size to work on local laptop
SAMPLE_SIZE = 40000

# returns a dataframe
sampling = multilabel_sample_dataframe(df,
                                       pd.get_dummies(df[LABELS]),
                                       size=SAMPLE_SIZE,
                                       min_count=25,
                                       seed=43)

# create the dummy labels
# Note: since we are using df, these LABELS were already converted to category
dummy_labels = pd.get_dummies(sampling[LABELS])

Build a simple pipeline for numeric data

OK, so lets build our simple numeric pipeline!

# split into train test
X_train, X_test, y_train, y_test = multilabel_train_test_split(sampling[NUMERIC_COLUMNS],
                                                               dummy_labels,
                                                               0.2,
                                                               min_count=3,
                                                               seed=43)

print("Training data shape with only Numeric Features: ")
print(X_train.shape)
print(y_train.shape)

print("Test data shape with only Numeric Features: ")
print(X_test.shape)
print(y_test.shape)
Training data shape with only Numeric Features:
(32000, 2)
(32000, 104)
Test data shape with only Numeric Features:
(8000, 2)
(8000, 104)

As we can see, we are using 32,000 observations with 2 numeric features. The 9 label columns are our target labels, which have been one-hot-encoded to 104 columns.

Using Imputation for missing numeric values

print(sampling[NUMERIC_COLUMNS].shape)
(40000, 2)
sampling[NUMERIC_COLUMNS].isna().sum()
FTE      27532
Total      443
dtype: int64

As you can see, there are a lot of missing values for FTE column, one simple strategy is to use mean imputation.

# Import the Imputer object
from sklearn.preprocessing import Imputer

import warnings
warnings.filterwarnings("ignore")

# Insantiate Pipeline object: pl
pl = Pipeline([
        ('imp', Imputer()),
        ('clf', OneVsRestClassifier(LogisticRegression()))
    ])

# Fit the pipeline to the training data
pl.fit(X_train, y_train)

# Compute and print accuracy
accuracy = pl.score(X_test, y_test)
print("\nAccuracy on sample data - all numeric, incl nans: ", accuracy)
Accuracy on sample data - all numeric, incl nans:  0.0

We got the same accuracy. No suprise here, because all we did was put the earlier implementation into a pipeline.

Build a pipeline for Text columns

Before I dive into working with Text columns, lets take a peak at how each of the 14 text features are.

sampling_text_df = sampling[TEXT_COLUMNS]
sampling_text_df.head()
Object_Description Text_2 SubFund_Description Job_Title_Description Text_3 Text_4 Sub_Object_Description Location_Description Function_Description Facility_or_Department Position_Extra Program_Description Fund_Description Text_1
38 OTHER PURCHASED SERVICES NaN SCHOOL-WIDE SCHOOL PGMS FOR TITLE GRANTS NaN NaN NaN NaN NaN STUDENT TRANSPORT SERVICE NaN NaN Misc Schoolwide Schools NaN
70 Extra Duty Pay/Overtime For Support Personnel NaN Operations SECURITY OFFICER NaN NaN Extra Duty Pay/Overtime For Support Personnel Unallocated Security And Monitoring Services Security Department POLICE PATROL MAN Undistributed General Operating Fund OVERTIME
198 Supplemental * NaN Operation and Maintenance of Plant Services NaN NaN NaN Non-Certificated Salaries And Wages NaN Care and Upkeep of Building Services NaN NaN NaN Title I - Disadvantaged Children/Targeted Assi... TITLE I CARRYOVER
209 REPAIR AND MAINTENANCE SERVICES NaN PUPIL TRANSPORTATION NaN NaN NaN NaN ADMIN. SERVICES STUDENT TRANSPORT SERVICE NaN NaN PUPIL TRANSPORTATION General Fund NaN
614 NaN GENERAL EDUCATION LOCAL EDUCATIONAL AIDE,70 HRS NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

As you can see, there are many values which are NaN. These have to be first replaced with empty strings “”. Secondly, the CountVectorizer object expects a string to be vectorized. The idea here is to treat each row as one single string, that will be passed into the Vectorizer to get the features. Let’s take one example of how CountVectorizer works on one single column, then we will extend that to take all the columns of 1 row and finally we will apply that to all the rows and columns of the dataframe.

Using CountVectorizer on 1 single column

Let’s consider Position_Extra as our column, that we will Vectorize.

# showing first 10 values of Position_Extra
sampling_text_df.Position_Extra[:10]
38                             NaN
70               POLICE PATROL MAN
198                            NaN
209                            NaN
614                            NaN
662     PROFESSIONAL-INSTRUCTIONAL
750                        TEACHER
931                            NaN
1265                           NaN
1350                  UNDESIGNATED
Name: Position_Extra, dtype: object
# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Create the token pattern: TOKENS_ALPHANUMERIC
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)'

# Fill missing values in df.Position_Extra
sampling_text_df.Position_Extra.fillna('', inplace=True)

# Instantiate the CountVectorizer: vec_alphanumeric
vec_alphanumeric = CountVectorizer(token_pattern=TOKENS_ALPHANUMERIC)

# Fit to the data
vec_alphanumeric.fit(sampling_text_df.Position_Extra)

# Print the number of tokens and first 15 tokens
msg = "There are {} tokens in Position_Extra if we split on non-alpha numeric"
print(msg.format(len(vec_alphanumeric.get_feature_names())))
print(vec_alphanumeric.get_feature_names()[:30])
There are 299 tokens in Position_Extra if we split on non-alpha numeric
['1st', '2nd', '3rd', '4th', '5th', '9th', 'a', 'ab', 'accountability', 'adaptive', 'addit', 'additional', 'adm', 'admin', 'administrative', 'adult', 'aide', 'air', 'and', 'any', 'area', 'arra', 'art', 'assessment', 'assistant', 'assistive', 'asst', 'at', 'athletic', 'attendance']
sampling_text_df.Position_Extra[:10]
38                                
70               POLICE PATROL MAN
198                               
209                               
614                               
662     PROFESSIONAL-INSTRUCTIONAL
750                        TEACHER
931                               
1265                              
1350                  UNDESIGNATED
Name: Position_Extra, dtype: object

Using CountVectorizer on 1 single row/observation

Combining text columns for tokenization: In order to get a bag-of-words representation for all of the text data in our DataFrame, you must first convert the text data in each row of the DataFrame into a single string.

In the earlier case when we dealt with only 1 column, this wasn’t necessary necessary because you only looked at one column of data, so each row was already just a single string. CountVectorizer expects each row to just be a single string, so in order to use all of the text columns, you’ll need a method to turn a list of strings into a single string. The function combine_text_columns() does exactly that.

def combine_text_columns(data_frame, to_drop=NUMERIC_COLUMNS + LABELS):
    """ Takes the dataset as read in, drops the non-feature, non-text columns and
        then combines all of the text columns into a single vector that has all of
        the text for a row.

        :param data_frame: The data as read in with read_csv (no preprocessing necessary)
        :param to_drop (optional): Removes the numeric and label columns by default.
    """
    # drop non-text columns that are in the df
    to_drop = set(to_drop) & set(data_frame.columns.tolist())
    text_data = data_frame.drop(to_drop, axis=1)

    # replace nans with blanks
    text_data.fillna("", inplace=True)

    # joins all of the text items in a row (axis=1)
    # with a space in between
    return text_data.apply(lambda x: " ".join(x), axis=1)
# lets reload the dataframe
sampling_text_df = sampling[TEXT_COLUMNS]

# Create the token pattern: TOKENS_ALPHANUMERIC
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)'

# Instantiate alphanumeric CountVectorizer: vec_alphanumeric
vec_alphanumeric = CountVectorizer(token_pattern=TOKENS_ALPHANUMERIC)

# Create the text vector
text_vector = combine_text_columns(sampling_text_df)

display(text_vector.shape)

# Fit and transform vec_alphanumeric
vec_alphanumeric.fit_transform(text_vector)

# Print number of tokens of vec_alphanumeric
print("There are {} alpha-numeric tokens in the dataset".format(len(vec_alphanumeric.get_feature_names())))
(40000,)


There are 2375 alpha-numeric tokens in the dataset

FunctionTransformer to the rescue

Can we add everything we did in the above cell to the Pipeline object? NO we can’t as-is. Remember one of the constraints for adding an object to the Pipeline is that it needs to have fit() and transform() methods implemented. If we were to add the combine_text_columns() function to the pipeline, then we need to first transform it using the FunctionTransformer utility.

Any step in the pipeline must be an object that implements the fit and transform methods. The FunctionTransformer creates an object with these methods out of any Python function that you pass to it.

Since we are working with numeric data that needs imputation, and text data that needs to be converted into a bag-of-words. We’ll create functions that separate the text from the numeric variables and see how the .fit() and .transform() methods work.

# Import FunctionTransformer
from sklearn.preprocessing import FunctionTransformer
#pd.options.display.max_colwidth=500
#pd.reset_option('display.max_colwidth')
# convert combine_text_columns into a FunctionTransformer
get_text_data = FunctionTransformer(combine_text_columns, validate=False)
get_text_data.fit_transform(sampling_text_df[:5])
38     OTHER PURCHASED SERVICES  SCHOOL-WIDE SCHOOL P...
70     Extra Duty Pay/Overtime For Support Personnel ...
198    Supplemental *  Operation and Maintenance of P...
209    REPAIR AND MAINTENANCE SERVICES  PUPIL TRANSPO...
614     GENERAL EDUCATION LOCAL EDUCATIONAL AIDE,70 H...
dtype: object

Combining Numeric and Text pipelines using FeatureUnion

So far, we have seen how we can build a pipeline for Numeric data and a separate pipeline for Text data. The reason we need to seperate these two into their own pipelines is because, we can’t use Imputation on the text data. Similarly, we can’t use CountVectorizer on numeric data.

What we really need is to combine the results (dataframes) so that the results of the numeric pipeline and text pipeline will be merged into one single dataframe all within the pipeline’s workflow.

This can be accomplished using FeatureUnion. To put it more eloquently:

Now that you can separate text and numeric data in your pipeline, you’re ready to perform separate steps on each by nesting pipelines and using FeatureUnion().

These tools will allow you to streamline all preprocessing steps for your model, even when multiple datatypes are involved. Here, for example, you don’t want to impute our text data, and you don’t want to create a bag-of-words with our numeric data. Instead, you want to deal with these separately and then join the results together using FeatureUnion().

In the end, you’ll still have only two high-level steps in your pipeline: preprocessing and model instantiation. The difference is that the first preprocessing step actually consists of a pipeline for numeric data and a pipeline for text data. The results of those pipelines are joined using FeatureUnion().

Build the numeric and text preprocessing sub-pipelines

# Numeric Pipeline
## define a function that gets only numeric cols
get_numeric_data = FunctionTransformer(lambda x: x[NUMERIC_COLUMNS], validate=False)

numeric_pipeline = Pipeline([
    ('selector', get_numeric_data),
    ('imputer', Imputer())
])

# Text Pipeline
get_text_data = FunctionTransformer(combine_text_columns, validate=False)

text_pipeline = Pipeline([
    ('selector', get_text_data),
    ('vectorizer', CountVectorizer())
])

Fuse the outputs of two preprocessing sub-pipelines

from sklearn.pipeline import FeatureUnion
join_numeric_text_features = FeatureUnion(
    transformer_list= [
        ('numeric_features', numeric_pipeline),
        ('text_features', text_pipeline)
    ] )

Combine preprocessing and modeling steps into one pipeline object

# Overall pipeline
pl = Pipeline([
    ('union', join_numeric_text_features),
    ('clf', OneVsRestClassifier(LogisticRegression()))
])

Pass the entire dataframe through the complete pipeline

# split into train test
X_train, X_test, y_train, y_test = multilabel_train_test_split(sampling[NON_LABELS],
                                                               dummy_labels,
                                                               0.2,
                                                               min_count=3,
                                                               seed=43)

print("Training data shape with both Numeric and Text Features: ")
print(X_train.shape)
print(y_train.shape)

print("Test data shape with both Numeric and Text Features: ")
print(X_test.shape)
print(y_test.shape)
Training data shape with both Numeric and Text Features:
(32000, 16)
(32000, 104)
Test data shape with both Numeric and Text Features:
(8000, 16)
(8000, 104)
# Fit to the training data
pl.fit(X_train, y_train)
Pipeline(memory=None,
     steps=[('union', FeatureUnion(n_jobs=None,
       transformer_list=[('numeric_features', Pipeline(memory=None,
     steps=[('selector', FunctionTransformer(accept_sparse=False, check_inverse=True,
          func=<function <lambda> at 0xa1cd32268>, inv_kw_args=None,
          inverse_func=None, kw_ar...te=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False),
          n_jobs=None))])
# Compute and print accuracy
accuracy = pl.score(X_test, y_test)
print("\nAccuracy on budget dataset: ", accuracy)
Accuracy on budget dataset:  0.366375

Evaluate the model using logloss

Instead of accuracy, we will be using a custom logloss function multi_multi_log_loss to evaluate the logloss for this problem.

from sklearn.metrics.scorer import make_scorer

log_loss_scorer = make_scorer(multi_multi_log_loss)
# print the score of our trained pipeline on our test set
print("Logloss score of trained pipeline: ", log_loss_scorer(pl, X_test, y_test.values))
Logloss score of trained pipeline:  3.1857894491765064

Improve logloss by adding some tricks

From the above basic pipeline, we got a logloss score of 3.18. We will try adding some tricks to see if we can better this score. The first trick we are going to add is to trim down the number of features using SelectKBest features. Reducing the number of features will improve model generalization, since we are discarding features which don’t really contribute to the signal in the data.

Secondly, we will be using HashingVectorizer to improve computation efficiency. We need this since we are using ngram range of (1,2). Hashing methods use a finite number of encodings which has shown to be more computationally efficient. Although this does not contribute to the overall logloss metric, it does help in speeding up the calculations.

Lastly, we are using SparseInteractions which is an improvement over PolynomialFeatures class in sklearn. The PolynomialFeatures class only works for single class label, thus we need to extend that to support multi-class-multi-label case. The code is in the src directory. The need to add interactions is to see if a specific combination of words tend to be better features than when taken individually.

%%time
from sklearn.feature_selection import chi2, SelectKBest
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.preprocessing import MaxAbsScaler


# set a reasonable number of features before adding interactions
chi_k = 300

# create the pipeline object
pl = Pipeline([
        ('union', FeatureUnion(
            transformer_list = [
                ('numeric_features', Pipeline([
                    ('selector', get_numeric_data),
                    ('imputer', Imputer())
                ])),
                ('text_features', Pipeline([
                    ('selector', get_text_data),
                    ('vectorizer', HashingVectorizer(token_pattern=TOKENS_ALPHANUMERIC,
                                                     non_negative=True, norm=None, binary=False,
                                                     ngram_range=(1, 2))),
                    ('dim_red', SelectKBest(chi2, chi_k))
                ]))
             ]
        )),
        ('int', SparseInteractions(degree=2)),
        ('scale', MaxAbsScaler()),
        ('clf', OneVsRestClassifier(LogisticRegression()))
    ])

# fit the pipeline to our training data
pl.fit(X_train, y_train.values)

# print the score of our trained pipeline on our test set
print("Logloss score of trained pipeline: ", log_loss_scorer(pl, X_test, y_test.values))

Logloss score of trained pipeline:  2.1864848617308343
CPU times: user 9min 54s, sys: 46.7 s, total: 10min 41s
Wall time: 6min 13s

The logloss has improved to 2.18. We can further tweak this pipeline by trying out other Classifiers and by adding addtional pre-processing steps and/or add more tricks. The beauty of putting this into the pipeline is that all the transformations are captured very neatly into a pipeline construct. This will prove especially useful, when we are going to run this on new unseen data. The new data, will undergo all the necessary transformations without us worrying about missing any steps.

Testing the improved pipeline on a holdout dataset

As mentioned above, all we need to do on the holdout dataset is to invoke the pipeline object’s predict_proba to get the predictions. We can then save these prediction probabilities and upload them to the validator to check how well our algorithm worked on new and unseen data.

path_to_holdout_data = os.path.join(os.pardir,
                                    'data',
                                    'TestData.csv')

# Load holdout data
holdout = pd.read_csv(path_to_holdout_data, index_col=0)

# Make predictions
predictions = pl.predict_proba(holdout)

# Format correctly in new DataFrame: prediction_df
prediction_df = pd.DataFrame(columns=pd.get_dummies(df[LABELS]).columns,
                             index=holdout.index,
                             data=predictions)


# Save prediction_df to csv called "predictions.csv"
prediction_df.to_csv("predictions.csv")

Conclusion

Some key takeaways from this post are as follows:

  • I have shown how we can tackle a problem where we have a mix of text and numeric features.
  • We have seen how we can build pre-processing sub-pipelines.
  • We have seen how to combine these sub-pipelines into one single pipeline object that we can use to fit and predict.

Now we can extend our analysis by simply updating the pipeline steps. It is very easy to try out multiple algorithms with different parameters, and build a table of the models that we have tried out.