Binary Classification Problem

42 minute read

Abstract

A common workflow used in applied machine learning is illustrated through the use of a Binary Classification problem. The focus of this project is to predict the binary class label 0 or 1 of y based on a set of 50 features.

Introduction

Any predictive modeling machine learning project can be broken down into seven common tasks:

  • Define the problem
  • Do exploratory data analysis
  • Prepare the Data
  • Evaluate Algorithms
  • Improve Results
  • Deploy the model
  • Present Results

Each of these tasks can be further sub-divided based on the problem and the dataset.

- Define the problem
- Do exploratory data analysis
|- Load the libraries and data
|- Provide Descriptive statistics
|- Check for missing data and outliers
|- Visualize missingness
- Prepare the data
|- Replace missing data
|- One-hot encode categorical variables
|- Remove highly correlated variables
- Evaluate Algorithms
|- Choose the metric
|- Compare linear and non-linear models
|- Try Ensemble Methods:
    |- Bagging (Bootstrap aggregation) methods:
        |- Ensemble of different weak learners - (VotingClassifer)
        |- BaggingClassifier with DecisionTree base learner
        |- BaggingClassifier with KNeighbors base learner
        |- RandomForestClassifer with hyperparameter tuning
        |- Extract best features from RandomForestClassifier
    |- Boosting methods:
        |- Adaptive Boosting with AdaBoostClassifier
        |- Adaptive Boosting with tuned DecisionTree base learner
        |- Extract best features from Adaptive Boosting
    |- Compare and Combine Bagging and Boosting features
- Improve Results
|- Tune RandomForestClassifier with only the best parameters
|- Tune AdaBoostClassifier with only the best parameters
|- Tune GradientBoostingClassifier with only the best parameters
|- Summarize the results
- Deploy the model
|- Persist the best model for future predictions
- Present Results and Insights
|- What did we learn from this analysis ?

Define the problem

The goal is build a model which can predict the class label of the variable y. Each observation contains 50 features, most of which are numeric data except for a few categorical features. The data contains 160000 observations with labeled data. As the data is labeled and since it takes one of two values 0 or 1, a supervised machine learning classifier is used.

Do exploratory data analysis

Load the libraries and data

# Load the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# visualize missing values
import missingno as msno

# model_selection
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split

# linear classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

# non-linear classifiers
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

# ensemble learners
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import VotingClassifier

# metrics
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score

# model persistence
from pickle import dump
from pickle import load

# ignore deprecated warnings
import warnings
warnings.filterwarnings('ignore')
# read in the dataset
data = pd.read_csv('final_project.csv')
data.shape
(160000, 51)

Descriptive statistics

There are 160000 observations with 50 predictor variables represented by x0-x49 and a response variabe represented by y. For providing descriptive statistics, it is often useful to divide the variables based on their type. The describe() output for object type variable returns a useful summary giving the number of unique values present in that feature. If the number of unique values taken by a feature is large, then it most likely is mis-represented as an object type. For instance, variable x37 clearly is not of type object, rather it represents a monetory value.

# divide the variables based on their types
objectTypes = data.loc[:, data.dtypes == object]
numericTypes = data.loc[:, data.dtypes == float]
responseVariable = data.y

Categorical variables

# object type value counts
objectTypes.describe().T
count unique top freq
x24 159972 3 asia 138965
x29 159970 12 July 45569
x30 159970 5 wednesday 101535
x32 159969 12 0.01% 40767
x37 159977 129198 $618.22 6
  • Based on the number of unique values x37 variable looks to be a numeric variable which contains a dollar amount. This needs to be converted to float by removing the dollar sign and converting to float.
  • x32 variable is a percentage value, which can be converted to a float.
# cleaning x37
def removeDollarSign(var):
    var = var.str.replace('$', "")
    return var.astype("float")

data['x37'] = removeDollarSign(data['x37'])
data.x37.head()
0    1313.96
1    1962.78
2     430.47
3   -2366.29
4    -620.66
Name: x37, dtype: float64
# cleaning x32
def removePercentSign(var):
    var = var.str.replace("%", "")
    return var.astype("float")

data['x32'] = removePercentSign(data['x32'])
data.x32.head()
0    0.00
1   -0.02
2   -0.01
3    0.01
4    0.01
Name: x32, dtype: float64

Numeric variables

When it comes to numeric values, the output of describe() by default provides the min, max and quantile values for each feature. However, it is very useful to summarize the number of missing values and outliers augmented to the descriptive statistics for each feature. This is accomplished by defining functions which return the number of missing values and number of outliers. An outlier is a data point that is distinctly separate from the rest of the data. One definition of outlier is any data point more than 1.5 interquartile ranges (IQRs) below the first quartile or above the third quartile. In addition to the summary statistics, the following columns are added:

  • num_missing: number of missing values
  • outlier : a True value indicates that this column has outliers in it.
  • num_outliers: number of data points considered as outliers for this column.
  • IQR: Inter Quartile Range.
# get the float columns
numericTypes = data.loc[:, data.dtypes == float]
numericDescribe = (numericTypes.describe(include='all').T).round(decimals=5)

# Calculate outliers using this formula: first quartile – 1.5·IQR > outlier > third quartile + 1.5·IQR
numericDescribe['IQR'] = numericDescribe['75%'] - numericDescribe['25%']
numericDescribe['outliers'] = (numericDescribe['max'] > (numericDescribe['75%'] + (1.5 * numericDescribe['IQR']))) \
                            | (numericDescribe['min'] < (numericDescribe['25%'] - (1.5 * numericDescribe['IQR'])))

# Calculate IQR for each column of the dataframe.
IQR = numericTypes.quantile(.75) - numericTypes.quantile(.25)

# Calculate the upper and lower outlier values
upperOutlier = numericTypes.quantile(.75) + (1.5 * (IQR))
lowerOutlier = numericTypes.quantile(.25) - (1.5 * (IQR))

# Check if each value of mySeries lies within the outlier range.
# If the value is greater than upperOutlier mark it True
# If the value is lower than lowerOutlier mark it True
# return the total number of outliers for that series.
def numberOfOutliers(mySeries):
    return sum((mySeries > upperOutlier.loc[mySeries.name, ]) |\
               (mySeries < lowerOutlier.loc[mySeries.name,]))

def numberOfMissing(mySeries):
    return sum(mySeries.isnull())

# append to the describe output
numericDescribe['num_outliers'] = numericTypes.apply(numberOfOutliers)
numericDescribe['num_missing'] = numericTypes.apply(numberOfMissing)

# re-order and filter columns
numericDescribe.sort_values('num_missing', ascending=False, inplace=True)
newColOrder = ['count', 'num_missing', 'outliers', 'num_outliers', 'mean',\
               'min', 'max', 'std']
numericDescribe.reindex(columns=newColOrder)
count num_missing outliers num_outliers mean min max std
x23 159953.0 47 True 1012 0.72203 -66.64034 58.49050 14.90913
x10 159957.0 43 True 1103 0.00225 -36.30657 37.94558 7.87143
x33 159959.0 41 True 1049 -0.00657 -7.45145 7.78712 1.74776
x34 159959.0 41 True 1120 -0.00043 -36.11661 34.84143 8.01418
x44 159960.0 40 True 1141 -0.00625 -17.98349 19.06976 4.16460
x41 159960.0 40 True 1264 6.70108 -82.16722 100.05043 18.68020
x18 159960.0 40 True 1114 0.01269 -20.19869 19.65299 4.54071
x31 159961.0 39 True 1113 -0.00595 -12.28936 12.27936 2.76751
x2 159962.0 38 True 1139 -1.15014 -59.39405 63.54565 13.27348
x20 159962.0 38 True 946 0.29907 -26.67740 27.81456 5.80620
x3 159963.0 37 True 1111 -0.02464 -35.47659 38.90603 8.06503
x5 159963.0 37 True 1078 0.01358 -33.82299 35.55011 7.67008
x47 159963.0 37 True 1040 0.02862 -21.08633 20.83685 4.78816
x43 159963.0 37 True 1108 -0.00209 -6.87623 6.68092 1.53495
x40 159964.0 36 True 1265 -2.31653 -74.05920 88.82448 17.04355
x12 159964.0 36 True 1447 -1.33440 -64.19797 73.27935 14.75099
x26 159964.0 36 True 1142 -0.00107 -3.85748 3.95165 0.84326
x15 159965.0 35 True 1110 0.00121 -17.00236 13.78256 3.27178
x19 159965.0 35 True 1075 0.02456 -35.63340 33.51555 7.59532
x28 159965.0 35 True 1017 0.03154 -72.89671 67.75384 14.43953
x14 159966.0 34 True 1131 0.00810 -30.90521 32.54634 6.96410
x48 159968.0 32 True 957 -0.00022 -8.49016 8.22655 1.93550
x49 159968.0 32 True 883 -0.67422 -65.79119 66.87760 15.03674
x38 159969.0 31 True 1265 6.05913 -74.29756 90.46798 16.89160
x32 159969.0 31 True 7 -0.00001 -0.05000 0.05000 0.01040
x13 159969.0 31 True 1133 0.00767 -38.72351 42.39218 8.95384
x46 159969.0 31 True 1890 -12.75540 -201.82683 150.85942 36.60864
x27 159970.0 30 True 870 -0.00416 -32.00356 28.64507 6.77405
x35 159970.0 30 True 1131 0.00094 -10.00815 9.89243 2.37956
x11 159970.0 30 True 1122 0.03023 -38.09287 36.36044 8.76963
x9 159970.0 30 True 1087 0.00546 -27.98066 27.37784 6.35504
x21 159971.0 29 True 1119 -0.02914 -43.50185 46.23750 9.40964
x45 159971.0 29 True 1091 0.00088 -1.75322 1.66920 0.39662
x7 159973.0 27 True 2195 -7.69279 -181.50698 149.15063 30.54226
x22 159973.0 27 True 1128 0.00840 -23.64419 24.86301 5.41201
x17 159973.0 27 True 1177 0.01204 -34.39590 37.05705 7.56935
x36 159973.0 27 True 1150 0.00645 -6.86602 6.99954 1.59318
x4 159974.0 26 True 1085 -0.00055 -28.46754 26.24781 6.38229
x6 159974.0 26 True 1139 -1.67067 -86.35448 92.39060 19.29866
x0 159974.0 26 True 1109 -0.00103 -1.59263 1.60085 0.37114
x42 159974.0 26 True 2021 -1.83382 -27.93375 22.66804 5.11071
x16 159974.0 26 True 1149 0.00622 -26.04298 21.96112 4.98407
x1 159975.0 25 True 1115 0.00136 -26.27830 27.98818 6.34063
x39 159977.0 23 True 1083 0.00425 -22.10165 21.54559 5.13432
x37 159977.0 23 True 1349 0.37763 -4734.54000 5546.99000 999.94533
x25 159978.0 22 True 1142 -0.00081 -6.36465 5.31417 1.26366
x8 159979.0 21 True 1081 -0.03054 -37.69105 39.04983 8.90119

A visual inspection of these variables is provided below. Most of the variables are normally distributed except for variable x32 which is actually a percentage that appears to be having only a fixed number of values.

While most of the machine learning algorithms are non-parametric in nature, i.e, these algorithms do not assume any underlying distribution of the data, it is still a good practice to check the distribution of variables when possible. Outliers, particularly, tend to have greater impact in distance based algorithms like knn and support vector machines. When using linear models and models that rely on distance metric, it is advisable to scale the data. For this particular case, I chose not to scale the data as I am mostly using non-linear models like DecisionTree’s which do not really care about whether the data is scaled or not.

%matplotlib inline
ax = numericTypes.plot(kind='density', figsize=(16,8), subplots=True, layout=(7,7), sharex=False)
plt.xticks([]),plt.yticks([])
(([], <a list of 0 Text xticklabel objects>),
 ([], <a list of 0 Text yticklabel objects>))

png

Visualize missingness

One other use of checking the distributions of each variable, is that, one can know how to deal with missing values. If the values are heavily skewed (right/left), then choosing a median imputation is a good idea. On the other hand, if the values are normally distributed as above, one can choose mean/median to impute the missing values. In this case, median imputation was chosen.

Understanding the missingness pattern in the data is something that should not be overlooked. The pattern of missingness should guide the imputation method. Based on the figure below, it is clear that there is no obvious pattern in the missingness. The missingness appears to be random.

temp = numericTypes.isnull().sum()
temp = temp[temp > 0]
numericTypesMissing = numericTypes[temp.index]
msno.matrix(numericTypesMissing)
<matplotlib.axes._subplots.AxesSubplot at 0x11d6b6da0>

png

Prepare the Data

Replacing numeric missing values with median

def replaceNumMissingData(df):
    temp_df = df.copy()
    columnnames = df.columns

    print("##############################")
    print("## NA count Before Cleaning ##")
    print("##############################")
    print(df.isnull().sum())

    #replacing missing continous values with median
    for i in range(0,len(columnnames)):
        try:
            temp_array =df[df[columnnames[i]]!=np.nan][columnnames[i]]
            # replace NAs with median for continous variables created from above arrays
            temp_df[columnnames[i]] =temp_df[columnnames[i]].replace(np.nan,temp_array.median())     
        except Exception as e:
            print(e.args)

    print("##############################")
    print("## NA Count After Cleaning  ##")
    print("##############################")
    print(temp_df.isnull().sum())
    df = temp_df
    return df

dataWithoutMissing = replaceNumMissingData(numericTypes)
dataWithoutMissing.info()
##############################
## NA count Before Cleaning ##
##############################
x0     26
x1     25
x2     38
x3     37
x4     26
x5     37
x6     26
x7     27
x8     21
x9     30
x10    43
x11    30
x12    36
x13    31
x14    34
x15    35
x16    26
x17    27
x18    40
x19    35
x20    38
x21    29
x22    27
x23    47
x25    22
x26    36
x27    30
x28    35
x31    39
x32    31
x33    41
x34    41
x35    30
x36    27
x37    23
x38    31
x39    23
x40    36
x41    40
x42    26
x43    37
x44    40
x45    29
x46    31
x47    37
x48    32
x49    32
dtype: int64
##############################
## NA Count After Cleaning  ##
##############################
x0     0
x1     0
x2     0
x3     0
x4     0
x5     0
x6     0
x7     0
x8     0
x9     0
x10    0
x11    0
x12    0
x13    0
x14    0
x15    0
x16    0
x17    0
x18    0
x19    0
x20    0
x21    0
x22    0
x23    0
x25    0
x26    0
x27    0
x28    0
x31    0
x32    0
x33    0
x34    0
x35    0
x36    0
x37    0
x38    0
x39    0
x40    0
x41    0
x42    0
x43    0
x44    0
x45    0
x46    0
x47    0
x48    0
x49    0
dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 160000 entries, 0 to 159999
Data columns (total 47 columns):
x0     160000 non-null float64
x1     160000 non-null float64
x2     160000 non-null float64
x3     160000 non-null float64
x4     160000 non-null float64
x5     160000 non-null float64
x6     160000 non-null float64
x7     160000 non-null float64
x8     160000 non-null float64
x9     160000 non-null float64
x10    160000 non-null float64
x11    160000 non-null float64
x12    160000 non-null float64
x13    160000 non-null float64
x14    160000 non-null float64
x15    160000 non-null float64
x16    160000 non-null float64
x17    160000 non-null float64
x18    160000 non-null float64
x19    160000 non-null float64
x20    160000 non-null float64
x21    160000 non-null float64
x22    160000 non-null float64
x23    160000 non-null float64
x25    160000 non-null float64
x26    160000 non-null float64
x27    160000 non-null float64
x28    160000 non-null float64
x31    160000 non-null float64
x32    160000 non-null float64
x33    160000 non-null float64
x34    160000 non-null float64
x35    160000 non-null float64
x36    160000 non-null float64
x37    160000 non-null float64
x38    160000 non-null float64
x39    160000 non-null float64
x40    160000 non-null float64
x41    160000 non-null float64
x42    160000 non-null float64
x43    160000 non-null float64
x44    160000 non-null float64
x45    160000 non-null float64
x46    160000 non-null float64
x47    160000 non-null float64
x48    160000 non-null float64
x49    160000 non-null float64
dtypes: float64(47)
memory usage: 57.4 MB

Replace categorical missing values with mode

def replaceCatMissingData(df):
    temp_df = df.copy()
    columnnames = df.columns

    print("##############################")
    print("## NA count Before Cleaning ##")
    print("##############################")
    print(df.isnull().sum())

    for i in range(0,len(columnnames)):
        try:
            temp_array = df[df[columnnames[i]]!=np.nan][columnnames[i]]
            temp_df[columnnames[i]] =temp_df[columnnames[i]].replace(np.nan,str(temp_array.mode()[0]))      
        except Exception as e:
            print(e.args)

    print("##############################")
    print("## NA Count After Cleaning  ##")
    print("##############################")
    print(temp_df.isnull().sum())
    df = temp_df
    return df
del objectTypes['x32']
del objectTypes['x37']
dataWithoutMissingCat = replaceCatMissingData(objectTypes)
dataWithoutMissingCat.info()
##############################
## NA count Before Cleaning ##
##############################
x24    28
x29    30
x30    30
dtype: int64
##############################
## NA Count After Cleaning  ##
##############################
x24    0
x29    0
x30    0
dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 160000 entries, 0 to 159999
Data columns (total 3 columns):
x24    160000 non-null object
x29    160000 non-null object
x30    160000 non-null object
dtypes: object(3)
memory usage: 3.7+ MB

Concatenate dataWithoutMissing, dataWithoutMissingCat and responseVariable to create a clean dataset.

cleanData = pd.concat([dataWithoutMissing, dataWithoutMissingCat, responseVariable], axis=1)
cleanData.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 160000 entries, 0 to 159999
Data columns (total 51 columns):
x0     160000 non-null float64
x1     160000 non-null float64
x2     160000 non-null float64
x3     160000 non-null float64
x4     160000 non-null float64
x5     160000 non-null float64
x6     160000 non-null float64
x7     160000 non-null float64
x8     160000 non-null float64
x9     160000 non-null float64
x10    160000 non-null float64
x11    160000 non-null float64
x12    160000 non-null float64
x13    160000 non-null float64
x14    160000 non-null float64
x15    160000 non-null float64
x16    160000 non-null float64
x17    160000 non-null float64
x18    160000 non-null float64
x19    160000 non-null float64
x20    160000 non-null float64
x21    160000 non-null float64
x22    160000 non-null float64
x23    160000 non-null float64
x25    160000 non-null float64
x26    160000 non-null float64
x27    160000 non-null float64
x28    160000 non-null float64
x31    160000 non-null float64
x32    160000 non-null float64
x33    160000 non-null float64
x34    160000 non-null float64
x35    160000 non-null float64
x36    160000 non-null float64
x37    160000 non-null float64
x38    160000 non-null float64
x39    160000 non-null float64
x40    160000 non-null float64
x41    160000 non-null float64
x42    160000 non-null float64
x43    160000 non-null float64
x44    160000 non-null float64
x45    160000 non-null float64
x46    160000 non-null float64
x47    160000 non-null float64
x48    160000 non-null float64
x49    160000 non-null float64
x24    160000 non-null object
x29    160000 non-null object
x30    160000 non-null object
y      160000 non-null int64
dtypes: float64(47), int64(1), object(3)
memory usage: 62.3+ MB

One hot encoding the categorical variables

df = cleanData.copy()
dummyCols = pd.get_dummies(df.loc[:, df.dtypes == 'object'])
dummyCols.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 160000 entries, 0 to 159999
Data columns (total 20 columns):
x24_america      160000 non-null uint8
x24_asia         160000 non-null uint8
x24_euorpe       160000 non-null uint8
x29_Apr          160000 non-null uint8
x29_Aug          160000 non-null uint8
x29_Dev          160000 non-null uint8
x29_Feb          160000 non-null uint8
x29_January      160000 non-null uint8
x29_July         160000 non-null uint8
x29_Jun          160000 non-null uint8
x29_Mar          160000 non-null uint8
x29_May          160000 non-null uint8
x29_Nov          160000 non-null uint8
x29_Oct          160000 non-null uint8
x29_sept.        160000 non-null uint8
x30_friday       160000 non-null uint8
x30_monday       160000 non-null uint8
x30_thurday      160000 non-null uint8
x30_tuesday      160000 non-null uint8
x30_wednesday    160000 non-null uint8
dtypes: uint8(20)
memory usage: 3.1 MB
df = pd.concat([df, dummyCols], axis=1)
df.drop(['x24', 'x29', 'x30'], axis=1, inplace=True)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 160000 entries, 0 to 159999
Data columns (total 68 columns):
x0               160000 non-null float64
x1               160000 non-null float64
x2               160000 non-null float64
x3               160000 non-null float64
x4               160000 non-null float64
x5               160000 non-null float64
x6               160000 non-null float64
x7               160000 non-null float64
x8               160000 non-null float64
x9               160000 non-null float64
x10              160000 non-null float64
x11              160000 non-null float64
x12              160000 non-null float64
x13              160000 non-null float64
x14              160000 non-null float64
x15              160000 non-null float64
x16              160000 non-null float64
x17              160000 non-null float64
x18              160000 non-null float64
x19              160000 non-null float64
x20              160000 non-null float64
x21              160000 non-null float64
x22              160000 non-null float64
x23              160000 non-null float64
x25              160000 non-null float64
x26              160000 non-null float64
x27              160000 non-null float64
x28              160000 non-null float64
x31              160000 non-null float64
x32              160000 non-null float64
x33              160000 non-null float64
x34              160000 non-null float64
x35              160000 non-null float64
x36              160000 non-null float64
x37              160000 non-null float64
x38              160000 non-null float64
x39              160000 non-null float64
x40              160000 non-null float64
x41              160000 non-null float64
x42              160000 non-null float64
x43              160000 non-null float64
x44              160000 non-null float64
x45              160000 non-null float64
x46              160000 non-null float64
x47              160000 non-null float64
x48              160000 non-null float64
x49              160000 non-null float64
y                160000 non-null int64
x24_america      160000 non-null uint8
x24_asia         160000 non-null uint8
x24_euorpe       160000 non-null uint8
x29_Apr          160000 non-null uint8
x29_Aug          160000 non-null uint8
x29_Dev          160000 non-null uint8
x29_Feb          160000 non-null uint8
x29_January      160000 non-null uint8
x29_July         160000 non-null uint8
x29_Jun          160000 non-null uint8
x29_Mar          160000 non-null uint8
x29_May          160000 non-null uint8
x29_Nov          160000 non-null uint8
x29_Oct          160000 non-null uint8
x29_sept.        160000 non-null uint8
x30_friday       160000 non-null uint8
x30_monday       160000 non-null uint8
x30_thurday      160000 non-null uint8
x30_tuesday      160000 non-null uint8
x30_wednesday    160000 non-null uint8
dtypes: float64(47), int64(1), uint8(20)
memory usage: 61.6 MB

Remove highly correlated variables


sns.set(style="white")

# Compute the correlation matrix
corr = df.corr()

# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmin = -1, vmax=1, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})
<matplotlib.axes._subplots.AxesSubplot at 0x11d51fb00>

png

There does not seem to be much correlation going on, so we can move on.

Evaluate Algorithms

Choose a metric

The first step in evaluating the algorithms is to choose a metric. For an imbalanced binary classification problem the most commonly used metric of evaluation is the roc_auc_score - which represents the Area under the curve value. Area under the curve (AUC) is a single-number-summary of the performance of a classifier. AUC is the percentage of the ROC plot that is underneath the curve. It turns out that AUC can also be interpreted as follows, if we randomly choose 1 positive observation and 1 negative observation from our testing set, AUC represents the likelihood that our classifier would assign a higher predicted probability to the positive observation. It makes sense that this is useful goal, because ultimately, we want the classifier to rank positive observations higher than negative observations in terms of predicted probability. This the reason why, we will be extracting the predicted probability from the second column of the predict_proba output. For instance, mdl.predict_proba(y_test)[:,1], where y_test is the true label of the test set and mdl is the classifier model we are predicting on.

#df.to_csv('df.csv')
#df = pd.read_csv('df.csv')
#df = df.iloc[:, 1:]

#Create data = x
Cls_X = df.copy()
#create target = Y
Cls_y = Cls_X.y
print('X shape before ')
print(Cls_X.shape)
#drop the response column from the x dataset
Cls_X.pop('y')

print('X shape after ')
print(Cls_X.shape)
print('Y shape after ')
print(Cls_y.shape)
X shape before
(160000, 68)
X shape after
(160000, 67)
Y shape after
(160000,)

Class Distribution and Null Accuracy

There is 60:40 ratio of observations with class labels 0 and 1 respectively and naturally a null accuracy of 60%. That is, a model which randomly predicts 0 for every observations is still 60% accurate. This is typically as a benchmark to see how well our model predictions are when compared with null model.

# Explore distribuion of our class
print("Class distibution")
print(Cls_y.value_counts())

# calculate the percentage of zeros
print('Percentage of 0s: ',round(1 - Cls_y.mean(),4))

# calculate the percentage of ones
print('Percentage of 1s: ', round(Cls_y.mean(),4))

# calculate null accuracy
print('')
print('##############################')
print('NULL Accuracy: ', max(round(Cls_y.mean(),4), round(1 - Cls_y.mean(),4)))
print('##############################')
Class distibution
0    95803
1    64197
Name: y, dtype: int64
Percentage of 0s:  0.5988
Percentage of 1s:  0.4012

##############################
NULL Accuracy:  0.5988
##############################

Compare linear and non-linear models

The goal here is to train many models and see which ones are learning the pattern in the data. The models variable is a list of tuples, with name and the classifier object. Initially, defaults are used to get an idea of which models are able to learn the pattern in the data. 10-fold crossvalidation is used on the entire dataset and for each fold the roc_auc_score is calculated on the test indicies. The kfold.split() method returns a generator which can be used to retrieve the fold’s training and test indices. Within each loop the model is fit on the training indices and evaluated on the test indicies. The results of each fold are stored in cvscores list. At the end of the loop, the cvscores are stored inside a results dictionary. The results dictionary is then converted to a dataframe and the mean scores are computed.

Cls = df.copy()
target_data=Cls.y
Cls.pop('y')
train_data=Cls
# prepare models
models = []

# linear
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
# non-linear
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
# evaluate each model in turn
results = {}
oofs = {}
scoring = 'roc_auc'
for name, model in models:
    print("Training model :" + name)
    kfold = KFold(n_splits=10, random_state=7)
    oofs[name] = np.zeros(Cls.shape[0])
    cvscores = []
    for train_idx, test_idx in kfold.split(Cls):
        model.fit(train_data.loc[train_idx], target_data.loc[train_idx])
        oofs[name][test_idx] = model.predict_proba(train_data.loc[test_idx])[:,1]
        print("Fold ROC score is %f" % roc_auc_score(target_data[test_idx],oofs[name][test_idx]))
        cvscores.append(roc_auc_score(target_data[test_idx],oofs[name][test_idx]))
    results[name] = cvscores
Training model :LR
Fold ROC score is 0.755776
Fold ROC score is 0.754715
Fold ROC score is 0.758968
Fold ROC score is 0.760445
Fold ROC score is 0.761291
Fold ROC score is 0.758513
Fold ROC score is 0.763135
Fold ROC score is 0.759802
Fold ROC score is 0.759879
Fold ROC score is 0.759665
Training model :LDA
Fold ROC score is 0.755638
Fold ROC score is 0.754448
Fold ROC score is 0.758941
Fold ROC score is 0.760134
Fold ROC score is 0.761071
Fold ROC score is 0.758328
Fold ROC score is 0.763140
Fold ROC score is 0.759336
Fold ROC score is 0.759578
Fold ROC score is 0.759365
Training model :KNN
Fold ROC score is 0.896217
Fold ROC score is 0.897008
Fold ROC score is 0.897484
Fold ROC score is 0.896951
Fold ROC score is 0.893606
Fold ROC score is 0.894688
Fold ROC score is 0.895257
Fold ROC score is 0.892476
Fold ROC score is 0.895564
Fold ROC score is 0.893309
Training model :CART
Fold ROC score is 0.842974
Fold ROC score is 0.841589
Fold ROC score is 0.844041
Fold ROC score is 0.847429
Fold ROC score is 0.846501
Fold ROC score is 0.845466
Fold ROC score is 0.843749
Fold ROC score is 0.843105
Fold ROC score is 0.845250
Fold ROC score is 0.840554
Training model :NB
Fold ROC score is 0.746454
Fold ROC score is 0.740179
Fold ROC score is 0.748718
Fold ROC score is 0.745380
Fold ROC score is 0.751330
Fold ROC score is 0.746224
Fold ROC score is 0.749450
Fold ROC score is 0.747154
Fold ROC score is 0.752555
Fold ROC score is 0.749893
resultsDF = pd.DataFrame(results)
print(resultsDF.mean())
LR      0.759219
LDA     0.758998
KNN     0.895256
CART    0.844066
NB      0.747734
dtype: float64

KNN and CART are performing far better than the rest of the classifiers. In the next section, Logistic Regression, KNN and CART are combined into one meta-algorithm called VotingClassifier to yeild slightly better results.

Ensemble methods

Boostrap Aggregation methods:

Ensemble of different weak learners
Voting Classifier:

The idea behind the VotingClassifier is to combine conceptually different machine learning classifiers and use a majority vote or the average predicted probabilities (soft vote) to predict the class labels. Such a classifier can be useful for a set of equally well performing model in order to balance out their individual weaknesses. Use an ensemble of Logistic Regression, KNN and CART.

Three models are initialized and a list of tuples is constructed. In each iteration, individual models are fit on the Xtrain split and validated on the X_test split.

# select a seed to use for repeatability
seed = 100

#Split data to  into train and holding set.
X_train,X_test,y_train,y_test = train_test_split(Cls_X, Cls_y, test_size = 0.3, random_state = seed)

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
(112000, 67)
(112000,)
(48000, 67)
(48000,)
vlr = LogisticRegression(random_state=7)
vknn = KNeighborsClassifier()
vdt = DecisionTreeClassifier(random_state=7)
vClassifiers = [('Logistic Regression', vlr), ('K Nearest Neighbors', vknn), ('Classification Tree', vdt)]

# Iterate over the pre-defined list of classifiers
for vclf_name, vclf in vClassifiers:    

    # Fit clf to the training set
    vclf.fit(X_train, y_train)    

    # Predict y_pred and get proba
    y_pred = vclf.predict(X_test)
    y_pred_proba = vclf.predict_proba(X_test)[:,1]

    # Calculate roc_auc_score
    v_roc_auc_score = roc_auc_score(y_test, y_pred_proba)

    # Evaluate clf's roc_auc_score on the test set
    print('{:s} : {:.3f}'.format(vclf_name, v_roc_auc_score))
Logistic Regression : 0.761
K Nearest Neighbors : 0.892
Classification Tree : 0.838

Next, a VotingClassifier in instantiated with the list of estimators. Voting is set to soft since probabilities are used. By combining the above three models, the VotingClassifier’s roc_auc_score has improved to 0.929

# Instantiate a VotingClassifier vc
vc = VotingClassifier(estimators=vClassifiers, voting='soft')     

# Fit vc to the training set
vc.fit(X_train, y_train)   

# Evaluate the test set predictions
y_pred = vc.predict(X_test)
y_pred_proba = vc.predict_proba(X_test)[:,1]

# Calculate roc_auc_score
vc_roc_auc_score = roc_auc_score(y_test, y_pred_proba)
print('Voting Classifier: {:.3f}'.format(vc_roc_auc_score))
Voting Classifier: 0.929
BaggingClassifier with DecisionTree base learner

The VotingClassifier uses the same training dataset with different algorithms and the final predictions are obtained by majority voting. Another type of ensemble learner is a Bootstrap aggregation or in short Bagging. In Bagging, the same algorithm is used on different training data subsets. That is, each model is trained on a different subsets of the data. Overall, bagging has the effect of reducing the variance of individual models in the ensemble. The base estimator to use can be specified when instantiating the BaggingClassifier object. In this case, the BaggingClassifier did quite well with a roc_auc_score of 0.97

# Import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier

# Import BaggingClassifier
from sklearn.ensemble import BaggingClassifier

# Instantiate dt
dt = DecisionTreeClassifier(random_state=7)

# Instantiate bc
bc = BaggingClassifier(base_estimator=dt, n_estimators=50, random_state=7)

# Fit bc to the training set
bc.fit(X_train, y_train)

# Predict test set labels
y_pred = bc.predict(X_test)
y_pred_proba = bc.predict_proba(X_test)[:,1]

# Evaluate acc_test
bc_roc_auc_score = roc_auc_score(y_test, y_pred_proba)
print('Test set roc_auc_score of bc: {:.2f}'.format(bc_roc_auc_score))
Test set roc_auc_score of bc: 0.97
BaggingClassifier with KNeighborsClassifier base learner

The base estimator of the BaggingClassifier can be a KNeighbhorsClassifier. Trying this approach yeilded a roc_auc_score of 0.90

# Import KNeighborsClassifier
from sklearn.neighbors import KNeighborsClassifier

# Import BaggingClassifier
from sklearn.ensemble import BaggingClassifier

# Instantiate knc
knc = KNeighborsClassifier()

# Instantiate bc
bc_kn = BaggingClassifier(base_estimator=knc, n_estimators=10, random_state=1, n_jobs=-1)

# Fit bc to the training set
bc_kn.fit(X_train, y_train)

# Predict test set labels
y_pred = bc_kn.predict(X_test)
y_pred_proba = bc_kn.predict_proba(X_test)[:,1]

# Evaluate acc_test
bc_kn_roc_auc_score = roc_auc_score(y_test, y_pred_proba)
print('Test set roc_auc_score of bc with KNN: {:.2f}'.format(bc_kn_roc_auc_score))
Test set roc_auc_score of bc with KNN: 0.90
Ensemble of DecisionTrees - RandomForestClassifier

With BaggingClassifier, the base estimator can be any model. Each estimator is trained on a distinct bootstrap sample of the training set. Estimators use all features for training and prediction. RandomForest is another type of ensemble learner with DecisionTree as the base estimator. Like in BaggingClassifier, each estimator is trained on a different boostrap sample. But, RF introduces further randomization in the training of individual trees,i.e, when each tree is trained only d features are sampled at each node without replacement, where d < total number of features . The node is split using the feature that maximizes information gain (IG). By default d is the sqrt of the number of features. For example, if there are 100 features, then only 10 features are sampled at each node. Due to this pecularity, it is possible to extract the best features from the model - a thing that is most useful for building robust machine learning models.

The RandomForest itself can be tuned. A grid of parameters is passed the the GridSearch object along with the RandomForestClassifier object. Due to the fact that RandonForest inherently make use of DecisionTrees, it is also possible to pass in the parameters to tune individual trees like the IG criterion, max_depth etc.

tuned_parameters={'n_estimators':[10,20,30,40,50,100], 'criterion':['gini','entropy'], 'max_depth':[5,10,15],
                 'min_samples_split':[2,4,9]}
scores = ['roc_auc']

for score in scores:
    print("# Tuning hyper-parameters for %s" % score)
    print()

    clf = RandomizedSearchCV(RandomForestClassifier(), tuned_parameters, cv=5,
                       scoring='%s' % score, n_iter=10)
    clf.fit(X_train, y_train)

    print("Best parameters set found on development set:")
    print()
    print(clf.best_params_)
    print()
    print("Grid scores on development set:")
    print()
    means = clf.cv_results_['mean_test_score']
    stds = clf.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, clf.cv_results_['params']):
        print("%0.3f (+/-%0.03f) for %r"
              % (mean, std * 2, params))
    print()

    print("Detailed classification report:")
    print()
    print("The model is trained on the full development set.")
    print("The scores are computed on the full evaluation set.")
    print()
    y_true, y_pred = y_test, clf.predict(X_test)
    print(classification_report(y_true, y_pred))
    print()
# Tuning hyper-parameters for roc_auc

Best parameters set found on development set:

{'n_estimators': 40, 'min_samples_split': 9, 'max_depth': 15, 'criterion': 'entropy'}

Grid scores on development set:

0.860 (+/-0.005) for {'n_estimators': 40, 'min_samples_split': 2, 'max_depth': 5, 'criterion': 'gini'}
0.963 (+/-0.002) for {'n_estimators': 40, 'min_samples_split': 9, 'max_depth': 15, 'criterion': 'gini'}
0.964 (+/-0.002) for {'n_estimators': 50, 'min_samples_split': 9, 'max_depth': 15, 'criterion': 'gini'}
0.944 (+/-0.004) for {'n_estimators': 100, 'min_samples_split': 4, 'max_depth': 10, 'criterion': 'entropy'}
0.951 (+/-0.004) for {'n_estimators': 10, 'min_samples_split': 4, 'max_depth': 15, 'criterion': 'gini'}
0.959 (+/-0.002) for {'n_estimators': 20, 'min_samples_split': 9, 'max_depth': 15, 'criterion': 'gini'}
0.927 (+/-0.003) for {'n_estimators': 10, 'min_samples_split': 4, 'max_depth': 10, 'criterion': 'entropy'}
0.938 (+/-0.003) for {'n_estimators': 30, 'min_samples_split': 9, 'max_depth': 10, 'criterion': 'entropy'}
0.964 (+/-0.002) for {'n_estimators': 40, 'min_samples_split': 9, 'max_depth': 15, 'criterion': 'entropy'}
0.952 (+/-0.003) for {'n_estimators': 10, 'min_samples_split': 2, 'max_depth': 15, 'criterion': 'entropy'}

Detailed classification report:

The model is trained on the full development set.
The scores are computed on the full evaluation set.

             precision    recall  f1-score   support

          0       0.90      0.95      0.92     28776
          1       0.92      0.84      0.87     19224

avg / total       0.90      0.90      0.90     48000

The best estimator is then used to make predictions on the test set. The roc_auc_score in this case was observed to be: 0.966

clf.best_estimator_
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=15, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=9,
            min_weight_fraction_leaf=0.0, n_estimators=40, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
# get the best model and make predictions
bestRF_mdl = clf.best_estimator_
RF_ypred = bestRF_mdl.predict(X_test)
RF_ypred_prob = bestRF_mdl.predict_proba(X_test)[:,1]
# calculate the roc_auc_score on X_test
roc_auc_score(y_test, RF_ypred_prob)
0.9665118297620746
Extract important features from RandomForestClassifier

The most important features contributing to the model can be extracted using the feature_importances_ attribute of the best estimator. The top features are plotted and it can be readily observed that only the first 17 attributes are significantly contributing to the model. In the next section, the model is trained using only these top 17 features to show how using a simpler model, at times, yields even better results than using the full model. This principle is called the Occum's Razor. The idea here is that simple models tend to be more robust than complex models.

# get the best predictors
importances = pd.DataFrame({'feature':X_train.columns, 'importance': bestRF_mdl.feature_importances_})
importances = importances.sort_values('importance',ascending=False)

# display top 20
topDF1 = importances.head(20)
print(topDF1)

f, ax = plt.subplots(figsize=(15, 6))

g = sns.barplot('feature', y="importance", data=topDF1, color="red", saturation=.9, ci=68)
g =plt.xticks(rotation=90)
plt.xlabel('Feature')
plt.ylabel('importance')
plt.title('Random Forest Features importance')
   feature  importance
23     x23    0.085280
20     x20    0.070743
46     x49    0.067521
45     x48    0.062922
27     x28    0.057728
39     x42    0.057319
12     x12    0.053220
26     x27    0.052421
34     x37    0.046483
7       x7    0.045977
37     x40    0.045037
35     x38    0.039564
38     x41    0.037709
43     x46    0.036343
2       x2    0.034945
6       x6    0.031490
29     x32    0.031396
3       x3    0.004695
41     x44    0.004548
0       x0    0.004461





Text(0.5,1,'Random Forest Features importance')

png

Boosting methods

Boosting refers to an ensemble method in which several models are trained sequentially with each model learning from the errors of its predecessors.

Adaptive boosting - AdaBoostClassifier

In Adaptive boosting, each predictor pays more attention to the instances wrongly predicted by its predecessor, by constantly changing the weights of training instances. Furthermore, each predictor is assigned a coefficient alpha, that weighs its contribution in the ensemble’s final prediction.

Any classifier can be used as a base estimator. At first an un-tuned DecisionTree with 50 estimators is tried. This resulted a roughly same performance as the baseline DecisionTree model with an roc_auc_score of 0.84

# Import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier

# Import AdaBoostClassifier
from sklearn.ensemble import AdaBoostClassifier

# Instantiate dt
dt_ada = DecisionTreeClassifier(random_state=7)

# Instantiate ada
ada = AdaBoostClassifier(base_estimator=dt_ada, n_estimators=50, random_state=7)

#Fit ada to the training set
ada.fit(X_train, y_train)

# Compute the probabilities of obtaining the positive class
y_pred_proba = ada.predict_proba(X_test)[:,1]

# Evaluate acc_test
ada_roc_auc_score = roc_auc_score(y_test, y_pred_proba)
print('Test set roc_auc_score of adaboosting: {:.2f}'.format(ada_roc_auc_score))
Test set roc_auc_score of adaboosting: 0.84

A tuned DecisionTree perfomed much better than the base line DecisionTree with a roc_auc_score of 0.908

Find the best tree
# Import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier

dt_tune = DecisionTreeClassifier(random_state=7)

# optimize max_depth; max_features and min_samples_leaf
params_dt = {
    'max_depth': [5,10,15,20],
    'min_samples_leaf': [0.08, 1],
    'criterion': ['gini', 'entropy']
}

# instantiate a RandomSearchCV object
grid_dt = RandomizedSearchCV(estimator=dt_tune, param_distributions = params_dt,
                             scoring='roc_auc', cv=5, n_jobs=-1, n_iter=10)
grid_dt.fit(X_train, y_train)

ypred = grid_dt.predict(X_test)
ypred_proba = grid_dt.predict_proba(X_test)[:,1]

# Evaluate acc_test
dt_tune_roc_auc_score = roc_auc_score(y_test, ypred_proba)
print('Test set roc_auc_score of DecisionTree: {:.5f}'.format(dt_tune_roc_auc_score))
Test set roc_auc_score of DecisionTree: 0.90866
grid_dt.best_params_
{'min_samples_leaf': 1, 'max_depth': 10, 'criterion': 'entropy'}

Use the model with these hyperparameters in the next section where Boosting is done.

Tuning AdaBoostClassifier

AdaBoostClassifier itself can be tune further to find the best learning_rate and the optimal number of estimators to use.

tuned_parameters={'n_estimators':[10,20,30,40,50,100,200], 'learning_rate': [0.01, 0.1, 1]}
scores = ['roc_auc']

for score in scores:
    print("# Tuning hyper-parameters for %s" % score)
    print()

    # Instantiate dt
    dt_ada = DecisionTreeClassifier(min_samples_leaf=1, max_depth=10,
                                    criterion='entropy', random_state=7)

    clf_ada = RandomizedSearchCV(AdaBoostClassifier(base_estimator=dt_ada), tuned_parameters, cv=5,
                       scoring='%s' % score, n_iter=10, verbose=1, n_jobs=-1)
    clf_ada.fit(X_train, y_train)

    print("Best parameters set found on development set:")
    print()
    print(clf_ada.best_params_)
    print()
    print("Grid scores on development set:")
    print()
    means = clf_ada.cv_results_['mean_test_score']
    stds = clf_ada.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, clf_ada.cv_results_['params']):
        print("%0.3f (+/-%0.03f) for %r"
              % (mean, std * 2, params))
    print()

    print("Detailed classification report:")
    print()
    print("The model is trained on the full development set.")
    print("The scores are computed on the full evaluation set.")
    print()
    y_true, y_pred = y_test, clf_ada.predict(X_test)
    print(classification_report(y_true, y_pred))
    print()
# Tuning hyper-parameters for roc_auc

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed: 140.6min
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed: 144.3min finished


Best parameters set found on development set:

{'n_estimators': 100, 'learning_rate': 0.01}

Grid scores on development set:

0.945 (+/-0.009) for {'n_estimators': 20, 'learning_rate': 0.01}
0.962 (+/-0.005) for {'n_estimators': 100, 'learning_rate': 0.1}
0.925 (+/-0.006) for {'n_estimators': 100, 'learning_rate': 1}
0.965 (+/-0.004) for {'n_estimators': 200, 'learning_rate': 0.1}
0.958 (+/-0.002) for {'n_estimators': 20, 'learning_rate': 0.1}
0.953 (+/-0.007) for {'n_estimators': 30, 'learning_rate': 0.01}
0.932 (+/-0.008) for {'n_estimators': 10, 'learning_rate': 0.01}
0.969 (+/-0.003) for {'n_estimators': 100, 'learning_rate': 0.01}
0.956 (+/-0.006) for {'n_estimators': 40, 'learning_rate': 0.01}
0.910 (+/-0.003) for {'n_estimators': 10, 'learning_rate': 1}

Detailed classification report:

The model is trained on the full development set.
The scores are computed on the full evaluation set.

             precision    recall  f1-score   support

          0       0.92      0.95      0.94     28776
          1       0.92      0.88      0.90     19224

avg / total       0.92      0.92      0.92     48000

The best estimator was found to be one which uses a learning rate of 0.01 with 100 estimators. Predictions using this estimator resulted in a roc_auc_score of 0.968

clf_ada.best_estimator_
AdaBoostClassifier(algorithm='SAMME.R',
          base_estimator=DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=10,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=7,
            splitter='best'),
          learning_rate=0.01, n_estimators=100, random_state=None)
# get the best model and make predictions
ada_best = clf_ada.best_estimator_
ada_best_ypred = ada_best.predict(X_test)
ada_best_ypred_prob = ada_best.predict_proba(X_test)[:,1]

# calculate the roc_auc_score on X_test
roc_auc_score(y_test, ada_best_ypred_prob)
0.9686331486097618

Since a DecisionTree base estimator was used in the AdaptiveBoosting meta algorithm, it is possible to extract the best features. The best features contributing the model are extracted and plotted.

Interestingly, even using AdaBoosting gave 17 best predictors that are contributing the most to the model’s performance.

# get the best predictors
#bestRF_mdl.feature_importances_

importances = pd.DataFrame({'feature':X_train.columns, 'importance': ada_best.feature_importances_})
importances = importances.sort_values('importance',ascending=False)#.set_index('feature')

#only display top 20 importances[:10].plot.bar()
topDF1 = importances.head(20)
print(topDF1)

f, ax = plt.subplots(figsize=(15, 6))

g = sns.barplot('feature', y="importance", data=topDF1, color="red", saturation=.9, ci=68)
g =plt.xticks(rotation=90)
plt.xlabel('Feature')
plt.ylabel('importance')
plt.title('AdaBoosting Features importance')
   feature  importance
45     x48    0.108718
12     x12    0.096015
23     x23    0.074627
39     x42    0.072313
26     x27    0.071410
20     x20    0.070695
27     x28    0.069197
46     x49    0.068749
34     x37    0.063748
43     x46    0.044468
29     x32    0.041231
7       x7    0.040534
38     x41    0.036494
2       x2    0.029015
35     x38    0.027606
37     x40    0.023242
6       x6    0.017873
36     x39    0.001677
41     x44    0.001653
32     x35    0.001636





Text(0.5,1,'AdaBoosting Features importance')

png

The top 17 features from AdaBoosting are collected into Ada_importantFeatures for later use.

Ada_importantFeatures = list(topDF1.iloc[0:17,0])
X_train_simple_ada = X_train[Ada_importantFeatures]
X_test_simple_ada = X_test[Ada_importantFeatures]
print(X_train_simple_ada.shape)
print(y_train.shape)
print(X_test_simple_ada.shape)
print(y_test.shape)
(112000, 17)
(112000,)
(48000, 17)
(48000,)

Compare Bagging and Boosting methods

Compare RandomForestClassifier with AdaBoostClassifier

As RandomForestClassifer and AdaBoostClassifier ensemble methods both gave out 17 features to be the most useful in predicting the y label, it is interesting to see if these are the same exact 17 features.

The intersection of the features from both Bagging and Boosting ensembles are exactly the same. However, the order of importance varies.

def intersection(list1, list2):
    list3 = [value for value in list1 if value in list2]
    return list3

print(intersection(Ada_importantFeatures, importantFeatures))
print(len(intersection(Ada_importantFeatures, importantFeatures)))
['x48', 'x12', 'x23', 'x42', 'x27', 'x20', 'x28', 'x49', 'x37', 'x46', 'x32', 'x7', 'x41', 'x2', 'x38', 'x40', 'x6']
17

Improve Results

In this section, the features extracted from Bagging and Boosting methods are used to tune the RandomForestClassifier and Stochastic Gradient Boosting classifier. The tuned models are then used on the test data to see if there were any improvements to the roc_auc_score.

ensemble of decision trees with important features

At first the training and test sets are trimmed to use only these top 17 features.

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
(112000, 67)
(112000,)
(48000, 67)
(48000,)
importantFeatures = list(topDF1.iloc[0:17,0])
X_train_simple = X_train[importantFeatures]
X_test_simple = X_test[importantFeatures]
print(X_train_simple.shape)
print(y_train.shape)
print(X_test_simple.shape)
print(y_test.shape)
(112000, 17)
(112000,)
(48000, 17)
(48000,)
tuned_parameters={'n_estimators':[10,20,30,40,50,100], 'criterion':['gini','entropy'], 'max_depth':[5,10,15],
                 'min_samples_split':[2,4,9]}
scores = ['roc_auc']

for score in scores:
    print("# Tuning REDUCED MODEL's hyper-parameters for %s" % score)
    print()

    clf_reduced = RandomizedSearchCV(RandomForestClassifier(), tuned_parameters, cv=5,
                       scoring='%s' % score, n_iter=10)

    ## NOTE this time using reduced model
    clf_reduced.fit(X_train_simple, y_train)

    print("Best parameters set found on development set:")
    print()
    print(clf_reduced.best_params_)
    print()
    print("Grid scores on development set:")
    print()
    means = clf_reduced.cv_results_['mean_test_score']
    stds = clf_reduced.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, clf_reduced.cv_results_['params']):
        print("%0.3f (+/-%0.03f) for %r"
              % (mean, std * 2, params))
    print()

    print("Detailed classification report:")
    print()
    print("The model is trained on the full development set.")
    print("The scores are computed on the full evaluation set.")
    print()
    y_true, y_pred = y_test, clf_reduced.predict(X_test_simple)
    print(classification_report(y_true, y_pred))
    print()
# Tuning REDUCED MODEL's hyper-parameters for roc_auc

Best parameters set found on development set:

{'n_estimators': 50, 'min_samples_split': 2, 'max_depth': 15, 'criterion': 'entropy'}

Grid scores on development set:

0.968 (+/-0.002) for {'n_estimators': 10, 'min_samples_split': 4, 'max_depth': 15, 'criterion': 'gini'}
0.975 (+/-0.001) for {'n_estimators': 40, 'min_samples_split': 2, 'max_depth': 15, 'criterion': 'gini'}
0.875 (+/-0.007) for {'n_estimators': 100, 'min_samples_split': 9, 'max_depth': 5, 'criterion': 'gini'}
0.873 (+/-0.006) for {'n_estimators': 100, 'min_samples_split': 4, 'max_depth': 5, 'criterion': 'entropy'}
0.975 (+/-0.001) for {'n_estimators': 40, 'min_samples_split': 4, 'max_depth': 15, 'criterion': 'gini'}
0.973 (+/-0.001) for {'n_estimators': 20, 'min_samples_split': 2, 'max_depth': 15, 'criterion': 'gini'}
0.873 (+/-0.003) for {'n_estimators': 50, 'min_samples_split': 4, 'max_depth': 5, 'criterion': 'entropy'}
0.977 (+/-0.001) for {'n_estimators': 50, 'min_samples_split': 2, 'max_depth': 15, 'criterion': 'entropy'}
0.871 (+/-0.007) for {'n_estimators': 40, 'min_samples_split': 4, 'max_depth': 5, 'criterion': 'entropy'}
0.953 (+/-0.002) for {'n_estimators': 20, 'min_samples_split': 9, 'max_depth': 10, 'criterion': 'gini'}

Detailed classification report:

The model is trained on the full development set.
The scores are computed on the full evaluation set.

             precision    recall  f1-score   support

          0       0.93      0.96      0.94     28776
          1       0.93      0.88      0.91     19224

avg / total       0.93      0.93      0.93     48000
# get the best model and make predictions
bestRF_mdl_reduced = clf_reduced.best_estimator_
RF_ypred_reduced = bestRF_mdl_reduced.predict(X_test_simple)
RF_ypred_prob_reduced = bestRF_mdl_reduced.predict_proba(X_test_simple)[:,1]

# calculate the roc_auc_score on X_test_simple
roc_auc_score(y_test, RF_ypred_prob_reduced)
0.9776587629348727

Using a simpler model gives a better performance than using the entire model with a roc_auc_score of 0.977. Simple models are more robust.

bestRF_mdl_reduced
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=15, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

Gradient Boosting with important features

Gradient boosting takes the errors of the previous predictor and corrects them. In contrast to AdaBoost, the weights of the instances are not tweaked, instead, each predictor is trained using its predecessor’s residual errors as labels. GradientBoosting Trees take the base learn as CART models, hence we can specify in the parameter grid params which are applicable for tuning Trees.

In this case, only the top 17 features identified in the previous section are used to perform the Stochastic Gradient Boosting. In SGB each tree is trained on a random subset of rows of the training data. The subset is sampled without replacement. The percent to rows to use as subset is given by subsample param of the GradientBoostingClassifier. Furthermore, features are sampled (without replacement) when choosing split points. The result of this that it adds further ensemble diversity and the net effect of this is that it adds further variance to the ensemble trees.

from sklearn.ensemble import GradientBoostingClassifier
tuned_parameters={'n_estimators':[40,50,100], 'max_depth':[5,10,15],
                 'min_samples_split':[2,4,9], 'max_features': [0.6, 0.8]}
scores = ['roc_auc']

for score in scores:
    print("# Tuning hyper-parameters for %s" % score)
    print()

    clf_gb = RandomizedSearchCV(GradientBoostingClassifier(subsample=0.9), tuned_parameters, cv=5,
                       scoring='%s' % score, n_iter=10, n_jobs = -1, verbose=2)
    clf_gb.fit(X_train_simple_ada, y_train)

    print("Best parameters set found on development set:")
    print()
    print(clf_gb.best_params_)
    print()
    print("Grid scores on development set:")
    print()
    means = clf_gb.cv_results_['mean_test_score']
    stds = clf_gb.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, clf_gb.cv_results_['params']):
        print("%0.3f (+/-%0.03f) for %r"
              % (mean, std * 2, params))
    print()

    print("Detailed classification report:")
    print()
    print("The model is trained on the full development set.")
    print("The scores are computed on the full evaluation set.")
    print()
    y_true, y_pred = y_test, clf_gb.predict(X_test_simple_ada)
    print(classification_report(y_true, y_pred))
    print()
# Tuning hyper-parameters for roc_auc

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV] n_estimators=50, min_samples_split=9, max_features=0.6, max_depth=5
[CV] n_estimators=50, min_samples_split=9, max_features=0.6, max_depth=5
[CV] n_estimators=50, min_samples_split=9, max_features=0.6, max_depth=5
[CV] n_estimators=50, min_samples_split=9, max_features=0.6, max_depth=5
[CV]  n_estimators=50, min_samples_split=9, max_features=0.6, max_depth=5, total=  37.4s
[CV] n_estimators=50, min_samples_split=9, max_features=0.6, max_depth=5
[CV]  n_estimators=50, min_samples_split=9, max_features=0.6, max_depth=5, total=  37.6s
[CV]  n_estimators=50, min_samples_split=9, max_features=0.6, max_depth=5, total=  37.6s
[CV] n_estimators=100, min_samples_split=9, max_features=0.6, max_depth=5
[CV] n_estimators=100, min_samples_split=9, max_features=0.6, max_depth=5
[CV]  n_estimators=50, min_samples_split=9, max_features=0.6, max_depth=5, total=  38.3s
[CV] n_estimators=100, min_samples_split=9, max_features=0.6, max_depth=5
[CV]  n_estimators=50, min_samples_split=9, max_features=0.6, max_depth=5, total=  38.1s
[CV] n_estimators=100, min_samples_split=9, max_features=0.6, max_depth=5
[CV]  n_estimators=100, min_samples_split=9, max_features=0.6, max_depth=5, total= 1.2min
[CV] n_estimators=100, min_samples_split=9, max_features=0.6, max_depth=5
[CV]  n_estimators=100, min_samples_split=9, max_features=0.6, max_depth=5, total= 1.2min
[CV] n_estimators=100, min_samples_split=9, max_features=0.8, max_depth=15
[CV]  n_estimators=100, min_samples_split=9, max_features=0.6, max_depth=5, total= 1.2min
[CV] n_estimators=100, min_samples_split=9, max_features=0.8, max_depth=15
[CV]  n_estimators=100, min_samples_split=9, max_features=0.6, max_depth=5, total= 1.2min
[CV] n_estimators=100, min_samples_split=9, max_features=0.8, max_depth=15
[CV]  n_estimators=100, min_samples_split=9, max_features=0.6, max_depth=5, total= 1.3min
[CV] n_estimators=100, min_samples_split=9, max_features=0.8, max_depth=15
[CV]  n_estimators=100, min_samples_split=9, max_features=0.8, max_depth=15, total=32.3min
[CV] n_estimators=100, min_samples_split=9, max_features=0.8, max_depth=15
[CV]  n_estimators=100, min_samples_split=9, max_features=0.8, max_depth=15, total=33.3min
[CV] n_estimators=100, min_samples_split=9, max_features=0.6, max_depth=10
[CV]  n_estimators=100, min_samples_split=9, max_features=0.8, max_depth=15, total=33.4min
[CV] n_estimators=100, min_samples_split=9, max_features=0.6, max_depth=10
[CV]  n_estimators=100, min_samples_split=9, max_features=0.8, max_depth=15, total=33.6min
[CV] n_estimators=100, min_samples_split=9, max_features=0.6, max_depth=10
[CV]  n_estimators=100, min_samples_split=9, max_features=0.6, max_depth=10, total= 6.7min
[CV] n_estimators=100, min_samples_split=9, max_features=0.6, max_depth=10
[CV]  n_estimators=100, min_samples_split=9, max_features=0.6, max_depth=10, total= 6.8min
[CV] n_estimators=100, min_samples_split=9, max_features=0.6, max_depth=10
[CV]  n_estimators=100, min_samples_split=9, max_features=0.6, max_depth=10, total= 6.6min
[CV] n_estimators=100, min_samples_split=2, max_features=0.8, max_depth=10
[CV]  n_estimators=100, min_samples_split=9, max_features=0.6, max_depth=10, total= 6.5min
[CV] n_estimators=100, min_samples_split=2, max_features=0.8, max_depth=10
[CV]  n_estimators=100, min_samples_split=9, max_features=0.6, max_depth=10, total= 6.7min
[CV] n_estimators=100, min_samples_split=2, max_features=0.8, max_depth=10
[CV]  n_estimators=100, min_samples_split=2, max_features=0.8, max_depth=10, total=10.0min
[CV] n_estimators=100, min_samples_split=2, max_features=0.8, max_depth=10
[CV]  n_estimators=100, min_samples_split=2, max_features=0.8, max_depth=10, total=10.5min
[CV] n_estimators=100, min_samples_split=2, max_features=0.8, max_depth=10
[CV]  n_estimators=100, min_samples_split=2, max_features=0.8, max_depth=10, total=10.2min
[CV] n_estimators=40, min_samples_split=4, max_features=0.8, max_depth=10
[CV]  n_estimators=100, min_samples_split=2, max_features=0.8, max_depth=10, total=10.4min
[CV] n_estimators=40, min_samples_split=4, max_features=0.8, max_depth=10
[CV]  n_estimators=40, min_samples_split=4, max_features=0.8, max_depth=10, total= 4.9min
[CV] n_estimators=40, min_samples_split=4, max_features=0.8, max_depth=10
[CV]  n_estimators=100, min_samples_split=9, max_features=0.8, max_depth=15, total=34.4min
[CV] n_estimators=40, min_samples_split=4, max_features=0.8, max_depth=10
[CV]  n_estimators=40, min_samples_split=4, max_features=0.8, max_depth=10, total= 5.1min
[CV] n_estimators=40, min_samples_split=4, max_features=0.8, max_depth=10
[CV]  n_estimators=100, min_samples_split=2, max_features=0.8, max_depth=10, total=10.6min
[CV] n_estimators=40, min_samples_split=9, max_features=0.8, max_depth=15
[CV]  n_estimators=40, min_samples_split=4, max_features=0.8, max_depth=10, total= 5.1min
[CV] n_estimators=40, min_samples_split=9, max_features=0.8, max_depth=15
[CV]  n_estimators=40, min_samples_split=4, max_features=0.8, max_depth=10, total= 5.3min
[CV] n_estimators=40, min_samples_split=9, max_features=0.8, max_depth=15
[CV]  n_estimators=40, min_samples_split=4, max_features=0.8, max_depth=10, total= 5.3min
[CV] n_estimators=40, min_samples_split=9, max_features=0.8, max_depth=15
[CV]  n_estimators=40, min_samples_split=9, max_features=0.8, max_depth=15, total=20.3min
[CV] n_estimators=40, min_samples_split=9, max_features=0.8, max_depth=15
[CV]  n_estimators=40, min_samples_split=9, max_features=0.8, max_depth=15, total=20.3min
[CV] n_estimators=100, min_samples_split=2, max_features=0.8, max_depth=15
[CV]  n_estimators=40, min_samples_split=9, max_features=0.8, max_depth=15, total=20.1min
[CV] n_estimators=100, min_samples_split=2, max_features=0.8, max_depth=15


[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed: 94.0min


[CV]  n_estimators=40, min_samples_split=9, max_features=0.8, max_depth=15, total=20.0min
[CV] n_estimators=100, min_samples_split=2, max_features=0.8, max_depth=15
[CV]  n_estimators=40, min_samples_split=9, max_features=0.8, max_depth=15, total=19.2min
[CV] n_estimators=100, min_samples_split=2, max_features=0.8, max_depth=15
[CV]  n_estimators=100, min_samples_split=2, max_features=0.8, max_depth=15, total=61.4min
[CV] n_estimators=100, min_samples_split=2, max_features=0.8, max_depth=15
[CV]  n_estimators=100, min_samples_split=2, max_features=0.8, max_depth=15, total=59.1min
[CV] n_estimators=50, min_samples_split=2, max_features=0.8, max_depth=15
[CV]  n_estimators=100, min_samples_split=2, max_features=0.8, max_depth=15, total=59.8min
[CV] n_estimators=50, min_samples_split=2, max_features=0.8, max_depth=15
[CV]  n_estimators=100, min_samples_split=2, max_features=0.8, max_depth=15, total=57.9min
[CV] n_estimators=50, min_samples_split=2, max_features=0.8, max_depth=15
[CV]  n_estimators=50, min_samples_split=2, max_features=0.8, max_depth=15, total=31.6min
[CV] n_estimators=50, min_samples_split=2, max_features=0.8, max_depth=15
[CV]  n_estimators=50, min_samples_split=2, max_features=0.8, max_depth=15, total=31.3min
[CV] n_estimators=50, min_samples_split=2, max_features=0.8, max_depth=15
[CV]  n_estimators=50, min_samples_split=2, max_features=0.8, max_depth=15, total=31.4min
[CV] n_estimators=100, min_samples_split=2, max_features=0.8, max_depth=5
[CV]  n_estimators=100, min_samples_split=2, max_features=0.8, max_depth=5, total= 1.6min
[CV] n_estimators=100, min_samples_split=2, max_features=0.8, max_depth=5
[CV]  n_estimators=100, min_samples_split=2, max_features=0.8, max_depth=15, total=48.8min
[CV] n_estimators=100, min_samples_split=2, max_features=0.8, max_depth=5
[CV]  n_estimators=100, min_samples_split=2, max_features=0.8, max_depth=5, total= 1.6min
[CV] n_estimators=100, min_samples_split=2, max_features=0.8, max_depth=5
[CV]  n_estimators=100, min_samples_split=2, max_features=0.8, max_depth=5, total= 1.5min
[CV] n_estimators=100, min_samples_split=2, max_features=0.8, max_depth=5
[CV]  n_estimators=100, min_samples_split=2, max_features=0.8, max_depth=5, total= 1.5min
[CV]  n_estimators=100, min_samples_split=2, max_features=0.8, max_depth=5, total= 1.4min
[CV]  n_estimators=50, min_samples_split=2, max_features=0.8, max_depth=15, total=24.6min
[CV]  n_estimators=50, min_samples_split=2, max_features=0.8, max_depth=15, total=24.7min


[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed: 209.9min finished


Best parameters set found on development set:

{'n_estimators': 100, 'min_samples_split': 9, 'max_features': 0.8, 'max_depth': 15}

Grid scores on development set:

0.941 (+/-0.003) for {'n_estimators': 50, 'min_samples_split': 9, 'max_features': 0.6, 'max_depth': 5}
0.959 (+/-0.004) for {'n_estimators': 100, 'min_samples_split': 9, 'max_features': 0.6, 'max_depth': 5}
0.985 (+/-0.001) for {'n_estimators': 100, 'min_samples_split': 9, 'max_features': 0.8, 'max_depth': 15}
0.983 (+/-0.001) for {'n_estimators': 100, 'min_samples_split': 9, 'max_features': 0.6, 'max_depth': 10}
0.983 (+/-0.001) for {'n_estimators': 100, 'min_samples_split': 2, 'max_features': 0.8, 'max_depth': 10}
0.978 (+/-0.001) for {'n_estimators': 40, 'min_samples_split': 4, 'max_features': 0.8, 'max_depth': 10}
0.981 (+/-0.002) for {'n_estimators': 40, 'min_samples_split': 9, 'max_features': 0.8, 'max_depth': 15}
0.984 (+/-0.001) for {'n_estimators': 100, 'min_samples_split': 2, 'max_features': 0.8, 'max_depth': 15}
0.982 (+/-0.001) for {'n_estimators': 50, 'min_samples_split': 2, 'max_features': 0.8, 'max_depth': 15}
0.959 (+/-0.002) for {'n_estimators': 100, 'min_samples_split': 2, 'max_features': 0.8, 'max_depth': 5}

Detailed classification report:

The model is trained on the full development set.
The scores are computed on the full evaluation set.

             precision    recall  f1-score   support

          0       0.95      0.96      0.96     28776
          1       0.94      0.92      0.93     19224

avg / total       0.95      0.95      0.95     48000

The best estimator was found to be {'n_estimators': 100, 'min_samples_split': 9, 'max_features': 0.8, 'max_depth': 15} with an roc_auc_score on the test set of 0.985.

clf_gb.best_estimator_
GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=15,
              max_features=0.8, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=9,
              min_weight_fraction_leaf=0.0, n_estimators=100,
              presort='auto', random_state=None, subsample=0.9, verbose=0,
              warm_start=False)
# get the best model and make predictions
gb_best = clf_gb.best_estimator_
gb_best_ypred = gb_best.predict(X_test_simple_ada)
gb_best_ypred_prob = gb_best.predict_proba(X_test_simple_ada)[:,1]

# calculate the roc_auc_score on X_test
roc_auc_score(y_test, gb_best_ypred_prob)
0.9853109879331402

Summarize the models

Using roc_auc_score as the metric, the best model for predicting the label of y is found to be Tuned Stochastic Gradient Boosting Classifier with score of 0.985

modelSummary = {
    'Tuned Stochastic GradientBoostingClassifer with top 17 features': 0.985,
    'Tuned RandomForestClassifier with top 17 features': 0.977,
    'Tuned Adaptive Boosting DecisionTree': 0.968,
    'UnTuned Adaptive Boosting DecisionTree': 0.84,
    'Tuned DecisionTreeClassifier': 0.908,
    'Tuned RandomForestClassifier': 0.966,
    'VotingClassifier': 0.929,
    'BaggingClassifier with DT': 0.97,
    'BaggingClassifier with KNN': 0.90,
    'LogisticRegression': 0.759,
    'LinearDiscriminantAnalysis': 0.758,
    'KNeighborsClassifier': 0.895,
    'CART': 0.844,
    'NaiveBayesClassifier': 0.747
}

modelSummarySeries = pd.Series(modelSummary)
modelSummarySeries.sort_values(ascending=False)
Tuned Stochastic GradientBoostingClassifer with top 17 features    0.985
Tuned RandomForestClassifier with top 17 features                  0.977
BaggingClassifier with DT                                          0.970
Tuned Adaptive Boosting DecisionTree                               0.968
Tuned RandomForestClassifier                                       0.966
VotingClassifier                                                   0.929
Tuned DecisionTreeClassifier                                       0.908
BaggingClassifier with KNN                                         0.900
KNeighborsClassifier                                               0.895
CART                                                               0.844
UnTuned Adaptive Boosting DecisionTree                             0.840
LogisticRegression                                                 0.759
LinearDiscriminantAnalysis                                         0.758
NaiveBayesClassifier                                               0.747
dtype: float64

Deploy the model

Finding an accurate machine learning model is not the end of the project. It is imperative to save the model for future predictions on the new data. Training computationally intensive models like Gradient Boosting takes a considerable amount of time. One way to save the models is by using the pickle module’s dump utility. The best model can be exported to other systems and integrated into a larger pipeline. It is important to remember that this model expects the same number of features as was used to train the model.

from pickle import dump
from pickle import load
filename = 'gb_best.sav'
dump(gb_best, open(filename, 'wb'))
dump(importantFeatures, open('importantFeaturesRF.sav', 'wb'))
dump(Ada_importantFeatures, open('importantFeaturesAda.sav', 'wb'))
dump(ada_best, open('ada_best_full.sav', 'wb'))
dump(bestRF_mdl_reduced, open('rf_best.sav', 'wb'))

Present Results and Insights

A variety of classifier models were tried out to find which model yields the best score. Machine Learning is an emperical process, where there is some trail and error involved. But, it is possible to improve the performance of a model through a scientific process by understanding how the model parameters play a role. A general approach has been followed by training a variety of classifiers and the knowledge gained in each step was applied to training meta algorithms like Bagging and Boosting. A useful feature of using these ensemble learners with decision tree base estimators is that one can extract the top features contributing to the identification of patterns in the data.

Interestingly, both ensemble methods have identified that only the same 17 features are required to predict the label of y. Given this knowledge, the researcher can narrow down his/her analysis by focusing on these features.

A simple model is more robust and the predictions obtained from such a model can be considered more reliable.

importantFeaturesRF = load(open('importantFeaturesRF.sav', 'rb'))
importantFeaturesADA = load(open('importantFeaturesAda.sav', 'rb'))
impFeatures = pd.DataFrame({'Bagging': importantFeaturesRF,
                          'Boosting': importantFeaturesADA})
impFeatures
Bagging Boosting
0 x23 x48
1 x20 x12
2 x49 x23
3 x48 x42
4 x28 x27
5 x42 x20
6 x12 x28
7 x27 x49
8 x37 x37
9 x7 x46
10 x40 x32
11 x38 x7
12 x41 x41
13 x46 x2
14 x2 x38
15 x6 x40
16 x32 x6