New York City Airbnb Data Preprocessing
New York City Airbnb Pre-processing
Two important preprocessing tasks for models that rely on distance metrics are:
- Dealing with Outliers
- Scaling data
In this notebook, I will first identify features with outliers and decide whether keeping these outliers is necessary for the overall analysis or not. Secondly, I will explore three different types of Scaling techniques and choose the one that is most relevant to our analysis.
Get the data
Read in the data that was cleaned as part of this blog post: New York City Airbnb Data Cleaning . The cleansed dataset is loaded from s3://skuchkula-sagemaker-airbnb/airbnb_clean.csv
.
# data managing and display libs
import pandas as pd
import numpy as np
import os
import io
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline
# sagemaker libraries
import boto3
df_airbnb = pd.read_csv('airbnb_clean.csv')
Identifying outliers
The first step here is to isolate numeric features from text/categorical features. Shown below are the columns which are either text/categorical in nature. I have also excluded some features which don’t make much sense in the context of the problem.
drop_cols = list(df_airbnb.select_dtypes(['O']).columns) + ['id', 'latitude', 'longitude', 'maximum_nights']
drop_cols
['name',
'summary',
'description',
'host_verifications',
'neighbourhood_cleansed',
'neighbourhood_group_cleansed',
'amenities',
'id',
'latitude',
'longitude',
'maximum_nights']
df_airbnb.drop(drop_cols, axis=1, inplace=True)
Ensure that all the columns are in float64
format.
# convert all these to float64
df_airbnb = df_airbnb.astype(float)
df_airbnb.info(verbose=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45605 entries, 0 to 45604
Data columns (total 56 columns):
host_listings_count 45605 non-null float64
host_total_listings_count 45605 non-null float64
accommodates 45605 non-null float64
bathrooms 45605 non-null float64
bedrooms 45605 non-null float64
beds 45605 non-null float64
price 45605 non-null float64
guests_included 45605 non-null float64
extra_people 45605 non-null float64
minimum_nights 45605 non-null float64
availability_30 45605 non-null float64
availability_60 45605 non-null float64
availability_90 45605 non-null float64
availability_365 45605 non-null float64
number_of_reviews 45605 non-null float64
number_of_reviews_ltm 45605 non-null float64
calculated_host_listings_count 45605 non-null float64
calculated_host_listings_count_entire_homes 45605 non-null float64
calculated_host_listings_count_private_rooms 45605 non-null float64
calculated_host_listings_count_shared_rooms 45605 non-null float64
host_is_superhost_f 45605 non-null float64
host_is_superhost_t 45605 non-null float64
host_has_profile_pic_f 45605 non-null float64
host_has_profile_pic_t 45605 non-null float64
host_identity_verified_f 45605 non-null float64
host_identity_verified_t 45605 non-null float64
is_location_exact_f 45605 non-null float64
is_location_exact_t 45605 non-null float64
property_type_Apartment 45605 non-null float64
property_type_Condominium 45605 non-null float64
property_type_House 45605 non-null float64
property_type_Townhouse 45605 non-null float64
room_type_Entire home/apt 45605 non-null float64
room_type_Private room 45605 non-null float64
room_type_Shared room 45605 non-null float64
bed_type_Airbed 45605 non-null float64
bed_type_Couch 45605 non-null float64
bed_type_Futon 45605 non-null float64
bed_type_Pull-out Sofa 45605 non-null float64
bed_type_Real Bed 45605 non-null float64
has_availability_t 45605 non-null float64
requires_license_f 45605 non-null float64
requires_license_t 45605 non-null float64
instant_bookable_f 45605 non-null float64
instant_bookable_t 45605 non-null float64
is_business_travel_ready_f 45605 non-null float64
cancellation_policy_flexible 45605 non-null float64
cancellation_policy_moderate 45605 non-null float64
cancellation_policy_strict 45605 non-null float64
cancellation_policy_strict_14_with_grace_period 45605 non-null float64
cancellation_policy_super_strict_30 45605 non-null float64
cancellation_policy_super_strict_60 45605 non-null float64
require_guest_profile_picture_f 45605 non-null float64
require_guest_profile_picture_t 45605 non-null float64
require_guest_phone_verification_f 45605 non-null float64
require_guest_phone_verification_t 45605 non-null float64
dtypes: float64(56)
memory usage: 19.5 MB
outlier_stats
takes a dataframe and returns outlier statistics for that dataframe. One standard way of identifying outliers is using this formula
Formula: {data_point < first quartile – (1.5) * IQR} (or) {data_point > third quartile + (1.5) * IQR}
# Check if each value of mySeries lies within the outlier range.
# If the value is greater than upperOutlier mark it True
# If the value is lower than lowerOutlier mark it True
# return the total number of outliers for that series.
def numberOfOutliers(mySeries, upperOutlier, lowerOutlier):
return sum((mySeries > upperOutlier.loc[mySeries.name, ]) |\
(mySeries < lowerOutlier.loc[mySeries.name,]))
def outlier_stats(df):
numericDescribe = (df.describe(include='all').T).round(decimals=3)
# Calculate outliers using this formula: first quartile – 1.5·IQR > outlier > third quartile + 1.5·IQR
numericDescribe['IQR'] = numericDescribe['75%'] - numericDescribe['25%']
numericDescribe['outliers'] = (numericDescribe['max'] > (numericDescribe['75%'] + (1.5 * numericDescribe['IQR']))) \
| (numericDescribe['min'] < (numericDescribe['25%'] - (1.5 * numericDescribe['IQR'])))
# Calculate IQR for each column of the dataframe.
IQR = df.quantile(.75) - df.quantile(.25)
# Calculate the upper and lower outlier values
upperOutlier = df.quantile(.75) + (1.5 * (IQR))
lowerOutlier = df.quantile(.25) - (1.5 * (IQR))
# Store the result in a new column
numericDescribe['num_outliers'] = df.apply(numberOfOutliers, args=(upperOutlier, lowerOutlier))
numericDescribe.sort_values('num_outliers', ascending=False, inplace=True)
newColOrder = ['count', 'outliers', 'num_outliers', 'IQR', 'mean', 'std', \
'min', '25%', '50%', '75%', 'max']
numericDescribe = numericDescribe.reindex(columns=newColOrder)
return numericDescribe
outlier_stats_df = outlier_stats(df_airbnb)
outlier_stats_df
count | outliers | num_outliers | IQR | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|---|---|---|
bedrooms | 45605.0 | True | 12668 | 0.0 | 1.179 | 0.741 | 0.0 | 1.0 | 1.0 | 1.0 | 14.0 |
cancellation_policy_moderate | 45605.0 | True | 10558 | 0.0 | 0.232 | 0.422 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
host_is_superhost_f | 45605.0 | True | 8752 | 0.0 | 0.808 | 0.394 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 |
host_is_superhost_t | 45605.0 | True | 8752 | 0.0 | 0.192 | 0.394 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
is_location_exact_t | 45605.0 | True | 7887 | 0.0 | 0.827 | 0.378 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 |
is_location_exact_f | 45605.0 | True | 7887 | 0.0 | 0.173 | 0.378 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
property_type_Apartment | 45605.0 | True | 7000 | 0.0 | 0.847 | 0.360 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 |
bathrooms | 45605.0 | True | 6949 | 0.0 | 1.143 | 0.428 | 0.0 | 1.0 | 1.0 | 1.0 | 15.5 |
host_total_listings_count | 45605.0 | True | 6858 | 1.0 | 14.828 | 90.301 | 0.0 | 1.0 | 1.0 | 2.0 | 1465.0 |
host_listings_count | 45605.0 | True | 6858 | 1.0 | 14.828 | 90.301 | 0.0 | 1.0 | 1.0 | 2.0 | 1465.0 |
minimum_nights | 45605.0 | True | 6107 | 4.0 | 7.079 | 20.644 | 1.0 | 1.0 | 3.0 | 5.0 | 1250.0 |
calculated_host_listings_count | 45605.0 | True | 6034 | 1.0 | 7.317 | 35.596 | 1.0 | 1.0 | 1.0 | 2.0 | 343.0 |
number_of_reviews | 45605.0 | True | 5618 | 23.0 | 23.397 | 45.101 | 0.0 | 1.0 | 5.0 | 24.0 | 639.0 |
number_of_reviews_ltm | 45605.0 | True | 5360 | 11.0 | 9.109 | 15.985 | 0.0 | 0.0 | 2.0 | 11.0 | 359.0 |
calculated_host_listings_count_private_rooms | 45605.0 | True | 5014 | 1.0 | 1.485 | 6.507 | 0.0 | 0.0 | 1.0 | 1.0 | 115.0 |
availability_30 | 45605.0 | True | 4129 | 10.0 | 6.614 | 9.460 | 0.0 | 0.0 | 1.0 | 10.0 | 30.0 |
property_type_House | 45605.0 | True | 3846 | 0.0 | 0.084 | 0.278 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
guests_included | 45605.0 | True | 3261 | 1.0 | 1.510 | 1.138 | 1.0 | 1.0 | 1.0 | 2.0 | 16.0 |
calculated_host_listings_count_entire_homes | 45605.0 | True | 3152 | 1.0 | 5.728 | 34.625 | 0.0 | 0.0 | 1.0 | 1.0 | 335.0 |
beds | 45605.0 | True | 2412 | 1.0 | 1.542 | 1.108 | 0.0 | 1.0 | 1.0 | 2.0 | 40.0 |
price | 45605.0 | True | 2359 | 108.0 | 146.268 | 226.386 | 0.0 | 67.0 | 100.0 | 175.0 | 10000.0 |
property_type_Townhouse | 45605.0 | True | 1659 | 0.0 | 0.036 | 0.187 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
calculated_host_listings_count_shared_rooms | 45605.0 | True | 1539 | 0.0 | 0.104 | 0.944 | 0.0 | 0.0 | 0.0 | 0.0 | 23.0 |
property_type_Condominium | 45605.0 | True | 1495 | 0.0 | 0.033 | 0.178 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
extra_people | 45605.0 | True | 1349 | 25.0 | 14.257 | 24.298 | 0.0 | 0.0 | 0.0 | 25.0 | 300.0 |
accommodates | 45605.0 | True | 1261 | 2.0 | 2.819 | 1.848 | 1.0 | 2.0 | 2.0 | 4.0 | 26.0 |
require_guest_phone_verification_f | 45605.0 | True | 1086 | 0.0 | 0.976 | 0.152 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 |
require_guest_phone_verification_t | 45605.0 | True | 1086 | 0.0 | 0.024 | 0.152 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
room_type_Shared room | 45605.0 | True | 1062 | 0.0 | 0.023 | 0.151 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
require_guest_profile_picture_t | 45605.0 | True | 1020 | 0.0 | 0.022 | 0.148 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
require_guest_profile_picture_f | 45605.0 | True | 1020 | 0.0 | 0.978 | 0.148 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 |
bed_type_Real Bed | 45605.0 | True | 750 | 0.0 | 0.984 | 0.127 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 |
bed_type_Futon | 45605.0 | True | 280 | 0.0 | 0.006 | 0.078 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
bed_type_Pull-out Sofa | 45605.0 | True | 240 | 0.0 | 0.005 | 0.072 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
bed_type_Airbed | 45605.0 | True | 166 | 0.0 | 0.004 | 0.060 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
host_has_profile_pic_t | 45605.0 | True | 93 | 0.0 | 0.998 | 0.045 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 |
host_has_profile_pic_f | 45605.0 | True | 93 | 0.0 | 0.002 | 0.045 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
cancellation_policy_super_strict_60 | 45605.0 | True | 93 | 0.0 | 0.002 | 0.045 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
bed_type_Couch | 45605.0 | True | 64 | 0.0 | 0.001 | 0.037 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
cancellation_policy_super_strict_30 | 45605.0 | True | 22 | 0.0 | 0.000 | 0.022 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
cancellation_policy_strict | 45605.0 | True | 5 | 0.0 | 0.000 | 0.010 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
cancellation_policy_strict_14_with_grace_period | 45605.0 | False | 0 | 1.0 | 0.448 | 0.497 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
host_identity_verified_t | 45605.0 | False | 0 | 1.0 | 0.484 | 0.500 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
cancellation_policy_flexible | 45605.0 | False | 0 | 1.0 | 0.318 | 0.466 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
availability_90 | 45605.0 | False | 0 | 47.0 | 24.799 | 30.896 | 0.0 | 0.0 | 8.0 | 47.0 | 90.0 |
availability_365 | 45605.0 | False | 0 | 220.0 | 108.421 | 130.655 | 0.0 | 0.0 | 35.0 | 220.0 | 365.0 |
requires_license_t | 45605.0 | False | 0 | 0.0 | 0.000 | 0.000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
is_business_travel_ready_f | 45605.0 | False | 0 | 0.0 | 1.000 | 0.000 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
instant_bookable_t | 45605.0 | False | 0 | 1.0 | 0.366 | 0.482 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
instant_bookable_f | 45605.0 | False | 0 | 1.0 | 0.634 | 0.482 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
host_identity_verified_f | 45605.0 | False | 0 | 1.0 | 0.516 | 0.500 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
requires_license_f | 45605.0 | False | 0 | 0.0 | 1.000 | 0.000 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
has_availability_t | 45605.0 | False | 0 | 0.0 | 1.000 | 0.000 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
availability_60 | 45605.0 | False | 0 | 26.0 | 14.747 | 19.611 | 0.0 | 0.0 | 4.0 | 26.0 | 60.0 |
room_type_Entire home/apt | 45605.0 | False | 0 | 1.0 | 0.515 | 0.500 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
room_type_Private room | 45605.0 | False | 0 | 1.0 | 0.462 | 0.499 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
Dealing with outliers
From the above output, we can clearly see that the price column and couple of others have outliers. Using subject matter expertise, we must investigate these outliers individually. For instance the host_listings and minimum_nights have absurdly high values which could be an error on the part of the owner in entering these values. Alternatively, these could be special properties which are not open for all customers but reserved for people who want to sub-lease these properties. By including this feature in our analysis may skew our principal components as these might get more weightage over other features in the dataset. For this reason, I decided to drop these columns.
price has many outliers, however, upon closer examination, it was observed that the price given was for minimum_nights and not per night. For this reason, I have constructed a new feature here called price_adjusted which is the price per night.
df_airbnb['price_adjusted'] = df_airbnb['price'] / df_airbnb['minimum_nights']
df_airbnb.drop(['price', 'minimum_nights'], axis='columns', inplace=True)
drop_more_cols = ['host_total_listings_count',
'host_listings_count',
'host_total_listings_count',
'calculated_host_listings_count_entire_homes',
'calculated_host_listings_count_private_rooms',
'availability_365']
df_airbnb.drop(drop_more_cols, axis=1, inplace=True)
outlier_stats(df_airbnb)
count | outliers | num_outliers | IQR | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|---|---|---|
bedrooms | 45605.0 | True | 12668 | 0.0 | 1.179 | 0.741 | 0.0 | 1.0 | 1.0 | 1.0 | 14.0 |
cancellation_policy_moderate | 45605.0 | True | 10558 | 0.0 | 0.232 | 0.422 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
host_is_superhost_t | 45605.0 | True | 8752 | 0.0 | 0.192 | 0.394 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
host_is_superhost_f | 45605.0 | True | 8752 | 0.0 | 0.808 | 0.394 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 |
is_location_exact_f | 45605.0 | True | 7887 | 0.0 | 0.173 | 0.378 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
is_location_exact_t | 45605.0 | True | 7887 | 0.0 | 0.827 | 0.378 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 |
property_type_Apartment | 45605.0 | True | 7000 | 0.0 | 0.847 | 0.360 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 |
bathrooms | 45605.0 | True | 6949 | 0.0 | 1.143 | 0.428 | 0.0 | 1.0 | 1.0 | 1.0 | 15.5 |
calculated_host_listings_count | 45605.0 | True | 6034 | 1.0 | 7.317 | 35.596 | 1.0 | 1.0 | 1.0 | 2.0 | 343.0 |
number_of_reviews | 45605.0 | True | 5618 | 23.0 | 23.397 | 45.101 | 0.0 | 1.0 | 5.0 | 24.0 | 639.0 |
number_of_reviews_ltm | 45605.0 | True | 5360 | 11.0 | 9.109 | 15.985 | 0.0 | 0.0 | 2.0 | 11.0 | 359.0 |
availability_30 | 45605.0 | True | 4129 | 10.0 | 6.614 | 9.460 | 0.0 | 0.0 | 1.0 | 10.0 | 30.0 |
property_type_House | 45605.0 | True | 3846 | 0.0 | 0.084 | 0.278 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
guests_included | 45605.0 | True | 3261 | 1.0 | 1.510 | 1.138 | 1.0 | 1.0 | 1.0 | 2.0 | 16.0 |
price_adjusted | 45605.0 | True | 2798 | 60.0 | 65.534 | 130.939 | 0.0 | 19.8 | 43.0 | 79.8 | 8000.0 |
beds | 45605.0 | True | 2412 | 1.0 | 1.542 | 1.108 | 0.0 | 1.0 | 1.0 | 2.0 | 40.0 |
property_type_Townhouse | 45605.0 | True | 1659 | 0.0 | 0.036 | 0.187 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
calculated_host_listings_count_shared_rooms | 45605.0 | True | 1539 | 0.0 | 0.104 | 0.944 | 0.0 | 0.0 | 0.0 | 0.0 | 23.0 |
property_type_Condominium | 45605.0 | True | 1495 | 0.0 | 0.033 | 0.178 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
extra_people | 45605.0 | True | 1349 | 25.0 | 14.257 | 24.298 | 0.0 | 0.0 | 0.0 | 25.0 | 300.0 |
accommodates | 45605.0 | True | 1261 | 2.0 | 2.819 | 1.848 | 1.0 | 2.0 | 2.0 | 4.0 | 26.0 |
require_guest_phone_verification_t | 45605.0 | True | 1086 | 0.0 | 0.024 | 0.152 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
require_guest_phone_verification_f | 45605.0 | True | 1086 | 0.0 | 0.976 | 0.152 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 |
room_type_Shared room | 45605.0 | True | 1062 | 0.0 | 0.023 | 0.151 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
require_guest_profile_picture_t | 45605.0 | True | 1020 | 0.0 | 0.022 | 0.148 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
require_guest_profile_picture_f | 45605.0 | True | 1020 | 0.0 | 0.978 | 0.148 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 |
bed_type_Real Bed | 45605.0 | True | 750 | 0.0 | 0.984 | 0.127 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 |
bed_type_Futon | 45605.0 | True | 280 | 0.0 | 0.006 | 0.078 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
bed_type_Pull-out Sofa | 45605.0 | True | 240 | 0.0 | 0.005 | 0.072 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
bed_type_Airbed | 45605.0 | True | 166 | 0.0 | 0.004 | 0.060 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
cancellation_policy_super_strict_60 | 45605.0 | True | 93 | 0.0 | 0.002 | 0.045 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
host_has_profile_pic_t | 45605.0 | True | 93 | 0.0 | 0.998 | 0.045 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 |
host_has_profile_pic_f | 45605.0 | True | 93 | 0.0 | 0.002 | 0.045 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
bed_type_Couch | 45605.0 | True | 64 | 0.0 | 0.001 | 0.037 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
cancellation_policy_super_strict_30 | 45605.0 | True | 22 | 0.0 | 0.000 | 0.022 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
cancellation_policy_strict | 45605.0 | True | 5 | 0.0 | 0.000 | 0.010 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
availability_60 | 45605.0 | False | 0 | 26.0 | 14.747 | 19.611 | 0.0 | 0.0 | 4.0 | 26.0 | 60.0 |
cancellation_policy_strict_14_with_grace_period | 45605.0 | False | 0 | 1.0 | 0.448 | 0.497 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
host_identity_verified_t | 45605.0 | False | 0 | 1.0 | 0.484 | 0.500 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
requires_license_t | 45605.0 | False | 0 | 0.0 | 0.000 | 0.000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
cancellation_policy_flexible | 45605.0 | False | 0 | 1.0 | 0.318 | 0.466 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
is_business_travel_ready_f | 45605.0 | False | 0 | 0.0 | 1.000 | 0.000 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
instant_bookable_t | 45605.0 | False | 0 | 1.0 | 0.366 | 0.482 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
instant_bookable_f | 45605.0 | False | 0 | 1.0 | 0.634 | 0.482 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
host_identity_verified_f | 45605.0 | False | 0 | 1.0 | 0.516 | 0.500 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
requires_license_f | 45605.0 | False | 0 | 0.0 | 1.000 | 0.000 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
has_availability_t | 45605.0 | False | 0 | 0.0 | 1.000 | 0.000 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
room_type_Private room | 45605.0 | False | 0 | 1.0 | 0.462 | 0.499 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
availability_90 | 45605.0 | False | 0 | 47.0 | 24.799 | 30.896 | 0.0 | 0.0 | 8.0 | 47.0 | 90.0 |
room_type_Entire home/apt | 45605.0 | False | 0 | 1.0 | 0.515 | 0.500 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
df=df_airbnb.copy()
Scaling
Many machine learning algorithms work better when features are on a relatively similar scale and close to normally distributed. MinMaxScaler, RobustScaler, StandardScaler, and Normalizer are scikit-learn methods to preprocess data for machine learning.
What does scaling actually mean?: To scale generally means to change the range of the values. The shape of the distribution doesn’t change. Think about how a scale model of a building has the same proportions as the original, just smaller. That’s why we say it is drawn to scale. The range is often set at 0 to 1.
What does standardize mean?: Standardize generally means changing the values so that the distribution standard deviation from the mean equals one. It outputs something very close to a normal distribution. Scaling is often implied.
Why do we need to Scale or Standardize or Normalize? : Many machine learning algorithms perform better or converge faster when features are on a relatively similar scale and/or close to normally distributed. Examples of such algorithm families include:
- linear and logistic regression (Remember the famous regression assumptions?)
- nearest neighbors (Better results with similar scale)
- principal components analysis (Finds linear combinations of features, if not on same scale, the explained variance will be attributed to features with larger values.)
- neural networks
- support vector machines (with some kernels)
- linear discriminant analysis
StandardScaler
The StandardScaler standardizes a feature by subtracting the mean and then scaling to unit variance. Unit variance means dividing all the values by the standard deviation. This means that some values might be negative and some positive and are not guaranteed to be in -1 to 1 range.
By looking at the values for bedrooms for instance, we can see that we have a min value of -1.591 and a max value of 17.313. While having negative values by itself is not an issue for PCA analysis, it is not well suited for this particular dataset. As we shall see when we merge this dataset with the features extracted from text which are all in the range 0 to 1, it is better if we have all our numeric features also in the same range of 0 to 1.
For this reason, we are not going to use the StandardScaler.
from sklearn.preprocessing import StandardScaler
standard_scaler = StandardScaler()
scaled_df = pd.DataFrame(standard_scaler.fit_transform(df), columns=df.columns)
outlier_stats(scaled_df)
count | outliers | num_outliers | IQR | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|---|---|---|
bedrooms | 45605.0 | True | 12668 | 0.000 | 0.0 | 1.0 | -1.591 | -0.241 | -0.241 | -0.241 | 17.313 |
cancellation_policy_moderate | 45605.0 | True | 10558 | 0.000 | -0.0 | 1.0 | -0.549 | -0.549 | -0.549 | -0.549 | 1.822 |
host_is_superhost_t | 45605.0 | True | 8752 | 0.000 | -0.0 | 1.0 | -0.487 | -0.487 | -0.487 | -0.487 | 2.052 |
host_is_superhost_f | 45605.0 | True | 8752 | 0.000 | 0.0 | 1.0 | -2.052 | 0.487 | 0.487 | 0.487 | 0.487 |
is_location_exact_f | 45605.0 | True | 7887 | 0.000 | 0.0 | 1.0 | -0.457 | -0.457 | -0.457 | -0.457 | 2.187 |
is_location_exact_t | 45605.0 | True | 7887 | 0.000 | -0.0 | 1.0 | -2.187 | 0.457 | 0.457 | 0.457 | 0.457 |
property_type_Apartment | 45605.0 | True | 7000 | 0.000 | -0.0 | 1.0 | -2.348 | 0.426 | 0.426 | 0.426 | 0.426 |
bathrooms | 45605.0 | True | 6949 | 0.000 | 0.0 | 1.0 | -2.667 | -0.333 | -0.333 | -0.333 | 33.511 |
calculated_host_listings_count | 45605.0 | True | 6034 | 0.028 | -0.0 | 1.0 | -0.177 | -0.177 | -0.177 | -0.149 | 9.431 |
number_of_reviews | 45605.0 | True | 5618 | 0.510 | 0.0 | 1.0 | -0.519 | -0.497 | -0.408 | 0.013 | 13.650 |
number_of_reviews_ltm | 45605.0 | True | 5360 | 0.688 | -0.0 | 1.0 | -0.570 | -0.570 | -0.445 | 0.118 | 21.889 |
availability_30 | 45605.0 | True | 4488 | 1.057 | -0.0 | 1.0 | -0.699 | -0.699 | -0.594 | 0.358 | 2.472 |
property_type_House | 45605.0 | True | 3846 | 0.000 | -0.0 | 1.0 | -0.303 | -0.303 | -0.303 | -0.303 | 3.295 |
guests_included | 45605.0 | True | 3261 | 0.878 | -0.0 | 1.0 | -0.448 | -0.448 | -0.448 | 0.430 | 12.727 |
price_adjusted | 45605.0 | True | 2798 | 0.458 | 0.0 | 1.0 | -0.500 | -0.349 | -0.172 | 0.109 | 60.597 |
beds | 45605.0 | True | 2412 | 0.902 | -0.0 | 1.0 | -1.391 | -0.489 | -0.489 | 0.413 | 34.699 |
property_type_Townhouse | 45605.0 | True | 1659 | 0.000 | 0.0 | 1.0 | -0.194 | -0.194 | -0.194 | -0.194 | 5.147 |
calculated_host_listings_count_shared_rooms | 45605.0 | True | 1539 | 0.000 | 0.0 | 1.0 | -0.110 | -0.110 | -0.110 | -0.110 | 24.257 |
property_type_Condominium | 45605.0 | True | 1495 | 0.000 | 0.0 | 1.0 | -0.184 | -0.184 | -0.184 | -0.184 | 5.432 |
extra_people | 45605.0 | True | 1349 | 1.029 | 0.0 | 1.0 | -0.587 | -0.587 | -0.587 | 0.442 | 11.760 |
accommodates | 45605.0 | True | 1261 | 1.082 | -0.0 | 1.0 | -0.984 | -0.443 | -0.443 | 0.639 | 12.542 |
require_guest_phone_verification_t | 45605.0 | True | 1086 | 0.000 | 0.0 | 1.0 | -0.156 | -0.156 | -0.156 | -0.156 | 6.403 |
require_guest_phone_verification_f | 45605.0 | True | 1086 | 0.000 | -0.0 | 1.0 | -6.403 | 0.156 | 0.156 | 0.156 | 0.156 |
room_type_Shared room | 45605.0 | True | 1062 | 0.000 | 0.0 | 1.0 | -0.154 | -0.154 | -0.154 | -0.154 | 6.476 |
require_guest_profile_picture_t | 45605.0 | True | 1020 | 0.000 | -0.0 | 1.0 | -0.151 | -0.151 | -0.151 | -0.151 | 6.611 |
require_guest_profile_picture_f | 45605.0 | True | 1020 | 0.000 | 0.0 | 1.0 | -6.611 | 0.151 | 0.151 | 0.151 | 0.151 |
bed_type_Real Bed | 45605.0 | True | 750 | 0.000 | -0.0 | 1.0 | -7.733 | 0.129 | 0.129 | 0.129 | 0.129 |
bed_type_Futon | 45605.0 | True | 280 | 0.000 | 0.0 | 1.0 | -0.079 | -0.079 | -0.079 | -0.079 | 12.723 |
bed_type_Pull-out Sofa | 45605.0 | True | 240 | 0.000 | -0.0 | 1.0 | -0.073 | -0.073 | -0.073 | -0.073 | 13.748 |
bed_type_Airbed | 45605.0 | True | 166 | 0.000 | -0.0 | 1.0 | -0.060 | -0.060 | -0.060 | -0.060 | 16.545 |
cancellation_policy_super_strict_60 | 45605.0 | True | 93 | 0.000 | -0.0 | 1.0 | -0.045 | -0.045 | -0.045 | -0.045 | 22.122 |
host_has_profile_pic_t | 45605.0 | True | 93 | 0.000 | -0.0 | 1.0 | -22.122 | 0.045 | 0.045 | 0.045 | 0.045 |
host_has_profile_pic_f | 45605.0 | True | 93 | 0.000 | 0.0 | 1.0 | -0.045 | -0.045 | -0.045 | -0.045 | 22.122 |
bed_type_Couch | 45605.0 | True | 64 | 0.000 | 0.0 | 1.0 | -0.037 | -0.037 | -0.037 | -0.037 | 26.675 |
cancellation_policy_super_strict_30 | 45605.0 | True | 22 | 0.000 | 0.0 | 1.0 | -0.022 | -0.022 | -0.022 | -0.022 | 45.519 |
cancellation_policy_strict | 45605.0 | True | 5 | 0.000 | -0.0 | 1.0 | -0.010 | -0.010 | -0.010 | -0.010 | 95.499 |
availability_60 | 45605.0 | False | 0 | 1.326 | 0.0 | 1.0 | -0.752 | -0.752 | -0.548 | 0.574 | 2.308 |
cancellation_policy_strict_14_with_grace_period | 45605.0 | False | 0 | 2.011 | -0.0 | 1.0 | -0.902 | -0.902 | -0.902 | 1.109 | 1.109 |
host_identity_verified_t | 45605.0 | False | 0 | 2.001 | -0.0 | 1.0 | -0.969 | -0.969 | -0.969 | 1.032 | 1.032 |
requires_license_t | 45605.0 | False | 0 | 0.000 | 0.0 | 0.0 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
cancellation_policy_flexible | 45605.0 | False | 0 | 2.148 | 0.0 | 1.0 | -0.682 | -0.682 | -0.682 | 1.466 | 1.466 |
is_business_travel_ready_f | 45605.0 | False | 0 | 0.000 | 0.0 | 0.0 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
instant_bookable_t | 45605.0 | False | 0 | 2.076 | -0.0 | 1.0 | -0.760 | -0.760 | -0.760 | 1.316 | 1.316 |
instant_bookable_f | 45605.0 | False | 0 | 2.076 | 0.0 | 1.0 | -1.316 | -1.316 | 0.760 | 0.760 | 0.760 |
host_identity_verified_f | 45605.0 | False | 0 | 2.001 | 0.0 | 1.0 | -1.032 | -1.032 | 0.969 | 0.969 | 0.969 |
requires_license_f | 45605.0 | False | 0 | 0.000 | 0.0 | 0.0 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
has_availability_t | 45605.0 | False | 0 | 0.000 | 0.0 | 0.0 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
room_type_Private room | 45605.0 | False | 0 | 2.006 | -0.0 | 1.0 | -0.926 | -0.926 | -0.926 | 1.080 | 1.080 |
availability_90 | 45605.0 | False | 0 | 1.522 | -0.0 | 1.0 | -0.803 | -0.803 | -0.544 | 0.719 | 2.110 |
room_type_Entire home/apt | 45605.0 | False | 0 | 2.001 | -0.0 | 1.0 | -1.031 | -1.031 | 0.970 | 0.970 | 0.970 |
MinMaxScaler
The MinMaxScaler works by taking each value in a feature and subtracts it by the minimum value and divides it by the range. The range is the difference between the original maximum and original minimum.
MinMaxScaler preserves the shape of the original distribution. It doesn’t meaningfully change the information embedded in the original data. This is exactly what we want for our dataset. We want to preserve the individual feature distributions while bringing all the features to a 0 to 1 scale. The reason we are doing this is because for the Text Features, we have made use of TfidfVectorizer
which translates the text features to numeric values that are in range 0 to 1.
For this reason, we will be using MinMaxScaler.
from sklearn.preprocessing import MinMaxScaler
min_max_scaler = MinMaxScaler()
min_max_scaled_df = pd.DataFrame(min_max_scaler.fit_transform(df), columns=df.columns)
outlier_stats(min_max_scaled_df)
count | outliers | num_outliers | IQR | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|---|---|---|
bedrooms | 45605.0 | True | 12668 | 0.000 | 0.084 | 0.053 | 0.0 | 0.071 | 0.071 | 0.071 | 1.0 |
cancellation_policy_moderate | 45605.0 | True | 10558 | 0.000 | 0.232 | 0.422 | 0.0 | 0.000 | 0.000 | 0.000 | 1.0 |
host_is_superhost_f | 45605.0 | True | 8752 | 0.000 | 0.808 | 0.394 | 0.0 | 1.000 | 1.000 | 1.000 | 1.0 |
host_is_superhost_t | 45605.0 | True | 8752 | 0.000 | 0.192 | 0.394 | 0.0 | 0.000 | 0.000 | 0.000 | 1.0 |
is_location_exact_f | 45605.0 | True | 7887 | 0.000 | 0.173 | 0.378 | 0.0 | 0.000 | 0.000 | 0.000 | 1.0 |
is_location_exact_t | 45605.0 | True | 7887 | 0.000 | 0.827 | 0.378 | 0.0 | 1.000 | 1.000 | 1.000 | 1.0 |
property_type_Apartment | 45605.0 | True | 7000 | 0.000 | 0.847 | 0.360 | 0.0 | 1.000 | 1.000 | 1.000 | 1.0 |
bathrooms | 45605.0 | True | 6949 | 0.000 | 0.074 | 0.028 | 0.0 | 0.065 | 0.065 | 0.065 | 1.0 |
calculated_host_listings_count | 45605.0 | True | 6034 | 0.003 | 0.018 | 0.104 | 0.0 | 0.000 | 0.000 | 0.003 | 1.0 |
number_of_reviews | 45605.0 | True | 5618 | 0.036 | 0.037 | 0.071 | 0.0 | 0.002 | 0.008 | 0.038 | 1.0 |
number_of_reviews_ltm | 45605.0 | True | 5360 | 0.031 | 0.025 | 0.045 | 0.0 | 0.000 | 0.006 | 0.031 | 1.0 |
availability_30 | 45605.0 | True | 4488 | 0.333 | 0.220 | 0.315 | 0.0 | 0.000 | 0.033 | 0.333 | 1.0 |
property_type_House | 45605.0 | True | 3846 | 0.000 | 0.084 | 0.278 | 0.0 | 0.000 | 0.000 | 0.000 | 1.0 |
guests_included | 45605.0 | True | 3261 | 0.067 | 0.034 | 0.076 | 0.0 | 0.000 | 0.000 | 0.067 | 1.0 |
price_adjusted | 45605.0 | True | 2798 | 0.008 | 0.008 | 0.016 | 0.0 | 0.002 | 0.005 | 0.010 | 1.0 |
beds | 45605.0 | True | 2412 | 0.025 | 0.039 | 0.028 | 0.0 | 0.025 | 0.025 | 0.050 | 1.0 |
accommodates | 45605.0 | True | 1703 | 0.080 | 0.073 | 0.074 | 0.0 | 0.040 | 0.040 | 0.120 | 1.0 |
property_type_Townhouse | 45605.0 | True | 1659 | 0.000 | 0.036 | 0.187 | 0.0 | 0.000 | 0.000 | 0.000 | 1.0 |
calculated_host_listings_count_shared_rooms | 45605.0 | True | 1539 | 0.000 | 0.005 | 0.041 | 0.0 | 0.000 | 0.000 | 0.000 | 1.0 |
property_type_Condominium | 45605.0 | True | 1495 | 0.000 | 0.033 | 0.178 | 0.0 | 0.000 | 0.000 | 0.000 | 1.0 |
extra_people | 45605.0 | True | 1349 | 0.083 | 0.048 | 0.081 | 0.0 | 0.000 | 0.000 | 0.083 | 1.0 |
require_guest_phone_verification_t | 45605.0 | True | 1086 | 0.000 | 0.024 | 0.152 | 0.0 | 0.000 | 0.000 | 0.000 | 1.0 |
require_guest_phone_verification_f | 45605.0 | True | 1086 | 0.000 | 0.976 | 0.152 | 0.0 | 1.000 | 1.000 | 1.000 | 1.0 |
room_type_Shared room | 45605.0 | True | 1062 | 0.000 | 0.023 | 0.151 | 0.0 | 0.000 | 0.000 | 0.000 | 1.0 |
require_guest_profile_picture_f | 45605.0 | True | 1020 | 0.000 | 0.978 | 0.148 | 0.0 | 1.000 | 1.000 | 1.000 | 1.0 |
require_guest_profile_picture_t | 45605.0 | True | 1020 | 0.000 | 0.022 | 0.148 | 0.0 | 0.000 | 0.000 | 0.000 | 1.0 |
bed_type_Real Bed | 45605.0 | True | 750 | 0.000 | 0.984 | 0.127 | 0.0 | 1.000 | 1.000 | 1.000 | 1.0 |
bed_type_Futon | 45605.0 | True | 280 | 0.000 | 0.006 | 0.078 | 0.0 | 0.000 | 0.000 | 0.000 | 1.0 |
bed_type_Pull-out Sofa | 45605.0 | True | 240 | 0.000 | 0.005 | 0.072 | 0.0 | 0.000 | 0.000 | 0.000 | 1.0 |
bed_type_Airbed | 45605.0 | True | 166 | 0.000 | 0.004 | 0.060 | 0.0 | 0.000 | 0.000 | 0.000 | 1.0 |
host_has_profile_pic_f | 45605.0 | True | 93 | 0.000 | 0.002 | 0.045 | 0.0 | 0.000 | 0.000 | 0.000 | 1.0 |
host_has_profile_pic_t | 45605.0 | True | 93 | 0.000 | 0.998 | 0.045 | 0.0 | 1.000 | 1.000 | 1.000 | 1.0 |
cancellation_policy_super_strict_60 | 45605.0 | True | 93 | 0.000 | 0.002 | 0.045 | 0.0 | 0.000 | 0.000 | 0.000 | 1.0 |
bed_type_Couch | 45605.0 | True | 64 | 0.000 | 0.001 | 0.037 | 0.0 | 0.000 | 0.000 | 0.000 | 1.0 |
cancellation_policy_super_strict_30 | 45605.0 | True | 22 | 0.000 | 0.000 | 0.022 | 0.0 | 0.000 | 0.000 | 0.000 | 1.0 |
cancellation_policy_strict | 45605.0 | True | 5 | 0.000 | 0.000 | 0.010 | 0.0 | 0.000 | 0.000 | 0.000 | 1.0 |
cancellation_policy_flexible | 45605.0 | False | 0 | 1.000 | 0.318 | 0.466 | 0.0 | 0.000 | 0.000 | 1.000 | 1.0 |
cancellation_policy_strict_14_with_grace_period | 45605.0 | False | 0 | 1.000 | 0.448 | 0.497 | 0.0 | 0.000 | 0.000 | 1.000 | 1.0 |
host_identity_verified_f | 45605.0 | False | 0 | 1.000 | 0.516 | 0.500 | 0.0 | 0.000 | 1.000 | 1.000 | 1.0 |
has_availability_t | 45605.0 | False | 0 | 0.000 | 0.000 | 0.000 | 0.0 | 0.000 | 0.000 | 0.000 | 0.0 |
is_business_travel_ready_f | 45605.0 | False | 0 | 0.000 | 0.000 | 0.000 | 0.0 | 0.000 | 0.000 | 0.000 | 0.0 |
instant_bookable_t | 45605.0 | False | 0 | 1.000 | 0.366 | 0.482 | 0.0 | 0.000 | 0.000 | 1.000 | 1.0 |
instant_bookable_f | 45605.0 | False | 0 | 1.000 | 0.634 | 0.482 | 0.0 | 0.000 | 1.000 | 1.000 | 1.0 |
requires_license_t | 45605.0 | False | 0 | 0.000 | 0.000 | 0.000 | 0.0 | 0.000 | 0.000 | 0.000 | 0.0 |
requires_license_f | 45605.0 | False | 0 | 0.000 | 0.000 | 0.000 | 0.0 | 0.000 | 0.000 | 0.000 | 0.0 |
host_identity_verified_t | 45605.0 | False | 0 | 1.000 | 0.484 | 0.500 | 0.0 | 0.000 | 0.000 | 1.000 | 1.0 |
room_type_Private room | 45605.0 | False | 0 | 1.000 | 0.462 | 0.499 | 0.0 | 0.000 | 0.000 | 1.000 | 1.0 |
availability_60 | 45605.0 | False | 0 | 0.433 | 0.246 | 0.327 | 0.0 | 0.000 | 0.067 | 0.433 | 1.0 |
availability_90 | 45605.0 | False | 0 | 0.522 | 0.276 | 0.343 | 0.0 | 0.000 | 0.089 | 0.522 | 1.0 |
room_type_Entire home/apt | 45605.0 | False | 0 | 1.000 | 0.515 | 0.500 | 0.0 | 0.000 | 1.000 | 1.000 | 1.0 |
RobustScaler
Yet another Scaler is the RobustScaler. The MinMaxScaler is suseptible to outliers, whereas the RobustScaler is not. However, RobustScaler can generate values that outside the range 0 to 1. Moreover, we have dealt with the outlier above in a reasonable manner, so we should be good with using the MinMaxScaler for this dataset.
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
scaled_final_df = scaler.fit_transform(final_df)
scaled_final_df = pd.DataFrame(scaled_final_df, columns=final_df.columns)
scaled_final_df.head()
host_listings_count | host_total_listings_count | accommodates | bathrooms | bedrooms | beds | price | guests_included | extra_people | minimum_nights | ... | description_contains_yankee | description_contains_yard | description_contains_year | description_contains_yellow | description_contains_yoga | description_contains_york | description_contains_young | description_contains_yummy | description_contains_zero | description_contains_zone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4.0 | 4.0 | 0.0 | 0.0 | -1.0 | 0.0 | 1.157407 | 1.0 | 0.0 | -0.50 | ... | 0.0 | 0.0 | 0.00000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.462963 | 1.0 | 0.8 | 0.00 | ... | 0.0 | 0.0 | 0.00000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 0.0 | 0.0 | -0.5 | 0.0 | 0.0 | 0.0 | -0.185185 | 0.0 | 0.8 | 1.75 | ... | 0.0 | 0.0 | 0.00000 | 0.0 | 0.137645 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.925926 | 1.0 | 4.0 | 0.00 | ... | 0.0 | 0.0 | 0.00000 | 0.0 | 0.000000 | 0.074083 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | -0.370370 | 0.0 | 1.2 | 10.50 | ... | 0.0 | 0.0 | 0.18384 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 2208 columns
Merge the feature engineered dataset this dataset
In this section, we will be taking the feature engineered dataset that was created as part of this post: test and merge it with the current dataset. Recall, that the featured engineered data set contains Text features along with some newly created categorical features.
merged_df = pd.read_csv('merged_df.csv')
merged_df.head()
amenities_tv | amenities_wifi | amenities_air_conditioning | amenities_kitchen | amenities_paid_parking_off_premises | amenities_free_street_parking | amenities_indoor_fireplace | amenities_heating | amenities_family/kid_friendly | amenities_smoke_detector | ... | description_contains_yankee | description_contains_yard | description_contains_year | description_contains_yellow | description_contains_yoga | description_contains_york | description_contains_young | description_contains_yummy | description_contains_zero | description_contains_zone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | ... | 0.0 | 0.0 | 0.00000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.00000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.00000 | 0.0 | 0.137645 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.00000 | 0.0 | 0.000000 | 0.074083 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.18384 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 2151 columns
# check if there are any glaring outliers.
outlier_stats(merged_df)
count | outliers | num_outliers | IQR | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|---|---|---|
description_contains_home | 45605.0 | True | 11384 | 0.0 | 0.026 | 0.054 | 0.0 | 0.0 | 0.0 | 0.0 | 0.916 |
description_contains_place | 45605.0 | True | 11317 | 0.0 | 0.030 | 0.069 | 0.0 | 0.0 | 0.0 | 0.0 | 1.000 |
description_contains_city | 45605.0 | True | 11249 | 0.0 | 0.024 | 0.048 | 0.0 | 0.0 | 0.0 | 0.0 | 0.596 |
host_verification_by_selfie | 45605.0 | True | 11126 | 0.0 | 0.244 | 0.429 | 0.0 | 0.0 | 0.0 | 0.0 | 1.000 |
amenities_coffee_maker | 45605.0 | True | 11087 | 0.0 | 0.243 | 0.429 | 0.0 | 0.0 | 0.0 | 0.0 | 1.000 |
description_contains_quiet | 45605.0 | True | 11035 | 0.0 | 0.024 | 0.048 | 0.0 | 0.0 | 0.0 | 0.0 | 1.000 |
host_verification_by_kba | 45605.0 | True | 10985 | 0.0 | 0.241 | 0.428 | 0.0 | 0.0 | 0.0 | 0.0 | 1.000 |
description_contains_brooklyn | 45605.0 | True | 10794 | 0.0 | 0.029 | 0.063 | 0.0 | 0.0 | 0.0 | 0.0 | 0.679 |
amenities_cable_tv | 45605.0 | True | 10706 | 0.0 | 0.235 | 0.424 | 0.0 | 0.0 | 0.0 | 0.0 | 1.000 |
description_contains_large | 45605.0 | True | 10567 | 0.0 | 0.025 | 0.055 | 0.0 | 0.0 | 0.0 | 0.0 | 0.762 |
host_verification_by_facebook | 45605.0 | True | 10428 | 0.0 | 0.229 | 0.420 | 0.0 | 0.0 | 0.0 | 0.0 | 1.000 |
description_contains_time | 45605.0 | True | 10350 | 0.0 | 0.022 | 0.047 | 0.0 | 0.0 | 0.0 | 0.0 | 1.000 |
amenities_self_check-in | 45605.0 | True | 10233 | 0.0 | 0.224 | 0.417 | 0.0 | 0.0 | 0.0 | 0.0 | 1.000 |
amenities_buzzer/wireless_intercom | 45605.0 | True | 10076 | 0.0 | 0.221 | 0.415 | 0.0 | 0.0 | 0.0 | 0.0 | 1.000 |
host_verification_by_identity_manual | 45605.0 | True | 10025 | 0.0 | 0.220 | 0.414 | 0.0 | 0.0 | 0.0 | 0.0 | 1.000 |
description_contains_beautiful | 45605.0 | True | 9867 | 0.0 | 0.022 | 0.049 | 0.0 | 0.0 | 0.0 | 0.0 | 1.000 |
description_contains_need | 45605.0 | True | 9819 | 0.0 | 0.020 | 0.044 | 0.0 | 0.0 | 0.0 | 0.0 | 0.421 |
description_contains_spacious | 45605.0 | True | 9774 | 0.0 | 0.021 | 0.047 | 0.0 | 0.0 | 0.0 | 0.0 | 1.000 |
description_contains_clean | 45605.0 | True | 9447 | 0.0 | 0.022 | 0.050 | 0.0 | 0.0 | 0.0 | 0.0 | 0.709 |
description_contains_line | 45605.0 | True | 9067 | 0.0 | 0.022 | 0.051 | 0.0 | 0.0 | 0.0 | 0.0 | 1.000 |
amenities_long_term_stays_allowed | 45605.0 | True | 8831 | 0.0 | 0.194 | 0.395 | 0.0 | 0.0 | 0.0 | 0.0 | 1.000 |
amenities_extra_pillows_and_blankets | 45605.0 | True | 8798 | 0.0 | 0.193 | 0.395 | 0.0 | 0.0 | 0.0 | 0.0 | 1.000 |
description_contains_share | 45605.0 | True | 8777 | 0.0 | 0.022 | 0.053 | 0.0 | 0.0 | 0.0 | 0.0 | 1.000 |
description_contains_include | 45605.0 | True | 8605 | 0.0 | 0.018 | 0.043 | 0.0 | 0.0 | 0.0 | 0.0 | 0.489 |
description_contains_central | 45605.0 | True | 8570 | 0.0 | 0.022 | 0.052 | 0.0 | 0.0 | 0.0 | 0.0 | 1.000 |
description_contains_comfortable | 45605.0 | True | 8331 | 0.0 | 0.018 | 0.043 | 0.0 | 0.0 | 0.0 | 0.0 | 0.748 |
description_contains_walking | 45605.0 | True | 8194 | 0.0 | 0.019 | 0.047 | 0.0 | 0.0 | 0.0 | 0.0 | 0.571 |
description_contains_fully | 45605.0 | True | 8175 | 0.0 | 0.018 | 0.043 | 0.0 | 0.0 | 0.0 | 0.0 | 0.658 |
amenities_private_entrance | 45605.0 | True | 8109 | 0.0 | 0.178 | 0.382 | 0.0 | 0.0 | 0.0 | 0.0 | 1.000 |
amenities_translation_missing:_en.hosting_amenity_50 | 45605.0 | True | 8104 | 0.0 | 0.178 | 0.382 | 0.0 | 0.0 | 0.0 | 0.0 | 1.000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
amenities_ceiling_hoist | 45605.0 | True | 3 | 0.0 | 0.000 | 0.008 | 0.0 | 0.0 | 0.0 | 0.0 | 1.000 |
amenities_fixed_grab_bars_for_shower.1 | 45605.0 | True | 1 | 0.0 | 0.000 | 0.005 | 0.0 | 0.0 | 0.0 | 0.0 | 1.000 |
amenities_lock_on_bedroom_door | 45605.0 | False | 0 | 1.0 | 0.372 | 0.483 | 0.0 | 0.0 | 0.0 | 1.0 | 1.000 |
amenities_laptop_friendly_workspace | 45605.0 | False | 0 | 1.0 | 0.623 | 0.485 | 0.0 | 0.0 | 1.0 | 1.0 | 1.000 |
amenities_iron | 45605.0 | False | 0 | 1.0 | 0.607 | 0.488 | 0.0 | 0.0 | 1.0 | 1.0 | 1.000 |
amenities_hair_dryer | 45605.0 | False | 0 | 1.0 | 0.649 | 0.477 | 0.0 | 0.0 | 1.0 | 1.0 | 1.000 |
amenities_hangers | 45605.0 | False | 0 | 1.0 | 0.734 | 0.442 | 0.0 | 0.0 | 1.0 | 1.0 | 1.000 |
amenities_carbon_monoxide_detector | 45605.0 | False | 0 | 1.0 | 0.689 | 0.463 | 0.0 | 0.0 | 1.0 | 1.0 | 1.000 |
amenities_shampoo | 45605.0 | False | 0 | 1.0 | 0.671 | 0.470 | 0.0 | 0.0 | 1.0 | 1.0 | 1.000 |
amenities_fire_extinguisher | 45605.0 | False | 0 | 1.0 | 0.367 | 0.482 | 0.0 | 0.0 | 0.0 | 1.0 | 1.000 |
host_verification_by_government_id | 45605.0 | False | 0 | 1.0 | 0.601 | 0.490 | 0.0 | 0.0 | 1.0 | 1.0 | 1.000 |
amenities_family/kid_friendly | 45605.0 | False | 0 | 1.0 | 0.269 | 0.443 | 0.0 | 0.0 | 0.0 | 1.0 | 1.000 |
host_verification_by_offline_government_id | 45605.0 | False | 0 | 1.0 | 0.389 | 0.488 | 0.0 | 0.0 | 0.0 | 1.0 | 1.000 |
amenities_free_street_parking | 45605.0 | False | 0 | 1.0 | 0.301 | 0.459 | 0.0 | 0.0 | 0.0 | 1.0 | 1.000 |
amenities_hot_water | 45605.0 | False | 0 | 1.0 | 0.492 | 0.500 | 0.0 | 0.0 | 0.0 | 1.0 | 1.000 |
amenities_dishes_and_silverware | 45605.0 | False | 0 | 1.0 | 0.292 | 0.455 | 0.0 | 0.0 | 0.0 | 1.0 | 1.000 |
amenities_bed_linens | 45605.0 | False | 0 | 1.0 | 0.286 | 0.452 | 0.0 | 0.0 | 0.0 | 1.0 | 1.000 |
amenities_refrigerator | 45605.0 | False | 0 | 1.0 | 0.327 | 0.469 | 0.0 | 0.0 | 0.0 | 1.0 | 1.000 |
amenities_cooking_basics | 45605.0 | False | 0 | 1.0 | 0.273 | 0.446 | 0.0 | 0.0 | 0.0 | 1.0 | 1.000 |
amenities_oven | 45605.0 | False | 0 | 1.0 | 0.283 | 0.451 | 0.0 | 0.0 | 0.0 | 1.0 | 1.000 |
amenities_stove | 45605.0 | False | 0 | 1.0 | 0.287 | 0.452 | 0.0 | 0.0 | 0.0 | 1.0 | 1.000 |
amenities_internet | 45605.0 | False | 0 | 1.0 | 0.306 | 0.461 | 0.0 | 0.0 | 0.0 | 1.0 | 1.000 |
amenities_elevator | 45605.0 | False | 0 | 1.0 | 0.270 | 0.444 | 0.0 | 0.0 | 0.0 | 1.0 | 1.000 |
amenities_washer | 45605.0 | False | 0 | 1.0 | 0.405 | 0.491 | 0.0 | 0.0 | 0.0 | 1.0 | 1.000 |
amenities_dryer | 45605.0 | False | 0 | 1.0 | 0.397 | 0.489 | 0.0 | 0.0 | 0.0 | 1.0 | 1.000 |
amenities_microwave | 45605.0 | False | 0 | 1.0 | 0.275 | 0.447 | 0.0 | 0.0 | 0.0 | 1.0 | 1.000 |
amenities_first_aid_kit | 45605.0 | False | 0 | 1.0 | 0.327 | 0.469 | 0.0 | 0.0 | 0.0 | 1.0 | 1.000 |
host_verification_by_jumio | 45605.0 | False | 0 | 1.0 | 0.456 | 0.498 | 0.0 | 0.0 | 0.0 | 1.0 | 1.000 |
host_verification_by_reviews | 45605.0 | False | 0 | 1.0 | 0.722 | 0.448 | 0.0 | 0.0 | 1.0 | 1.0 | 1.000 |
amenities_tv | 45605.0 | False | 0 | 1.0 | 0.671 | 0.470 | 0.0 | 0.0 | 1.0 | 1.0 | 1.000 |
2151 rows × 11 columns
As pointed out earlier, we will be using the MinMaxScaler
as our preferred scaling option. We will be merging this with the feature engineered dataset.
# merge them both
final_df = pd.concat([min_max_scaled_df, merged_df], axis='columns')
Upload pre-processed dataset to S3
final_df.to_csv('min_max_scaled_final_df.csv', index=False)
# boto3 client to get S3 data
s3_client = boto3.client('s3')
bucket_name='skuchkula-sagemaker-airbnb'
# upload it to S3
s3_client.upload_file(Bucket=bucket_name,
Filename='min_max_scaled_final_df.csv',
Key='feature_eng/min_max_scaled_final_df.csv')
Conclusion
We have come a long way from the raw dataset to the final dataset which is ready for modeling. Along the way we dealt with missing values, incorrect data types, outliers, scaling and created several new features that will help us group Airbnb listings that are similar to each other. Our final data set consists of 45,604 listings, with each having 2201 features. Shown below is the shape and head of this final dataset.
final_df.shape
(45605, 2201)
final_df.head()
accommodates | bathrooms | bedrooms | beds | guests_included | extra_people | availability_30 | availability_60 | availability_90 | number_of_reviews | ... | description_contains_yankee | description_contains_yard | description_contains_year | description_contains_yellow | description_contains_yoga | description_contains_york | description_contains_young | description_contains_yummy | description_contains_zero | description_contains_zone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.04 | 0.064516 | 0.000000 | 0.025 | 0.066667 | 0.000000 | 0.433333 | 0.283333 | 0.344444 | 0.071987 | ... | 0.0 | 0.0 | 0.00000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 0.04 | 0.064516 | 0.071429 | 0.025 | 0.066667 | 0.066667 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | ... | 0.0 | 0.0 | 0.00000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 0.00 | 0.064516 | 0.071429 | 0.025 | 0.000000 | 0.066667 | 0.000000 | 0.000000 | 0.000000 | 0.014085 | ... | 0.0 | 0.0 | 0.00000 | 0.0 | 0.137645 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 0.04 | 0.064516 | 0.071429 | 0.025 | 0.066667 | 0.333333 | 0.800000 | 0.550000 | 0.700000 | 0.117371 | ... | 0.0 | 0.0 | 0.00000 | 0.0 | 0.000000 | 0.074083 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 0.04 | 0.064516 | 0.071429 | 0.025 | 0.000000 | 0.100000 | 0.000000 | 0.000000 | 0.000000 | 0.076682 | ... | 0.0 | 0.0 | 0.18384 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 2201 columns