New York City Airbnb Data Preprocessing

26 minute read

New York City Airbnb Pre-processing

Two important preprocessing tasks for models that rely on distance metrics are:

  • Dealing with Outliers
  • Scaling data

In this notebook, I will first identify features with outliers and decide whether keeping these outliers is necessary for the overall analysis or not. Secondly, I will explore three different types of Scaling techniques and choose the one that is most relevant to our analysis.

Get the data

Read in the data that was cleaned as part of this blog post: New York City Airbnb Data Cleaning . The cleansed dataset is loaded from s3://skuchkula-sagemaker-airbnb/airbnb_clean.csv.

# data managing and display libs
import pandas as pd
import numpy as np
import os
import io

import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline 

# sagemaker libraries
import boto3
df_airbnb = pd.read_csv('airbnb_clean.csv')

Identifying outliers

The first step here is to isolate numeric features from text/categorical features. Shown below are the columns which are either text/categorical in nature. I have also excluded some features which don’t make much sense in the context of the problem.

drop_cols = list(df_airbnb.select_dtypes(['O']).columns) + ['id', 'latitude', 'longitude', 'maximum_nights']
drop_cols
['name',
 'summary',
 'description',
 'host_verifications',
 'neighbourhood_cleansed',
 'neighbourhood_group_cleansed',
 'amenities',
 'id',
 'latitude',
 'longitude',
 'maximum_nights']
df_airbnb.drop(drop_cols, axis=1, inplace=True)

Ensure that all the columns are in float64 format.

# convert all these to float64
df_airbnb = df_airbnb.astype(float)
df_airbnb.info(verbose=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45605 entries, 0 to 45604
Data columns (total 56 columns):
host_listings_count                                45605 non-null float64
host_total_listings_count                          45605 non-null float64
accommodates                                       45605 non-null float64
bathrooms                                          45605 non-null float64
bedrooms                                           45605 non-null float64
beds                                               45605 non-null float64
price                                              45605 non-null float64
guests_included                                    45605 non-null float64
extra_people                                       45605 non-null float64
minimum_nights                                     45605 non-null float64
availability_30                                    45605 non-null float64
availability_60                                    45605 non-null float64
availability_90                                    45605 non-null float64
availability_365                                   45605 non-null float64
number_of_reviews                                  45605 non-null float64
number_of_reviews_ltm                              45605 non-null float64
calculated_host_listings_count                     45605 non-null float64
calculated_host_listings_count_entire_homes        45605 non-null float64
calculated_host_listings_count_private_rooms       45605 non-null float64
calculated_host_listings_count_shared_rooms        45605 non-null float64
host_is_superhost_f                                45605 non-null float64
host_is_superhost_t                                45605 non-null float64
host_has_profile_pic_f                             45605 non-null float64
host_has_profile_pic_t                             45605 non-null float64
host_identity_verified_f                           45605 non-null float64
host_identity_verified_t                           45605 non-null float64
is_location_exact_f                                45605 non-null float64
is_location_exact_t                                45605 non-null float64
property_type_Apartment                            45605 non-null float64
property_type_Condominium                          45605 non-null float64
property_type_House                                45605 non-null float64
property_type_Townhouse                            45605 non-null float64
room_type_Entire home/apt                          45605 non-null float64
room_type_Private room                             45605 non-null float64
room_type_Shared room                              45605 non-null float64
bed_type_Airbed                                    45605 non-null float64
bed_type_Couch                                     45605 non-null float64
bed_type_Futon                                     45605 non-null float64
bed_type_Pull-out Sofa                             45605 non-null float64
bed_type_Real Bed                                  45605 non-null float64
has_availability_t                                 45605 non-null float64
requires_license_f                                 45605 non-null float64
requires_license_t                                 45605 non-null float64
instant_bookable_f                                 45605 non-null float64
instant_bookable_t                                 45605 non-null float64
is_business_travel_ready_f                         45605 non-null float64
cancellation_policy_flexible                       45605 non-null float64
cancellation_policy_moderate                       45605 non-null float64
cancellation_policy_strict                         45605 non-null float64
cancellation_policy_strict_14_with_grace_period    45605 non-null float64
cancellation_policy_super_strict_30                45605 non-null float64
cancellation_policy_super_strict_60                45605 non-null float64
require_guest_profile_picture_f                    45605 non-null float64
require_guest_profile_picture_t                    45605 non-null float64
require_guest_phone_verification_f                 45605 non-null float64
require_guest_phone_verification_t                 45605 non-null float64
dtypes: float64(56)
memory usage: 19.5 MB

outlier_stats takes a dataframe and returns outlier statistics for that dataframe. One standard way of identifying outliers is using this formula

Formula: {data_point < first quartile – (1.5) * IQR} (or) {data_point > third quartile + (1.5) * IQR}

# Check if each value of mySeries lies within the outlier range.
# If the value is greater than upperOutlier mark it True
# If the value is lower than lowerOutlier mark it True
# return the total number of outliers for that series.
def numberOfOutliers(mySeries, upperOutlier, lowerOutlier):
    return sum((mySeries > upperOutlier.loc[mySeries.name, ]) |\
               (mySeries < lowerOutlier.loc[mySeries.name,]))

def outlier_stats(df):
    numericDescribe = (df.describe(include='all').T).round(decimals=3)
    
    # Calculate outliers using this formula: first quartile – 1.5·IQR > outlier > third quartile + 1.5·IQR
    numericDescribe['IQR'] = numericDescribe['75%'] - numericDescribe['25%']
    numericDescribe['outliers'] = (numericDescribe['max'] > (numericDescribe['75%'] + (1.5 * numericDescribe['IQR']))) \
                            | (numericDescribe['min'] < (numericDescribe['25%'] - (1.5 * numericDescribe['IQR'])))
    
    # Calculate IQR for each column of the dataframe.
    IQR = df.quantile(.75) - df.quantile(.25)
    
    # Calculate the upper and lower outlier values
    upperOutlier = df.quantile(.75) + (1.5 * (IQR))
    lowerOutlier = df.quantile(.25) - (1.5 * (IQR))
    
    # Store the result in a new column
    numericDescribe['num_outliers'] = df.apply(numberOfOutliers, args=(upperOutlier, lowerOutlier))
    numericDescribe.sort_values('num_outliers', ascending=False, inplace=True)
    newColOrder = ['count', 'outliers', 'num_outliers', 'IQR', 'mean', 'std', \
               'min', '25%', '50%', '75%', 'max']
    numericDescribe = numericDescribe.reindex(columns=newColOrder)
    
    return numericDescribe
outlier_stats_df = outlier_stats(df_airbnb)
outlier_stats_df
count outliers num_outliers IQR mean std min 25% 50% 75% max
bedrooms 45605.0 True 12668 0.0 1.179 0.741 0.0 1.0 1.0 1.0 14.0
cancellation_policy_moderate 45605.0 True 10558 0.0 0.232 0.422 0.0 0.0 0.0 0.0 1.0
host_is_superhost_f 45605.0 True 8752 0.0 0.808 0.394 0.0 1.0 1.0 1.0 1.0
host_is_superhost_t 45605.0 True 8752 0.0 0.192 0.394 0.0 0.0 0.0 0.0 1.0
is_location_exact_t 45605.0 True 7887 0.0 0.827 0.378 0.0 1.0 1.0 1.0 1.0
is_location_exact_f 45605.0 True 7887 0.0 0.173 0.378 0.0 0.0 0.0 0.0 1.0
property_type_Apartment 45605.0 True 7000 0.0 0.847 0.360 0.0 1.0 1.0 1.0 1.0
bathrooms 45605.0 True 6949 0.0 1.143 0.428 0.0 1.0 1.0 1.0 15.5
host_total_listings_count 45605.0 True 6858 1.0 14.828 90.301 0.0 1.0 1.0 2.0 1465.0
host_listings_count 45605.0 True 6858 1.0 14.828 90.301 0.0 1.0 1.0 2.0 1465.0
minimum_nights 45605.0 True 6107 4.0 7.079 20.644 1.0 1.0 3.0 5.0 1250.0
calculated_host_listings_count 45605.0 True 6034 1.0 7.317 35.596 1.0 1.0 1.0 2.0 343.0
number_of_reviews 45605.0 True 5618 23.0 23.397 45.101 0.0 1.0 5.0 24.0 639.0
number_of_reviews_ltm 45605.0 True 5360 11.0 9.109 15.985 0.0 0.0 2.0 11.0 359.0
calculated_host_listings_count_private_rooms 45605.0 True 5014 1.0 1.485 6.507 0.0 0.0 1.0 1.0 115.0
availability_30 45605.0 True 4129 10.0 6.614 9.460 0.0 0.0 1.0 10.0 30.0
property_type_House 45605.0 True 3846 0.0 0.084 0.278 0.0 0.0 0.0 0.0 1.0
guests_included 45605.0 True 3261 1.0 1.510 1.138 1.0 1.0 1.0 2.0 16.0
calculated_host_listings_count_entire_homes 45605.0 True 3152 1.0 5.728 34.625 0.0 0.0 1.0 1.0 335.0
beds 45605.0 True 2412 1.0 1.542 1.108 0.0 1.0 1.0 2.0 40.0
price 45605.0 True 2359 108.0 146.268 226.386 0.0 67.0 100.0 175.0 10000.0
property_type_Townhouse 45605.0 True 1659 0.0 0.036 0.187 0.0 0.0 0.0 0.0 1.0
calculated_host_listings_count_shared_rooms 45605.0 True 1539 0.0 0.104 0.944 0.0 0.0 0.0 0.0 23.0
property_type_Condominium 45605.0 True 1495 0.0 0.033 0.178 0.0 0.0 0.0 0.0 1.0
extra_people 45605.0 True 1349 25.0 14.257 24.298 0.0 0.0 0.0 25.0 300.0
accommodates 45605.0 True 1261 2.0 2.819 1.848 1.0 2.0 2.0 4.0 26.0
require_guest_phone_verification_f 45605.0 True 1086 0.0 0.976 0.152 0.0 1.0 1.0 1.0 1.0
require_guest_phone_verification_t 45605.0 True 1086 0.0 0.024 0.152 0.0 0.0 0.0 0.0 1.0
room_type_Shared room 45605.0 True 1062 0.0 0.023 0.151 0.0 0.0 0.0 0.0 1.0
require_guest_profile_picture_t 45605.0 True 1020 0.0 0.022 0.148 0.0 0.0 0.0 0.0 1.0
require_guest_profile_picture_f 45605.0 True 1020 0.0 0.978 0.148 0.0 1.0 1.0 1.0 1.0
bed_type_Real Bed 45605.0 True 750 0.0 0.984 0.127 0.0 1.0 1.0 1.0 1.0
bed_type_Futon 45605.0 True 280 0.0 0.006 0.078 0.0 0.0 0.0 0.0 1.0
bed_type_Pull-out Sofa 45605.0 True 240 0.0 0.005 0.072 0.0 0.0 0.0 0.0 1.0
bed_type_Airbed 45605.0 True 166 0.0 0.004 0.060 0.0 0.0 0.0 0.0 1.0
host_has_profile_pic_t 45605.0 True 93 0.0 0.998 0.045 0.0 1.0 1.0 1.0 1.0
host_has_profile_pic_f 45605.0 True 93 0.0 0.002 0.045 0.0 0.0 0.0 0.0 1.0
cancellation_policy_super_strict_60 45605.0 True 93 0.0 0.002 0.045 0.0 0.0 0.0 0.0 1.0
bed_type_Couch 45605.0 True 64 0.0 0.001 0.037 0.0 0.0 0.0 0.0 1.0
cancellation_policy_super_strict_30 45605.0 True 22 0.0 0.000 0.022 0.0 0.0 0.0 0.0 1.0
cancellation_policy_strict 45605.0 True 5 0.0 0.000 0.010 0.0 0.0 0.0 0.0 1.0
cancellation_policy_strict_14_with_grace_period 45605.0 False 0 1.0 0.448 0.497 0.0 0.0 0.0 1.0 1.0
host_identity_verified_t 45605.0 False 0 1.0 0.484 0.500 0.0 0.0 0.0 1.0 1.0
cancellation_policy_flexible 45605.0 False 0 1.0 0.318 0.466 0.0 0.0 0.0 1.0 1.0
availability_90 45605.0 False 0 47.0 24.799 30.896 0.0 0.0 8.0 47.0 90.0
availability_365 45605.0 False 0 220.0 108.421 130.655 0.0 0.0 35.0 220.0 365.0
requires_license_t 45605.0 False 0 0.0 0.000 0.000 0.0 0.0 0.0 0.0 0.0
is_business_travel_ready_f 45605.0 False 0 0.0 1.000 0.000 1.0 1.0 1.0 1.0 1.0
instant_bookable_t 45605.0 False 0 1.0 0.366 0.482 0.0 0.0 0.0 1.0 1.0
instant_bookable_f 45605.0 False 0 1.0 0.634 0.482 0.0 0.0 1.0 1.0 1.0
host_identity_verified_f 45605.0 False 0 1.0 0.516 0.500 0.0 0.0 1.0 1.0 1.0
requires_license_f 45605.0 False 0 0.0 1.000 0.000 1.0 1.0 1.0 1.0 1.0
has_availability_t 45605.0 False 0 0.0 1.000 0.000 1.0 1.0 1.0 1.0 1.0
availability_60 45605.0 False 0 26.0 14.747 19.611 0.0 0.0 4.0 26.0 60.0
room_type_Entire home/apt 45605.0 False 0 1.0 0.515 0.500 0.0 0.0 1.0 1.0 1.0
room_type_Private room 45605.0 False 0 1.0 0.462 0.499 0.0 0.0 0.0 1.0 1.0

Dealing with outliers

From the above output, we can clearly see that the price column and couple of others have outliers. Using subject matter expertise, we must investigate these outliers individually. For instance the host_listings and minimum_nights have absurdly high values which could be an error on the part of the owner in entering these values. Alternatively, these could be special properties which are not open for all customers but reserved for people who want to sub-lease these properties. By including this feature in our analysis may skew our principal components as these might get more weightage over other features in the dataset. For this reason, I decided to drop these columns.

price has many outliers, however, upon closer examination, it was observed that the price given was for minimum_nights and not per night. For this reason, I have constructed a new feature here called price_adjusted which is the price per night.

df_airbnb['price_adjusted'] = df_airbnb['price'] / df_airbnb['minimum_nights']
df_airbnb.drop(['price', 'minimum_nights'], axis='columns', inplace=True)
drop_more_cols = ['host_total_listings_count',
                  'host_listings_count',
                 'host_total_listings_count',
                 'calculated_host_listings_count_entire_homes',
                 'calculated_host_listings_count_private_rooms',
                 'availability_365']
df_airbnb.drop(drop_more_cols, axis=1, inplace=True)
outlier_stats(df_airbnb)
count outliers num_outliers IQR mean std min 25% 50% 75% max
bedrooms 45605.0 True 12668 0.0 1.179 0.741 0.0 1.0 1.0 1.0 14.0
cancellation_policy_moderate 45605.0 True 10558 0.0 0.232 0.422 0.0 0.0 0.0 0.0 1.0
host_is_superhost_t 45605.0 True 8752 0.0 0.192 0.394 0.0 0.0 0.0 0.0 1.0
host_is_superhost_f 45605.0 True 8752 0.0 0.808 0.394 0.0 1.0 1.0 1.0 1.0
is_location_exact_f 45605.0 True 7887 0.0 0.173 0.378 0.0 0.0 0.0 0.0 1.0
is_location_exact_t 45605.0 True 7887 0.0 0.827 0.378 0.0 1.0 1.0 1.0 1.0
property_type_Apartment 45605.0 True 7000 0.0 0.847 0.360 0.0 1.0 1.0 1.0 1.0
bathrooms 45605.0 True 6949 0.0 1.143 0.428 0.0 1.0 1.0 1.0 15.5
calculated_host_listings_count 45605.0 True 6034 1.0 7.317 35.596 1.0 1.0 1.0 2.0 343.0
number_of_reviews 45605.0 True 5618 23.0 23.397 45.101 0.0 1.0 5.0 24.0 639.0
number_of_reviews_ltm 45605.0 True 5360 11.0 9.109 15.985 0.0 0.0 2.0 11.0 359.0
availability_30 45605.0 True 4129 10.0 6.614 9.460 0.0 0.0 1.0 10.0 30.0
property_type_House 45605.0 True 3846 0.0 0.084 0.278 0.0 0.0 0.0 0.0 1.0
guests_included 45605.0 True 3261 1.0 1.510 1.138 1.0 1.0 1.0 2.0 16.0
price_adjusted 45605.0 True 2798 60.0 65.534 130.939 0.0 19.8 43.0 79.8 8000.0
beds 45605.0 True 2412 1.0 1.542 1.108 0.0 1.0 1.0 2.0 40.0
property_type_Townhouse 45605.0 True 1659 0.0 0.036 0.187 0.0 0.0 0.0 0.0 1.0
calculated_host_listings_count_shared_rooms 45605.0 True 1539 0.0 0.104 0.944 0.0 0.0 0.0 0.0 23.0
property_type_Condominium 45605.0 True 1495 0.0 0.033 0.178 0.0 0.0 0.0 0.0 1.0
extra_people 45605.0 True 1349 25.0 14.257 24.298 0.0 0.0 0.0 25.0 300.0
accommodates 45605.0 True 1261 2.0 2.819 1.848 1.0 2.0 2.0 4.0 26.0
require_guest_phone_verification_t 45605.0 True 1086 0.0 0.024 0.152 0.0 0.0 0.0 0.0 1.0
require_guest_phone_verification_f 45605.0 True 1086 0.0 0.976 0.152 0.0 1.0 1.0 1.0 1.0
room_type_Shared room 45605.0 True 1062 0.0 0.023 0.151 0.0 0.0 0.0 0.0 1.0
require_guest_profile_picture_t 45605.0 True 1020 0.0 0.022 0.148 0.0 0.0 0.0 0.0 1.0
require_guest_profile_picture_f 45605.0 True 1020 0.0 0.978 0.148 0.0 1.0 1.0 1.0 1.0
bed_type_Real Bed 45605.0 True 750 0.0 0.984 0.127 0.0 1.0 1.0 1.0 1.0
bed_type_Futon 45605.0 True 280 0.0 0.006 0.078 0.0 0.0 0.0 0.0 1.0
bed_type_Pull-out Sofa 45605.0 True 240 0.0 0.005 0.072 0.0 0.0 0.0 0.0 1.0
bed_type_Airbed 45605.0 True 166 0.0 0.004 0.060 0.0 0.0 0.0 0.0 1.0
cancellation_policy_super_strict_60 45605.0 True 93 0.0 0.002 0.045 0.0 0.0 0.0 0.0 1.0
host_has_profile_pic_t 45605.0 True 93 0.0 0.998 0.045 0.0 1.0 1.0 1.0 1.0
host_has_profile_pic_f 45605.0 True 93 0.0 0.002 0.045 0.0 0.0 0.0 0.0 1.0
bed_type_Couch 45605.0 True 64 0.0 0.001 0.037 0.0 0.0 0.0 0.0 1.0
cancellation_policy_super_strict_30 45605.0 True 22 0.0 0.000 0.022 0.0 0.0 0.0 0.0 1.0
cancellation_policy_strict 45605.0 True 5 0.0 0.000 0.010 0.0 0.0 0.0 0.0 1.0
availability_60 45605.0 False 0 26.0 14.747 19.611 0.0 0.0 4.0 26.0 60.0
cancellation_policy_strict_14_with_grace_period 45605.0 False 0 1.0 0.448 0.497 0.0 0.0 0.0 1.0 1.0
host_identity_verified_t 45605.0 False 0 1.0 0.484 0.500 0.0 0.0 0.0 1.0 1.0
requires_license_t 45605.0 False 0 0.0 0.000 0.000 0.0 0.0 0.0 0.0 0.0
cancellation_policy_flexible 45605.0 False 0 1.0 0.318 0.466 0.0 0.0 0.0 1.0 1.0
is_business_travel_ready_f 45605.0 False 0 0.0 1.000 0.000 1.0 1.0 1.0 1.0 1.0
instant_bookable_t 45605.0 False 0 1.0 0.366 0.482 0.0 0.0 0.0 1.0 1.0
instant_bookable_f 45605.0 False 0 1.0 0.634 0.482 0.0 0.0 1.0 1.0 1.0
host_identity_verified_f 45605.0 False 0 1.0 0.516 0.500 0.0 0.0 1.0 1.0 1.0
requires_license_f 45605.0 False 0 0.0 1.000 0.000 1.0 1.0 1.0 1.0 1.0
has_availability_t 45605.0 False 0 0.0 1.000 0.000 1.0 1.0 1.0 1.0 1.0
room_type_Private room 45605.0 False 0 1.0 0.462 0.499 0.0 0.0 0.0 1.0 1.0
availability_90 45605.0 False 0 47.0 24.799 30.896 0.0 0.0 8.0 47.0 90.0
room_type_Entire home/apt 45605.0 False 0 1.0 0.515 0.500 0.0 0.0 1.0 1.0 1.0
df=df_airbnb.copy()

Scaling

Many machine learning algorithms work better when features are on a relatively similar scale and close to normally distributed. MinMaxScaler, RobustScaler, StandardScaler, and Normalizer are scikit-learn methods to preprocess data for machine learning.

What does scaling actually mean?: To scale generally means to change the range of the values. The shape of the distribution doesn’t change. Think about how a scale model of a building has the same proportions as the original, just smaller. That’s why we say it is drawn to scale. The range is often set at 0 to 1.

What does standardize mean?: Standardize generally means changing the values so that the distribution standard deviation from the mean equals one. It outputs something very close to a normal distribution. Scaling is often implied.

Why do we need to Scale or Standardize or Normalize? : Many machine learning algorithms perform better or converge faster when features are on a relatively similar scale and/or close to normally distributed. Examples of such algorithm families include:

  • linear and logistic regression (Remember the famous regression assumptions?)
  • nearest neighbors (Better results with similar scale)
  • principal components analysis (Finds linear combinations of features, if not on same scale, the explained variance will be attributed to features with larger values.)
  • neural networks
  • support vector machines (with some kernels)
  • linear discriminant analysis

StandardScaler

The StandardScaler standardizes a feature by subtracting the mean and then scaling to unit variance. Unit variance means dividing all the values by the standard deviation. This means that some values might be negative and some positive and are not guaranteed to be in -1 to 1 range.

By looking at the values for bedrooms for instance, we can see that we have a min value of -1.591 and a max value of 17.313. While having negative values by itself is not an issue for PCA analysis, it is not well suited for this particular dataset. As we shall see when we merge this dataset with the features extracted from text which are all in the range 0 to 1, it is better if we have all our numeric features also in the same range of 0 to 1.

For this reason, we are not going to use the StandardScaler.

from sklearn.preprocessing import StandardScaler
standard_scaler = StandardScaler()
scaled_df = pd.DataFrame(standard_scaler.fit_transform(df), columns=df.columns)
outlier_stats(scaled_df)
count outliers num_outliers IQR mean std min 25% 50% 75% max
bedrooms 45605.0 True 12668 0.000 0.0 1.0 -1.591 -0.241 -0.241 -0.241 17.313
cancellation_policy_moderate 45605.0 True 10558 0.000 -0.0 1.0 -0.549 -0.549 -0.549 -0.549 1.822
host_is_superhost_t 45605.0 True 8752 0.000 -0.0 1.0 -0.487 -0.487 -0.487 -0.487 2.052
host_is_superhost_f 45605.0 True 8752 0.000 0.0 1.0 -2.052 0.487 0.487 0.487 0.487
is_location_exact_f 45605.0 True 7887 0.000 0.0 1.0 -0.457 -0.457 -0.457 -0.457 2.187
is_location_exact_t 45605.0 True 7887 0.000 -0.0 1.0 -2.187 0.457 0.457 0.457 0.457
property_type_Apartment 45605.0 True 7000 0.000 -0.0 1.0 -2.348 0.426 0.426 0.426 0.426
bathrooms 45605.0 True 6949 0.000 0.0 1.0 -2.667 -0.333 -0.333 -0.333 33.511
calculated_host_listings_count 45605.0 True 6034 0.028 -0.0 1.0 -0.177 -0.177 -0.177 -0.149 9.431
number_of_reviews 45605.0 True 5618 0.510 0.0 1.0 -0.519 -0.497 -0.408 0.013 13.650
number_of_reviews_ltm 45605.0 True 5360 0.688 -0.0 1.0 -0.570 -0.570 -0.445 0.118 21.889
availability_30 45605.0 True 4488 1.057 -0.0 1.0 -0.699 -0.699 -0.594 0.358 2.472
property_type_House 45605.0 True 3846 0.000 -0.0 1.0 -0.303 -0.303 -0.303 -0.303 3.295
guests_included 45605.0 True 3261 0.878 -0.0 1.0 -0.448 -0.448 -0.448 0.430 12.727
price_adjusted 45605.0 True 2798 0.458 0.0 1.0 -0.500 -0.349 -0.172 0.109 60.597
beds 45605.0 True 2412 0.902 -0.0 1.0 -1.391 -0.489 -0.489 0.413 34.699
property_type_Townhouse 45605.0 True 1659 0.000 0.0 1.0 -0.194 -0.194 -0.194 -0.194 5.147
calculated_host_listings_count_shared_rooms 45605.0 True 1539 0.000 0.0 1.0 -0.110 -0.110 -0.110 -0.110 24.257
property_type_Condominium 45605.0 True 1495 0.000 0.0 1.0 -0.184 -0.184 -0.184 -0.184 5.432
extra_people 45605.0 True 1349 1.029 0.0 1.0 -0.587 -0.587 -0.587 0.442 11.760
accommodates 45605.0 True 1261 1.082 -0.0 1.0 -0.984 -0.443 -0.443 0.639 12.542
require_guest_phone_verification_t 45605.0 True 1086 0.000 0.0 1.0 -0.156 -0.156 -0.156 -0.156 6.403
require_guest_phone_verification_f 45605.0 True 1086 0.000 -0.0 1.0 -6.403 0.156 0.156 0.156 0.156
room_type_Shared room 45605.0 True 1062 0.000 0.0 1.0 -0.154 -0.154 -0.154 -0.154 6.476
require_guest_profile_picture_t 45605.0 True 1020 0.000 -0.0 1.0 -0.151 -0.151 -0.151 -0.151 6.611
require_guest_profile_picture_f 45605.0 True 1020 0.000 0.0 1.0 -6.611 0.151 0.151 0.151 0.151
bed_type_Real Bed 45605.0 True 750 0.000 -0.0 1.0 -7.733 0.129 0.129 0.129 0.129
bed_type_Futon 45605.0 True 280 0.000 0.0 1.0 -0.079 -0.079 -0.079 -0.079 12.723
bed_type_Pull-out Sofa 45605.0 True 240 0.000 -0.0 1.0 -0.073 -0.073 -0.073 -0.073 13.748
bed_type_Airbed 45605.0 True 166 0.000 -0.0 1.0 -0.060 -0.060 -0.060 -0.060 16.545
cancellation_policy_super_strict_60 45605.0 True 93 0.000 -0.0 1.0 -0.045 -0.045 -0.045 -0.045 22.122
host_has_profile_pic_t 45605.0 True 93 0.000 -0.0 1.0 -22.122 0.045 0.045 0.045 0.045
host_has_profile_pic_f 45605.0 True 93 0.000 0.0 1.0 -0.045 -0.045 -0.045 -0.045 22.122
bed_type_Couch 45605.0 True 64 0.000 0.0 1.0 -0.037 -0.037 -0.037 -0.037 26.675
cancellation_policy_super_strict_30 45605.0 True 22 0.000 0.0 1.0 -0.022 -0.022 -0.022 -0.022 45.519
cancellation_policy_strict 45605.0 True 5 0.000 -0.0 1.0 -0.010 -0.010 -0.010 -0.010 95.499
availability_60 45605.0 False 0 1.326 0.0 1.0 -0.752 -0.752 -0.548 0.574 2.308
cancellation_policy_strict_14_with_grace_period 45605.0 False 0 2.011 -0.0 1.0 -0.902 -0.902 -0.902 1.109 1.109
host_identity_verified_t 45605.0 False 0 2.001 -0.0 1.0 -0.969 -0.969 -0.969 1.032 1.032
requires_license_t 45605.0 False 0 0.000 0.0 0.0 0.000 0.000 0.000 0.000 0.000
cancellation_policy_flexible 45605.0 False 0 2.148 0.0 1.0 -0.682 -0.682 -0.682 1.466 1.466
is_business_travel_ready_f 45605.0 False 0 0.000 0.0 0.0 0.000 0.000 0.000 0.000 0.000
instant_bookable_t 45605.0 False 0 2.076 -0.0 1.0 -0.760 -0.760 -0.760 1.316 1.316
instant_bookable_f 45605.0 False 0 2.076 0.0 1.0 -1.316 -1.316 0.760 0.760 0.760
host_identity_verified_f 45605.0 False 0 2.001 0.0 1.0 -1.032 -1.032 0.969 0.969 0.969
requires_license_f 45605.0 False 0 0.000 0.0 0.0 0.000 0.000 0.000 0.000 0.000
has_availability_t 45605.0 False 0 0.000 0.0 0.0 0.000 0.000 0.000 0.000 0.000
room_type_Private room 45605.0 False 0 2.006 -0.0 1.0 -0.926 -0.926 -0.926 1.080 1.080
availability_90 45605.0 False 0 1.522 -0.0 1.0 -0.803 -0.803 -0.544 0.719 2.110
room_type_Entire home/apt 45605.0 False 0 2.001 -0.0 1.0 -1.031 -1.031 0.970 0.970 0.970

MinMaxScaler

The MinMaxScaler works by taking each value in a feature and subtracts it by the minimum value and divides it by the range. The range is the difference between the original maximum and original minimum.

MinMaxScaler preserves the shape of the original distribution. It doesn’t meaningfully change the information embedded in the original data. This is exactly what we want for our dataset. We want to preserve the individual feature distributions while bringing all the features to a 0 to 1 scale. The reason we are doing this is because for the Text Features, we have made use of TfidfVectorizer which translates the text features to numeric values that are in range 0 to 1.

For this reason, we will be using MinMaxScaler.

from sklearn.preprocessing import MinMaxScaler
min_max_scaler = MinMaxScaler()
min_max_scaled_df = pd.DataFrame(min_max_scaler.fit_transform(df), columns=df.columns)
outlier_stats(min_max_scaled_df)
count outliers num_outliers IQR mean std min 25% 50% 75% max
bedrooms 45605.0 True 12668 0.000 0.084 0.053 0.0 0.071 0.071 0.071 1.0
cancellation_policy_moderate 45605.0 True 10558 0.000 0.232 0.422 0.0 0.000 0.000 0.000 1.0
host_is_superhost_f 45605.0 True 8752 0.000 0.808 0.394 0.0 1.000 1.000 1.000 1.0
host_is_superhost_t 45605.0 True 8752 0.000 0.192 0.394 0.0 0.000 0.000 0.000 1.0
is_location_exact_f 45605.0 True 7887 0.000 0.173 0.378 0.0 0.000 0.000 0.000 1.0
is_location_exact_t 45605.0 True 7887 0.000 0.827 0.378 0.0 1.000 1.000 1.000 1.0
property_type_Apartment 45605.0 True 7000 0.000 0.847 0.360 0.0 1.000 1.000 1.000 1.0
bathrooms 45605.0 True 6949 0.000 0.074 0.028 0.0 0.065 0.065 0.065 1.0
calculated_host_listings_count 45605.0 True 6034 0.003 0.018 0.104 0.0 0.000 0.000 0.003 1.0
number_of_reviews 45605.0 True 5618 0.036 0.037 0.071 0.0 0.002 0.008 0.038 1.0
number_of_reviews_ltm 45605.0 True 5360 0.031 0.025 0.045 0.0 0.000 0.006 0.031 1.0
availability_30 45605.0 True 4488 0.333 0.220 0.315 0.0 0.000 0.033 0.333 1.0
property_type_House 45605.0 True 3846 0.000 0.084 0.278 0.0 0.000 0.000 0.000 1.0
guests_included 45605.0 True 3261 0.067 0.034 0.076 0.0 0.000 0.000 0.067 1.0
price_adjusted 45605.0 True 2798 0.008 0.008 0.016 0.0 0.002 0.005 0.010 1.0
beds 45605.0 True 2412 0.025 0.039 0.028 0.0 0.025 0.025 0.050 1.0
accommodates 45605.0 True 1703 0.080 0.073 0.074 0.0 0.040 0.040 0.120 1.0
property_type_Townhouse 45605.0 True 1659 0.000 0.036 0.187 0.0 0.000 0.000 0.000 1.0
calculated_host_listings_count_shared_rooms 45605.0 True 1539 0.000 0.005 0.041 0.0 0.000 0.000 0.000 1.0
property_type_Condominium 45605.0 True 1495 0.000 0.033 0.178 0.0 0.000 0.000 0.000 1.0
extra_people 45605.0 True 1349 0.083 0.048 0.081 0.0 0.000 0.000 0.083 1.0
require_guest_phone_verification_t 45605.0 True 1086 0.000 0.024 0.152 0.0 0.000 0.000 0.000 1.0
require_guest_phone_verification_f 45605.0 True 1086 0.000 0.976 0.152 0.0 1.000 1.000 1.000 1.0
room_type_Shared room 45605.0 True 1062 0.000 0.023 0.151 0.0 0.000 0.000 0.000 1.0
require_guest_profile_picture_f 45605.0 True 1020 0.000 0.978 0.148 0.0 1.000 1.000 1.000 1.0
require_guest_profile_picture_t 45605.0 True 1020 0.000 0.022 0.148 0.0 0.000 0.000 0.000 1.0
bed_type_Real Bed 45605.0 True 750 0.000 0.984 0.127 0.0 1.000 1.000 1.000 1.0
bed_type_Futon 45605.0 True 280 0.000 0.006 0.078 0.0 0.000 0.000 0.000 1.0
bed_type_Pull-out Sofa 45605.0 True 240 0.000 0.005 0.072 0.0 0.000 0.000 0.000 1.0
bed_type_Airbed 45605.0 True 166 0.000 0.004 0.060 0.0 0.000 0.000 0.000 1.0
host_has_profile_pic_f 45605.0 True 93 0.000 0.002 0.045 0.0 0.000 0.000 0.000 1.0
host_has_profile_pic_t 45605.0 True 93 0.000 0.998 0.045 0.0 1.000 1.000 1.000 1.0
cancellation_policy_super_strict_60 45605.0 True 93 0.000 0.002 0.045 0.0 0.000 0.000 0.000 1.0
bed_type_Couch 45605.0 True 64 0.000 0.001 0.037 0.0 0.000 0.000 0.000 1.0
cancellation_policy_super_strict_30 45605.0 True 22 0.000 0.000 0.022 0.0 0.000 0.000 0.000 1.0
cancellation_policy_strict 45605.0 True 5 0.000 0.000 0.010 0.0 0.000 0.000 0.000 1.0
cancellation_policy_flexible 45605.0 False 0 1.000 0.318 0.466 0.0 0.000 0.000 1.000 1.0
cancellation_policy_strict_14_with_grace_period 45605.0 False 0 1.000 0.448 0.497 0.0 0.000 0.000 1.000 1.0
host_identity_verified_f 45605.0 False 0 1.000 0.516 0.500 0.0 0.000 1.000 1.000 1.0
has_availability_t 45605.0 False 0 0.000 0.000 0.000 0.0 0.000 0.000 0.000 0.0
is_business_travel_ready_f 45605.0 False 0 0.000 0.000 0.000 0.0 0.000 0.000 0.000 0.0
instant_bookable_t 45605.0 False 0 1.000 0.366 0.482 0.0 0.000 0.000 1.000 1.0
instant_bookable_f 45605.0 False 0 1.000 0.634 0.482 0.0 0.000 1.000 1.000 1.0
requires_license_t 45605.0 False 0 0.000 0.000 0.000 0.0 0.000 0.000 0.000 0.0
requires_license_f 45605.0 False 0 0.000 0.000 0.000 0.0 0.000 0.000 0.000 0.0
host_identity_verified_t 45605.0 False 0 1.000 0.484 0.500 0.0 0.000 0.000 1.000 1.0
room_type_Private room 45605.0 False 0 1.000 0.462 0.499 0.0 0.000 0.000 1.000 1.0
availability_60 45605.0 False 0 0.433 0.246 0.327 0.0 0.000 0.067 0.433 1.0
availability_90 45605.0 False 0 0.522 0.276 0.343 0.0 0.000 0.089 0.522 1.0
room_type_Entire home/apt 45605.0 False 0 1.000 0.515 0.500 0.0 0.000 1.000 1.000 1.0

RobustScaler

Yet another Scaler is the RobustScaler. The MinMaxScaler is suseptible to outliers, whereas the RobustScaler is not. However, RobustScaler can generate values that outside the range 0 to 1. Moreover, we have dealt with the outlier above in a reasonable manner, so we should be good with using the MinMaxScaler for this dataset.

from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
scaled_final_df = scaler.fit_transform(final_df)
scaled_final_df = pd.DataFrame(scaled_final_df, columns=final_df.columns)

scaled_final_df.head()
host_listings_count host_total_listings_count accommodates bathrooms bedrooms beds price guests_included extra_people minimum_nights ... description_contains_yankee description_contains_yard description_contains_year description_contains_yellow description_contains_yoga description_contains_york description_contains_young description_contains_yummy description_contains_zero description_contains_zone
0 4.0 4.0 0.0 0.0 -1.0 0.0 1.157407 1.0 0.0 -0.50 ... 0.0 0.0 0.00000 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.462963 1.0 0.8 0.00 ... 0.0 0.0 0.00000 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0
2 0.0 0.0 -0.5 0.0 0.0 0.0 -0.185185 0.0 0.8 1.75 ... 0.0 0.0 0.00000 0.0 0.137645 0.000000 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.925926 1.0 4.0 0.00 ... 0.0 0.0 0.00000 0.0 0.000000 0.074083 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 -0.370370 0.0 1.2 10.50 ... 0.0 0.0 0.18384 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0

5 rows × 2208 columns

Merge the feature engineered dataset this dataset

In this section, we will be taking the feature engineered dataset that was created as part of this post: test and merge it with the current dataset. Recall, that the featured engineered data set contains Text features along with some newly created categorical features.

merged_df = pd.read_csv('merged_df.csv')
merged_df.head()
amenities_tv amenities_wifi amenities_air_conditioning amenities_kitchen amenities_paid_parking_off_premises amenities_free_street_parking amenities_indoor_fireplace amenities_heating amenities_family/kid_friendly amenities_smoke_detector ... description_contains_yankee description_contains_yard description_contains_year description_contains_yellow description_contains_yoga description_contains_york description_contains_young description_contains_yummy description_contains_zero description_contains_zone
0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... 0.0 0.0 0.00000 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0
1 0.0 1.0 1.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 ... 0.0 0.0 0.00000 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0
2 0.0 1.0 1.0 1.0 0.0 1.0 0.0 1.0 0.0 1.0 ... 0.0 0.0 0.00000 0.0 0.137645 0.000000 0.0 0.0 0.0 0.0
3 1.0 1.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 ... 0.0 0.0 0.00000 0.0 0.000000 0.074083 0.0 0.0 0.0 0.0
4 0.0 1.0 1.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 ... 0.0 0.0 0.18384 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0

5 rows × 2151 columns

# check if there are any glaring outliers.
outlier_stats(merged_df)
count outliers num_outliers IQR mean std min 25% 50% 75% max
description_contains_home 45605.0 True 11384 0.0 0.026 0.054 0.0 0.0 0.0 0.0 0.916
description_contains_place 45605.0 True 11317 0.0 0.030 0.069 0.0 0.0 0.0 0.0 1.000
description_contains_city 45605.0 True 11249 0.0 0.024 0.048 0.0 0.0 0.0 0.0 0.596
host_verification_by_selfie 45605.0 True 11126 0.0 0.244 0.429 0.0 0.0 0.0 0.0 1.000
amenities_coffee_maker 45605.0 True 11087 0.0 0.243 0.429 0.0 0.0 0.0 0.0 1.000
description_contains_quiet 45605.0 True 11035 0.0 0.024 0.048 0.0 0.0 0.0 0.0 1.000
host_verification_by_kba 45605.0 True 10985 0.0 0.241 0.428 0.0 0.0 0.0 0.0 1.000
description_contains_brooklyn 45605.0 True 10794 0.0 0.029 0.063 0.0 0.0 0.0 0.0 0.679
amenities_cable_tv 45605.0 True 10706 0.0 0.235 0.424 0.0 0.0 0.0 0.0 1.000
description_contains_large 45605.0 True 10567 0.0 0.025 0.055 0.0 0.0 0.0 0.0 0.762
host_verification_by_facebook 45605.0 True 10428 0.0 0.229 0.420 0.0 0.0 0.0 0.0 1.000
description_contains_time 45605.0 True 10350 0.0 0.022 0.047 0.0 0.0 0.0 0.0 1.000
amenities_self_check-in 45605.0 True 10233 0.0 0.224 0.417 0.0 0.0 0.0 0.0 1.000
amenities_buzzer/wireless_intercom 45605.0 True 10076 0.0 0.221 0.415 0.0 0.0 0.0 0.0 1.000
host_verification_by_identity_manual 45605.0 True 10025 0.0 0.220 0.414 0.0 0.0 0.0 0.0 1.000
description_contains_beautiful 45605.0 True 9867 0.0 0.022 0.049 0.0 0.0 0.0 0.0 1.000
description_contains_need 45605.0 True 9819 0.0 0.020 0.044 0.0 0.0 0.0 0.0 0.421
description_contains_spacious 45605.0 True 9774 0.0 0.021 0.047 0.0 0.0 0.0 0.0 1.000
description_contains_clean 45605.0 True 9447 0.0 0.022 0.050 0.0 0.0 0.0 0.0 0.709
description_contains_line 45605.0 True 9067 0.0 0.022 0.051 0.0 0.0 0.0 0.0 1.000
amenities_long_term_stays_allowed 45605.0 True 8831 0.0 0.194 0.395 0.0 0.0 0.0 0.0 1.000
amenities_extra_pillows_and_blankets 45605.0 True 8798 0.0 0.193 0.395 0.0 0.0 0.0 0.0 1.000
description_contains_share 45605.0 True 8777 0.0 0.022 0.053 0.0 0.0 0.0 0.0 1.000
description_contains_include 45605.0 True 8605 0.0 0.018 0.043 0.0 0.0 0.0 0.0 0.489
description_contains_central 45605.0 True 8570 0.0 0.022 0.052 0.0 0.0 0.0 0.0 1.000
description_contains_comfortable 45605.0 True 8331 0.0 0.018 0.043 0.0 0.0 0.0 0.0 0.748
description_contains_walking 45605.0 True 8194 0.0 0.019 0.047 0.0 0.0 0.0 0.0 0.571
description_contains_fully 45605.0 True 8175 0.0 0.018 0.043 0.0 0.0 0.0 0.0 0.658
amenities_private_entrance 45605.0 True 8109 0.0 0.178 0.382 0.0 0.0 0.0 0.0 1.000
amenities_translation_missing:_en.hosting_amenity_50 45605.0 True 8104 0.0 0.178 0.382 0.0 0.0 0.0 0.0 1.000
... ... ... ... ... ... ... ... ... ... ... ...
amenities_ceiling_hoist 45605.0 True 3 0.0 0.000 0.008 0.0 0.0 0.0 0.0 1.000
amenities_fixed_grab_bars_for_shower.1 45605.0 True 1 0.0 0.000 0.005 0.0 0.0 0.0 0.0 1.000
amenities_lock_on_bedroom_door 45605.0 False 0 1.0 0.372 0.483 0.0 0.0 0.0 1.0 1.000
amenities_laptop_friendly_workspace 45605.0 False 0 1.0 0.623 0.485 0.0 0.0 1.0 1.0 1.000
amenities_iron 45605.0 False 0 1.0 0.607 0.488 0.0 0.0 1.0 1.0 1.000
amenities_hair_dryer 45605.0 False 0 1.0 0.649 0.477 0.0 0.0 1.0 1.0 1.000
amenities_hangers 45605.0 False 0 1.0 0.734 0.442 0.0 0.0 1.0 1.0 1.000
amenities_carbon_monoxide_detector 45605.0 False 0 1.0 0.689 0.463 0.0 0.0 1.0 1.0 1.000
amenities_shampoo 45605.0 False 0 1.0 0.671 0.470 0.0 0.0 1.0 1.0 1.000
amenities_fire_extinguisher 45605.0 False 0 1.0 0.367 0.482 0.0 0.0 0.0 1.0 1.000
host_verification_by_government_id 45605.0 False 0 1.0 0.601 0.490 0.0 0.0 1.0 1.0 1.000
amenities_family/kid_friendly 45605.0 False 0 1.0 0.269 0.443 0.0 0.0 0.0 1.0 1.000
host_verification_by_offline_government_id 45605.0 False 0 1.0 0.389 0.488 0.0 0.0 0.0 1.0 1.000
amenities_free_street_parking 45605.0 False 0 1.0 0.301 0.459 0.0 0.0 0.0 1.0 1.000
amenities_hot_water 45605.0 False 0 1.0 0.492 0.500 0.0 0.0 0.0 1.0 1.000
amenities_dishes_and_silverware 45605.0 False 0 1.0 0.292 0.455 0.0 0.0 0.0 1.0 1.000
amenities_bed_linens 45605.0 False 0 1.0 0.286 0.452 0.0 0.0 0.0 1.0 1.000
amenities_refrigerator 45605.0 False 0 1.0 0.327 0.469 0.0 0.0 0.0 1.0 1.000
amenities_cooking_basics 45605.0 False 0 1.0 0.273 0.446 0.0 0.0 0.0 1.0 1.000
amenities_oven 45605.0 False 0 1.0 0.283 0.451 0.0 0.0 0.0 1.0 1.000
amenities_stove 45605.0 False 0 1.0 0.287 0.452 0.0 0.0 0.0 1.0 1.000
amenities_internet 45605.0 False 0 1.0 0.306 0.461 0.0 0.0 0.0 1.0 1.000
amenities_elevator 45605.0 False 0 1.0 0.270 0.444 0.0 0.0 0.0 1.0 1.000
amenities_washer 45605.0 False 0 1.0 0.405 0.491 0.0 0.0 0.0 1.0 1.000
amenities_dryer 45605.0 False 0 1.0 0.397 0.489 0.0 0.0 0.0 1.0 1.000
amenities_microwave 45605.0 False 0 1.0 0.275 0.447 0.0 0.0 0.0 1.0 1.000
amenities_first_aid_kit 45605.0 False 0 1.0 0.327 0.469 0.0 0.0 0.0 1.0 1.000
host_verification_by_jumio 45605.0 False 0 1.0 0.456 0.498 0.0 0.0 0.0 1.0 1.000
host_verification_by_reviews 45605.0 False 0 1.0 0.722 0.448 0.0 0.0 1.0 1.0 1.000
amenities_tv 45605.0 False 0 1.0 0.671 0.470 0.0 0.0 1.0 1.0 1.000

2151 rows × 11 columns

As pointed out earlier, we will be using the MinMaxScaler as our preferred scaling option. We will be merging this with the feature engineered dataset.

# merge them both
final_df = pd.concat([min_max_scaled_df, merged_df], axis='columns')

Upload pre-processed dataset to S3

final_df.to_csv('min_max_scaled_final_df.csv', index=False)

# boto3 client to get S3 data
s3_client = boto3.client('s3')
bucket_name='skuchkula-sagemaker-airbnb'

# upload it to S3
s3_client.upload_file(Bucket=bucket_name, 
                      Filename='min_max_scaled_final_df.csv', 
                      Key='feature_eng/min_max_scaled_final_df.csv')

Conclusion

We have come a long way from the raw dataset to the final dataset which is ready for modeling. Along the way we dealt with missing values, incorrect data types, outliers, scaling and created several new features that will help us group Airbnb listings that are similar to each other. Our final data set consists of 45,604 listings, with each having 2201 features. Shown below is the shape and head of this final dataset.

final_df.shape
(45605, 2201)
final_df.head()
accommodates bathrooms bedrooms beds guests_included extra_people availability_30 availability_60 availability_90 number_of_reviews ... description_contains_yankee description_contains_yard description_contains_year description_contains_yellow description_contains_yoga description_contains_york description_contains_young description_contains_yummy description_contains_zero description_contains_zone
0 0.04 0.064516 0.000000 0.025 0.066667 0.000000 0.433333 0.283333 0.344444 0.071987 ... 0.0 0.0 0.00000 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0
1 0.04 0.064516 0.071429 0.025 0.066667 0.066667 1.000000 1.000000 1.000000 0.000000 ... 0.0 0.0 0.00000 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0
2 0.00 0.064516 0.071429 0.025 0.000000 0.066667 0.000000 0.000000 0.000000 0.014085 ... 0.0 0.0 0.00000 0.0 0.137645 0.000000 0.0 0.0 0.0 0.0
3 0.04 0.064516 0.071429 0.025 0.066667 0.333333 0.800000 0.550000 0.700000 0.117371 ... 0.0 0.0 0.00000 0.0 0.000000 0.074083 0.0 0.0 0.0 0.0
4 0.04 0.064516 0.071429 0.025 0.000000 0.100000 0.000000 0.000000 0.000000 0.076682 ... 0.0 0.0 0.18384 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0

5 rows × 2201 columns