New York City Airbnb Data Cleaning

25 minute read

Cleaning NYC Airbnb detailed listings dataset

The goal of this notebook is to clean the raw airbnb dataset which resides on S3 in s3://skuchkula-sagemaker-airbnb/ location. The original dataset can be found here: Inside Airbnb. I used the New York City datasets for the month of August 2019.

Note: I have used a SageMaker notebook instance, but you could easily do this on your local environment.

# Import the necessary packages
# data managing and display libs
import pandas as pd
import numpy as np
import os
import io

import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline

# sagemaker libraries
import boto3
import sagemaker

Get data from S3

Due to the size of the dataset, I had to store them in S3 as github woudln’t allow large files. Using boto3 package we can directly read the dataset from S3 instead of bringing it locally. This is logic is captured in the function get_data_frame, which takes the bucket-name and file-name inside the bucket and returns a pandas dataframe.

# boto3 client to get S3 data
s3_client = boto3.client('s3')
bucket_name='skuchkula-sagemaker-airbnb'

List the files inside s3://skuchkula-sagemaker-airbnb/

# get a list of objects in the bucket
obj_list=s3_client.list_objects(Bucket=bucket_name)

# print object(s)in S3 bucket
files=[]
for contents in obj_list['Contents']:
    files.append(contents['Key'])

print(files)
['detailed_listings.csv', 'summary_listings.csv']

We will be using the detailed_listings.csv file instead of summary_listings.csv. The summary dataset is mainly useful for data visualization, whereas, the detailed listings dataset contains much more information about each listing, that we can use to build a machine learning model.

detailed_listings=files[0]
def get_data_frame(bucket_name, file_name):
    '''
        Takes the location of the dataset on S3 and returns a dataframe.
        arguments:
            bucket_name: the name of the bucket
            file_name: the key inside the bucket
    '''
    # get an S3 object by passing in the bucket and file name
    data_object = s3_client.get_object(Bucket=bucket_name, Key=file_name)

    # information is in the "Body" of the object
    data_body = data_object["Body"].read()

    # read in bytes data
    data_stream = io.BytesIO(data_body)

    # create a dataframe
    df = pd.read_csv(data_stream, header=0, delimiter=",", low_memory=False)

    return df
df_detailed_listings = get_data_frame(bucket_name, detailed_listings)
df_detailed_listings.head()
id listing_url scrape_id last_scraped name summary space description experiences_offered neighborhood_overview ... instant_bookable is_business_travel_ready cancellation_policy require_guest_profile_picture require_guest_phone_verification calculated_host_listings_count calculated_host_listings_count_entire_homes calculated_host_listings_count_private_rooms calculated_host_listings_count_shared_rooms reviews_per_month
0 2595 https://www.airbnb.com/rooms/2595 20190806030549 2019-08-07 Skylit Midtown Castle Find your romantic getaway to this beautiful, ... - Spacious (500+ft²), immaculate and nicely fu... Find your romantic getaway to this beautiful, ... none Centrally located in the heart of Manhattan ju... ... f f strict_14_with_grace_period t t 2 1 0 1 0.39
1 3647 https://www.airbnb.com/rooms/3647 20190806030549 2019-08-06 THE VILLAGE OF HARLEM....NEW YORK ! NaN WELCOME TO OUR INTERNATIONAL URBAN COMMUNITY T... WELCOME TO OUR INTERNATIONAL URBAN COMMUNITY T... none NaN ... f f strict_14_with_grace_period t t 1 0 1 0 NaN
2 3831 https://www.airbnb.com/rooms/3831 20190806030549 2019-08-06 Cozy Entire Floor of Brownstone Urban retreat: enjoy 500 s.f. floor in 1899 br... Greetings! We own a double-duplex brownst... Urban retreat: enjoy 500 s.f. floor in 1899 br... none Just the right mix of urban center and local n... ... f f moderate f f 1 1 0 0 4.64
3 5022 https://www.airbnb.com/rooms/5022 20190806030549 2019-08-06 Entire Apt: Spacious Studio/Loft by central park NaN Loft apartment with high ceiling and wood floo... Loft apartment with high ceiling and wood floo... none NaN ... f f strict_14_with_grace_period t t 1 1 0 0 0.10
4 5099 https://www.airbnb.com/rooms/5099 20190806030549 2019-08-06 Large Cozy 1 BR Apartment In Midtown East My large 1 bedroom apartment is true New York ... I have a large 1 bedroom apartment centrally l... My large 1 bedroom apartment is true New York ... none My neighborhood in Midtown East is called Murr... ... f f strict_14_with_grace_period t t 1 1 0 0 0.60

5 rows × 106 columns

The Detailed listings contains about 106 different features for each of the Airbnb locations. We need to do some EDA to check if we can trim down some features.

print("Detailed Listings row, cols: ", df_detailed_listings.shape)
Detailed Listings row, cols:  (48864, 106)
# check how the data types are distributed.
# pandas does its best to interpret the datatype while reading in
# however it is our duty to check if the datatype makes sense
df_detailed_listings.dtypes.value_counts()
object     63
float64    22
int64      21
dtype: int64

There are about 48,864 airbnb listings and each of these listing has associated with it 106 features describing all aspects of the airbnb listing. However, some of these features might be having missing values, whereas some of them might not be relevant to Machine Learning. The data cleaning step will investigate and tackle these problems.

Data Cleaning

The focus of this notebook is on cleaning the data and storing the cleaned version back into S3. The strategy employed for cleaning the data is as follows:

  • Drop columns that are not relevant to the problem. Example: URL, host picture etc.
  • Find missing values for each column.
  • Drop columns which have more than 20% missing data. Make a note of the columns.
  • Convert columns to their correct data type.
  • Subset the dataset based on certain criterion. Ex: property_type = Apartment/House/Townhouse/Condo
  • One-hot-encode the categorical variables.
# Set the display properties so that we can inspect the data
pd.set_option("display.max_columns", 100)
pd.set_option("display.max_rows", 100)

When working with a dataset which contains many different types of columns, it is best to break them down by type and inspect the cleanliness of each type. We will start with Object types.

Drop columns

Drop irrelevant Object types

Using select_dtypes we can filter out the desired type of columns.

df_detailed_listings.select_dtypes(include=['object']).columns
Index(['listing_url', 'last_scraped', 'name', 'summary', 'space',
       'description', 'experiences_offered', 'neighborhood_overview', 'notes',
       'transit', 'access', 'interaction', 'house_rules', 'picture_url',
       'host_url', 'host_name', 'host_since', 'host_location', 'host_about',
       'host_response_time', 'host_response_rate', 'host_is_superhost',
       'host_thumbnail_url', 'host_picture_url', 'host_neighbourhood',
       'host_verifications', 'host_has_profile_pic', 'host_identity_verified',
       'street', 'neighbourhood', 'neighbourhood_cleansed',
       'neighbourhood_group_cleansed', 'city', 'state', 'zipcode', 'market',
       'smart_location', 'country_code', 'country', 'is_location_exact',
       'property_type', 'room_type', 'bed_type', 'amenities', 'price',
       'weekly_price', 'monthly_price', 'security_deposit', 'cleaning_fee',
       'extra_people', 'calendar_updated', 'has_availability',
       'calendar_last_scraped', 'first_review', 'last_review',
       'requires_license', 'license', 'jurisdiction_names', 'instant_bookable',
       'is_business_travel_ready', 'cancellation_policy',
       'require_guest_profile_picture', 'require_guest_phone_verification'],
      dtype='object')

By inspecting some values of object types, it becomes evident that some columns are just not that useful for our analysis. Example, the urls, last_scraped etc. Shown below are a list of columns that we can safely drop without much loss of information needed for our analysis. I have also displayed the head of the object columns to get a sense of why we can drop these columns.

drop_object_cols = ['listing_url',
             'last_scraped',
             'experiences_offered',
             'picture_url',
             'host_url',
             'host_name',
             'host_since',
             'host_location',
             'host_about',
             'host_thumbnail_url',
             'host_picture_url',
             'host_neighbourhood',
             'street',
             'neighbourhood',
             'city',
             'state',
             'zipcode',
             'market',
             'smart_location',
             'country_code',
             'country',
             'calendar_updated',
             'calendar_last_scraped',
             'first_review',
             'last_review'
            ]
df_detailed_listings.select_dtypes(include=['object']).head()
listing_url last_scraped name summary space description experiences_offered neighborhood_overview notes transit access interaction house_rules picture_url host_url host_name host_since host_location host_about host_response_time host_response_rate host_is_superhost host_thumbnail_url host_picture_url host_neighbourhood host_verifications host_has_profile_pic host_identity_verified street neighbourhood neighbourhood_cleansed neighbourhood_group_cleansed city state zipcode market smart_location country_code country is_location_exact property_type room_type bed_type amenities price weekly_price monthly_price security_deposit cleaning_fee extra_people calendar_updated has_availability calendar_last_scraped first_review last_review requires_license license jurisdiction_names instant_bookable is_business_travel_ready cancellation_policy require_guest_profile_picture require_guest_phone_verification
0 https://www.airbnb.com/rooms/2595 2019-08-07 Skylit Midtown Castle Find your romantic getaway to this beautiful, ... - Spacious (500+ft²), immaculate and nicely fu... Find your romantic getaway to this beautiful, ... none Centrally located in the heart of Manhattan ju... NaN Apartment is located on 37th Street between 5t... Guests have full access to the kitchen, bathro... I am a Sound Therapy Practitioner and Kundalin... Make yourself at home, respect the space and t... https://a0.muscache.com/im/pictures/f0813a11-4... https://www.airbnb.com/users/show/2845 Jennifer 2008-09-09 New York, New York, United States A New Yorker since 2000! My passion is creatin... within a few hours 90% f https://a0.muscache.com/im/users/2845/profile_... https://a0.muscache.com/im/users/2845/profile_... Midtown ['email', 'phone', 'reviews', 'kba', 'work_ema... t t New York, NY, United States Midtown Midtown Manhattan New York NY 10018 New York New York, NY US United States f Apartment Entire home/apt Real Bed {TV,Wifi,"Air conditioning",Kitchen,"Paid park... $225.00 $1,995.00 NaN $350.00 $100.00 $0.00 a week ago t 2019-08-07 2009-11-21 2019-07-14 f NaN NaN f f strict_14_with_grace_period t t
1 https://www.airbnb.com/rooms/3647 2019-08-06 THE VILLAGE OF HARLEM....NEW YORK ! NaN WELCOME TO OUR INTERNATIONAL URBAN COMMUNITY T... WELCOME TO OUR INTERNATIONAL URBAN COMMUNITY T... none NaN NaN NaN NaN NaN Upon arrival please have a legibile copy of yo... https://a0.muscache.com/im/pictures/838341/9b3... https://www.airbnb.com/users/show/4632 Elisabeth 2008-11-25 New York, New York, United States Make Up Artist National/ (Website hidden by Ai... within a day 100% f https://a0.muscache.com/im/users/4632/profile_... https://a0.muscache.com/im/users/4632/profile_... Harlem ['email', 'phone', 'google', 'reviews', 'jumio... t t New York, NY, United States Harlem Harlem Manhattan New York NY 10027 New York New York, NY US United States t Apartment Private room Pull-out Sofa {"Cable TV",Internet,Wifi,"Air conditioning",K... $150.00 NaN NaN $200.00 $75.00 $20.00 35 months ago t 2019-08-06 NaN NaN f NaN NaN f f strict_14_with_grace_period t t
2 https://www.airbnb.com/rooms/3831 2019-08-06 Cozy Entire Floor of Brownstone Urban retreat: enjoy 500 s.f. floor in 1899 br... Greetings! We own a double-duplex brownst... Urban retreat: enjoy 500 s.f. floor in 1899 br... none Just the right mix of urban center and local n... NaN B52 bus for a 10-minute ride to downtown Brook... You will have exclusive use of and access to: ... We'll be around, but since you have the top fl... Smoking - outside please; pets allowed but ple... https://a0.muscache.com/im/pictures/e49999c2-9... https://www.airbnb.com/users/show/4869 LisaRoxanne 2008-12-07 New York, New York, United States Laid-back bi-coastal actor/professor/attorney. within an hour 90% f https://a0.muscache.com/im/users/4869/profile_... https://a0.muscache.com/im/users/4869/profile_... Clinton Hill ['email', 'phone', 'reviews', 'kba'] t t Brooklyn, NY, United States Brooklyn Clinton Hill Brooklyn Brooklyn NY 11238 New York Brooklyn, NY US United States t Guest suite Entire home/apt Real Bed {TV,"Cable TV",Internet,Wifi,"Air conditioning... $89.00 $575.00 $2,100.00 $500.00 NaN $0.00 today t 2019-08-06 2014-09-30 2019-07-26 f NaN NaN f f moderate f f
3 https://www.airbnb.com/rooms/5022 2019-08-06 Entire Apt: Spacious Studio/Loft by central park NaN Loft apartment with high ceiling and wood floo... Loft apartment with high ceiling and wood floo... none NaN NaN NaN NaN NaN Please be considerate when staying in the apar... https://a0.muscache.com/im/pictures/feb453bd-f... https://www.airbnb.com/users/show/7192 Laura 2009-01-29 Miami, Florida, United States I have been a NYer for almost 10 years. I came... NaN NaN f https://a0.muscache.com/im/users/7192/profile_... https://a0.muscache.com/im/users/7192/profile_... East Harlem ['email', 'phone', 'facebook', 'reviews', 'kba'] t t New York, NY, United States East Harlem East Harlem Manhattan New York NY 10029 New York New York, NY US United States t Apartment Entire home/apt Real Bed {Internet,Wifi,"Air conditioning",Kitchen,Elev... $80.00 $600.00 $1,600.00 $100.00 $80.00 $20.00 4 months ago t 2019-08-06 2012-03-20 2018-11-19 f NaN NaN f f strict_14_with_grace_period t t
4 https://www.airbnb.com/rooms/5099 2019-08-06 Large Cozy 1 BR Apartment In Midtown East My large 1 bedroom apartment is true New York ... I have a large 1 bedroom apartment centrally l... My large 1 bedroom apartment is true New York ... none My neighborhood in Midtown East is called Murr... Read My Full Listing For All Information. New ... From the apartment is a 10 minute walk to Gran... I will meet you upon arrival. I usually check in with guests via text or ema... • Check-in time is 2PM. • Check-out time is 12... https://a0.muscache.com/im/pictures/be2fdcf6-e... https://www.airbnb.com/users/show/7322 Chris 2009-02-02 New York, New York, United States I'm an artist, writer, traveler, and a native ... within a few hours 90% f https://a0.muscache.com/im/pictures/user/26745... https://a0.muscache.com/im/pictures/user/26745... Flatiron District ['email', 'phone', 'reviews', 'jumio', 'govern... t f New York, NY, United States Midtown East Murray Hill Manhattan New York NY 10016 New York New York, NY US United States f Apartment Entire home/apt Real Bed {TV,"Cable TV",Internet,Wifi,Kitchen,"Buzzer/w... $200.00 NaN NaN $300.00 $125.00 $100.00 4 days ago t 2019-08-06 2009-04-20 2019-07-21 f NaN NaN f f strict_14_with_grace_period t t

Drop irrelevant float64 types

df_detailed_listings.select_dtypes(include=['float64']).columns
Index(['thumbnail_url', 'medium_url', 'xl_picture_url', 'host_acceptance_rate',
       'host_listings_count', 'host_total_listings_count', 'latitude',
       'longitude', 'bathrooms', 'bedrooms', 'beds', 'square_feet',
       'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm',
       'review_scores_rating', 'review_scores_accuracy',
       'review_scores_cleanliness', 'review_scores_checkin',
       'review_scores_communication', 'review_scores_location',
       'review_scores_value', 'reviews_per_month'],
      dtype='object')
df_detailed_listings.select_dtypes(include=['float64']).head()
thumbnail_url medium_url xl_picture_url host_acceptance_rate host_listings_count host_total_listings_count latitude longitude bathrooms bedrooms beds square_feet minimum_nights_avg_ntm maximum_nights_avg_ntm review_scores_rating review_scores_accuracy review_scores_cleanliness review_scores_checkin review_scores_communication review_scores_location review_scores_value reviews_per_month
0 NaN NaN NaN NaN 5.0 5.0 40.75362 -73.98377 1.0 0.0 1.0 NaN 1.0 1125.0 95.0 10.0 9.0 10.0 10.0 10.0 9.0 0.39
1 NaN NaN NaN NaN 1.0 1.0 40.80902 -73.94190 1.0 1.0 1.0 NaN 3.0 7.0 NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN 1.0 1.0 40.68514 -73.95976 1.0 1.0 4.0 500.0 1.0 730.0 90.0 10.0 9.0 10.0 9.0 10.0 9.0 4.64
3 NaN NaN NaN NaN 1.0 1.0 40.79851 -73.94399 1.0 NaN 1.0 NaN 10.0 120.0 93.0 10.0 9.0 10.0 10.0 9.0 10.0 0.10
4 NaN NaN NaN NaN 1.0 1.0 40.74767 -73.97500 1.0 1.0 1.0 NaN 3.0 21.0 89.0 10.0 9.0 10.0 10.0 9.0 9.0 0.60
drop_float_cols = ['thumbnail_url',
                   'medium_url',
                   'xl_picture_url',
                   'minimum_nights_avg_ntm',
                   'maximum_nights_avg_ntm'
                  ]

Drop irrelevant int64 types

df_detailed_listings.select_dtypes(include=['int64']).columns
Index(['id', 'scrape_id', 'host_id', 'accommodates', 'guests_included',
       'minimum_nights', 'maximum_nights', 'minimum_minimum_nights',
       'maximum_minimum_nights', 'minimum_maximum_nights',
       'maximum_maximum_nights', 'availability_30', 'availability_60',
       'availability_90', 'availability_365', 'number_of_reviews',
       'number_of_reviews_ltm', 'calculated_host_listings_count',
       'calculated_host_listings_count_entire_homes',
       'calculated_host_listings_count_private_rooms',
       'calculated_host_listings_count_shared_rooms'],
      dtype='object')
df_detailed_listings.select_dtypes(include=['int64']).head()
id scrape_id host_id accommodates guests_included minimum_nights maximum_nights minimum_minimum_nights maximum_minimum_nights minimum_maximum_nights maximum_maximum_nights availability_30 availability_60 availability_90 availability_365 number_of_reviews number_of_reviews_ltm calculated_host_listings_count calculated_host_listings_count_entire_homes calculated_host_listings_count_private_rooms calculated_host_listings_count_shared_rooms
0 2595 20190806030549 2845 2 2 1 1125 1 1 1125 1125 13 17 31 288 46 12 2 1 0 1
1 3647 20190806030549 4632 2 2 3 7 3 3 7 7 30 60 90 365 0 0 1 0 1 0
2 3831 20190806030549 4869 3 1 1 730 1 1 730 730 0 1 4 212 274 70 1 1 0 0
3 5022 20190806030549 7192 1 1 10 120 10 10 120 120 0 0 0 0 9 2 1 1 0 0
4 5099 20190806030549 7322 2 2 3 21 3 3 21 21 24 33 63 127 75 10 1 1 0 0
drop_int_cols = ['scrape_id',
                 'host_id',
                 'minimum_minimum_nights',
                 'maximum_minimum_nights',
                 'minimum_maximum_nights',
                 'maximum_maximum_nights'
                ]
# combine all the columns and drop them
drop_cols = drop_object_cols + drop_float_cols + drop_int_cols
print("Shape of the dataset before dropping: ", df_detailed_listings.shape)
print("Dropping {} columns".format(len(drop_cols)))
df_dl_clean_df = df_detailed_listings.drop(columns=drop_cols)
print("Shape of the dataset after dropping: ", df_dl_clean_df.shape)
Shape of the dataset before dropping:  (48864, 106)
Dropping 36 columns
Shape of the dataset after dropping:  (48864, 70)

Missing Value Statistics

An important step in data cleaning is identifying columns with missing values. Let’s first calculate the missing value statistics.

def missing_statistics(df):
    '''
        Calculates missing value statistics for a given dataframe and
        returns a dataframe containing number of missing values per column
        and the percentage of values missing per column.
        arguments:
            df: the dataframe for which missing values need to be calculated.
    '''
    missing_stats = df.isnull().sum().to_frame()
    missing_stats.columns = ['num_missing']
    missing_stats['pct_missing'] = np.round(100 * (missing_stats['num_missing'] / df.shape[0]))
    missing_stats.sort_values(by='num_missing', ascending=False, inplace=True)

    return missing_stats
num_missing = missing_statistics(df_dl_clean_df)
num_missing
num_missing pct_missing
host_acceptance_rate 48864 100.0
jurisdiction_names 48853 100.0
license 48842 100.0
square_feet 48469 99.0
monthly_price 43715 89.0
weekly_price 42964 88.0
notes 28692 59.0
access 21916 45.0
interaction 19947 41.0
house_rules 18912 39.0
neighborhood_overview 17297 35.0
security_deposit 17290 35.0
transit 16975 35.0
host_response_rate 16582 34.0
host_response_time 16582 34.0
space 13985 29.0
review_scores_location 11163 23.0
review_scores_value 11161 23.0
review_scores_checkin 11158 23.0
review_scores_accuracy 11142 23.0
review_scores_communication 11136 23.0
review_scores_cleanliness 11126 23.0
review_scores_rating 11104 23.0
cleaning_fee 10584 22.0
reviews_per_month 10131 21.0
summary 2075 4.0
description 843 2.0
bathrooms 56 0.0
beds 42 0.0
bedrooms 27 0.0
host_identity_verified 18 0.0
host_has_profile_pic 18 0.0
host_total_listings_count 18 0.0
host_listings_count 18 0.0
host_is_superhost 18 0.0
name 16 0.0
cancellation_policy 1 0.0
neighbourhood_cleansed 0 0.0
is_business_travel_ready 0 0.0
instant_bookable 0 0.0
require_guest_profile_picture 0 0.0
require_guest_phone_verification 0 0.0
requires_license 0 0.0
calculated_host_listings_count 0 0.0
calculated_host_listings_count_entire_homes 0 0.0
calculated_host_listings_count_private_rooms 0 0.0
host_verifications 0 0.0
calculated_host_listings_count_shared_rooms 0 0.0
property_type 0 0.0
availability_365 0 0.0
number_of_reviews_ltm 0 0.0
number_of_reviews 0 0.0
accommodates 0 0.0
bed_type 0 0.0
amenities 0 0.0
is_location_exact 0 0.0
price 0 0.0
longitude 0 0.0
latitude 0 0.0
neighbourhood_group_cleansed 0 0.0
guests_included 0 0.0
extra_people 0 0.0
minimum_nights 0 0.0
maximum_nights 0 0.0
has_availability 0 0.0
availability_30 0 0.0
availability_60 0 0.0
availability_90 0 0.0
room_type 0 0.0
id 0 0.0

As we can see in the table above, few columns have 100% missing data. We will be dropping these columns and upon close observation, it makes sense to drop the columns which have more than 20% missing data.

cols_to_drop = list(num_missing[num_missing.pct_missing > 20].index)
cols_to_drop
['host_acceptance_rate',
 'jurisdiction_names',
 'license',
 'square_feet',
 'monthly_price',
 'weekly_price',
 'notes',
 'access',
 'interaction',
 'house_rules',
 'neighborhood_overview',
 'security_deposit',
 'transit',
 'host_response_rate',
 'host_response_time',
 'space',
 'review_scores_location',
 'review_scores_value',
 'review_scores_checkin',
 'review_scores_accuracy',
 'review_scores_communication',
 'review_scores_cleanliness',
 'review_scores_rating',
 'cleaning_fee',
 'reviews_per_month']
df_dl_clean_df = df_dl_clean_df.drop(cols_to_drop, axis='columns')
print(df_dl_clean_df.shape)
(48864, 45)

At this point, we have not fully dealt with missing values. But before we move forward with handling missing values, let’s first validate that each of the variable is of the correct type, i.e, categorical and boolean type. Representing all string columns as object type is not an effective approach. We need to convert the object types to categorical or boolean if necessary. Another advantage of converting them from object to categorical/boolean is that we can make use of pd.get_dummies() to easily one-hot-encode the variables.

Convert object types to categorical and boolean variables

Shown below is the head of all the object types in the dataset. By inspecting these variables, we can clearly see that some of these object types are actually categorical variables, some are boolean variables and some are text variables like summary and description. Our task is to identify categorical variables and separate them from text variables. Later, further down in the notebook, we will one-hot-encode all the categorical variables.

df_dl_clean_df.select_dtypes(['object']).head()
name summary description host_is_superhost host_verifications host_has_profile_pic host_identity_verified neighbourhood_cleansed neighbourhood_group_cleansed is_location_exact property_type room_type bed_type amenities price extra_people has_availability requires_license instant_bookable is_business_travel_ready cancellation_policy require_guest_profile_picture require_guest_phone_verification
0 Skylit Midtown Castle Find your romantic getaway to this beautiful, ... Find your romantic getaway to this beautiful, ... f ['email', 'phone', 'reviews', 'kba', 'work_ema... t t Midtown Manhattan f Apartment Entire home/apt Real Bed {TV,Wifi,"Air conditioning",Kitchen,"Paid park... $225.00 $0.00 t f f f strict_14_with_grace_period t t
1 THE VILLAGE OF HARLEM....NEW YORK ! NaN WELCOME TO OUR INTERNATIONAL URBAN COMMUNITY T... f ['email', 'phone', 'google', 'reviews', 'jumio... t t Harlem Manhattan t Apartment Private room Pull-out Sofa {"Cable TV",Internet,Wifi,"Air conditioning",K... $150.00 $20.00 t f f f strict_14_with_grace_period t t
2 Cozy Entire Floor of Brownstone Urban retreat: enjoy 500 s.f. floor in 1899 br... Urban retreat: enjoy 500 s.f. floor in 1899 br... f ['email', 'phone', 'reviews', 'kba'] t t Clinton Hill Brooklyn t Guest suite Entire home/apt Real Bed {TV,"Cable TV",Internet,Wifi,"Air conditioning... $89.00 $0.00 t f f f moderate f f
3 Entire Apt: Spacious Studio/Loft by central park NaN Loft apartment with high ceiling and wood floo... f ['email', 'phone', 'facebook', 'reviews', 'kba'] t t East Harlem Manhattan t Apartment Entire home/apt Real Bed {Internet,Wifi,"Air conditioning",Kitchen,Elev... $80.00 $20.00 t f f f strict_14_with_grace_period t t
4 Large Cozy 1 BR Apartment In Midtown East My large 1 bedroom apartment is true New York ... My large 1 bedroom apartment is true New York ... f ['email', 'phone', 'reviews', 'jumio', 'govern... t f Murray Hill Manhattan f Apartment Entire home/apt Real Bed {TV,"Cable TV",Internet,Wifi,Kitchen,"Buzzer/w... $200.00 $100.00 t f f f strict_14_with_grace_period t t
# collect all variables which need to be dummy encoded.
categorical_types = ['host_is_superhost',
                     'host_has_profile_pic',
                     'host_identity_verified',
                     'is_location_exact',
                     'property_type',
                     'room_type',
                     'bed_type',
                     'has_availability',
                     'requires_license',
                     'instant_bookable',
                     'is_business_travel_ready',
                     'cancellation_policy',
                     'require_guest_profile_picture',
                     'require_guest_phone_verification']
df_dl_clean_df[categorical_types].dtypes
host_is_superhost                   object
host_has_profile_pic                object
host_identity_verified              object
is_location_exact                   object
property_type                       object
room_type                           object
bed_type                            object
has_availability                    object
requires_license                    object
instant_bookable                    object
is_business_travel_ready            object
cancellation_policy                 object
require_guest_profile_picture       object
require_guest_phone_verification    object
dtype: object

As we can see, these columns are all of the type object. Representing them as categories is much more efficient.

df_dl_clean_df[categorical_types] = df_dl_clean_df[categorical_types].apply(lambda x: x.astype('category'),
                                                                                axis='rows')
df_dl_clean_df[categorical_types].dtypes
host_is_superhost                   category
host_has_profile_pic                category
host_identity_verified              category
is_location_exact                   category
property_type                       category
room_type                           category
bed_type                            category
has_availability                    category
requires_license                    category
instant_bookable                    category
is_business_travel_ready            category
cancellation_policy                 category
require_guest_profile_picture       category
require_guest_phone_verification    category
dtype: object

Subset the dataset

Checking for unique values in each category

Let’s check how many unique values each of these columns have. This can be achieved using the pd.Series.nunique method.

# Calculate number of unique values for each label: num_unique_labels
num_unique_values = df_dl_clean_df[categorical_types].apply(pd.Series.nunique, axis='rows')
print(num_unique_values)

# Plot number of unique values for each label
num_unique_values.plot(kind='bar')
# Label the axes
plt.xlabel('Labels')
plt.ylabel('Number of unique values')

# Display the plot
plt.show()
host_is_superhost                    2
host_has_profile_pic                 2
host_identity_verified               2
is_location_exact                    2
property_type                       37
room_type                            3
bed_type                             5
has_availability                     1
requires_license                     2
instant_bookable                     2
is_business_travel_ready             1
cancellation_policy                  6
require_guest_profile_picture        2
require_guest_phone_verification     2
dtype: int64

png

It appears there are far too many property types in this data. Let’s check why that is. Looking at the below numbers, we can see that NYC has many peculiar properties, for instance, Treehouse, boat, Earth house etc. To stay focused, we will restrict our analysis to Apartment, House, Townhouse and Condominium.

df_dl_clean_df.property_type.value_counts()
Apartment                 38605
House                      3846
Townhouse                  1659
Condominium                1495
Loft                       1412
Serviced apartment          505
Guest suite                 363
Hotel                       227
Boutique hotel              190
Other                       118
Bed and breakfast            88
Resort                       72
Hostel                       62
Guesthouse                   56
Bungalow                     38
Villa                        28
Tiny house                   19
Aparthotel                   17
Boat                         13
Camper/RV                    10
Cottage                       7
Tent                          6
Earth house                   4
Cabin                         3
Houseboat                     3
Casa particular (Cuba)        2
Farm stay                     2
Cave                          2
Bus                           2
Barn                          2
Yurt                          2
Castle                        1
Nature lodge                  1
Lighthouse                    1
Dome house                    1
Timeshare                     1
Treehouse                     1
Name: property_type, dtype: int64

Since I am subsetting the dataset, I created a new one, so that we have a copy of the original dataset with all the property types.

df_dl_clean_df.to_csv("detailed_listings_without_subset_partially_cleaned.csv", index=False)
# Subset the data to include few property types
df_airbnb = df_dl_clean_df[df_dl_clean_df.property_type.isin(['Apartment', 'House', 'Townhouse', 'Condominium'])]
# let pandas know that to use a copy instead of a view
df_airbnb = df_airbnb.copy()

# We need to remove unused categories since we are using type category,
# the removed categories are retained with count of 0.
# so, we need to explicitly remove the unused categories.
df_airbnb.loc[:, 'property_type'] = df_airbnb.loc[:,'property_type'].cat.remove_unused_categories()
df_airbnb.property_type.value_counts()
Apartment      38605
House           3846
Townhouse       1659
Condominium     1495
Name: property_type, dtype: int64
df_airbnb.head()
id name summary description host_is_superhost host_listings_count host_total_listings_count host_verifications host_has_profile_pic host_identity_verified neighbourhood_cleansed neighbourhood_group_cleansed latitude longitude is_location_exact property_type room_type accommodates bathrooms bedrooms beds bed_type amenities price guests_included extra_people minimum_nights maximum_nights has_availability availability_30 availability_60 availability_90 availability_365 number_of_reviews number_of_reviews_ltm requires_license instant_bookable is_business_travel_ready cancellation_policy require_guest_profile_picture require_guest_phone_verification calculated_host_listings_count calculated_host_listings_count_entire_homes calculated_host_listings_count_private_rooms calculated_host_listings_count_shared_rooms
0 2595 Skylit Midtown Castle Find your romantic getaway to this beautiful, ... Find your romantic getaway to this beautiful, ... f 5.0 5.0 ['email', 'phone', 'reviews', 'kba', 'work_ema... t t Midtown Manhattan 40.75362 -73.98377 f Apartment Entire home/apt 2 1.0 0.0 1.0 Real Bed {TV,Wifi,"Air conditioning",Kitchen,"Paid park... $225.00 2 $0.00 1 1125 t 13 17 31 288 46 12 f f f strict_14_with_grace_period t t 2 1 0 1
1 3647 THE VILLAGE OF HARLEM....NEW YORK ! NaN WELCOME TO OUR INTERNATIONAL URBAN COMMUNITY T... f 1.0 1.0 ['email', 'phone', 'google', 'reviews', 'jumio... t t Harlem Manhattan 40.80902 -73.94190 t Apartment Private room 2 1.0 1.0 1.0 Pull-out Sofa {"Cable TV",Internet,Wifi,"Air conditioning",K... $150.00 2 $20.00 3 7 t 30 60 90 365 0 0 f f f strict_14_with_grace_period t t 1 0 1 0
3 5022 Entire Apt: Spacious Studio/Loft by central park NaN Loft apartment with high ceiling and wood floo... f 1.0 1.0 ['email', 'phone', 'facebook', 'reviews', 'kba'] t t East Harlem Manhattan 40.79851 -73.94399 t Apartment Entire home/apt 1 1.0 NaN 1.0 Real Bed {Internet,Wifi,"Air conditioning",Kitchen,Elev... $80.00 1 $20.00 10 120 t 0 0 0 0 9 2 f f f strict_14_with_grace_period t t 1 1 0 0
4 5099 Large Cozy 1 BR Apartment In Midtown East My large 1 bedroom apartment is true New York ... My large 1 bedroom apartment is true New York ... f 1.0 1.0 ['email', 'phone', 'reviews', 'jumio', 'govern... t f Murray Hill Manhattan 40.74767 -73.97500 f Apartment Entire home/apt 2 1.0 1.0 1.0 Real Bed {TV,"Cable TV",Internet,Wifi,Kitchen,"Buzzer/w... $200.00 2 $100.00 3 21 t 24 33 63 127 75 10 f f f strict_14_with_grace_period t t 1 1 0 0
5 5121 BlissArtsSpace! NaN HELLO EVERYONE AND THANKS FOR VISITING BLISS A... f 1.0 1.0 ['email', 'phone', 'facebook', 'reviews', 'off... t f Bedford-Stuyvesant Brooklyn 40.68688 -73.95596 f Apartment Private room 2 NaN 1.0 1.0 Futon {Wifi,"Air conditioning",Kitchen,"Pets live on... $60.00 1 $30.00 45 730 t 0 0 0 0 49 0 f f f strict_14_with_grace_period f f 1 0 1 0

Missing Value Statistics for the subset

missing_df = missing_statistics(df_airbnb)
missing_df
num_missing pct_missing
summary 1938 4.0
description 797 2.0
bathrooms 55 0.0
beds 38 0.0
bedrooms 25 0.0
host_identity_verified 17 0.0
host_has_profile_pic 17 0.0
host_total_listings_count 17 0.0
host_listings_count 17 0.0
host_is_superhost 17 0.0
name 16 0.0
cancellation_policy 1 0.0
availability_60 0 0.0
availability_90 0 0.0
availability_365 0 0.0
number_of_reviews 0 0.0
number_of_reviews_ltm 0 0.0
id 0 0.0
requires_license 0 0.0
instant_bookable 0 0.0
is_business_travel_ready 0 0.0
has_availability 0 0.0
require_guest_profile_picture 0 0.0
require_guest_phone_verification 0 0.0
calculated_host_listings_count 0 0.0
calculated_host_listings_count_entire_homes 0 0.0
calculated_host_listings_count_private_rooms 0 0.0
availability_30 0 0.0
amenities 0 0.0
maximum_nights 0 0.0
minimum_nights 0 0.0
extra_people 0 0.0
guests_included 0 0.0
price 0 0.0
bed_type 0 0.0
accommodates 0 0.0
room_type 0 0.0
property_type 0 0.0
is_location_exact 0 0.0
longitude 0 0.0
latitude 0 0.0
neighbourhood_group_cleansed 0 0.0
neighbourhood_cleansed 0 0.0
host_verifications 0 0.0
calculated_host_listings_count_shared_rooms 0 0.0

The missing values for columns starting with host seem to have a pattern. Since all these columns have 17 missing values, it could be possible that the data is missing purpose.

# collect all the columns which have missing values
cols_missing_values = list(missing_df[missing_df.num_missing > 0].index)

df_airbnb_missing_values = df_airbnb[cols_missing_values]
df_airbnb_missing_values.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 45605 entries, 0 to 48863
Data columns (total 12 columns):
summary                      43667 non-null object
description                  44808 non-null object
bathrooms                    45550 non-null float64
beds                         45567 non-null float64
bedrooms                     45580 non-null float64
host_identity_verified       45588 non-null category
host_has_profile_pic         45588 non-null category
host_total_listings_count    45588 non-null float64
host_listings_count          45588 non-null float64
host_is_superhost            45588 non-null category
name                         45589 non-null object
cancellation_policy          45604 non-null category
dtypes: category(4), float64(5), object(3)
memory usage: 3.3+ MB

Check if the 17 missing values of host columns are are missing together.

host_cols = list(df_airbnb_missing_values.columns[df_airbnb_missing_values.columns.str.contains('host')])
df_airbnb_missing_values[host_cols][df_airbnb_missing_values.host_identity_verified.isnull()]
host_identity_verified host_has_profile_pic host_total_listings_count host_listings_count host_is_superhost
2451 NaN NaN NaN NaN NaN
2715 NaN NaN NaN NaN NaN
5962 NaN NaN NaN NaN NaN
6198 NaN NaN NaN NaN NaN
6454 NaN NaN NaN NaN NaN
8108 NaN NaN NaN NaN NaN
8974 NaN NaN NaN NaN NaN
9642 NaN NaN NaN NaN NaN
10692 NaN NaN NaN NaN NaN
14184 NaN NaN NaN NaN NaN
14207 NaN NaN NaN NaN NaN
14652 NaN NaN NaN NaN NaN
18835 NaN NaN NaN NaN NaN
23104 NaN NaN NaN NaN NaN
23119 NaN NaN NaN NaN NaN
26552 NaN NaN NaN NaN NaN
45837 NaN NaN NaN NaN NaN

Yes, as suspected, these host columns have missing values for the same observations.

Handling Missing Values

  • For columns containing text, we will be replacing them with an empty string.
  • For categorical columns, we will be replacing missing values with the mode.
  • For continuous columns, we will be replacing the missing values with the median.

I will be using sklearn’s SimpleImputer to replace the missing values.

TEXT COLUMNS:

df_airbnb_clean = df_airbnb.copy(deep=True)
df_airbnb_clean.shape
(45605, 45)
# to avoid setcopywithwarning use .copy() - this will tell pandas use a copy and not a view.
df_airbnb_clean.loc[df_airbnb_clean.summary.isna().copy(), 'summary'] = ''
df_airbnb_clean.loc[df_airbnb_clean.description.isna().copy(), 'description'] = ''
df_airbnb_clean.loc[df_airbnb_clean.name.isna().copy(), 'name'] = ''

CATEGORICAL COLUMNS:

from sklearn.impute import SimpleImputer
category_missing_cols = ['host_has_profile_pic', 'host_identity_verified',
                         'host_is_superhost', 'cancellation_policy']

float_missing_cols = ['bathrooms', 'beds', 'bedrooms',
                      'host_total_listings_count',
                      'host_listings_count']

For validation purpose, lets first check the float_missing_cols. Example, check the bathrooms and bedrooms columns, you will notice there are some NaN values in there. If the missing value imputation is successful, then we should expect to see these NaN values replaced by their median values.

# get all the observations for float_missing_cols which contain na.
# we will use np.where to get the indices of these observations.
#index,_ = np.where(df_airbnb_clean[float_missing_cols].isna() == True)
#df_airbnb_clean.iloc[index, :]
# Store these indices, so that we can validate after imputation
#indexes = list(index)
#df_airbnb_clean.iloc[index, :]
def replace_missing_values(cols, df):
    '''
        Takes a list of columns and a dataframe and imputes based on
        the column type. If it is object type, then most_frequent value
        is used for imputation. If it is a float/int type, then the median
        value is used for imputation.
        arguments:
            cols: list of columns
            df : dataframe containing these columns.
        returns:
            df: the imputed dataframe
    '''
    for col in cols:
        if type(df[col].dtype) is pd.core.dtypes.dtypes.CategoricalDtype:
            print("Imputing {} column with most frequent value".format(col))
            mode_imputer = SimpleImputer(strategy='most_frequent')
            df.loc[:, col] = mode_imputer.fit_transform(df[[col]])
        elif df[col].dtype == 'float64' or df[col].dtype == 'int64':
            print("Imputing {} column with median value".format(col))
            median_imputer = SimpleImputer(strategy='median')
            df.loc[:, col] = median_imputer.fit_transform(df[[col]])
        else:
            raise ValueError("Invalid column type")

    return df
df_airbnb_clean = replace_missing_values(float_missing_cols, df_airbnb_clean)
df_airbnb_clean = replace_missing_values(category_missing_cols, df_airbnb_clean)
Imputing bathrooms column with median value
Imputing beds column with median value
Imputing bedrooms column with median value
Imputing host_total_listings_count column with median value
Imputing host_listings_count column with median value
Imputing host_has_profile_pic column with most frequent value
Imputing host_identity_verified column with most frequent value
Imputing host_is_superhost column with most frequent value
Imputing cancellation_policy column with most frequent value

After imputation, we have successfully removed all the missing values.

df_airbnb_clean.isna().sum()
id                                              0
name                                            0
summary                                         0
description                                     0
host_is_superhost                               0
host_listings_count                             0
host_total_listings_count                       0
host_verifications                              0
host_has_profile_pic                            0
host_identity_verified                          0
neighbourhood_cleansed                          0
neighbourhood_group_cleansed                    0
latitude                                        0
longitude                                       0
is_location_exact                               0
property_type                                   0
room_type                                       0
accommodates                                    0
bathrooms                                       0
bedrooms                                        0
beds                                            0
bed_type                                        0
amenities                                       0
price                                           0
guests_included                                 0
extra_people                                    0
minimum_nights                                  0
maximum_nights                                  0
has_availability                                0
availability_30                                 0
availability_60                                 0
availability_90                                 0
availability_365                                0
number_of_reviews                               0
number_of_reviews_ltm                           0
requires_license                                0
instant_bookable                                0
is_business_travel_ready                        0
cancellation_policy                             0
require_guest_profile_picture                   0
require_guest_phone_verification                0
calculated_host_listings_count                  0
calculated_host_listings_count_entire_homes     0
calculated_host_listings_count_private_rooms    0
calculated_host_listings_count_shared_rooms     0
dtype: int64

Clean the columns

Some columns like price are represented as objects since we have a dollar sign. Before we proceed further with our analysis, we need to clean these columns.

Cleaning price and extra_people

import re
df_airbnb_clean['price'] = df_airbnb_clean.price.apply(lambda x:re.sub(r'[$,]', '', x)).astype('float')
df_airbnb_clean['extra_people'] = df_airbnb_clean.extra_people.apply(lambda x:re.sub(r'[$,]', '', x)).astype('float')
df_airbnb_clean[['extra_people', 'price']].plot(kind='box')
<matplotlib.axes._subplots.AxesSubplot at 0x7f246e3a9e10>

png

Price has quite many outliers, we will deal with outliers in the next stage.

One-hot encode the categorical variables

category_one_hot_encoding = pd.get_dummies(df_airbnb_clean[categorical_types])
category_one_hot_encoding.head()
host_is_superhost_f host_is_superhost_t host_has_profile_pic_f host_has_profile_pic_t host_identity_verified_f host_identity_verified_t is_location_exact_f is_location_exact_t property_type_Apartment property_type_Condominium property_type_House property_type_Townhouse room_type_Entire home/apt room_type_Private room room_type_Shared room bed_type_Airbed bed_type_Couch bed_type_Futon bed_type_Pull-out Sofa bed_type_Real Bed has_availability_t requires_license_f requires_license_t instant_bookable_f instant_bookable_t is_business_travel_ready_f cancellation_policy_flexible cancellation_policy_moderate cancellation_policy_strict cancellation_policy_strict_14_with_grace_period cancellation_policy_super_strict_30 cancellation_policy_super_strict_60 require_guest_profile_picture_f require_guest_profile_picture_t require_guest_phone_verification_f require_guest_phone_verification_t
0 1 0 0 1 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 1 1 1 0 1 0 1 0 0 0 1 0 0 0 1 0 1
1 1 0 0 1 0 1 0 1 1 0 0 0 0 1 0 0 0 0 1 0 1 1 0 1 0 1 0 0 0 1 0 0 0 1 0 1
3 1 0 0 1 0 1 0 1 1 0 0 0 1 0 0 0 0 0 0 1 1 1 0 1 0 1 0 0 0 1 0 0 0 1 0 1
4 1 0 0 1 1 0 1 0 1 0 0 0 1 0 0 0 0 0 0 1 1 1 0 1 0 1 0 0 0 1 0 0 0 1 0 1
5 1 0 0 1 1 0 1 0 1 0 0 0 0 1 0 0 0 1 0 0 1 1 0 1 0 1 0 0 0 1 0 0 1 0 1 0
category_one_hot_encoding.shape
(45605, 36)
df_airbnb_clean.shape
(45605, 45)

We need to merge these two and drop the categorical variables

df_airbnb_clean = pd.concat([df_airbnb_clean, category_one_hot_encoding],
                            axis='columns').drop(categorical_types,
                                                 axis='columns')
df_airbnb_clean.shape
(45605, 67)
df_airbnb_clean.to_csv('airbnb_clean.csv', index=False)

Upload the cleaned file to S3

# upload the file to S3
s3_client.upload_file(Bucket=bucket_name, Filename='airbnb_clean.csv', Key='clean/airbnb_clean.csv')
response = s3_client.list_objects(Bucket=bucket_name)

for file in response['Contents']:
    print(file['Key'])
clean/airbnb_clean.csv
detailed_listings.csv
summary_listings.csv