New York City Airbnb Data Cleaning
Cleaning NYC Airbnb detailed listings dataset
The goal of this notebook is to clean the raw airbnb dataset which resides on S3 in s3://skuchkula-sagemaker-airbnb/
location. The original dataset can be found here: Inside Airbnb. I used the New York City datasets for the month of August 2019.
Note: I have used a SageMaker notebook instance, but you could easily do this on your local environment.
# Import the necessary packages
# data managing and display libs
import pandas as pd
import numpy as np
import os
import io
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline
# sagemaker libraries
import boto3
import sagemaker
Get data from S3
Due to the size of the dataset, I had to store them in S3 as github woudln’t allow large files. Using boto3
package we can directly read the dataset from S3 instead of bringing it locally. This is logic is captured in the function get_data_frame
, which takes the bucket-name and file-name inside the bucket and returns a pandas dataframe.
# boto3 client to get S3 data
s3_client = boto3.client('s3')
bucket_name='skuchkula-sagemaker-airbnb'
List the files inside s3://skuchkula-sagemaker-airbnb/
# get a list of objects in the bucket
obj_list=s3_client.list_objects(Bucket=bucket_name)
# print object(s)in S3 bucket
files=[]
for contents in obj_list['Contents']:
files.append(contents['Key'])
print(files)
['detailed_listings.csv', 'summary_listings.csv']
We will be using the detailed_listings.csv
file instead of summary_listings.csv
. The summary dataset is mainly useful for data visualization, whereas, the detailed listings dataset contains much more information about each listing, that we can use to build a machine learning model.
detailed_listings=files[0]
def get_data_frame(bucket_name, file_name):
'''
Takes the location of the dataset on S3 and returns a dataframe.
arguments:
bucket_name: the name of the bucket
file_name: the key inside the bucket
'''
# get an S3 object by passing in the bucket and file name
data_object = s3_client.get_object(Bucket=bucket_name, Key=file_name)
# information is in the "Body" of the object
data_body = data_object["Body"].read()
# read in bytes data
data_stream = io.BytesIO(data_body)
# create a dataframe
df = pd.read_csv(data_stream, header=0, delimiter=",", low_memory=False)
return df
df_detailed_listings = get_data_frame(bucket_name, detailed_listings)
df_detailed_listings.head()
id | listing_url | scrape_id | last_scraped | name | summary | space | description | experiences_offered | neighborhood_overview | ... | instant_bookable | is_business_travel_ready | cancellation_policy | require_guest_profile_picture | require_guest_phone_verification | calculated_host_listings_count | calculated_host_listings_count_entire_homes | calculated_host_listings_count_private_rooms | calculated_host_listings_count_shared_rooms | reviews_per_month | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2595 | https://www.airbnb.com/rooms/2595 | 20190806030549 | 2019-08-07 | Skylit Midtown Castle | Find your romantic getaway to this beautiful, ... | - Spacious (500+ft²), immaculate and nicely fu... | Find your romantic getaway to this beautiful, ... | none | Centrally located in the heart of Manhattan ju... | ... | f | f | strict_14_with_grace_period | t | t | 2 | 1 | 0 | 1 | 0.39 |
1 | 3647 | https://www.airbnb.com/rooms/3647 | 20190806030549 | 2019-08-06 | THE VILLAGE OF HARLEM....NEW YORK ! | NaN | WELCOME TO OUR INTERNATIONAL URBAN COMMUNITY T... | WELCOME TO OUR INTERNATIONAL URBAN COMMUNITY T... | none | NaN | ... | f | f | strict_14_with_grace_period | t | t | 1 | 0 | 1 | 0 | NaN |
2 | 3831 | https://www.airbnb.com/rooms/3831 | 20190806030549 | 2019-08-06 | Cozy Entire Floor of Brownstone | Urban retreat: enjoy 500 s.f. floor in 1899 br... | Greetings! We own a double-duplex brownst... | Urban retreat: enjoy 500 s.f. floor in 1899 br... | none | Just the right mix of urban center and local n... | ... | f | f | moderate | f | f | 1 | 1 | 0 | 0 | 4.64 |
3 | 5022 | https://www.airbnb.com/rooms/5022 | 20190806030549 | 2019-08-06 | Entire Apt: Spacious Studio/Loft by central park | NaN | Loft apartment with high ceiling and wood floo... | Loft apartment with high ceiling and wood floo... | none | NaN | ... | f | f | strict_14_with_grace_period | t | t | 1 | 1 | 0 | 0 | 0.10 |
4 | 5099 | https://www.airbnb.com/rooms/5099 | 20190806030549 | 2019-08-06 | Large Cozy 1 BR Apartment In Midtown East | My large 1 bedroom apartment is true New York ... | I have a large 1 bedroom apartment centrally l... | My large 1 bedroom apartment is true New York ... | none | My neighborhood in Midtown East is called Murr... | ... | f | f | strict_14_with_grace_period | t | t | 1 | 1 | 0 | 0 | 0.60 |
5 rows × 106 columns
The Detailed listings contains about 106 different features for each of the Airbnb locations. We need to do some EDA to check if we can trim down some features.
print("Detailed Listings row, cols: ", df_detailed_listings.shape)
Detailed Listings row, cols: (48864, 106)
# check how the data types are distributed.
# pandas does its best to interpret the datatype while reading in
# however it is our duty to check if the datatype makes sense
df_detailed_listings.dtypes.value_counts()
object 63
float64 22
int64 21
dtype: int64
There are about 48,864 airbnb listings and each of these listing has associated with it 106 features describing all aspects of the airbnb listing. However, some of these features might be having missing values, whereas some of them might not be relevant to Machine Learning. The data cleaning step will investigate and tackle these problems.
Data Cleaning
The focus of this notebook is on cleaning the data and storing the cleaned version back into S3. The strategy employed for cleaning the data is as follows:
- Drop columns that are not relevant to the problem. Example: URL, host picture etc.
- Find missing values for each column.
- Drop columns which have more than 20% missing data. Make a note of the columns.
- Convert columns to their correct data type.
- Subset the dataset based on certain criterion. Ex: property_type = Apartment/House/Townhouse/Condo
- One-hot-encode the categorical variables.
# Set the display properties so that we can inspect the data
pd.set_option("display.max_columns", 100)
pd.set_option("display.max_rows", 100)
When working with a dataset which contains many different types of columns, it is best to break them down by type and inspect the cleanliness of each type. We will start with Object types.
Drop columns
Drop irrelevant Object types
Using select_dtypes
we can filter out the desired type of columns.
df_detailed_listings.select_dtypes(include=['object']).columns
Index(['listing_url', 'last_scraped', 'name', 'summary', 'space',
'description', 'experiences_offered', 'neighborhood_overview', 'notes',
'transit', 'access', 'interaction', 'house_rules', 'picture_url',
'host_url', 'host_name', 'host_since', 'host_location', 'host_about',
'host_response_time', 'host_response_rate', 'host_is_superhost',
'host_thumbnail_url', 'host_picture_url', 'host_neighbourhood',
'host_verifications', 'host_has_profile_pic', 'host_identity_verified',
'street', 'neighbourhood', 'neighbourhood_cleansed',
'neighbourhood_group_cleansed', 'city', 'state', 'zipcode', 'market',
'smart_location', 'country_code', 'country', 'is_location_exact',
'property_type', 'room_type', 'bed_type', 'amenities', 'price',
'weekly_price', 'monthly_price', 'security_deposit', 'cleaning_fee',
'extra_people', 'calendar_updated', 'has_availability',
'calendar_last_scraped', 'first_review', 'last_review',
'requires_license', 'license', 'jurisdiction_names', 'instant_bookable',
'is_business_travel_ready', 'cancellation_policy',
'require_guest_profile_picture', 'require_guest_phone_verification'],
dtype='object')
By inspecting some values of object types, it becomes evident that some columns are just not that useful for our analysis. Example, the urls, last_scraped etc. Shown below are a list of columns that we can safely drop without much loss of information needed for our analysis. I have also displayed the head of the object columns to get a sense of why we can drop these columns.
drop_object_cols = ['listing_url',
'last_scraped',
'experiences_offered',
'picture_url',
'host_url',
'host_name',
'host_since',
'host_location',
'host_about',
'host_thumbnail_url',
'host_picture_url',
'host_neighbourhood',
'street',
'neighbourhood',
'city',
'state',
'zipcode',
'market',
'smart_location',
'country_code',
'country',
'calendar_updated',
'calendar_last_scraped',
'first_review',
'last_review'
]
df_detailed_listings.select_dtypes(include=['object']).head()
listing_url | last_scraped | name | summary | space | description | experiences_offered | neighborhood_overview | notes | transit | access | interaction | house_rules | picture_url | host_url | host_name | host_since | host_location | host_about | host_response_time | host_response_rate | host_is_superhost | host_thumbnail_url | host_picture_url | host_neighbourhood | host_verifications | host_has_profile_pic | host_identity_verified | street | neighbourhood | neighbourhood_cleansed | neighbourhood_group_cleansed | city | state | zipcode | market | smart_location | country_code | country | is_location_exact | property_type | room_type | bed_type | amenities | price | weekly_price | monthly_price | security_deposit | cleaning_fee | extra_people | calendar_updated | has_availability | calendar_last_scraped | first_review | last_review | requires_license | license | jurisdiction_names | instant_bookable | is_business_travel_ready | cancellation_policy | require_guest_profile_picture | require_guest_phone_verification | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | https://www.airbnb.com/rooms/2595 | 2019-08-07 | Skylit Midtown Castle | Find your romantic getaway to this beautiful, ... | - Spacious (500+ft²), immaculate and nicely fu... | Find your romantic getaway to this beautiful, ... | none | Centrally located in the heart of Manhattan ju... | NaN | Apartment is located on 37th Street between 5t... | Guests have full access to the kitchen, bathro... | I am a Sound Therapy Practitioner and Kundalin... | Make yourself at home, respect the space and t... | https://a0.muscache.com/im/pictures/f0813a11-4... | https://www.airbnb.com/users/show/2845 | Jennifer | 2008-09-09 | New York, New York, United States | A New Yorker since 2000! My passion is creatin... | within a few hours | 90% | f | https://a0.muscache.com/im/users/2845/profile_... | https://a0.muscache.com/im/users/2845/profile_... | Midtown | ['email', 'phone', 'reviews', 'kba', 'work_ema... | t | t | New York, NY, United States | Midtown | Midtown | Manhattan | New York | NY | 10018 | New York | New York, NY | US | United States | f | Apartment | Entire home/apt | Real Bed | {TV,Wifi,"Air conditioning",Kitchen,"Paid park... | $225.00 | $1,995.00 | NaN | $350.00 | $100.00 | $0.00 | a week ago | t | 2019-08-07 | 2009-11-21 | 2019-07-14 | f | NaN | NaN | f | f | strict_14_with_grace_period | t | t |
1 | https://www.airbnb.com/rooms/3647 | 2019-08-06 | THE VILLAGE OF HARLEM....NEW YORK ! | NaN | WELCOME TO OUR INTERNATIONAL URBAN COMMUNITY T... | WELCOME TO OUR INTERNATIONAL URBAN COMMUNITY T... | none | NaN | NaN | NaN | NaN | NaN | Upon arrival please have a legibile copy of yo... | https://a0.muscache.com/im/pictures/838341/9b3... | https://www.airbnb.com/users/show/4632 | Elisabeth | 2008-11-25 | New York, New York, United States | Make Up Artist National/ (Website hidden by Ai... | within a day | 100% | f | https://a0.muscache.com/im/users/4632/profile_... | https://a0.muscache.com/im/users/4632/profile_... | Harlem | ['email', 'phone', 'google', 'reviews', 'jumio... | t | t | New York, NY, United States | Harlem | Harlem | Manhattan | New York | NY | 10027 | New York | New York, NY | US | United States | t | Apartment | Private room | Pull-out Sofa | {"Cable TV",Internet,Wifi,"Air conditioning",K... | $150.00 | NaN | NaN | $200.00 | $75.00 | $20.00 | 35 months ago | t | 2019-08-06 | NaN | NaN | f | NaN | NaN | f | f | strict_14_with_grace_period | t | t |
2 | https://www.airbnb.com/rooms/3831 | 2019-08-06 | Cozy Entire Floor of Brownstone | Urban retreat: enjoy 500 s.f. floor in 1899 br... | Greetings! We own a double-duplex brownst... | Urban retreat: enjoy 500 s.f. floor in 1899 br... | none | Just the right mix of urban center and local n... | NaN | B52 bus for a 10-minute ride to downtown Brook... | You will have exclusive use of and access to: ... | We'll be around, but since you have the top fl... | Smoking - outside please; pets allowed but ple... | https://a0.muscache.com/im/pictures/e49999c2-9... | https://www.airbnb.com/users/show/4869 | LisaRoxanne | 2008-12-07 | New York, New York, United States | Laid-back bi-coastal actor/professor/attorney. | within an hour | 90% | f | https://a0.muscache.com/im/users/4869/profile_... | https://a0.muscache.com/im/users/4869/profile_... | Clinton Hill | ['email', 'phone', 'reviews', 'kba'] | t | t | Brooklyn, NY, United States | Brooklyn | Clinton Hill | Brooklyn | Brooklyn | NY | 11238 | New York | Brooklyn, NY | US | United States | t | Guest suite | Entire home/apt | Real Bed | {TV,"Cable TV",Internet,Wifi,"Air conditioning... | $89.00 | $575.00 | $2,100.00 | $500.00 | NaN | $0.00 | today | t | 2019-08-06 | 2014-09-30 | 2019-07-26 | f | NaN | NaN | f | f | moderate | f | f |
3 | https://www.airbnb.com/rooms/5022 | 2019-08-06 | Entire Apt: Spacious Studio/Loft by central park | NaN | Loft apartment with high ceiling and wood floo... | Loft apartment with high ceiling and wood floo... | none | NaN | NaN | NaN | NaN | NaN | Please be considerate when staying in the apar... | https://a0.muscache.com/im/pictures/feb453bd-f... | https://www.airbnb.com/users/show/7192 | Laura | 2009-01-29 | Miami, Florida, United States | I have been a NYer for almost 10 years. I came... | NaN | NaN | f | https://a0.muscache.com/im/users/7192/profile_... | https://a0.muscache.com/im/users/7192/profile_... | East Harlem | ['email', 'phone', 'facebook', 'reviews', 'kba'] | t | t | New York, NY, United States | East Harlem | East Harlem | Manhattan | New York | NY | 10029 | New York | New York, NY | US | United States | t | Apartment | Entire home/apt | Real Bed | {Internet,Wifi,"Air conditioning",Kitchen,Elev... | $80.00 | $600.00 | $1,600.00 | $100.00 | $80.00 | $20.00 | 4 months ago | t | 2019-08-06 | 2012-03-20 | 2018-11-19 | f | NaN | NaN | f | f | strict_14_with_grace_period | t | t |
4 | https://www.airbnb.com/rooms/5099 | 2019-08-06 | Large Cozy 1 BR Apartment In Midtown East | My large 1 bedroom apartment is true New York ... | I have a large 1 bedroom apartment centrally l... | My large 1 bedroom apartment is true New York ... | none | My neighborhood in Midtown East is called Murr... | Read My Full Listing For All Information. New ... | From the apartment is a 10 minute walk to Gran... | I will meet you upon arrival. | I usually check in with guests via text or ema... | • Check-in time is 2PM. • Check-out time is 12... | https://a0.muscache.com/im/pictures/be2fdcf6-e... | https://www.airbnb.com/users/show/7322 | Chris | 2009-02-02 | New York, New York, United States | I'm an artist, writer, traveler, and a native ... | within a few hours | 90% | f | https://a0.muscache.com/im/pictures/user/26745... | https://a0.muscache.com/im/pictures/user/26745... | Flatiron District | ['email', 'phone', 'reviews', 'jumio', 'govern... | t | f | New York, NY, United States | Midtown East | Murray Hill | Manhattan | New York | NY | 10016 | New York | New York, NY | US | United States | f | Apartment | Entire home/apt | Real Bed | {TV,"Cable TV",Internet,Wifi,Kitchen,"Buzzer/w... | $200.00 | NaN | NaN | $300.00 | $125.00 | $100.00 | 4 days ago | t | 2019-08-06 | 2009-04-20 | 2019-07-21 | f | NaN | NaN | f | f | strict_14_with_grace_period | t | t |
Drop irrelevant float64 types
df_detailed_listings.select_dtypes(include=['float64']).columns
Index(['thumbnail_url', 'medium_url', 'xl_picture_url', 'host_acceptance_rate',
'host_listings_count', 'host_total_listings_count', 'latitude',
'longitude', 'bathrooms', 'bedrooms', 'beds', 'square_feet',
'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm',
'review_scores_rating', 'review_scores_accuracy',
'review_scores_cleanliness', 'review_scores_checkin',
'review_scores_communication', 'review_scores_location',
'review_scores_value', 'reviews_per_month'],
dtype='object')
df_detailed_listings.select_dtypes(include=['float64']).head()
thumbnail_url | medium_url | xl_picture_url | host_acceptance_rate | host_listings_count | host_total_listings_count | latitude | longitude | bathrooms | bedrooms | beds | square_feet | minimum_nights_avg_ntm | maximum_nights_avg_ntm | review_scores_rating | review_scores_accuracy | review_scores_cleanliness | review_scores_checkin | review_scores_communication | review_scores_location | review_scores_value | reviews_per_month | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | NaN | NaN | NaN | 5.0 | 5.0 | 40.75362 | -73.98377 | 1.0 | 0.0 | 1.0 | NaN | 1.0 | 1125.0 | 95.0 | 10.0 | 9.0 | 10.0 | 10.0 | 10.0 | 9.0 | 0.39 |
1 | NaN | NaN | NaN | NaN | 1.0 | 1.0 | 40.80902 | -73.94190 | 1.0 | 1.0 | 1.0 | NaN | 3.0 | 7.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | NaN | NaN | NaN | NaN | 1.0 | 1.0 | 40.68514 | -73.95976 | 1.0 | 1.0 | 4.0 | 500.0 | 1.0 | 730.0 | 90.0 | 10.0 | 9.0 | 10.0 | 9.0 | 10.0 | 9.0 | 4.64 |
3 | NaN | NaN | NaN | NaN | 1.0 | 1.0 | 40.79851 | -73.94399 | 1.0 | NaN | 1.0 | NaN | 10.0 | 120.0 | 93.0 | 10.0 | 9.0 | 10.0 | 10.0 | 9.0 | 10.0 | 0.10 |
4 | NaN | NaN | NaN | NaN | 1.0 | 1.0 | 40.74767 | -73.97500 | 1.0 | 1.0 | 1.0 | NaN | 3.0 | 21.0 | 89.0 | 10.0 | 9.0 | 10.0 | 10.0 | 9.0 | 9.0 | 0.60 |
drop_float_cols = ['thumbnail_url',
'medium_url',
'xl_picture_url',
'minimum_nights_avg_ntm',
'maximum_nights_avg_ntm'
]
Drop irrelevant int64 types
df_detailed_listings.select_dtypes(include=['int64']).columns
Index(['id', 'scrape_id', 'host_id', 'accommodates', 'guests_included',
'minimum_nights', 'maximum_nights', 'minimum_minimum_nights',
'maximum_minimum_nights', 'minimum_maximum_nights',
'maximum_maximum_nights', 'availability_30', 'availability_60',
'availability_90', 'availability_365', 'number_of_reviews',
'number_of_reviews_ltm', 'calculated_host_listings_count',
'calculated_host_listings_count_entire_homes',
'calculated_host_listings_count_private_rooms',
'calculated_host_listings_count_shared_rooms'],
dtype='object')
df_detailed_listings.select_dtypes(include=['int64']).head()
id | scrape_id | host_id | accommodates | guests_included | minimum_nights | maximum_nights | minimum_minimum_nights | maximum_minimum_nights | minimum_maximum_nights | maximum_maximum_nights | availability_30 | availability_60 | availability_90 | availability_365 | number_of_reviews | number_of_reviews_ltm | calculated_host_listings_count | calculated_host_listings_count_entire_homes | calculated_host_listings_count_private_rooms | calculated_host_listings_count_shared_rooms | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2595 | 20190806030549 | 2845 | 2 | 2 | 1 | 1125 | 1 | 1 | 1125 | 1125 | 13 | 17 | 31 | 288 | 46 | 12 | 2 | 1 | 0 | 1 |
1 | 3647 | 20190806030549 | 4632 | 2 | 2 | 3 | 7 | 3 | 3 | 7 | 7 | 30 | 60 | 90 | 365 | 0 | 0 | 1 | 0 | 1 | 0 |
2 | 3831 | 20190806030549 | 4869 | 3 | 1 | 1 | 730 | 1 | 1 | 730 | 730 | 0 | 1 | 4 | 212 | 274 | 70 | 1 | 1 | 0 | 0 |
3 | 5022 | 20190806030549 | 7192 | 1 | 1 | 10 | 120 | 10 | 10 | 120 | 120 | 0 | 0 | 0 | 0 | 9 | 2 | 1 | 1 | 0 | 0 |
4 | 5099 | 20190806030549 | 7322 | 2 | 2 | 3 | 21 | 3 | 3 | 21 | 21 | 24 | 33 | 63 | 127 | 75 | 10 | 1 | 1 | 0 | 0 |
drop_int_cols = ['scrape_id',
'host_id',
'minimum_minimum_nights',
'maximum_minimum_nights',
'minimum_maximum_nights',
'maximum_maximum_nights'
]
# combine all the columns and drop them
drop_cols = drop_object_cols + drop_float_cols + drop_int_cols
print("Shape of the dataset before dropping: ", df_detailed_listings.shape)
print("Dropping {} columns".format(len(drop_cols)))
df_dl_clean_df = df_detailed_listings.drop(columns=drop_cols)
print("Shape of the dataset after dropping: ", df_dl_clean_df.shape)
Shape of the dataset before dropping: (48864, 106)
Dropping 36 columns
Shape of the dataset after dropping: (48864, 70)
Missing Value Statistics
An important step in data cleaning is identifying columns with missing values. Let’s first calculate the missing value statistics.
def missing_statistics(df):
'''
Calculates missing value statistics for a given dataframe and
returns a dataframe containing number of missing values per column
and the percentage of values missing per column.
arguments:
df: the dataframe for which missing values need to be calculated.
'''
missing_stats = df.isnull().sum().to_frame()
missing_stats.columns = ['num_missing']
missing_stats['pct_missing'] = np.round(100 * (missing_stats['num_missing'] / df.shape[0]))
missing_stats.sort_values(by='num_missing', ascending=False, inplace=True)
return missing_stats
num_missing = missing_statistics(df_dl_clean_df)
num_missing
num_missing | pct_missing | |
---|---|---|
host_acceptance_rate | 48864 | 100.0 |
jurisdiction_names | 48853 | 100.0 |
license | 48842 | 100.0 |
square_feet | 48469 | 99.0 |
monthly_price | 43715 | 89.0 |
weekly_price | 42964 | 88.0 |
notes | 28692 | 59.0 |
access | 21916 | 45.0 |
interaction | 19947 | 41.0 |
house_rules | 18912 | 39.0 |
neighborhood_overview | 17297 | 35.0 |
security_deposit | 17290 | 35.0 |
transit | 16975 | 35.0 |
host_response_rate | 16582 | 34.0 |
host_response_time | 16582 | 34.0 |
space | 13985 | 29.0 |
review_scores_location | 11163 | 23.0 |
review_scores_value | 11161 | 23.0 |
review_scores_checkin | 11158 | 23.0 |
review_scores_accuracy | 11142 | 23.0 |
review_scores_communication | 11136 | 23.0 |
review_scores_cleanliness | 11126 | 23.0 |
review_scores_rating | 11104 | 23.0 |
cleaning_fee | 10584 | 22.0 |
reviews_per_month | 10131 | 21.0 |
summary | 2075 | 4.0 |
description | 843 | 2.0 |
bathrooms | 56 | 0.0 |
beds | 42 | 0.0 |
bedrooms | 27 | 0.0 |
host_identity_verified | 18 | 0.0 |
host_has_profile_pic | 18 | 0.0 |
host_total_listings_count | 18 | 0.0 |
host_listings_count | 18 | 0.0 |
host_is_superhost | 18 | 0.0 |
name | 16 | 0.0 |
cancellation_policy | 1 | 0.0 |
neighbourhood_cleansed | 0 | 0.0 |
is_business_travel_ready | 0 | 0.0 |
instant_bookable | 0 | 0.0 |
require_guest_profile_picture | 0 | 0.0 |
require_guest_phone_verification | 0 | 0.0 |
requires_license | 0 | 0.0 |
calculated_host_listings_count | 0 | 0.0 |
calculated_host_listings_count_entire_homes | 0 | 0.0 |
calculated_host_listings_count_private_rooms | 0 | 0.0 |
host_verifications | 0 | 0.0 |
calculated_host_listings_count_shared_rooms | 0 | 0.0 |
property_type | 0 | 0.0 |
availability_365 | 0 | 0.0 |
number_of_reviews_ltm | 0 | 0.0 |
number_of_reviews | 0 | 0.0 |
accommodates | 0 | 0.0 |
bed_type | 0 | 0.0 |
amenities | 0 | 0.0 |
is_location_exact | 0 | 0.0 |
price | 0 | 0.0 |
longitude | 0 | 0.0 |
latitude | 0 | 0.0 |
neighbourhood_group_cleansed | 0 | 0.0 |
guests_included | 0 | 0.0 |
extra_people | 0 | 0.0 |
minimum_nights | 0 | 0.0 |
maximum_nights | 0 | 0.0 |
has_availability | 0 | 0.0 |
availability_30 | 0 | 0.0 |
availability_60 | 0 | 0.0 |
availability_90 | 0 | 0.0 |
room_type | 0 | 0.0 |
id | 0 | 0.0 |
As we can see in the table above, few columns have 100% missing data. We will be dropping these columns and upon close observation, it makes sense to drop the columns which have more than 20% missing data.
cols_to_drop = list(num_missing[num_missing.pct_missing > 20].index)
cols_to_drop
['host_acceptance_rate',
'jurisdiction_names',
'license',
'square_feet',
'monthly_price',
'weekly_price',
'notes',
'access',
'interaction',
'house_rules',
'neighborhood_overview',
'security_deposit',
'transit',
'host_response_rate',
'host_response_time',
'space',
'review_scores_location',
'review_scores_value',
'review_scores_checkin',
'review_scores_accuracy',
'review_scores_communication',
'review_scores_cleanliness',
'review_scores_rating',
'cleaning_fee',
'reviews_per_month']
df_dl_clean_df = df_dl_clean_df.drop(cols_to_drop, axis='columns')
print(df_dl_clean_df.shape)
(48864, 45)
At this point, we have not fully dealt with missing values. But before we move forward with handling missing values, let’s first validate that each of the variable is of the correct type, i.e, categorical and boolean type. Representing all string columns as object type is not an effective approach. We need to convert the object types to categorical or boolean if necessary. Another advantage of converting them from object to categorical/boolean is that we can make use of pd.get_dummies()
to easily one-hot-encode the variables.
Convert object types to categorical and boolean variables
Shown below is the head of all the object types in the dataset. By inspecting these variables, we can clearly see that some of these object types are actually categorical variables, some are boolean variables and some are text variables like summary and description. Our task is to identify categorical variables and separate them from text variables. Later, further down in the notebook, we will one-hot-encode all the categorical variables.
df_dl_clean_df.select_dtypes(['object']).head()
name | summary | description | host_is_superhost | host_verifications | host_has_profile_pic | host_identity_verified | neighbourhood_cleansed | neighbourhood_group_cleansed | is_location_exact | property_type | room_type | bed_type | amenities | price | extra_people | has_availability | requires_license | instant_bookable | is_business_travel_ready | cancellation_policy | require_guest_profile_picture | require_guest_phone_verification | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Skylit Midtown Castle | Find your romantic getaway to this beautiful, ... | Find your romantic getaway to this beautiful, ... | f | ['email', 'phone', 'reviews', 'kba', 'work_ema... | t | t | Midtown | Manhattan | f | Apartment | Entire home/apt | Real Bed | {TV,Wifi,"Air conditioning",Kitchen,"Paid park... | $225.00 | $0.00 | t | f | f | f | strict_14_with_grace_period | t | t |
1 | THE VILLAGE OF HARLEM....NEW YORK ! | NaN | WELCOME TO OUR INTERNATIONAL URBAN COMMUNITY T... | f | ['email', 'phone', 'google', 'reviews', 'jumio... | t | t | Harlem | Manhattan | t | Apartment | Private room | Pull-out Sofa | {"Cable TV",Internet,Wifi,"Air conditioning",K... | $150.00 | $20.00 | t | f | f | f | strict_14_with_grace_period | t | t |
2 | Cozy Entire Floor of Brownstone | Urban retreat: enjoy 500 s.f. floor in 1899 br... | Urban retreat: enjoy 500 s.f. floor in 1899 br... | f | ['email', 'phone', 'reviews', 'kba'] | t | t | Clinton Hill | Brooklyn | t | Guest suite | Entire home/apt | Real Bed | {TV,"Cable TV",Internet,Wifi,"Air conditioning... | $89.00 | $0.00 | t | f | f | f | moderate | f | f |
3 | Entire Apt: Spacious Studio/Loft by central park | NaN | Loft apartment with high ceiling and wood floo... | f | ['email', 'phone', 'facebook', 'reviews', 'kba'] | t | t | East Harlem | Manhattan | t | Apartment | Entire home/apt | Real Bed | {Internet,Wifi,"Air conditioning",Kitchen,Elev... | $80.00 | $20.00 | t | f | f | f | strict_14_with_grace_period | t | t |
4 | Large Cozy 1 BR Apartment In Midtown East | My large 1 bedroom apartment is true New York ... | My large 1 bedroom apartment is true New York ... | f | ['email', 'phone', 'reviews', 'jumio', 'govern... | t | f | Murray Hill | Manhattan | f | Apartment | Entire home/apt | Real Bed | {TV,"Cable TV",Internet,Wifi,Kitchen,"Buzzer/w... | $200.00 | $100.00 | t | f | f | f | strict_14_with_grace_period | t | t |
# collect all variables which need to be dummy encoded.
categorical_types = ['host_is_superhost',
'host_has_profile_pic',
'host_identity_verified',
'is_location_exact',
'property_type',
'room_type',
'bed_type',
'has_availability',
'requires_license',
'instant_bookable',
'is_business_travel_ready',
'cancellation_policy',
'require_guest_profile_picture',
'require_guest_phone_verification']
df_dl_clean_df[categorical_types].dtypes
host_is_superhost object
host_has_profile_pic object
host_identity_verified object
is_location_exact object
property_type object
room_type object
bed_type object
has_availability object
requires_license object
instant_bookable object
is_business_travel_ready object
cancellation_policy object
require_guest_profile_picture object
require_guest_phone_verification object
dtype: object
As we can see, these columns are all of the type object. Representing them as categories is much more efficient.
df_dl_clean_df[categorical_types] = df_dl_clean_df[categorical_types].apply(lambda x: x.astype('category'),
axis='rows')
df_dl_clean_df[categorical_types].dtypes
host_is_superhost category
host_has_profile_pic category
host_identity_verified category
is_location_exact category
property_type category
room_type category
bed_type category
has_availability category
requires_license category
instant_bookable category
is_business_travel_ready category
cancellation_policy category
require_guest_profile_picture category
require_guest_phone_verification category
dtype: object
Subset the dataset
Checking for unique values in each category
Let’s check how many unique values each of these columns have. This can be achieved using the pd.Series.nunique
method.
# Calculate number of unique values for each label: num_unique_labels
num_unique_values = df_dl_clean_df[categorical_types].apply(pd.Series.nunique, axis='rows')
print(num_unique_values)
# Plot number of unique values for each label
num_unique_values.plot(kind='bar')
# Label the axes
plt.xlabel('Labels')
plt.ylabel('Number of unique values')
# Display the plot
plt.show()
host_is_superhost 2
host_has_profile_pic 2
host_identity_verified 2
is_location_exact 2
property_type 37
room_type 3
bed_type 5
has_availability 1
requires_license 2
instant_bookable 2
is_business_travel_ready 1
cancellation_policy 6
require_guest_profile_picture 2
require_guest_phone_verification 2
dtype: int64
It appears there are far too many property types in this data. Let’s check why that is. Looking at the below numbers, we can see that NYC has many peculiar properties, for instance, Treehouse, boat, Earth house etc. To stay focused, we will restrict our analysis to Apartment, House, Townhouse and Condominium.
df_dl_clean_df.property_type.value_counts()
Apartment 38605
House 3846
Townhouse 1659
Condominium 1495
Loft 1412
Serviced apartment 505
Guest suite 363
Hotel 227
Boutique hotel 190
Other 118
Bed and breakfast 88
Resort 72
Hostel 62
Guesthouse 56
Bungalow 38
Villa 28
Tiny house 19
Aparthotel 17
Boat 13
Camper/RV 10
Cottage 7
Tent 6
Earth house 4
Cabin 3
Houseboat 3
Casa particular (Cuba) 2
Farm stay 2
Cave 2
Bus 2
Barn 2
Yurt 2
Castle 1
Nature lodge 1
Lighthouse 1
Dome house 1
Timeshare 1
Treehouse 1
Name: property_type, dtype: int64
Since I am subsetting the dataset, I created a new one, so that we have a copy of the original dataset with all the property types.
df_dl_clean_df.to_csv("detailed_listings_without_subset_partially_cleaned.csv", index=False)
# Subset the data to include few property types
df_airbnb = df_dl_clean_df[df_dl_clean_df.property_type.isin(['Apartment', 'House', 'Townhouse', 'Condominium'])]
# let pandas know that to use a copy instead of a view
df_airbnb = df_airbnb.copy()
# We need to remove unused categories since we are using type category,
# the removed categories are retained with count of 0.
# so, we need to explicitly remove the unused categories.
df_airbnb.loc[:, 'property_type'] = df_airbnb.loc[:,'property_type'].cat.remove_unused_categories()
df_airbnb.property_type.value_counts()
Apartment 38605
House 3846
Townhouse 1659
Condominium 1495
Name: property_type, dtype: int64
df_airbnb.head()
id | name | summary | description | host_is_superhost | host_listings_count | host_total_listings_count | host_verifications | host_has_profile_pic | host_identity_verified | neighbourhood_cleansed | neighbourhood_group_cleansed | latitude | longitude | is_location_exact | property_type | room_type | accommodates | bathrooms | bedrooms | beds | bed_type | amenities | price | guests_included | extra_people | minimum_nights | maximum_nights | has_availability | availability_30 | availability_60 | availability_90 | availability_365 | number_of_reviews | number_of_reviews_ltm | requires_license | instant_bookable | is_business_travel_ready | cancellation_policy | require_guest_profile_picture | require_guest_phone_verification | calculated_host_listings_count | calculated_host_listings_count_entire_homes | calculated_host_listings_count_private_rooms | calculated_host_listings_count_shared_rooms | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2595 | Skylit Midtown Castle | Find your romantic getaway to this beautiful, ... | Find your romantic getaway to this beautiful, ... | f | 5.0 | 5.0 | ['email', 'phone', 'reviews', 'kba', 'work_ema... | t | t | Midtown | Manhattan | 40.75362 | -73.98377 | f | Apartment | Entire home/apt | 2 | 1.0 | 0.0 | 1.0 | Real Bed | {TV,Wifi,"Air conditioning",Kitchen,"Paid park... | $225.00 | 2 | $0.00 | 1 | 1125 | t | 13 | 17 | 31 | 288 | 46 | 12 | f | f | f | strict_14_with_grace_period | t | t | 2 | 1 | 0 | 1 |
1 | 3647 | THE VILLAGE OF HARLEM....NEW YORK ! | NaN | WELCOME TO OUR INTERNATIONAL URBAN COMMUNITY T... | f | 1.0 | 1.0 | ['email', 'phone', 'google', 'reviews', 'jumio... | t | t | Harlem | Manhattan | 40.80902 | -73.94190 | t | Apartment | Private room | 2 | 1.0 | 1.0 | 1.0 | Pull-out Sofa | {"Cable TV",Internet,Wifi,"Air conditioning",K... | $150.00 | 2 | $20.00 | 3 | 7 | t | 30 | 60 | 90 | 365 | 0 | 0 | f | f | f | strict_14_with_grace_period | t | t | 1 | 0 | 1 | 0 |
3 | 5022 | Entire Apt: Spacious Studio/Loft by central park | NaN | Loft apartment with high ceiling and wood floo... | f | 1.0 | 1.0 | ['email', 'phone', 'facebook', 'reviews', 'kba'] | t | t | East Harlem | Manhattan | 40.79851 | -73.94399 | t | Apartment | Entire home/apt | 1 | 1.0 | NaN | 1.0 | Real Bed | {Internet,Wifi,"Air conditioning",Kitchen,Elev... | $80.00 | 1 | $20.00 | 10 | 120 | t | 0 | 0 | 0 | 0 | 9 | 2 | f | f | f | strict_14_with_grace_period | t | t | 1 | 1 | 0 | 0 |
4 | 5099 | Large Cozy 1 BR Apartment In Midtown East | My large 1 bedroom apartment is true New York ... | My large 1 bedroom apartment is true New York ... | f | 1.0 | 1.0 | ['email', 'phone', 'reviews', 'jumio', 'govern... | t | f | Murray Hill | Manhattan | 40.74767 | -73.97500 | f | Apartment | Entire home/apt | 2 | 1.0 | 1.0 | 1.0 | Real Bed | {TV,"Cable TV",Internet,Wifi,Kitchen,"Buzzer/w... | $200.00 | 2 | $100.00 | 3 | 21 | t | 24 | 33 | 63 | 127 | 75 | 10 | f | f | f | strict_14_with_grace_period | t | t | 1 | 1 | 0 | 0 |
5 | 5121 | BlissArtsSpace! | NaN | HELLO EVERYONE AND THANKS FOR VISITING BLISS A... | f | 1.0 | 1.0 | ['email', 'phone', 'facebook', 'reviews', 'off... | t | f | Bedford-Stuyvesant | Brooklyn | 40.68688 | -73.95596 | f | Apartment | Private room | 2 | NaN | 1.0 | 1.0 | Futon | {Wifi,"Air conditioning",Kitchen,"Pets live on... | $60.00 | 1 | $30.00 | 45 | 730 | t | 0 | 0 | 0 | 0 | 49 | 0 | f | f | f | strict_14_with_grace_period | f | f | 1 | 0 | 1 | 0 |
Missing Value Statistics for the subset
missing_df = missing_statistics(df_airbnb)
missing_df
num_missing | pct_missing | |
---|---|---|
summary | 1938 | 4.0 |
description | 797 | 2.0 |
bathrooms | 55 | 0.0 |
beds | 38 | 0.0 |
bedrooms | 25 | 0.0 |
host_identity_verified | 17 | 0.0 |
host_has_profile_pic | 17 | 0.0 |
host_total_listings_count | 17 | 0.0 |
host_listings_count | 17 | 0.0 |
host_is_superhost | 17 | 0.0 |
name | 16 | 0.0 |
cancellation_policy | 1 | 0.0 |
availability_60 | 0 | 0.0 |
availability_90 | 0 | 0.0 |
availability_365 | 0 | 0.0 |
number_of_reviews | 0 | 0.0 |
number_of_reviews_ltm | 0 | 0.0 |
id | 0 | 0.0 |
requires_license | 0 | 0.0 |
instant_bookable | 0 | 0.0 |
is_business_travel_ready | 0 | 0.0 |
has_availability | 0 | 0.0 |
require_guest_profile_picture | 0 | 0.0 |
require_guest_phone_verification | 0 | 0.0 |
calculated_host_listings_count | 0 | 0.0 |
calculated_host_listings_count_entire_homes | 0 | 0.0 |
calculated_host_listings_count_private_rooms | 0 | 0.0 |
availability_30 | 0 | 0.0 |
amenities | 0 | 0.0 |
maximum_nights | 0 | 0.0 |
minimum_nights | 0 | 0.0 |
extra_people | 0 | 0.0 |
guests_included | 0 | 0.0 |
price | 0 | 0.0 |
bed_type | 0 | 0.0 |
accommodates | 0 | 0.0 |
room_type | 0 | 0.0 |
property_type | 0 | 0.0 |
is_location_exact | 0 | 0.0 |
longitude | 0 | 0.0 |
latitude | 0 | 0.0 |
neighbourhood_group_cleansed | 0 | 0.0 |
neighbourhood_cleansed | 0 | 0.0 |
host_verifications | 0 | 0.0 |
calculated_host_listings_count_shared_rooms | 0 | 0.0 |
The missing values for columns starting with host seem to have a pattern. Since all these columns have 17 missing values, it could be possible that the data is missing purpose.
# collect all the columns which have missing values
cols_missing_values = list(missing_df[missing_df.num_missing > 0].index)
df_airbnb_missing_values = df_airbnb[cols_missing_values]
df_airbnb_missing_values.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 45605 entries, 0 to 48863
Data columns (total 12 columns):
summary 43667 non-null object
description 44808 non-null object
bathrooms 45550 non-null float64
beds 45567 non-null float64
bedrooms 45580 non-null float64
host_identity_verified 45588 non-null category
host_has_profile_pic 45588 non-null category
host_total_listings_count 45588 non-null float64
host_listings_count 45588 non-null float64
host_is_superhost 45588 non-null category
name 45589 non-null object
cancellation_policy 45604 non-null category
dtypes: category(4), float64(5), object(3)
memory usage: 3.3+ MB
Check if the 17 missing values of host columns are are missing together.
host_cols = list(df_airbnb_missing_values.columns[df_airbnb_missing_values.columns.str.contains('host')])
df_airbnb_missing_values[host_cols][df_airbnb_missing_values.host_identity_verified.isnull()]
host_identity_verified | host_has_profile_pic | host_total_listings_count | host_listings_count | host_is_superhost | |
---|---|---|---|---|---|
2451 | NaN | NaN | NaN | NaN | NaN |
2715 | NaN | NaN | NaN | NaN | NaN |
5962 | NaN | NaN | NaN | NaN | NaN |
6198 | NaN | NaN | NaN | NaN | NaN |
6454 | NaN | NaN | NaN | NaN | NaN |
8108 | NaN | NaN | NaN | NaN | NaN |
8974 | NaN | NaN | NaN | NaN | NaN |
9642 | NaN | NaN | NaN | NaN | NaN |
10692 | NaN | NaN | NaN | NaN | NaN |
14184 | NaN | NaN | NaN | NaN | NaN |
14207 | NaN | NaN | NaN | NaN | NaN |
14652 | NaN | NaN | NaN | NaN | NaN |
18835 | NaN | NaN | NaN | NaN | NaN |
23104 | NaN | NaN | NaN | NaN | NaN |
23119 | NaN | NaN | NaN | NaN | NaN |
26552 | NaN | NaN | NaN | NaN | NaN |
45837 | NaN | NaN | NaN | NaN | NaN |
Yes, as suspected, these host columns have missing values for the same observations.
Handling Missing Values
- For columns containing text, we will be replacing them with an empty string.
- For categorical columns, we will be replacing missing values with the mode.
- For continuous columns, we will be replacing the missing values with the median.
I will be using sklearn’s SimpleImputer
to replace the missing values.
TEXT COLUMNS:
df_airbnb_clean = df_airbnb.copy(deep=True)
df_airbnb_clean.shape
(45605, 45)
# to avoid setcopywithwarning use .copy() - this will tell pandas use a copy and not a view.
df_airbnb_clean.loc[df_airbnb_clean.summary.isna().copy(), 'summary'] = ''
df_airbnb_clean.loc[df_airbnb_clean.description.isna().copy(), 'description'] = ''
df_airbnb_clean.loc[df_airbnb_clean.name.isna().copy(), 'name'] = ''
CATEGORICAL COLUMNS:
from sklearn.impute import SimpleImputer
category_missing_cols = ['host_has_profile_pic', 'host_identity_verified',
'host_is_superhost', 'cancellation_policy']
float_missing_cols = ['bathrooms', 'beds', 'bedrooms',
'host_total_listings_count',
'host_listings_count']
For validation purpose, lets first check the float_missing_cols
. Example, check the bathrooms and bedrooms columns, you will notice there are some NaN values in there. If the missing value imputation is successful, then we should expect to see these NaN values replaced by their median values.
# get all the observations for float_missing_cols which contain na.
# we will use np.where to get the indices of these observations.
#index,_ = np.where(df_airbnb_clean[float_missing_cols].isna() == True)
#df_airbnb_clean.iloc[index, :]
# Store these indices, so that we can validate after imputation
#indexes = list(index)
#df_airbnb_clean.iloc[index, :]
def replace_missing_values(cols, df):
'''
Takes a list of columns and a dataframe and imputes based on
the column type. If it is object type, then most_frequent value
is used for imputation. If it is a float/int type, then the median
value is used for imputation.
arguments:
cols: list of columns
df : dataframe containing these columns.
returns:
df: the imputed dataframe
'''
for col in cols:
if type(df[col].dtype) is pd.core.dtypes.dtypes.CategoricalDtype:
print("Imputing {} column with most frequent value".format(col))
mode_imputer = SimpleImputer(strategy='most_frequent')
df.loc[:, col] = mode_imputer.fit_transform(df[[col]])
elif df[col].dtype == 'float64' or df[col].dtype == 'int64':
print("Imputing {} column with median value".format(col))
median_imputer = SimpleImputer(strategy='median')
df.loc[:, col] = median_imputer.fit_transform(df[[col]])
else:
raise ValueError("Invalid column type")
return df
df_airbnb_clean = replace_missing_values(float_missing_cols, df_airbnb_clean)
df_airbnb_clean = replace_missing_values(category_missing_cols, df_airbnb_clean)
Imputing bathrooms column with median value
Imputing beds column with median value
Imputing bedrooms column with median value
Imputing host_total_listings_count column with median value
Imputing host_listings_count column with median value
Imputing host_has_profile_pic column with most frequent value
Imputing host_identity_verified column with most frequent value
Imputing host_is_superhost column with most frequent value
Imputing cancellation_policy column with most frequent value
After imputation, we have successfully removed all the missing values.
df_airbnb_clean.isna().sum()
id 0
name 0
summary 0
description 0
host_is_superhost 0
host_listings_count 0
host_total_listings_count 0
host_verifications 0
host_has_profile_pic 0
host_identity_verified 0
neighbourhood_cleansed 0
neighbourhood_group_cleansed 0
latitude 0
longitude 0
is_location_exact 0
property_type 0
room_type 0
accommodates 0
bathrooms 0
bedrooms 0
beds 0
bed_type 0
amenities 0
price 0
guests_included 0
extra_people 0
minimum_nights 0
maximum_nights 0
has_availability 0
availability_30 0
availability_60 0
availability_90 0
availability_365 0
number_of_reviews 0
number_of_reviews_ltm 0
requires_license 0
instant_bookable 0
is_business_travel_ready 0
cancellation_policy 0
require_guest_profile_picture 0
require_guest_phone_verification 0
calculated_host_listings_count 0
calculated_host_listings_count_entire_homes 0
calculated_host_listings_count_private_rooms 0
calculated_host_listings_count_shared_rooms 0
dtype: int64
Clean the columns
Some columns like price
are represented as objects since we have a dollar sign. Before we proceed further with our analysis, we need to clean these columns.
Cleaning price and extra_people
import re
df_airbnb_clean['price'] = df_airbnb_clean.price.apply(lambda x:re.sub(r'[$,]', '', x)).astype('float')
df_airbnb_clean['extra_people'] = df_airbnb_clean.extra_people.apply(lambda x:re.sub(r'[$,]', '', x)).astype('float')
df_airbnb_clean[['extra_people', 'price']].plot(kind='box')
<matplotlib.axes._subplots.AxesSubplot at 0x7f246e3a9e10>
Price has quite many outliers, we will deal with outliers in the next stage.
One-hot encode the categorical variables
category_one_hot_encoding = pd.get_dummies(df_airbnb_clean[categorical_types])
category_one_hot_encoding.head()
host_is_superhost_f | host_is_superhost_t | host_has_profile_pic_f | host_has_profile_pic_t | host_identity_verified_f | host_identity_verified_t | is_location_exact_f | is_location_exact_t | property_type_Apartment | property_type_Condominium | property_type_House | property_type_Townhouse | room_type_Entire home/apt | room_type_Private room | room_type_Shared room | bed_type_Airbed | bed_type_Couch | bed_type_Futon | bed_type_Pull-out Sofa | bed_type_Real Bed | has_availability_t | requires_license_f | requires_license_t | instant_bookable_f | instant_bookable_t | is_business_travel_ready_f | cancellation_policy_flexible | cancellation_policy_moderate | cancellation_policy_strict | cancellation_policy_strict_14_with_grace_period | cancellation_policy_super_strict_30 | cancellation_policy_super_strict_60 | require_guest_profile_picture_f | require_guest_profile_picture_t | require_guest_phone_verification_f | require_guest_phone_verification_t | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 |
1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 |
3 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 |
4 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 |
5 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 |
category_one_hot_encoding.shape
(45605, 36)
df_airbnb_clean.shape
(45605, 45)
We need to merge these two and drop the categorical variables
df_airbnb_clean = pd.concat([df_airbnb_clean, category_one_hot_encoding],
axis='columns').drop(categorical_types,
axis='columns')
df_airbnb_clean.shape
(45605, 67)
df_airbnb_clean.to_csv('airbnb_clean.csv', index=False)
Upload the cleaned file to S3
# upload the file to S3
s3_client.upload_file(Bucket=bucket_name, Filename='airbnb_clean.csv', Key='clean/airbnb_clean.csv')
response = s3_client.list_objects(Bucket=bucket_name)
for file in response['Contents']:
print(file['Key'])
clean/airbnb_clean.csv
detailed_listings.csv
summary_listings.csv