New York City Airbnb t-SNE visualization

3 minute read

Dimensionality Reduction using t-SNE

t-Distributed Stochastic Neighbor Embedding (t-SNE) is a non-linear technique for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets. t-SNE minimizes the divergence between two distributions: a distribution that measures pairwise similarities of the input objects and a distribution that measures pairwise similarities of the corresponding low-dimensional points in the embedding.

In this way, t-SNE maps the multi-dimensional data to a lower dimensional space and attempts to find patterns in the data by identifying observed clusters based on similarity of data points with multiple features. However, after this process, the input features are no longer identifiable, and you cannot make any inference based only on the output of t-SNE. Hence it is mainly a data exploration and visualization technique.

We will start by taking the 50 principal components that we created in the earlier post New York City Airbnb PCA, and apply the t-SNE with 3 components which we can use to create a 3D scatter plot of the data points.

Get the data

The principal components created earlier are stored in airbnb_final.csv file, which we will load in to begin our analysis.

import pandas as pd
data = pd.read_csv('airbnb_final.csv')
data.head()
price_category name id price adjusted_price minimum_nights bedrooms bathrooms neighbourhood_group_cleansed neighbourhood_cleansed ... 40 41 42 43 44 45 46 47 48 49
0 high Skylit Midtown Castle 2595 225.0 225.000000 1 0.0 1.0 Manhattan Midtown ... 0.576904 0.466956 0.331311 0.261779 -0.330193 1.620287 0.867739 -0.798060 -0.576860 -0.254925
1 medium THE VILLAGE OF HARLEM....NEW YORK ! 3647 150.0 50.000000 3 1.0 1.0 Manhattan Harlem ... -0.252328 0.226731 0.269839 -0.211928 0.147831 1.354929 0.801862 -0.292820 -0.804985 -0.202175
2 low Entire Apt: Spacious Studio/Loft by central park 5022 80.0 8.000000 10 1.0 1.0 Manhattan East Harlem ... 0.193168 0.068044 0.015844 0.197295 -0.167786 1.117572 0.749340 0.109282 -0.870940 -0.278948
3 medium Large Cozy 1 BR Apartment In Midtown East 5099 200.0 66.666667 3 1.0 1.0 Manhattan Murray Hill ... 0.416250 -0.057388 0.073780 0.143102 0.198163 1.440193 0.641507 -0.021390 -0.912154 -0.285108
4 low BlissArtsSpace! 5121 60.0 1.333333 45 1.0 1.0 Brooklyn Bedford-Stuyvesant ... -0.214280 -0.095810 0.224063 0.083689 -0.192449 0.410582 0.017357 0.246550 0.796558 0.307420

5 rows × 60 columns

# rename the PC columns
pc_col_names = ["pc_" + item for item in list(data.columns[10:])]
other_col_names = list(data.columns[:10])
data.columns = other_col_names + pc_col_names

Apply t-SNE

from sklearn.manifold import TSNE

# extract the 50 principal components
A = data.iloc[:,10:].values
type(A)
numpy.ndarray
# Dimension reduction with t-SNE
model = TSNE(n_components=3, learning_rate=100, random_state=42)
tsne_features = model.fit_transform(A)

# Construct a t-SNE dataframe
tsne_df = pd.DataFrame({'TSNE1': tsne_features[:,0], 
              'TSNE2': tsne_features[:,1],
              'TSNE3': tsne_features[:,2]
             })
tsne_df.shape
(45605, 3)

The tsne_df dataframe contains the 3 tsne features for all 45,605 airbnb listings. We can now use this data along with other columns of the airbnb dataset to build a 3D scatterplot.

data_tsne = data[other_col_names]
tsne_final= pd.concat([tsne_df, data_tsne], axis=1)

# save this as tsne takes extremely long to run
tsne_final.to_csv('tsne_final.csv', index=False)

tsne_final.head()
TSNE1 TSNE2 TSNE3 price_category name id price adjusted_price minimum_nights bedrooms bathrooms neighbourhood_group_cleansed neighbourhood_cleansed
0 6.618355 18.307888 4.037642 high Skylit Midtown Castle 2595 225.0 225.000000 1 0.0 1.0 Manhattan Midtown
1 -20.100536 8.020902 -1.968155 medium THE VILLAGE OF HARLEM....NEW YORK ! 3647 150.0 50.000000 3 1.0 1.0 Manhattan Harlem
2 -9.849981 16.748266 2.556231 low Entire Apt: Spacious Studio/Loft by central park 5022 80.0 8.000000 10 1.0 1.0 Manhattan East Harlem
3 -2.867686 1.036031 15.170166 medium Large Cozy 1 BR Apartment In Midtown East 5099 200.0 66.666667 3 1.0 1.0 Manhattan Murray Hill
4 -8.865001 -15.556909 -7.953006 low BlissArtsSpace! 5121 60.0 1.333333 45 1.0 1.0 Brooklyn Bedford-Stuyvesant

Plotly express to visualize the data

import pandas as pd
import plotly.express as px
tsne_final = pd.read_csv('../data/raw/tsne_final.csv')
plotly_data = tsne_final[(tsne_final.neighbourhood_cleansed == 'Chelsea') & 
                         (tsne_final.minimum_nights <= 3) &
                         (tsne_final.bedrooms == 0)
                        ]
plotly_data.shape
(107, 13)
plotly_data.head()
TSNE1 TSNE2 TSNE3 price_category name id price adjusted_price minimum_nights bedrooms bathrooms neighbourhood_group_cleansed neighbourhood_cleansed
173 -8.011493 -10.242352 -9.682402 low Chelsea Studio sublet 1 - 2 months 47370 125.0 41.666667 3 0.0 1.0 Manhattan Chelsea
1082 -11.707626 13.309287 1.150806 high Beautiful Brand New Chelsea Studio 515392 200.0 200.000000 1 0.0 1.0 Manhattan Chelsea
2758 -11.331677 16.121357 1.707191 medium Large Comfortable Studio in Chelsea 1820858 161.0 80.500000 2 0.0 1.0 Manhattan Chelsea
2838 -10.474242 13.188574 -4.580487 medium Awesome Huge Studio - NYC Center 1891017 200.0 66.666667 3 0.0 1.0 Manhattan Chelsea
2957 -12.070560 6.332698 3.506707 medium Luxury studio 1975999 189.0 63.000000 3 0.0 1.0 Manhattan Chelsea
fig = px.scatter_3d(plotly_data, x='TSNE1', y='TSNE2', z='TSNE3', color='price_category', 
                    hover_name='name', hover_data=['price', 'minimum_nights', 'id'], 
                    template='plotly_dark', opacity=0.9, title='Visualizing airbnb locations in feature space',
                    labels={'TSNE1': 'X', 'TSNE2': 'Y', 'TSNE3':'Z'}, )

fig.write_html('scatter-3d.html')