This site is a compilation of projects and blog posts related to Data Science, Data Engineering, Machine Learning, NLP and Statistics that I have worked on in the last few years.

1. Group Airbnb listings in New York City based on what they offer

An interactive 3-D visualization to find similar Airbnb listings: By making use of publicly available New York City’s Airbnb listings data, I have gathered the data, cleaned, pre-processed and made use of linear (PCA) and non-linear dimensionality reduction (t-SNE) to build an interactive visualization to help users find listings that offer similar services.

  • Over 2000 features have been feature engineered from the raw text and numeric data.
  • An interactive data dashboard has been developed which lets users pick any neighborhood and find similar listings.
  • Developed a data dashboard and deployed it Heroku. Check out the application: (Note: takes ~15 seconds to boot-up the dyno on Heroku)

Hover over any data point to see more details. You can zoom and pan to find listings that are similar to each other in component space. Here, X, Y, Z co-ordinates are a condensed version of all the features in the dataset. Thus, two points that are closer together means that they have similar features.

The steps taken to arrive at this output have been thoroughly documented in the blog posts listed below:

FIGURE: Shown below are results of K-means clustering for Chelsea neighborhood. Each marker in the map is an Airbnb Listing, the color indicates the cluster they belong to. Click on any marker to see more details.

Keywords: CRISP-DM, PCA, t-SNE, Plotly, Dash, Heroku, Machine Learning workflow

2. Data pipelines with Apache Airflow

Automate Data Warehouse ETL process with Apache Airflow : github link Automation is at the heart of data engineering and Apache Airflow makes it possible to build reusable production-grade data pipelines that cater to the needs of Data Scientists. In this project, I took the role of a Data Engineer to:

  • Develop a data pipeline that automates data warehouse ETL by building custom airflow operators that handle the extraction, transformation, validation and loading of data from S3 -> Redshift -> S3.
  • Build a reusable production-grade data pipeline that incorporates data quality checks and allows for easy backfills.

Keywords: Apache Airflow, AWS Redshift, Python, Docker compose, ETL, Data Engineering

3. Data Lakes with Apache Spark

Develop an ETL pipeline for a Data Lake : github link As a data engineer, I was tasked with building an ETL pipeline that extracts data from S3, processes them using Spark, and loads the data back into S3 as a set of dimensional tables. This allows Data Scientists to continue finding insights from the data stored in the Data Lake.

  • Developed python scripts that make use of PySpark to wrangle the data loaded from S3.
  • Designed a star schema to store the transformed data back into S3 as partitioned parquet files.

Keywords: Apache EMR, Data Lakes, PySpark, Python, Data Wrangling, Data Engineering

4. Build a production-grade data pipeline using Airflow

Convert raw search text into actionable insights : github link A production-grade data pipeline has been designed to automate the parsing of user search patterns to analyze user engagement. Extract data from S3, apply a series of transformations and load clean datasets into S3 (Data Lake) and store aggregated data into Redshift (Data Warehouse). To accomplish this, I designed an ETL pipeline using Airflow framework that will:

  • Incrementally extract data from source S3 bucket.
  • Apply a series of transformations in-memory.
  • Load a clean dataset back into destination S3 bucket.
  • Aggregate data and store the results into a table in Redshift.

Keywords: Python, Apache Airflow, Data Engineering, Redshift, Pandas, Regex

5. Data Streaming Using Kafka

Stream processing data pipeline using Kafka: The Chicago Transit Authority (CTA) is interested in developing a data dashboard that displays the system status for its commuters. As their data engineer, I was tasked to build a real-time stream processing data pipeline that will take the arrival and turnstile events emitted by devices installed by CTA at each train station. The station data is located in an in-house POSTGRES database that needs to be leveraged by the pipeline to feed into the dashboard. Lastly, CTA also wishes to display the live weather data - which is served through a weather REST endpoint.

Keywords: Python, Kafka, KSQL, Faust, Kafka REST Proxy, Kakfa Connect, Data Engineering, Stream Processing