Data Engineering

A collection of Data Engineering projects and blog posts.

1. Data pipelines with Apache Airflow

Automate Data Warehouse ETL process with Apache Airflow : github link Automation is at the heart of data engineering and Apache Airflow makes it possible to build reusable production-grade data pipelines that cater to the needs of Data Scientists. In this project, I took the role of a Data Engineer to:

  • Develop a data pipeline that automates data warehouse ETL by building custom airflow operators that handle the extraction, transformation, validation and loading of data from S3 -> Redshift -> S3.
  • Build a reusable production-grade data pipeline that incorporates data quality checks and allows for easy backfills.

Keywords: Apache Airflow, AWS Redshift, Python, Docker compose, ETL, Data Engineering

2. Data Lakes with Apache Spark

Develop an ETL pipeline for a Data Lake : github link As a data engineer, I was tasked with building an ETL pipeline that extracts data from S3, processes them using Spark, and loads the data back into S3 as a set of dimensional tables. This allows Data Scientists to continue finding insights from the data stored in the Data Lake.

  • Developed python scripts that make use of PySpark to wrangle the data loaded from S3.
  • Designed a star schema to store the transformed data back into S3 as partitioned parquet files.

Keywords: Apache EMR, Data Lakes, PySpark, Python, Data Wrangling, Data Engineering

3. Build a production-grade data pipeline using Airflow

Convert raw search text into actionable insights : github link A production-grade data pipeline has been designed to automate the parsing of user search patterns to analyze user engagement. Extract data from S3, apply a series of transformations and load clean datasets into S3 (Data Lake) and store aggregated data into Redshift (Data Warehouse). To accomplish this, I designed an ETL pipeline using Airflow framework that will:

  • Incrementally extract data from source S3 bucket.
  • Apply a series of transformations in-memory.
  • Load a clean dataset back into destination S3 bucket.
  • Aggregate data and store the results into a table in Redshift.

Keywords: Python, Apache Airflow, Data Engineering, Redshift, Pandas, Regex

4. Data Streaming Using Kafka

Stream processing data pipeline using Kafka: The Chicago Transit Authority (CTA) is interested in developing a data dashboard that displays the system status for its commuters. As their data engineer, I was tasked to build a real-time stream processing data pipeline that will take the arrival and turnstile events emitted by devices installed by CTA at each train station. The station data is located in an in-house POSTGRES database that needs to be leveraged by the pipeline to feed into the dashboard. Lastly, CTA also wishes to display the live weather data - which is served through a weather REST endpoint.