Second onsite interview

Data Lake based questions

  • Come up with a reasonable explanation as to how we make use of Data Lake within chase.

One of the most common challenges organizations face, though, with their data lakes is the inability to find, understand, and trust the data they need for business value or to gain a competitive edge. That’s because the data might be gibberish (in its native format)—or even conflicting.

When the data scientist wants to access enterprise data for modeling or to deliver insights for analytics teams, this person is forced to dive into the depths of the data lake, and wade through the murkiness of undefined data sets from multiple sources. As data becomes an increasingly more important tool for businesses, this scenario is clearly not sustainable in the long run.

To be clear, for businesses to effectively and efficiently maximize data stored in data lakes, they need to add context to their data by implementing policy-driven processes that classify and identify what information is in the lake, and why it’s in there, what it means, who owns it, and who is using it.

This can best be accomplished through data governance integrated with a data catalog. Once this is done, the murky data lake will become crystal clear, particularly for the users who need it most.

Avoid data swamp: The potential of big data is virtually limitless. It can help businesses scale more efficiently, gain an advantage over their competitors, enhance customer service, and more. It may seem, the more data an organization has at its fingertips, the better. Yet that’s not necessarily the case—especially if that data is hidden in the data lake with no governance in place. A data lake without data governance will ultimately end up being a collection of disconnected data pools or information silos—just all in one place.

Data dumped into a data lake is not of business value without structure, processes, and rules around the data. Ungoverned, noncataloged data leaves businesses vulnerable.

Users won’t know where the data comes from, where it’s been, with whom they can share it, or if it’s certified. Regulatory and privacy compliance risks are magnified, and data definitions can change without any user’s knowledge. The data could be impossible to analyze or be used inappropriately because there are inaccuracies and/or the data is missing context.

The impact: stakeholders won’t trust results gathered from the data. A lack of data governance transforms a data lake from a business asset to a murky business liability.

Value of a data catalog

The tremendous volume and variety of big data across an enterprise makes it difficult to understand the data’s origin, format, lineage, and how it is organized, classified, and connected. Because data is dynamic, understanding all of its features is essential to its quality, usage, and context. Data governance provides systematic structure and management to data residing in the data lake, making it more accessible and meaningful.

Redshift based questions

  • You migrating from Postgres to Redshift. How would you design the data warehouse? What are some key considerations? What are the best practices? What are some of the gotchas that we need to consider? Principles of datawarehouse design.

SQL

  • Good with this.

Data Modeling

  • Given a problem, how would you go about modeling the data.

Python

  • Just iteration over dictionaries.

ETL

  • Revise Airflow advanced params.

Resume and project based questions.

Build some intution around how to use Spark Streaming.

Spark specific questions

Kafka and Spark Streaming based questions.

Explain how you would build a real-time stream processing application?

Data Warehousing (additional)

Kubernetes (additional)

AWS (additional)

CI/CD and deployment in general

Airflow scheduling and maintenance questions

  • Describe your Airflow environment.

File formats

CAF Theorem

I was asked about CAF theorem: Consistency, Availability and Fault-tolerance. I was asked to design a system which optimizes for either of the two. I was asked to give specific examples of how I would do it. Which databases would I consider when I am dealing with Consistency. When would I consider using a NoSQL database vs SQL database.

Concurrency and Multi-threading

I was asked to explain multi-threading concepts.

What is hotspotting?

Sometimes you notice that one of your spark node takes a little longer to run. Each and every time. What does this indicate?

Materialized Views

We have a bunch of materialized views and we are migrating this to Redshift. How would you go about solving that problem?

Case Studies

Case Study 1: Building a big data pipeline to process clickstream data.

Clickstream data is one of the largest and most important datasets within a real-estate company. The data set contains a log of a series of page requests, actions, user clicks and other web activity from the millions of home shoppers and sellers visiting the real-estate company sites every month. The data powers many reporting dashboards and helps us answers complex business questions. From product initiatives to analyzing user behavior, personalization, recommendations and machine learning all depend on this valuable dataset to generate actionable insights.

We’ve always been successful at collecting this data through a pipeline, but due to the sheer size of the data set, making this ETL process efficient and scalable has been a huge area of focus for the team. With close to a billion events per day received via click stream data, we couldn’t initially scale fast enough to meet the growth in data volumes using our existing system. We sought a distributed platform which would enable the fast parsing and processing of large datasets via a massively parallel processing framework. Ensuring we get this data to the right dashboards in a timely manner is crucial as internal stakeholders are counting on it to make decisions.

Our process has gone through several enhancements over time and here’s how we built it:

Version 1:

We receive a large batch of raw web activity data in JSON format. On our end, we decided to use Amazon Redshift as our data warehouse and query platform due to its cost-effectiveness and inherent columnar database benefits. Our ETL process involved:

  1. Download the JSONs to local.
  2. Using C# .NET, parse and convert the JSONs to CSVs while performing a number of transformations along the way. Such as flattening various nested elements, deriving calculated columns, cleaning up text and formatting. We also split the CSV into multiple parts and compress them to make COPY more efficient.
  3. Upload CSVs to Amazon S3.
  4. Run COPY command in Redshift to copy CSVs from S3 to a Redshift table.

Scheduler: SSIS package via SQL Server Agent.

Version 2:

At this point in our company’s growth, the process started becoming slow due to increase in data volume. To make it fast again, we merged steps 1, 2, 3 above into a single step and added multithreading. Here is what it looked like:

  1. Read JSON lines into memory, skipping the download. Perform the transformations on the fly using .NET’s multithreading capabilities. Write split CSVs into S3. Based on our server hardware, best performance was noticed when there are 8 parallel threads.
  2. Run COPY command in Redshift to copy CSVs from S3 to a Redshift table.

Scheduler: SSIS package via SQL Server Agent.

data-pipeline-design

Version 3:

Apache Hadoop is introduced into the picture. MapReduce written in Java. To replace an outdated .NET multi-threaded design. Hadoop cluster is launched and shut down for every run using command line (CLI) tools.

  1. Read JSON lines into memory, skipping the download. Perform the transformations on the fly using Hadoop, while writing CSVs into S3 in parallel.
  2. Run COPY command in Redshift to copy CSVs from S3 to a Redshift table.

Scheduler: Powershell+SQL Server based scheduling framework. Command line utilities for Hadoop cluster management.

big-data-pipeline

Hadoop MapReduce is implemented in Java. The java code is written such that it can support multiple current and future versions of the ETL and yet require as little change to the code as possible. This is done via a series of Abstract classes. All versions of the ETL start at BaseETL. From there, Version 1, 2 etc are created. These again are Abstract. Any changes made to BaseETLV2 will show up in any version that extends this. This allows us to make a general fix to all version 2s or all version 1s. The “V” is the actual implementation. This has a ParseJSON function that is called when parsing data.

class-diagram

Version 4:

Our web activity data is a collection of events generated from Zillow Group web platforms and mobile apps. Each web platform or mobile app maps to a separate data set in the source system storing the logs each day.

For example, zillow.com, Zillow iOS app, Zillow Android app all have their own unique dataset IDs and corresponding source tables, identified using a dataset ID and Date. Dataset IDs are unique IDs for each app. This data is extracted from source table to Zillow Group Data Lake daily for all datasets.

We set ourselves an audacious goal of building a new pipeline that makes the data available in the Data Lake in parquet format after its arrival in the source within 30 minutes. The biggest dataset is about 2 TB uncompressed for a single day. In addition to speed, there were other important decisions such as retries, alerting, auditing to be included in the design to make pipeline reliable. We tried different approaches to get this data in data lake.

We decided to go with the one that is not only faster and simpler but can also be modularized in separate tasks. We did this in Apache Airflow, which is now our primary choice of ETL scheduler. One of the main advantages with this approach is the ability to retry from a failed step as opposed to re-running the entire pipeline.

Case Study 2: Airflow to easily author and manage ETL pipelines

As one of the essentials serving millions of web and mobile requests for real-estate information, the Data Science and Engineering (DSE) team at Zillow collects, processes, analyzes and delivers tons of data everyday. It is not only the giant data size but also the continually evolving business needs that make ETL jobs super challenging. The team has been striving to find a platform that could make authoring and managing ETL pipelines much easier, and it was in 2016 we met Airflow.

Why and what is Airflow?: Airflow, an Apache project open-sourced by Airbnb, is a platform to author, schedule and monitor workflows and data pipelines.

Airflow jobs are described as directed acyclic graphs (DAGs), which define pipelines by specifying: what tasks to run, what dependencies they have, the job priority, how often to run, when to start/stop, what to do on job failures/retries, etc.

Typically, Airflow works in a distributed setting, as shown in the diagram below. The airflow scheduler schedules jobs according to the schedules/dependencies defined in the DAG, and the airflow workers pick up and run jobs with their loads properly balanced.

airflow

All job information is stored in the meta DB, which is always updated in a timely manner. The users can monitor their jobs via an Airflow web UI as well as the logs.

Case Study 3: Building a data streaming platform

Zillow produces a lot of data! We are a source of information on 110 million homes in the U.S., as well as multifamily rental properties. Both require a ton of data, in terms of ingestion and storage. Zillow also uses external data sources, including Clickstream data from Google Analytics.

Data from databases: When Zillow was originally formed, we used a lot of databases to store the data, with a cache in front to enable fast searching and quick lookup. Later, we standardized with Amazon S3 as our data lake provider. We had to overcome the challenge of how to get the data from databases into the data lake. Initially, we moved to using custom Sqoop jobs to extract data straight out of the tables and put them into S3. While this solved the immediate problem of getting data into S3, it also raised some issues:

  • Firstly, since the Sqoop jobs were being written by developers in the Data Science/Engineering Org, who had no idea about the semantics of the tables, or how they fit in the product, they had to constantly keep up with changes in the schema. Basically no way to handle schema evolution.
  • Secondly, the schema of the exported data closely followed the schema of the database (DB), and the DB schema was not necessarily optimized for Data Science/Machine Learning applications.
  • Since the Sqoop export jobs ran everyday, and sometimes they impacted live site facing databases, DBA had to create special Read-only replicas of those databases. This required more maintenance and overhead. Some of those databases could not be replicated easily, forcing us to read from one-day old database snapshots.

datalake

Writing directly to Data Lake: Some product teams wrote code to write data directly to S3. While this empowered product teams to send data directly, it also meant that there was no enforcement of the schema. Some teams wrote Json, while some teams wrote text files or csv files. The structures of the files were defined by the team, and were inconsistent. When schema changed, there were no consistent rules for when backfill was done for the historical data.

This also required teams to create and manage their own AWS resources. If they were writing directly to S3, they needed to create appropriate Roles and Credentials. If they were using firehose to write to S3, they had to create a firehose stream in addition to credentials.

Finally, people were not aware of governance and lifecycle policies for the data. For example, if the data contained PII, it was supposed to be encrypted. Otherwise, raw data and processed data were supposed to have different lifecycle policies. Frequently when teams wrote directly to S3, they were not aware of these policies.

datalake2

Data Streaming Platform: In order to solve the problem of getting data to the data lake in a consistent form, we developed a streaming platform as a service. Our goal was to build and architect stream processing as a service platform to power real-time analytics and machine learning applications. The key tenets of this architecture are as follows:

Kakfa Data pipelines

When people discuss building data pipelines using Apache Kafka, they are usuallly referring to a couple of use cases.

  • The first is building a data pipeline where Apache Kafka is one of the two end points. For example, getting data from Kafka to S3 or getting data from MongoDB into Kafka.
  • The second use case involves building a pipeline between two different systems but using Kafka as an intermediary. An example of this is getting data from Twitter to Elasticsearch by sending the data first from Twitter to Kafka and then from Kafka to Elasticsearch.

How does Kafka fit into the data pipeline?: The main value Kafka provides to data pipelines is its ability to serve as a very large, reliable buffer between various stages in the pipeline, effectively decoupling producers and consumers of data within the pipeline. This decoupling, combined with reliability security and efficiency, makes Kafka a good fit for most data pipelines.

How organizations look at Kafka?: Some organizations think of Kafka as an end point of a pipeline. They look at problems such as “How do I get data from Kafka to Elastic?”. This is a valid question to ask— especially if there is data you need in Elastic and it is currently in Kafka.

Additional Questions:

Sample Questions:

  • Can you tell us about a problem that you had to solve where you had wrestle with data to get to the answer?
    • Wrestle with the data.
      • NC school data
      • IMDB scrape the data
      • Guetenberg - various headings, each document had its own header and footer. So I had write a code to iteratively figure out. I was building a NLP pipeline to clean text, scrape text from a website.
      • Data cleaning pipeline.
      • what was the question you were trying to answer using the data?
  • What’s the most complex and/or largest dataset that you’ve had to deal with? *
  • Do you have any experience working with messy data?
    • yes
  • What’s the largest scale that your code has operated in, in any production environment?
    • What role did/your code play?
      • Designed and developed the Cert Tool which.
  • Can you tell us something you learned only through the experience of having built/shipped it? *
  • What are you most proud of having built?
    • Why are you proud of it?
    • Was it challenging? Why was it?
  • Are you familiar with SQL? How would you describe your proficiency?
  • Are you familiar with Python? What other languages?
    • What experience do you have with building ETL pipelines?

Culture:

They look for someone who exhibits intellectual humility, grit, and a desire to learn/grow while making an impact. Someone that has demonstrated the ability to reflect on mistakes they’ve made and how they’ve learned from them as well as person that doesn’t take all of the credit when working on teams - you acknowledge when you’ve gotten help.

Sample Culture Questions:

  • What is the most difficult technical challenge you’ve overcome?
  • Have you ever realized halfway through a project that your approach was completely wrong? What did you do when you did?
  • What are you hoping to learn in the next 1-2 years in your next role?
  • Is there anything technically you feel weak on that you’d like to improve?
    • It’s always good to have something to improve upon here, no one is perfect.

Do you have any questions?