When we talk about distributed computing, we generally refer to a big computational job, executed across a cluster of nodes. Each node is responsible for a set of operations on a subset of the data and at the end we combine these partial results to get the final answer. But how do the nodes know which task to run and in what order? And secondly, are all nodes equal?. Moreover, Which machine are you interacting with when you run your code?
Most computational frameworks are organized into a master-worker hierarchy, where a master node is responsible for orchestrating the tasks across the cluster, while the workers are performing the actual computations. There are four different modes to setup Spark.
- Local Mode: In this case, everything happens on a single machine. So, while we use Spark’s APIs, we don’t really do any distributed computing. Local mode can be useful to learn syntax and to prototype your project. In most of the notebooks that I use, I will be performing operations in local mode in the workspaces to become comfortable with Spark syntax.
The other 3 modes are Distributed and require a Cluster Manager. The cluster manager is a separate process that monitors the available resources, and makes sure that all machines are responsive during the job. There are 3 different options for cluster managers:
- Sparks own Standalone Cluster Manager:
- YARN: This is from the Hadoop project
- Mesos: This is from a open source project by UC Berkeley.
YARN and Mesos are useful when you are sharing a cluster with a team.
In the subsequent posts, we will setup and use our own Distributed Spark Cluster using the Standalone mode. In Spark Standalone we also have a so-called Driver Process. If you open a Spark shell (either python or scala), you are directly interacting with the driver program. It acts as a
master and is responsible for scheduling tasks, that the executors will perform.
Spark Use Cases
Let’s say you want to build a dashboard with big data to support a team of analysts, this usually starts with extracting and transforming data. The final step is to load the results into a database, where they can be quickly retrieved by a data visualization tool to make an interactive dashboard. This is what we refer to as ETL process. It is so common that it has become a verb in the industry:
ETLing data is the bread and butter of systems like
Spark, and is an essential skill for anyone working with big data.
The second use case for Spark, is to train Machine Learning models on big data. Spark is particularly useful for iterative algorithms, like Logistic Regression or calculating Page Rank. These algorithms repeat calculations with slightly different parameters, over and over on the same data. Spark is designed to keep this data in your memory, thus considerably speeding up the training. Spark’s strength at these two use cases: i.e, general-purpose big data analytics and machine learning is what makes it king of the big data ecosystem. It does this by using all the distributed processing techniques of Hadoop MapReduce, but with a more efficient use of memory.
Spark Streaming makes it easy to build scalable fault-tolerant streaming applications.
GraphX is Apache Spark’s API for graphs and graph-parallel computation.
You Don’t Always Need Spark
Spark is meant for big data sets that cannot fit on one computer. But you don’t need Spark if you are working on smaller data sets. In the cases of data sets that can fit on your local computer, there are many other options out there you can use to manipulate data such as:
- AWK - a command line tool for manipulating text files
- R - a programming language and software environment for statistical computing
- Python PyData Stack, which includes pandas, Matplotlib, NumPy, and scikit-learn among other libraries
Sometimes, you can still use pandas on a single, local machine even if your data set is only a little bit larger than memory. Pandas can read data in chunks. Depending on your use case, you can filter the data and write out the relevant parts to disk.
If the data is already stored in a relational database such as MySQL or Postgres, you can leverage SQL to extract, filter and aggregate the data. If you would like to leverage pandas and SQL simultaneously, you can use libraries such as SQLAlchemy, which provides an abstraction layer to manipulate SQL tables with generative Python expressions.
The most commonly used Python Machine Learning library is scikit-learn. It has a wide range of algorithms for classification, regression, and clustering, as well as utilities for preprocessing data, fine tuning model parameters and testing their results. However, if you want to use more complex algorithms - like deep learning - you’ll need to look further. TensorFlow and PyTorch are currently popular packages.
Spark has some limitation.
Spark Streaming’s latency is at least 500 milliseconds since it operates on micro-batches of records, instead of processing one record at a time. Native streaming tools such as
Flink can push down this latency value and might be more suitable for low-latency applications.
Apex can be used for batch computation as well, so if you’re already using them for stream processing, there’s no need to add Spark to your stack of technologies.
Another limitation of Spark is its selection of machine learning algorithms. Currently, Spark only supports algorithms that scale linearly with the input data size. In general, deep learning is not available either, though there are many projects integrate Spark with Tensorflow and other deep learning tools.
Hadoop versus Spark
The Hadoop ecosystem is a slightly older technology than the Spark ecosystem. In general, Hadoop MapReduce is slower than Spark because Hadoop writes data out to disk during intermediate steps. However, many big companies, such as Facebook and LinkedIn, started using Big Data early and built their infrastructure around the Hadoop ecosystem.
Spark is great for iterative algorithms, there is not much of a performance boost over Hadoop MapReduce when doing simple counting. Migrating legacy code to Spark, especially on hundreds of nodes that are already in production, might not be worth the cost for the small performance boost.
Beyond Spark for Storing and Processing Big Data
Keep in mind that Spark is not a data storage system, and there are a number of tools besides Spark that can be used to process and analyze large datasets.
Sometimes it makes sense to use the power and simplicity of SQL on big data. For these cases, a new class of databases, know as NoSQL and NewSQL, have been developed.
For example, you might hear about newer database storage systems like HBase or Cassandra. There are also distributed SQL engines like Impala and Presto. Many of these technologies use query syntax that you are likely already familiar with based on your experiences with Python and SQL.
In the comming posts you will learn about Spark specifically, but know that many of the skills you already have with SQL, Python, and soon enough, Spark, will also be useful if you end up needing to learn any of these additional Big Data tools.