Spark is built for large datasets that require distributed system of multiple computers and is optimized to use memory efficiently. In this post, we will learn how Spark uses functional programming and explain why procedural programming is not suitable for distributed systems.
Manipulating data using Functional Programming
In my experience, one of the hardest parts of learning Spark is becoming familar with the functional style of programming. Under the hood, Spark is written in a functional programming language called Scala. When you are programming with functional languages, you end up solving problems in a pretty different way than you would if you are using a general purpose language like Python. Although Spark is written in Scala, you can use it with other languages: like Java, R or Python. Here we will be using the Python programming interface or PySpark for short.
Even when you are using the PySpark API, you will see the functional programming influence of Scala. For example, in the previous post, we saw a problem where we counted up the number of occurences of a song in the
songs.txt file using MapReduce programming paradigm. This code went through each record and spit out a
tuple with the name of the song and the number
1. Then, these tuples were shuffled and reduced to sum up the ones that came with each song name. This is an example of a functional approach to summing up songs.
Now imagine doing this with a for-loop in Python using the Procedural style. You would start by keeping track of a counter variable to keep track of the play count for each song. Then you’d iterate through all the songs, and increment the counter by 1, if the song name matched.
If you want to use Spark effectively, then you will have to go beyond the procedural style, and get comfortable with a few tools from functional programming.
Spark uses functional programming and we learnt that Python is not a functional programming language. How is it possible that Spark programs can be written in Python if Python is not a functional programming language?
PySpark API is written with functional programming principles in mind.
Why use functional programming?
Before we dive into specific techniques, lets unpack why Spark uses functional programming in the first place. The core reason is that functional programming is perfect for distributed systems. We know that a distributed system is a bunch of computers working together. Another useful and amusing description of these distributed systems, comes from the Computer Science pioneer Leslie Lamport. He said that, “You know you have a distributed system, when your computer crashes because someone you didn’t know about made a mistake.” Functional programming helps minimize these sort of mistakes that can cripple an entire distributed system.
Functional programming gets its name from the functions you saw in algebra class. These functions are more strict than your average python function because in math, a function can only give you one answer when you give it some input. On the other hand, python allows you to make some flexible, albeit complex, functions that depend on the input and other parameters.
Also, in math class, when you evaluate a function, you would never change the inputs of that function. But as weird as this seems, this is exactly what can happen in Python.
Example to illustrate why we should use functional programming
Here’s a concrete example of a potential issue with procedural programming in python. The code shown below, counts up the number of times a specific song was played - assuming the log of songs was converted to a Python list.
We start by setting a global variable, play_count, to keep track of the number of times the song was played.
The code counts the number of times a song appears in the
log_of_songs variable. Some context: You’ll notice that the first time you run
count_plays("Despacito"), you get the correct count. However, when you run the same code again
count_plays("Despacito"), the results are no longer correct. This is because the global variable
play_count stores the results outside of the
log_of_songs = [ "Despacito", "Nice for what", "No tears left to cry", "Despacito", "Havana", "In my feelings", "Nice for what", "Despacito", "All the stars" ]
play_count = 0
def count_plays(song_title): global play_count for song in log_of_songs: if song == song_title: play_count = play_count + 1 return play_count
The second time when we run this, we get a cumulative result. This is not right. For whatever reason, we sometimes need to recalculate things in the distributed world, so when we re-run the function with the same input - ‘Despacito’, we should get the same result!! BUT it doesn’t in this case.
How might you solve this issue?
So you got rid of the global variable and instead used
play_count as an input to the function:
So, here, instead of using
play_count as a global variable, we pass that into the function. Does this change anything?
play_count = 0
# play_count is not a global variable anymore def count_plays(song_title, play_count): for song in log_of_songs: if song == song_title: play_count = play_count + 1 return play_count
So, this appears to be solving the problem of getting the same output given the same input.
How would this work with parallel programming?:
Spark splits up data onto multiple machines. If your songs list were split onto two machines,
Machine A would first need to finish counting, and then return its own result to
Machine B. And then Machine B could use the output from Machine A and add to the count.
However, that isn’t parallel computing. Machine B would have to wait until Machine A finishes. You’ll see in the next posts how Spark solves this issue with a functional programming paradigm.
In Spark, if your data is split onto two different machines, machine A will run a function to count how many times ‘Despacito’ appears on machine A. Machine B will simultaneously run a function to count how many times ‘Despacito’ appears on machine B. After they finish counting individually, they’ll combine their results together. You’ll see how this works below and how Spark solves this problem.
Pure Functions analogy
Imagine your program is like a bread factory, and your function is a specific machine in your factory that makes sourdough bread (i.e, your machine/function makes sourdough bread). The inputs to your function are like the ingredients to the machine, and every time you run it, it gives you the perfect
loaf (your output).
If you were just making some bread in your kitchen, you could change the ingredients slightly, as long as you follow the general recipe. But, since your factory needs to mass-produce bread loafs at a large scale, you need to be a bit more careful and think about the design of your break-maker/machine/function.
Your break-maker should be designed, such that you can build several copies of it (bread-maker/function/machine), and these copies can all run smoothly across the factory.
One thing you will need to avoid when designing you break-maker is unintended side-effects. That is, after each day of making loafs, your machine needs to leave the factory (lets say, you take it home) it needs to return to factory exactly the same as it was running before. If you don’t, then each machine could start interfering with the others. For example, if one of the bread-makers is faulty and running it for some time increases the overall room-temperature in the factory by 1 degree, then this would mess-up the other bread-makers.
This will mean, that you will end-up with a lot of burnt-bread, just because one of the machines’ was faulty. This is what that quote by Leslie Lamport means.
In distributed systems, your functions shouldn’t have side-effects on variables (temperature in this case) outside their scope, since this could interfere with other functions running in our cluster (factory). Another potential issue is contamination of your original ingredients, if you have ever made sourdough bread, you know that you need a combination of water, sugar and yeast - known as the starter.
Some sourdough bread factories have maintained their starters for decades, and they protect them. Your bread-making machine needs to get these ingredients without ruining them, since other bread-makers will also need them. Bread companies do this by using a mother dough, that can make copies of the starter, and these companies are very careful with this parent dough.
In distributed systems, you also need to be careful with how you design your functions. That is, whenever your functions run on some input data, it can’t alter it in the process.
If your bread-making machine protects the input ingredients, and doesn’t cause any side effects, then you will have a smooth and clean operation. Similarly, if you write functions that preserve their inputs and avoid any side effects, then they are called
pure functions. With pure functions, your spark code will work well at the scale of big data.
Just like bread companies make copies of starter from their mother dough, every spark function makes a copy of its input data and it never changes the original parent data. Because spark doesn’t change or mutate the input data, they are said to be
This all makes sense when you have just one function. But, what happens when you have lots of functions in your program? As shown here:
There are usually a lot of steps when we wrangle data, just like there are many steps in baking bread. In Spark, you do this by chaining together multiple functions that each accomplish a small chunk of the work. You will often see a function that is composed of multiple sub-functions, and in order for this big-function to be PURE, each sub-function also needs to be pure. It would seem that, Spark would need to make a copy of the input data for each sub-function.
If this was the case, then your Spark program will run out-of-memory very quickly. Fortunately, Spark avoid this by using a functional programming concept called lazy evaluation. Before Spark does anything with the data in your program, it first builds step-by-step directions of what functions and data it will need. These directions are like the recipe for your bread, and in Spark, this is called a DAG.
Once Spark builds the DAG from your code, it checks if it can procrastinate, waiting until the last possible moment to get the data.
We learnt the need for functional programming and the rationale behind using pure functions. Lastly, we saw how Spark builds a DAG when we have multiple functions in our program. Next we will get familiar with
Functional Programming concepts that you will encounter everyday in Spark, such as:
lambda functions, and
closures. Then we will see, how to read in and write out data using common formats like
JSON. We will also get comfortable with Spark environment and learn how to wrangle data with the two most popular Spark APIs:
Spark DataFrames and
Spark SQL. We will first practice using these APIs on smaller datasets, before you start to scale to big data. Finally, we will also touch on the low-level
RDD API, which we will run into from time-to-time, even when we are using the high-level APIs like DataFrames and Spark SQL.