Data driven decision making part 1
Data-driven decision making using SQL
Get the data
We will be using the MoviesNow
database which can be found here MoviesNow.sql. First launch pgAdmin
and create a new database called moviesnow
. Right click and open Query Tool
and then run the MoviesNow.sql
file. This will create the tables and insert records into them.
%load_ext sql
%sql postgresql://shravan:shravan@127.0.0.1/moviesnow
'Connected: shravan@moviesnow'
%%sql
-- Query the right table in information_schema
SELECT table_name
FROM information_schema.tables
-- Specify the correct table_schema value
WHERE table_schema = 'public';
* postgresql://shravan:***@127.0.0.1/moviesnow
5 rows affected.
table_name |
---|
movies |
actors |
actsin |
customers |
renting |
Here are the tables in this database
EDA customers table
Let’s first explore the customers table by summarizing customer information. In most business decisions customers are analyzed in groups, such as customers per country or customers per age group. Here’s a peek into the first 5 rows of the customers table.
%%sql
SELECT * FROM customers LIMIT 5;
* postgresql://shravan:***@127.0.0.1/moviesnow
5 rows affected.
customer_id | name | country | gender | date_of_birth | date_account_start |
---|---|---|---|---|---|
2 | Wolfgang Ackermann | Austria | male | 1971-11-17 | 2018-10-15 |
3 | Daniela Herzog | Austria | female | 1974-08-07 | 2019-02-14 |
4 | Julia Jung | Austria | female | 1991-01-04 | 2017-11-22 |
5 | Juliane Kirsch | Austria | female | 1977-03-01 | 2018-12-16 |
6 | Rowanne Couperus | Belgium | female | 1994-04-05 | 2018-08-26 |
Count the number of customers born in 80s
%%sql
SELECT
COUNT(customer_id) as num_customers
FROM customers
WHERE date_of_birth BETWEEN '1980-01-01' AND '1990-01-01';
* postgresql://shravan:***@127.0.0.1/moviesnow
1 rows affected.
num_customers |
---|
33 |
Count the number of customers from Germany
%%sql
SELECT
COUNT(customer_id) as num_customers
FROM customers
WHERE country = 'Germany';
* postgresql://shravan:***@127.0.0.1/moviesnow
1 rows affected.
num_customers |
---|
0 |
Count the number of countries where MovieNow has customers.
%%sql
SELECT
COUNT(DISTINCT country)
FROM customers;
* postgresql://shravan:***@127.0.0.1/moviesnow
1 rows affected.
count |
---|
11 |
You aggregated data from the customers
table and figured out that MovieNow has customers in 11 countries and that there are no customers from Germany.
Ratings for a movie
The movie ratings give us insight into the preferences of our customers. Let’s first take a look at the columns and their types.
%%sql
-- Get the column name and data type
SELECT
column_name,
data_type
-- From the system database information schema
FROM INFORMATION_SCHEMA.COLUMNS
-- For the customer table
WHERE table_name = 'renting';
* postgresql://shravan:***@127.0.0.1/moviesnow
5 rows affected.
column_name | data_type |
---|---|
renting_id | integer |
customer_id | integer |
movie_id | integer |
rating | integer |
date_renting | date |
%%sql
SELECT * FROM renting LIMIT 5;
* postgresql://shravan:***@127.0.0.1/moviesnow
5 rows affected.
renting_id | customer_id | movie_id | rating | date_renting |
---|---|---|---|---|
1 | 41 | 8 | None | 2018-10-09 |
2 | 10 | 29 | 10 | 2017-03-01 |
3 | 108 | 45 | 4 | 2018-06-08 |
4 | 39 | 66 | 8 | 2018-10-22 |
5 | 104 | 15 | 7 | 2019-03-18 |
Report summary statistics, such as the minimum, maximum, average, and count, of ratings for the movie with ID 25.
%%sql
SELECT
MIN(rating) as min_rating, -- Calculate the minimum rating and use alias min_rating
MAX(rating) as max_rating, -- Calculate the maximum rating and use alias max_rating
AVG(rating) as avg_rating, -- Calculate the average rating and use alias avg_rating
COUNT(rating) as number_ratings -- Count the number of ratings and use alias number_ratings
FROM renting
WHERE movie_id = 25; -- Select all records of the movie with ID 25
* postgresql://shravan:***@127.0.0.1/moviesnow
1 rows affected.
min_rating | max_rating | avg_rating | number_ratings |
---|---|---|---|
5 | 10 | 7.5000000000000000 | 8 |
Example: Examining annual rentals
You are asked to provide a report about the development of the company. Specifically, your manager is interested in the total number of movie rentals, the total number of ratings and the average rating of all movies since the beginning of 2019.
%%sql
SELECT
COUNT(*) AS number_renting,
AVG(rating) AS average_rating,
COUNT(rating) AS number_ratings -- Add the total number of ratings here.
FROM renting
WHERE date_renting >= '2019-01-01';
* postgresql://shravan:***@127.0.0.1/moviesnow
1 rows affected.
number_renting | average_rating | number_ratings |
---|---|---|
159 | 7.9462365591397849 | 93 |
You managed to extract all KPIs from the renting records by working with aggregations, NULL values and dates.
Analyzing Groups
Using the GROUP BY
clause allows us to apply aggregations to groups in a table. To evaluate the success and the potential of a company, it is often desirable to look at groups of customers or groups of products or jointly look at multiple groups. For example, we could be interested in grouping customers by country or gender OR in grouping movies by genre or year of release.
For example, we want to find out the average renting price by genre. First, lets gain an understanding of the different types of genres that we have. A simple group by will reveal the distinct groups.
%%sql
SELECT
genre
FROM movies
GROUP BY genre;
* postgresql://shravan:***@127.0.0.1/moviesnow
8 rows affected.
genre |
---|
Animation |
Other |
Drama |
Art House & International |
Mystery & Suspense |
Comedy |
Action & Adventure |
Science Fiction & Fantasy |
This is equivalent to running a Distinct.
%%sql
SELECT
DISTINCT genre
FROM movies
* postgresql://shravan:***@127.0.0.1/moviesnow
8 rows affected.
genre |
---|
Animation |
Other |
Drama |
Art House & International |
Mystery & Suspense |
Comedy |
Action & Adventure |
Science Fiction & Fantasy |
A GROUP BY
by itself is that is not much of value. Shown below is the visual of how GROUP BY genre
clause creates groups based on the genre. We can then apply any aggregation function to each of these sub-tables.
Shown below is the query that is extended to show the number of movies in each category in addition to showing the Average renting price for that genre.
%%sql
SELECT
genre,
AVG(renting_price) as avg_price,
COUNT(*) as number_movies
FROM movies
GROUP BY genre;
* postgresql://shravan:***@127.0.0.1/moviesnow
8 rows affected.
genre | avg_price | number_movies |
---|---|---|
Animation | 1.7900000000000000 | 2 |
Other | 2.5900000000000000 | 2 |
Drama | 2.1538888888888889 | 36 |
Art House & International | 2.5900000000000000 | 1 |
Mystery & Suspense | 2.1733333333333333 | 6 |
Comedy | 2.2900000000000000 | 10 |
Action & Adventure | 2.4566666666666667 | 3 |
Science Fiction & Fantasy | 2.2445454545454545 | 11 |
The HAVING clause can be added to the GROUP BY statement to further filter the groups based on some criteria. It is only used in conjuction with the group by clause and it CAN include aggregate functions such as sum, count, or avg. For example, we want to keep only those genres that have more than 2 movies.
%%sql
SELECT
genre,
AVG(renting_price) as avg_price,
COUNT(*) as number_movies
FROM movies
GROUP BY genre
HAVING COUNT(*) > 2;
* postgresql://shravan:***@127.0.0.1/moviesnow
5 rows affected.
genre | avg_price | number_movies |
---|---|---|
Drama | 2.1538888888888889 | 36 |
Mystery & Suspense | 2.1733333333333333 | 6 |
Comedy | 2.2900000000000000 | 10 |
Action & Adventure | 2.4566666666666667 | 3 |
Science Fiction & Fantasy | 2.2445454545454545 | 11 |
Example: First account for each country
Conduct an analysis to see when the first customer accounts were created for each country. Create a table with a row for each country and columns for the country name and the date when the first customer account was created.
%%sql
SELECT country, -- For each country report the earliest date when an account was created
MIN(date_account_start) AS first_account
FROM customers
GROUP BY country
ORDER BY first_account;
* postgresql://shravan:***@127.0.0.1/moviesnow
11 rows affected.
country | first_account |
---|---|
France | 2017-01-13 |
Hungary | 2017-01-18 |
Belgium | 2017-01-28 |
Slovenia | 2017-01-31 |
Spain | 2017-02-14 |
Italy | 2017-02-28 |
Poland | 2017-03-03 |
Great Britan | 2017-03-31 |
Denmark | 2017-04-30 |
USA | 2017-09-13 |
Austria | 2017-11-22 |
You learned that the first customer account was created in France.
Example: Average movie ratings
For each movie the average rating, the number of ratings and the number of views has to be reported. Generate a table with meaningful column names.
%%sql
SELECT movie_id,
AVG(rating) AS avg_rating,
COUNT(rating) AS number_ratings,
COUNT(*) AS number_renting
FROM renting
GROUP BY movie_id
ORDER BY avg_rating desc -- Order by average rating in decreasing order
LIMIT 5;
* postgresql://shravan:***@127.0.0.1/moviesnow
5 rows affected.
movie_id | avg_rating | number_ratings | number_renting |
---|---|---|---|
11 | None | 0 | 6 |
48 | 10.0000000000000000 | 2 | 5 |
5 | 9.7500000000000000 | 4 | 8 |
42 | 9.6000000000000000 | 5 | 10 |
63 | 9.3333333333333333 | 6 | 11 |
NOTE: The average is null only if all values are null. If only one/two values are NULL, then avg still returns the average of non-null values. It appears that for movie_id 11 all of the ratings were NULL. This is evident by the number_ratings
column which is zero for this movie.
Example: Average rating per customer
Similar to what you just did, you will now look at the average movie ratings, this time for customers. So you will obtain a table with the average rating given by each customer. Further, you will include the number of ratings and the number of movie rentals per customer. You will report these summary statistics only for customers with more than 7 movie rentals and order them in ascending order by the average rating.
%%sql
SELECT customer_id, -- Report the customer_id
AVG(rating), -- Report the average rating per customer
COUNT(rating), -- Report the number of ratings per customer
COUNT(*) -- Report the number of movie rentals per customer
FROM renting
GROUP BY customer_id
HAVING COUNT(*) > 7 -- Select only customers with more than 7 movie rentals
ORDER BY 1 -- Order by the average rating in ascending order
LIMIT 5;
* postgresql://shravan:***@127.0.0.1/moviesnow
5 rows affected.
customer_id | avg | count | count_1 |
---|---|---|---|
7 | 9.0000000000000000 | 4 | 9 |
8 | 9.2500000000000000 | 4 | 8 |
21 | 7.3333333333333333 | 6 | 14 |
25 | 7.2000000000000000 | 5 | 10 |
26 | 8.4000000000000000 | 5 | 10 |
Joining
Join queries allow us to combine - or join - data from multiple tables. There are many types of JOINs in SQL, which you should be already aware off. In this post, I will focus on LEFT JOIN. The LEFT JOIN statement is one of the OUTER JOINs that is used to augment one table - the table on the left, with information from another table - the table on the right. We keep all rows from the LEFT table and bring only matching rows from the right table. An identifier is needed to define which rows of the two tables can be matched.
Example Join renting with customers
For many analyses it is necessary to add customer information to the data in the table renting. Augment the table renting
with all columns from the table customers
with a LEFT JOIN.
%%sql
SELECT * -- Join renting with customers
FROM renting as r
LEFT JOIN customers as c
ON r.customer_id = c.customer_id
LIMIT 5;
* postgresql://shravan:***@127.0.0.1/moviesnow
5 rows affected.
renting_id | customer_id | movie_id | rating | date_renting | customer_id_1 | name | country | gender | date_of_birth | date_account_start |
---|---|---|---|---|---|---|---|---|---|---|
1 | 41 | 8 | None | 2018-10-09 | 41 | Zara Mitchell | Great Britan | female | 1994-07-08 | 2017-06-12 |
2 | 10 | 29 | 10 | 2017-03-01 | 10 | Arnout Veenhuis | Belgium | male | 1984-07-26 | 2017-01-28 |
3 | 108 | 45 | 4 | 2018-06-08 | 108 | Saúl Tafoya Meraz | Spain | male | 1992-05-15 | 2017-03-13 |
4 | 39 | 66 | 8 | 2018-10-22 | 39 | Amy Haynes | Great Britan | female | 1975-07-28 | 2018-01-19 |
5 | 104 | 15 | 7 | 2019-03-18 | 104 | Dimitri Zambrano Morales | Spain | male | 1992-06-21 | 2017-04-12 |
Select only records from customers coming from Belgium.
%%sql
SELECT * -- Join renting with customers
FROM renting as r
LEFT JOIN customers as c
ON r.customer_id = c.customer_id
WHERE c.country='Belgium'
LIMIT 5;
* postgresql://shravan:***@127.0.0.1/moviesnow
5 rows affected.
renting_id | customer_id | movie_id | rating | date_renting | customer_id_1 | name | country | gender | date_of_birth | date_account_start |
---|---|---|---|---|---|---|---|---|---|---|
2 | 10 | 29 | 10 | 2017-03-01 | 10 | Arnout Veenhuis | Belgium | male | 1984-07-26 | 2017-01-28 |
14 | 8 | 29 | None | 2018-08-03 | 8 | Jaëla van den Dolder | Belgium | female | 1990-08-31 | 2018-02-08 |
27 | 7 | 36 | None | 2019-03-14 | 7 | Annelous Sneep | Belgium | female | 1993-11-14 | 2018-05-12 |
32 | 8 | 42 | 10 | 2019-02-13 | 8 | Jaëla van den Dolder | Belgium | female | 1990-08-31 | 2018-02-08 |
70 | 10 | 44 | None | 2017-12-17 | 10 | Arnout Veenhuis | Belgium | male | 1984-07-26 | 2017-01-28 |
Average ratings of customers from Belgium.
%%sql
SELECT AVG(r.rating) -- Average ratings of customers from Belgium
FROM renting AS r
LEFT JOIN customers AS c
ON r.customer_id = c.customer_id
WHERE c.country='Belgium'
LIMIT 5;
* postgresql://shravan:***@127.0.0.1/moviesnow
1 rows affected.
avg |
---|
8.9000000000000000 |
You calculated the average rating for customers from Belgium, which is 8.9.
Example: Aggregating revenue, rentals and active customers
The management of MovieNow wants to report key performance indicators (KPIs) for the performance of the company in 2018. They are interested in measuring the financial successes as well as user engagement. Important KPIs are, therefore, the profit coming from movie rentals, the number of movie rentals and the number of active customers.
%%sql
SELECT *
FROM renting AS r
JOIN movies AS m -- Choose the correct join statment INNER is same as LEFT here
ON r.movie_id = m.movie_id
LIMIT 10;
* postgresql://shravan:***@127.0.0.1/moviesnow
10 rows affected.
renting_id | customer_id | movie_id | rating | date_renting | movie_id_1 | title | genre | runtime | year_of_release | renting_price |
---|---|---|---|---|---|---|---|---|---|---|
1 | 41 | 8 | None | 2018-10-09 | 8 | Waking Up in Reno | Comedy | 91 | 2002 | 2.59 |
2 | 10 | 29 | 10 | 2017-03-01 | 29 | Two for the Money | Drama | 122 | 2005 | 2.79 |
3 | 108 | 45 | 4 | 2018-06-08 | 45 | Burn After Reading | Drama | 96 | 2008 | 2.39 |
4 | 39 | 66 | 8 | 2018-10-22 | 66 | The Hunger Games | Drama | 142 | 2012 | 1.59 |
5 | 104 | 15 | 7 | 2019-03-18 | 15 | Something's Gotta Give | Comedy | 128 | 2003 | 1.69 |
6 | 50 | 71 | 7 | 2018-10-09 | 71 | Rio 2 | Animation | 101 | 2014 | 1.79 |
7 | 52 | 21 | None | 2018-11-10 | 21 | The Return of the King | Science Fiction & Fantasy | 200 | 2003 | 1.99 |
8 | 73 | 65 | 10 | 2018-06-05 | 65 | Ghost Rider: Spirit of Vengeance | Action & Adventure | 96 | 2012 | 1.79 |
9 | 78 | 2 | None | 2017-09-03 | 2 | Swordfish | Drama | 99 | 2001 | 2.19 |
10 | 121 | 43 | None | 2017-11-08 | 43 | The Invasion | Mystery & Suspense | 99 | 2007 | 2.09 |
%%sql
SELECT
SUM(renting_price), -- Get the revenue from movie rentals
COUNT(*), -- Count the number of rentals
COUNT(DISTINCT customer_id) -- Count the number of customers
FROM renting AS r
LEFT JOIN movies AS m
ON r.movie_id = m.movie_id;
* postgresql://shravan:***@127.0.0.1/moviesnow
1 rows affected.
sum | count | count_1 |
---|---|---|
1275.72 | 578 | 116 |
From this what we can say is that we generated 1275.72
dollars in revenue by renting 578
movies to 116
customers. So, what if I want to know only for year 2018?
%%sql
SELECT
SUM(renting_price), -- Get the revenue from movie rentals
COUNT(*), -- Count the number of rentals
COUNT(DISTINCT customer_id) -- Count the number of customers
FROM renting AS r
LEFT JOIN movies AS m
ON r.movie_id = m.movie_id
WHERE date_trunc('year', date_renting) = '2018-01-01'
* postgresql://shravan:***@127.0.0.1/moviesnow
1 rows affected.
sum | count | count_1 |
---|---|---|
658.02 | 298 | 93 |
Ok, so here we have it, this tells us that in the year 2018 the company’s revenue was 658.02
dollars from renting 298
movies to 93
customers. Stated another way “You’ve calculated a turnover of 658.02 and found the number of rentals to be 298 and the number of active users to be 93 in 2018.”
Example: Joining with an intermediate table (M:N relationship)
For this we will consider three tables: movies
, actors
, actsin
. Our goal is to “Create a list of actor names and movie titles in which they act. Make sure that each combination of actor and movie appears only once”.
%%sql
SELECT DISTINCT title, -- Create a list of movie titles and actor names
name
FROM actsin
LEFT JOIN movies AS m
ON m.movie_id = actsin.movie_id
LEFT JOIN actors AS a
ON a.actor_id = actsin.actor_id
LIMIT 10;
* postgresql://shravan:***@127.0.0.1/moviesnow
10 rows affected.
title | name |
---|---|
Django Unchained | Jamie Foxx |
The Two Towers | Ian McKellen |
Candy | Abbie Cornish |
The Hunger Games | Liam Hemsworth |
Two for the Money | Matthew McConaughey |
Burn After Reading | George Clooney |
Harry Potter and the Chamber of Secrets | Rupert Grint |
Harry Potter and the Goblet of Fire | Daniel Radcliffe |
The Recruit | Bridget Moynahan |
The Frozen Ground | Nicolas Cage |
Conclusion
In this post, we have seen how simple data-driven insights can be gathered with basic SQL queries of filtering, aggregations and joins. In the next post, we will use the same database and dive into Sub-queries, nested queries, correlated queries, use of set operations like EXISTS, UNION and UNION ALL and some advanced queries with OLAP cubes, grouping sets and rollups.