Data driven decision making part 1

Data-driven decision making using SQL

Get the data

We will be using the MoviesNow database which can be found here MoviesNow.sql. First launch pgAdmin and create a new database called moviesnow. Right click and open Query Tool and then run the MoviesNow.sql file. This will create the tables and insert records into them.

decision-0

%load_ext sql
%sql postgresql://shravan:shravan@127.0.0.1/moviesnow
'Connected: shravan@moviesnow'
%%sql
-- Query the right table in information_schema
SELECT table_name
FROM information_schema.tables
-- Specify the correct table_schema value
WHERE table_schema = 'public';
 * postgresql://shravan:***@127.0.0.1/moviesnow
5 rows affected.
table_name
movies
actors
actsin
customers
renting

Here are the tables in this database

decision

EDA customers table

Let’s first explore the customers table by summarizing customer information. In most business decisions customers are analyzed in groups, such as customers per country or customers per age group. Here’s a peek into the first 5 rows of the customers table.

%%sql
SELECT * FROM customers LIMIT 5;
 * postgresql://shravan:***@127.0.0.1/moviesnow
5 rows affected.
customer_id name country gender date_of_birth date_account_start
2 Wolfgang Ackermann Austria male 1971-11-17 2018-10-15
3 Daniela Herzog Austria female 1974-08-07 2019-02-14
4 Julia Jung Austria female 1991-01-04 2017-11-22
5 Juliane Kirsch Austria female 1977-03-01 2018-12-16
6 Rowanne Couperus Belgium female 1994-04-05 2018-08-26

Count the number of customers born in 80s

%%sql
SELECT
    COUNT(customer_id) as num_customers
FROM customers
WHERE date_of_birth BETWEEN '1980-01-01' AND '1990-01-01';
 * postgresql://shravan:***@127.0.0.1/moviesnow
1 rows affected.
num_customers
33

Count the number of customers from Germany

%%sql
SELECT
    COUNT(customer_id) as num_customers
FROM customers
WHERE country = 'Germany';
 * postgresql://shravan:***@127.0.0.1/moviesnow
1 rows affected.
num_customers
0

Count the number of countries where MovieNow has customers.

%%sql
SELECT
    COUNT(DISTINCT country)
FROM customers;
 * postgresql://shravan:***@127.0.0.1/moviesnow
1 rows affected.
count
11

You aggregated data from the customers table and figured out that MovieNow has customers in 11 countries and that there are no customers from Germany.

Ratings for a movie

The movie ratings give us insight into the preferences of our customers. Let’s first take a look at the columns and their types.

%%sql
-- Get the column name and data type
SELECT
    column_name,
    data_type
-- From the system database information schema
FROM INFORMATION_SCHEMA.COLUMNS
-- For the customer table
WHERE table_name = 'renting';
 * postgresql://shravan:***@127.0.0.1/moviesnow
5 rows affected.
column_name data_type
renting_id integer
customer_id integer
movie_id integer
rating integer
date_renting date
%%sql
SELECT * FROM renting LIMIT 5;
 * postgresql://shravan:***@127.0.0.1/moviesnow
5 rows affected.
renting_id customer_id movie_id rating date_renting
1 41 8 None 2018-10-09
2 10 29 10 2017-03-01
3 108 45 4 2018-06-08
4 39 66 8 2018-10-22
5 104 15 7 2019-03-18

Report summary statistics, such as the minimum, maximum, average, and count, of ratings for the movie with ID 25.

%%sql
SELECT
    MIN(rating) as  min_rating, -- Calculate the minimum rating and use alias min_rating
    MAX(rating) as  max_rating, -- Calculate the maximum rating and use alias max_rating
    AVG(rating) as avg_rating, -- Calculate the average rating and use alias avg_rating
    COUNT(rating) as number_ratings -- Count the number of ratings and use alias number_ratings
FROM renting
WHERE movie_id = 25; -- Select all records of the movie with ID 25
 * postgresql://shravan:***@127.0.0.1/moviesnow
1 rows affected.
min_rating max_rating avg_rating number_ratings
5 10 7.5000000000000000 8

Example: Examining annual rentals

You are asked to provide a report about the development of the company. Specifically, your manager is interested in the total number of movie rentals, the total number of ratings and the average rating of all movies since the beginning of 2019.

%%sql
SELECT
	COUNT(*) AS number_renting,
	AVG(rating) AS average_rating,
    COUNT(rating) AS number_ratings -- Add the total number of ratings here.
FROM renting
WHERE date_renting >= '2019-01-01';
 * postgresql://shravan:***@127.0.0.1/moviesnow
1 rows affected.
number_renting average_rating number_ratings
159 7.9462365591397849 93

You managed to extract all KPIs from the renting records by working with aggregations, NULL values and dates.

Analyzing Groups

Using the GROUP BY clause allows us to apply aggregations to groups in a table. To evaluate the success and the potential of a company, it is often desirable to look at groups of customers or groups of products or jointly look at multiple groups. For example, we could be interested in grouping customers by country or gender OR in grouping movies by genre or year of release.

For example, we want to find out the average renting price by genre. First, lets gain an understanding of the different types of genres that we have. A simple group by will reveal the distinct groups.

%%sql
SELECT
    genre
FROM movies
GROUP BY genre;
 * postgresql://shravan:***@127.0.0.1/moviesnow
8 rows affected.
genre
Animation
Other
Drama
Art House & International
Mystery & Suspense
Comedy
Action & Adventure
Science Fiction & Fantasy

This is equivalent to running a Distinct.

%%sql
SELECT
    DISTINCT genre
FROM movies
 * postgresql://shravan:***@127.0.0.1/moviesnow
8 rows affected.
genre
Animation
Other
Drama
Art House & International
Mystery & Suspense
Comedy
Action & Adventure
Science Fiction & Fantasy

A GROUP BY by itself is that is not much of value. Shown below is the visual of how GROUP BY genre clause creates groups based on the genre. We can then apply any aggregation function to each of these sub-tables.

decision-1

Shown below is the query that is extended to show the number of movies in each category in addition to showing the Average renting price for that genre.

%%sql
SELECT
    genre,
    AVG(renting_price) as avg_price,
    COUNT(*) as number_movies
FROM movies
GROUP BY genre;
 * postgresql://shravan:***@127.0.0.1/moviesnow
8 rows affected.
genre avg_price number_movies
Animation 1.7900000000000000 2
Other 2.5900000000000000 2
Drama 2.1538888888888889 36
Art House & International 2.5900000000000000 1
Mystery & Suspense 2.1733333333333333 6
Comedy 2.2900000000000000 10
Action & Adventure 2.4566666666666667 3
Science Fiction & Fantasy 2.2445454545454545 11

The HAVING clause can be added to the GROUP BY statement to further filter the groups based on some criteria. It is only used in conjuction with the group by clause and it CAN include aggregate functions such as sum, count, or avg. For example, we want to keep only those genres that have more than 2 movies.

%%sql
SELECT
    genre,
    AVG(renting_price) as avg_price,
    COUNT(*) as number_movies
FROM movies
GROUP BY genre
HAVING COUNT(*) > 2;
 * postgresql://shravan:***@127.0.0.1/moviesnow
5 rows affected.
genre avg_price number_movies
Drama 2.1538888888888889 36
Mystery & Suspense 2.1733333333333333 6
Comedy 2.2900000000000000 10
Action & Adventure 2.4566666666666667 3
Science Fiction & Fantasy 2.2445454545454545 11

Example: First account for each country

Conduct an analysis to see when the first customer accounts were created for each country. Create a table with a row for each country and columns for the country name and the date when the first customer account was created.

%%sql
SELECT country, -- For each country report the earliest date when an account was created
	MIN(date_account_start) AS first_account
FROM customers
GROUP BY country
ORDER BY first_account;
 * postgresql://shravan:***@127.0.0.1/moviesnow
11 rows affected.
country first_account
France 2017-01-13
Hungary 2017-01-18
Belgium 2017-01-28
Slovenia 2017-01-31
Spain 2017-02-14
Italy 2017-02-28
Poland 2017-03-03
Great Britan 2017-03-31
Denmark 2017-04-30
USA 2017-09-13
Austria 2017-11-22

You learned that the first customer account was created in France.

Example: Average movie ratings

For each movie the average rating, the number of ratings and the number of views has to be reported. Generate a table with meaningful column names.

%%sql
SELECT movie_id,
       AVG(rating) AS avg_rating,
       COUNT(rating) AS number_ratings,
       COUNT(*) AS number_renting
FROM renting
GROUP BY movie_id
ORDER BY avg_rating desc -- Order by average rating in decreasing order
LIMIT 5;
 * postgresql://shravan:***@127.0.0.1/moviesnow
5 rows affected.
movie_id avg_rating number_ratings number_renting
11 None 0 6
48 10.0000000000000000 2 5
5 9.7500000000000000 4 8
42 9.6000000000000000 5 10
63 9.3333333333333333 6 11

NOTE: The average is null only if all values are null. If only one/two values are NULL, then avg still returns the average of non-null values. It appears that for movie_id 11 all of the ratings were NULL. This is evident by the number_ratings column which is zero for this movie.

Example: Average rating per customer

Similar to what you just did, you will now look at the average movie ratings, this time for customers. So you will obtain a table with the average rating given by each customer. Further, you will include the number of ratings and the number of movie rentals per customer. You will report these summary statistics only for customers with more than 7 movie rentals and order them in ascending order by the average rating.

%%sql
SELECT customer_id, -- Report the customer_id
      AVG(rating),  -- Report the average rating per customer
      COUNT(rating),  -- Report the number of ratings per customer
      COUNT(*) -- Report the number of movie rentals per customer
FROM renting
GROUP BY customer_id
HAVING COUNT(*) > 7 -- Select only customers with more than 7 movie rentals
ORDER BY 1 -- Order by the average rating in ascending order
LIMIT 5;
 * postgresql://shravan:***@127.0.0.1/moviesnow
5 rows affected.
customer_id avg count count_1
7 9.0000000000000000 4 9
8 9.2500000000000000 4 8
21 7.3333333333333333 6 14
25 7.2000000000000000 5 10
26 8.4000000000000000 5 10

Joining

Join queries allow us to combine - or join - data from multiple tables. There are many types of JOINs in SQL, which you should be already aware off. In this post, I will focus on LEFT JOIN. The LEFT JOIN statement is one of the OUTER JOINs that is used to augment one table - the table on the left, with information from another table - the table on the right. We keep all rows from the LEFT table and bring only matching rows from the right table. An identifier is needed to define which rows of the two tables can be matched.

Example Join renting with customers

For many analyses it is necessary to add customer information to the data in the table renting. Augment the table renting with all columns from the table customers with a LEFT JOIN.

%%sql
SELECT * -- Join renting with customers
FROM renting as r
LEFT JOIN customers as c
ON r.customer_id = c.customer_id
LIMIT 5;
 * postgresql://shravan:***@127.0.0.1/moviesnow
5 rows affected.
renting_id customer_id movie_id rating date_renting customer_id_1 name country gender date_of_birth date_account_start
1 41 8 None 2018-10-09 41 Zara Mitchell Great Britan female 1994-07-08 2017-06-12
2 10 29 10 2017-03-01 10 Arnout Veenhuis Belgium male 1984-07-26 2017-01-28
3 108 45 4 2018-06-08 108 Saúl Tafoya Meraz Spain male 1992-05-15 2017-03-13
4 39 66 8 2018-10-22 39 Amy Haynes Great Britan female 1975-07-28 2018-01-19
5 104 15 7 2019-03-18 104 Dimitri Zambrano Morales Spain male 1992-06-21 2017-04-12

Select only records from customers coming from Belgium.

%%sql
SELECT * -- Join renting with customers
FROM renting as r
LEFT JOIN customers as c
ON r.customer_id = c.customer_id
WHERE c.country='Belgium'
LIMIT 5;
 * postgresql://shravan:***@127.0.0.1/moviesnow
5 rows affected.
renting_id customer_id movie_id rating date_renting customer_id_1 name country gender date_of_birth date_account_start
2 10 29 10 2017-03-01 10 Arnout Veenhuis Belgium male 1984-07-26 2017-01-28
14 8 29 None 2018-08-03 8 Jaëla van den Dolder Belgium female 1990-08-31 2018-02-08
27 7 36 None 2019-03-14 7 Annelous Sneep Belgium female 1993-11-14 2018-05-12
32 8 42 10 2019-02-13 8 Jaëla van den Dolder Belgium female 1990-08-31 2018-02-08
70 10 44 None 2017-12-17 10 Arnout Veenhuis Belgium male 1984-07-26 2017-01-28

Average ratings of customers from Belgium.

%%sql
SELECT AVG(r.rating) -- Average ratings of customers from Belgium
FROM renting AS r
LEFT JOIN customers AS c
ON r.customer_id = c.customer_id
WHERE c.country='Belgium'
LIMIT 5;
 * postgresql://shravan:***@127.0.0.1/moviesnow
1 rows affected.
avg
8.9000000000000000

You calculated the average rating for customers from Belgium, which is 8.9.

Example: Aggregating revenue, rentals and active customers

The management of MovieNow wants to report key performance indicators (KPIs) for the performance of the company in 2018. They are interested in measuring the financial successes as well as user engagement. Important KPIs are, therefore, the profit coming from movie rentals, the number of movie rentals and the number of active customers.

%%sql
SELECT *
FROM renting AS r
JOIN movies AS m -- Choose the correct join statment INNER is same as LEFT here
ON r.movie_id = m.movie_id
LIMIT 10;
 * postgresql://shravan:***@127.0.0.1/moviesnow
10 rows affected.
renting_id customer_id movie_id rating date_renting movie_id_1 title genre runtime year_of_release renting_price
1 41 8 None 2018-10-09 8 Waking Up in Reno Comedy 91 2002 2.59
2 10 29 10 2017-03-01 29 Two for the Money Drama 122 2005 2.79
3 108 45 4 2018-06-08 45 Burn After Reading Drama 96 2008 2.39
4 39 66 8 2018-10-22 66 The Hunger Games Drama 142 2012 1.59
5 104 15 7 2019-03-18 15 Something's Gotta Give Comedy 128 2003 1.69
6 50 71 7 2018-10-09 71 Rio 2 Animation 101 2014 1.79
7 52 21 None 2018-11-10 21 The Return of the King Science Fiction & Fantasy 200 2003 1.99
8 73 65 10 2018-06-05 65 Ghost Rider: Spirit of Vengeance Action & Adventure 96 2012 1.79
9 78 2 None 2017-09-03 2 Swordfish Drama 99 2001 2.19
10 121 43 None 2017-11-08 43 The Invasion Mystery & Suspense 99 2007 2.09
%%sql
SELECT
	SUM(renting_price), -- Get the revenue from movie rentals
	COUNT(*), -- Count the number of rentals
	COUNT(DISTINCT customer_id) -- Count the number of customers
FROM renting AS r
LEFT JOIN movies AS m
ON r.movie_id = m.movie_id;
 * postgresql://shravan:***@127.0.0.1/moviesnow
1 rows affected.
sum count count_1
1275.72 578 116

From this what we can say is that we generated 1275.72 dollars in revenue by renting 578 movies to 116 customers. So, what if I want to know only for year 2018?

%%sql
SELECT
	SUM(renting_price), -- Get the revenue from movie rentals
	COUNT(*), -- Count the number of rentals
	COUNT(DISTINCT customer_id) -- Count the number of customers
FROM renting AS r
LEFT JOIN movies AS m
ON r.movie_id = m.movie_id
WHERE date_trunc('year', date_renting) = '2018-01-01'
 * postgresql://shravan:***@127.0.0.1/moviesnow
1 rows affected.
sum count count_1
658.02 298 93

Ok, so here we have it, this tells us that in the year 2018 the company’s revenue was 658.02 dollars from renting 298 movies to 93 customers. Stated another way “You’ve calculated a turnover of 658.02 and found the number of rentals to be 298 and the number of active users to be 93 in 2018.”

Example: Joining with an intermediate table (M:N relationship)

For this we will consider three tables: movies, actors, actsin. Our goal is to “Create a list of actor names and movie titles in which they act. Make sure that each combination of actor and movie appears only once”.

%%sql
SELECT DISTINCT title, -- Create a list of movie titles and actor names
       name
FROM actsin
LEFT JOIN movies AS m
ON m.movie_id = actsin.movie_id
LEFT JOIN actors AS a
ON a.actor_id = actsin.actor_id
LIMIT 10;
 * postgresql://shravan:***@127.0.0.1/moviesnow
10 rows affected.
title name
Django Unchained Jamie Foxx
The Two Towers Ian McKellen
Candy Abbie Cornish
The Hunger Games Liam Hemsworth
Two for the Money Matthew McConaughey
Burn After Reading George Clooney
Harry Potter and the Chamber of Secrets Rupert Grint
Harry Potter and the Goblet of Fire Daniel Radcliffe
The Recruit Bridget Moynahan
The Frozen Ground Nicolas Cage

Conclusion

In this post, we have seen how simple data-driven insights can be gathered with basic SQL queries of filtering, aggregations and joins. In the next post, we will use the same database and dive into Sub-queries, nested queries, correlated queries, use of set operations like EXISTS, UNION and UNION ALL and some advanced queries with OLAP cubes, grouping sets and rollups.