Postgres and Jupyter notebooks

Installing postgres and connecting to it using jupyter notebook

Goal: I want a postgres environment wherein I can create some tables and insert some values into those tables and run some queries. This allows me to practice writing SQL queries to understand some concepts.

STEP 1: First make sure you install homebrew (if you don’t already have it)

The first step is to install homebrew - package manager for mac.

Run command: /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

The first part of the output shows what will be installed under /usr/local/

(base) shravan-~$ /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
==> This script will install:
/usr/local/bin/brew
/usr/local/share/doc/homebrew
/usr/local/share/man/man1/brew.1
/usr/local/share/zsh/site-functions/_brew
/usr/local/etc/bash_completion.d/brew
/usr/local/Homebrew
==> The following existing directories will be made group writable:
/usr/local/bin
/usr/local/lib
==> The following existing directories will have their owner set to shravan:
/usr/local/bin
/usr/local/lib
==> The following existing directories will have their group set to admin:
/usr/local/bin
/usr/local/lib
==> The following new directories will be created:
/usr/local/etc
/usr/local/include
/usr/local/sbin
/usr/local/share
/usr/local/var
/usr/local/opt
/usr/local/share/zsh
/usr/local/share/zsh/site-functions
/usr/local/var/homebrew
/usr/local/var/homebrew/linked
/usr/local/Cellar
/usr/local/Caskroom
/usr/local/Homebrew
/usr/local/Frameworks

Press RETURN to continue or any other key to abort

Press RETURN to continue ..

==> Installation successful!

==> Homebrew has enabled anonymous aggregate formulae and cask analytics.
Read the analytics documentation (and how to opt-out) here:
  https://docs.brew.sh/Analytics

==> Homebrew is run entirely by unpaid volunteers. Please consider donating:
  https://github.com/Homebrew/brew#donations
==> Next steps:
- Run `brew help` to get started
- Further documentation:
    https://docs.brew.sh

STEP 2: Use brew to install postgres

Run: brew install postgres

This will install postgres, the last few messages give you some suggestions as to what we can do to start postgres.

To have launchd start postgresql now and restart at login:
  brew services start postgresql
Or, if you don't want/need a background service you can just run:
  pg_ctl -D /usr/local/var/postgres start

STEP 3: Start postgres

I don’t want postgres to be running in the background everytime I start my laptop. So, just start the postgres server once using:

pg_ctl -D /usr/local/var/postgres start

By default postgres runs on port 5432

(base) shravan-~$ pg_ctl -D /usr/local/var/postgres start
waiting for server to start....2019-10-17 11:02:16.746 EDT [25689] LOG:  listening on IPv6 address "::1", port 5432
2019-10-17 11:02:16.746 EDT [25689] LOG:  listening on IPv4 address "127.0.0.1", port 5432
2019-10-17 11:02:16.747 EDT [25689] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2019-10-17 11:02:16.756 EDT [25690] LOG:  database system was shut down at 2019-10-17 10:55:32 EDT
2019-10-17 11:02:16.760 EDT [25689] LOG:  database system is ready to accept connections
 done
server started

STEP 4: Check existing users

Check what users exist by default: When we did the installation, postgres automatically creates the user postgres and shravan. The way to confirm this is by using psql utility. psql is a utility installed with Postgres that lets you carry out administrative functions without needing to know their actual SQL commands.

Run psql postgres

(base) shravan-~$ psql postgres
psql (11.5)
Type "help" for help.

postgres=#

The above output prompt is the psql command line. We can now enter a command to see what users are installed:

Run \du

postgres=# \du
                                   List of roles
 Role name |                         Attributes                         | Member of
-----------+------------------------------------------------------------+-----------
 shravan   | Superuser, Create role, Create DB, Replication, Bypass RLS | {}

postgres=#

Ah!! so, the default user postgres is not visible here, but you can see that Role name: shravan is created by default. Postgres deals with roles and not users. So, your role is shravan and it has the following attributes Superuser, Create role, Create DB, Replication, Bypass RLS.

Postgres doesn’t actually directly manage users or groups, it directly manages what it calls roles.

STEP 5: Check existing databases

Ok, so now we understand roles and users. The next thing is to understand how to connect to a database. But before that, can we list the existing databases and who owns them?

There is a default database that gets created when you first install postgres. The name of this database is postgres That’s why you typed psql postgres when you launched the psql command line utility, because you need to specify which database you are connecting to and as which user.

So, you want to connect to the default database postgres as shravan, then run this: psql postgres -U shravan

(base) shravan-~$ psql postgres -U shravan
psql (11.5)
Type "help" for help.

postgres=#

Since you are a superuser, you are getting the root prompt #.

List existing databases: Using \l we can list all the databases. As you can see, we have a database with name postgres whose owner is shravan.

postgres=# \l
                           List of databases
   Name    |  Owner  | Encoding | Collate | Ctype |  Access privileges
-----------+---------+----------+---------+-------+---------------------
 postgres  | shravan | UTF8     | C       | C     |
 template0 | shravan | UTF8     | C       | C     | =c/shravan         +
           |         |          |         |       | shravan=CTc/shravan
 template1 | shravan | UTF8     | C       | C     | =c/shravan         +
           |         |          |         |       | shravan=CTc/shravan
(3 rows)

postgres=#

STEP 6: Create a new database

Remember our goal is to be able to create some tables and run some queries on it. We don’t want to do this on the default database postgres. For this reason, we need to create a new database.

Create a database called interviews:

  • First connect to the postgres database as shravan: psql postgres -U shravan
  • Second run create database interviews and list the databases to see the output
postgres=# create database interviews;
CREATE DATABASE
postgres=# \l
                            List of databases
    Name    |  Owner  | Encoding | Collate | Ctype |  Access privileges
------------+---------+----------+---------+-------+---------------------
 interviews | shravan | UTF8     | C       | C     |
 postgres   | shravan | UTF8     | C       | C     |

NOTE: Since shravan is a superuser, I thought I already have all the permissions to create tables and such. Turns out, that I need to run the GRANT command explicitly. See the difference now in the Access privileges column.

postgres=# GRANT ALL PRIVILEGES ON DATABASE interviews TO shravan;
GRANT
postgres=# \l
                            List of databases
    Name    |  Owner  | Encoding | Collate | Ctype |  Access privileges
------------+---------+----------+---------+-------+---------------------
 interviews | shravan | UTF8     | C       | C     | =Tc/shravan        +
            |         |          |         |       | shravan=CTc/shravan
 postgres   | shravan | UTF8     | C       | C     |

Alight, we are all set now to connect to this database from our jupyter notebook!! Launch Jupyter notebook instance from the conda environment

STEP 7: Get a connection to the database.

Since we already created a new database using command line, we are now connect to it.

import os
import glob
import psycopg2
import pandas as pd
def connect_database(dbName):
    """
    The function to create database

    Returns:
        cur  : Use the connection to get a cursor that will be used to execute queries.
        conn : The connection towards current connecting database.
    """

    # connect to default database
    conn = psycopg2.connect("host=127.0.0.1 dbname={} user=shravan password=password".format(dbName))
    conn.set_session(autocommit=True)
    cur = conn.cursor()

    return cur, conn
cur, conn = connect_database('employees')

STEP 8: Create the tables and insert data into those tables

The way I like to organize this is by creating all the table creation, deletion, insertion into a python module/script. <dbName>_sql_queries.py

The schema of the employees database looks like:

employees                             projects
+---------------+---------+           +---------------+---------+
| id            | int     |<----+  +->| id            | int     |
| first_name    | varchar |     |  |  | title         | varchar |
| last_name     | varchar |     |  |  | start_date    | date    |
| salary        | int     |     |  |  | end_date      | date    |
| department_id | int     |--+  |  |  | budget        | int     |
+---------------+---------+  |  |  |  +---------------+---------+
                             |  |  |
departments                  |  |  |  employees_projects
+---------------+---------+  |  |  |  +---------------+---------+
| id            | int     |<-+  |  +--| project_id    | int     |
| name          | varchar |     +-----| employee_id   | int     |
+---------------+---------+           +---------------+---------+
from employee_sql_queries import *
def drop_tables(cur, conn):
    """
    The function to drop database

    Parameters:
        cur  : The cursor that will be used to execute queries.
        conn : The connection towards current connecting database.
    """

    for query in drop_table_queries:
        cur.execute(query)
        conn.commit()


def create_tables(cur, conn):
    """
    The function to drop database

    Parameters:
        cur  : The cursor that will be used to execute queries.
        conn : The connection towards current connecting database.
    """

    for query in create_table_queries:
        cur.execute(query)
        conn.commit()
# drop the tables
drop_tables(cur, conn)

# create the tables
create_tables(cur, conn)

Prepare your data in csv format

For example for employees table, the csv file looks like below. This format will come in handy during insertions.

  • employees.csv
    id,first_name,last_name,salary,department_id
    1,John,Smith,20000,1
    2,Ava,Muffinson,10000,5
    3,Cailin,Ninson,30000,2
    4,Mike,Peterson,20000,2
    5,Ian,Peterson,80000,2
    6,John,Mills,50000,3
    

Read in the data

data_path = 'data/employee-dept-project/'
employees = pd.read_csv(os.path.join(data_path, 'employees.csv'))
departments = pd.read_csv(os.path.join(data_path, 'departments.csv'))
projects = pd.read_csv(os.path.join(data_path, 'projects.csv'))
employees_projects = pd.read_csv(os.path.join(data_path, 'employees_projects.csv'))

Insert into table

Iterate over the data frame and insert the rows into the tables. Use itertuples() instead of iterrows() as it is shown to be more efficient. When using itertuples the index is also included when you convert the named-tuples to a list. For this reason, use index=False.

for i in employees.itertuples(index=False):
    print(list(i))
[1, 'John', 'Smith', 20000, 1]
[2, 'Ava', 'Muffinson', 10000, 5]
[3, 'Cailin', 'Ninson', 30000, 2]
[4, 'Mike', 'Peterson', 20000, 2]
[5, 'Ian', 'Peterson', 80000, 2]
[6, 'John', 'Mills', 50000, 3]
def insert_rows(df, insert_stmt):
    for row in df.itertuples(index=False):
        cur.execute(insert_stmt, list(row))
insert_rows(employees, employees_table_insert)
insert_rows(departments, departments_table_insert)
insert_rows(projects, projects_table_insert)
insert_rows(employees_projects, employees_projects_table_insert)

Query the data

%load_ext sql
The sql extension is already loaded. To reload it, use:
  %reload_ext sql
%sql postgresql://shravan:shravan@127.0.0.1/employees
'Connected: shravan@employees'
%sql SELECT * FROM employees
 * postgresql://shravan:***@127.0.0.1/employees
6 rows affected.
id first_name last_name salary department_id
1 John Smith 20000 1
2 Ava Muffinson 10000 5
3 Cailin Ninson 30000 2
4 Mike Peterson 20000 2
5 Ian Peterson 80000 2
6 John Mills 50000 3
%sql SELECT * FROM departments
 * postgresql://shravan:***@127.0.0.1/employees
5 rows affected.
id name
1 Reporting
2 Engineering
3 Marketing
4 Biz Dev
5 Silly Walks
%sql SELECT * FROM projects
 * postgresql://shravan:***@127.0.0.1/employees
3 rows affected.
id title start_date end_date budget
1 Build a cool site 2011-10-28 2012-01-26 1000000
2 Update TPS Reports 2011-07-20 2011-10-28 100000
3 Design 3 New Silly Walks 2009-05-11 2009-08-19 100
%sql SELECT * FROM employees_projects
 * postgresql://shravan:***@127.0.0.1/employees
5 rows affected.
project_id employee_id
2 1
3 2
1 3
1 4
1 5

Questions for Employees database

  1. Find the employee with highest salary?
  2. Find the employee with nth highest salary?
  3. Find the employee with highest salary in Engineering department?
  4. Find the employees who are working on “Build a cool site” project?
  5. Find all the duplicate rows in the employees table?

1. Find employee with highest salary:

Algorithm:

  • Get the max salary from employees
  • Filter the employees where the employee’s salary is equal to Max salary
%%sql
 SELECT first_name, last_name
 FROM employees
 WHERE salary = (
  SELECT Max(salary) from employees);
 * postgresql://shravan:***@127.0.0.1/employees
1 rows affected.
first_name last_name
Ian Peterson

2. Find the employee with n-th highest salary:

Notes: Here we need to use the concept of a correlated subquery. The basic idea behind a correlated subquery is that the subquery makes use of some column from the main query. The correlated subquery cannot run on its own like normal subqueries do. The purpose of using a correlated subquery is that for each row in the main query, we are going to run the subquery and check whether to keep the row of the main query in the result set or not.

This comes particularly handy when trying to find the n-th highest value, because for calculating the n-th highest row, we will use a WHERE (n-1) = (sub-query) , where the sub-query will check how many rows in the table are greater than the current value in the main query. For instance, to get the second highest value in the table, we specify (n-1) in the Where clause, as shown here:

%%sql
SELECT *
FROM employees as e1
WHERE (1) = (SELECT COUNT(DISTINCT(salary))
             FROM employees as e2
             WHERE e2.salary > e1.salary);
 * postgresql://shravan:***@127.0.0.1/employees
1 rows affected.
id first_name last_name salary department_id
6 John Mills 50000 3

A way to wrap your head around this query is by subsituting the value for e1.salary

Just to remind you of how the table looks

%%sql
SELECT * FROM employees;
 * postgresql://shravan:***@127.0.0.1/employees
6 rows affected.
id first_name last_name salary department_id
1 John Smith 20000 1
2 Ava Muffinson 10000 5
3 Cailin Ninson 30000 2
4 Mike Peterson 20000 2
5 Ian Peterson 80000 2
6 John Mills 50000 3

Shown below is what happens when the first row is read in. The sub-query will use the value of e1.salary, which in this case is 20000 and gets back the distinct count of number of salaries greater than this number. The distinct is needed so that if we have multiple employees with same salary, then remove the duplicates in the count.

%%sql
SELECT COUNT(DISTINCT(salary))
             FROM employees as e2
             WHERE e2.salary > 20000
 * postgresql://shravan:***@127.0.0.1/employees
1 rows affected.
count
3

This will be repeated for every row in the employee’s table, until we find the row(s) with second highest salary, which in this case is 50000, at which point, the count=1.

%%sql
SELECT COUNT(DISTINCT(salary))
             FROM employees as e2
             WHERE e2.salary > 50000
 * postgresql://shravan:***@127.0.0.1/employees
1 rows affected.
count
1

The WHERE (1) == 1 clause in the main query will then select this row into the final result set.

What will happen when there are duplicates for the n-th highest salary? Let’s check the case of 4th highest salary which is 20000.

%%sql
SELECT COUNT(DISTINCT(salary))
             FROM employees as e2
             WHERE e2.salary > 20000
 * postgresql://shravan:***@127.0.0.1/employees
1 rows affected.
count
3

As you can see, it doesn’t matter if there are mulitple values of the same salary, since we used Distinct, this only returns number of salaries greater than 20000. Now check the output below, we get back the two rows which satisfy the criteria of WHERE(3) = 3. How did this happen? Well, as I pointed out earlier, we go row-by-row in the main query and filter out all the rows which don’t satisfy this WHERE condition. In this manner, we will only be left with nth-highest salary rows.

%%sql
SELECT *
FROM employees as e1
WHERE (3) = (SELECT COUNT(DISTINCT(salary))
             FROM employees as e2
             WHERE e2.salary > e1.salary);
 * postgresql://shravan:***@127.0.0.1/employees
2 rows affected.
id first_name last_name salary department_id
1 John Smith 20000 1
4 Mike Peterson 20000 2

3. Find the employee with highest salary in Engineering department?

4. Find the employees who are working on “Build a cool site” project?

5. Find duplicate rows based on a particular column

%%sql
SELECT *
FROM employees as e1
WHERE (1) = (select count(distinct salary) FROM employees )
 * postgresql://shravan:***@127.0.0.1/employees
6 rows affected.
id first_name last_name salary department_id
1 John Smith 20000 1
2 Ava Muffinson 10000 5
3 Cailin Ninson 30000 2
4 Mike Peterson 20000 2
5 Ian Peterson 80000 2
6 John Mills 50000 3
%%sql
SELECT * FROM employees where salary = (
SELECT salary
FROM employees
GROUP BY salary
HAVING count(*) > 1)
 * postgresql://shravan:***@127.0.0.1/employees
2 rows affected.
id first_name last_name salary department_id
1 John Smith 20000 1
4 Mike Peterson 20000 2

Understanding inner vs outer joins

Joins are used to combine the data from two tables, with the result being a new, temporary table. The temporary table is created based on column(s) that the two tables share, which represent meaningful column(s) of comparison.

The goal is to extract meaningful data from the resulting temporary table.

Joins are performed based on something called a predicate, which specifies the condition to use in order to perform a join.

A join can be either an inner join or an outer join, depending on how one wants the resulting table to look.



self join

Find the list of employees in a department.

%%sql
SELECT *
FROM employees;
 * postgresql://shravan:***@127.0.0.1/employees
6 rows affected.
id first_name last_name salary department_id
1 John Smith 20000 1
2 Ava Muffinson 10000 5
3 Cailin Ninson 30000 2
4 Mike Peterson 20000 2
5 Ian Peterson 80000 2
6 John Mills 50000 3

%%sql
SELECT Distinct e1.id, e1.first_name, e1.last_name
FROM employees as e1
JOIN employees as e2
ON e1.department_id = e2.department_id
WHERE e1.id != e2.id
 * postgresql://shravan:***@127.0.0.1/employees
3 rows affected.
id first_name last_name
5 Ian Peterson
4 Mike Peterson
3 Cailin Ninson
%%sql
SELECT distinct e1.id, e1.first_name, e1.last_name
FROM employees as e1, employees as e2
WHERE e1.department_id = e2.department_id
AND e1.id != e2.id
 * postgresql://shravan:***@127.0.0.1/employees
3 rows affected.
id first_name last_name
5 Ian Peterson
4 Mike Peterson
3 Cailin Ninson

Let’s create a temp table to understand self-joins

We will be working with this table to understand self-joins

employee

Example 1: Find all employees that live in the same city as “Joe”

%%sql
DROP TABLE IF EXISTS temp_emp;
CREATE TABLE IF NOT EXISTS temp_emp(employee_name varchar, employee_location varchar);
 * postgresql://shravan:***@127.0.0.1/employees
Done.
Done.





[]
temp_emp_table_insert = ("""INSERT INTO temp_emp(employee_name, employee_location)
                 VALUES (%s, %s);""")
employee_names = ['Joe', 'Sunil', 'Alex', 'Albert', 'Jack']
employee_location = ['New York', 'India', 'Russia', 'Canada', 'New York']
temp_emp_df = pd.DataFrame({'employee_names': employee_names, 'employee_location': employee_location})
insert_rows(temp_emp_df, temp_emp_table_insert)
%%sql
select * from temp_emp;
 * postgresql://shravan:***@127.0.0.1/employees
5 rows affected.
employee_name employee_location
Joe New York
Sunil India
Alex Russia
Albert Canada
Jack New York
%%sql
SELECT e1.employee_name
FROM temp_emp as e1, temp_emp as e2
WHERE e1.employee_location = e2.employee_location AND e2.employee_name = 'Joe';
 * postgresql://shravan:***@127.0.0.1/employees
2 rows affected.
employee_name
Joe
Jack
%%sql
SELECT e1.employee_name, e2.employee_name
FROM temp_emp as e1, temp_emp as e2
WHERE e1.employee_location = e2.employee_location AND e2.employee_name = 'Joe'
 * postgresql://shravan:***@127.0.0.1/employees
2 rows affected.
employee_name employee_name_1
Joe Joe
Jack Joe

This is using a subquery

%%sql
SELECT e1.employee_name
FROM temp_emp as e1
WHERE e1.employee_location = (
SELECT employee_location
FROM temp_emp
WHERE employee_name = 'Joe')
 * postgresql://shravan:***@127.0.0.1/employees
2 rows affected.
employee_name
Joe
Jack

Self joins using the Employee Manager problem

  • Create the table
  • Create the insert string
  • Create the data
  • Call the insert function
  • Write your queries
%%sql
DROP TABLE IF EXISTS emp_manager;
CREATE TABLE IF NOT EXISTS emp_manager(emp_id int, name varchar, manager_id int);
 * postgresql://shravan:***@127.0.0.1/employees
Done.
Done.





[]
emp_manager_table_insert = ("""INSERT INTO emp_manager(emp_id, name, manager_id)
                 VALUES (%s, %s, %s);""")
emp_id = [1, 2, 4, 10]
name = ['Sam', 'Harry', 'Manager', 'AnotherManager']
manager_id = [10, 4, 0, 0]
emp_manager = pd.DataFrame({'emp_id': emp_id, 'name': name, 'manager_id': manager_id})
insert_rows(emp_manager, emp_manager_table_insert)
%%sql
SELECT *
FROM emp_manager;
 * postgresql://shravan:***@127.0.0.1/employees
4 rows affected.
emp_id name manager_id
1 Sam 10
2 Harry 4
4 Manager 0
10 AnotherManager 0
%%sql
SELECT e2.name, e1.name as Manager
FROM emp_manager e1, emp_manager e2
WHERE e1.emp_id = e2.manager_id
 * postgresql://shravan:***@127.0.0.1/employees
2 rows affected.
name manager
Harry Manager
Sam AnotherManager


Questions for Sales

  1. Suppose we want to retrieve the names of all salespeople that have more than 1 order from the tables above. You can assume that each salesperson only has one ID.
%%sql
SELECT name
FROM orders as o
LEFT JOIN salesperson as s
ON o.sales_id = s.id
GROUP BY o.sales_id, name
HAVING COUNT(*) > 1
 * postgresql://shravan:***@127.0.0.1/interviews
2 rows affected.
name
Dan
Bob
%%sql
SELECT *
FROM orders as o
LEFT JOIN salesperson as s
ON o.sales_id = s.id
 * postgresql://shravan:***@127.0.0.1/interviews
7 rows affected.
order_number order_date cust_id sales_id amount id name age salary
10 8/2/96 4 2 540 2 Bob 34 44000
20 1/30/99 4 8 1800 8 Ken 57 115000
30 7/14/95 9 1 460 1 Abe 61 140000
40 1/29/98 7 2 2400 2 Bob 34 44000
50 2/3/98 6 7 600 7 Dan 41 52000
60 3/2/98 6 7 720 7 Dan 41 52000
70 5/6/98 9 7 150 7 Dan 41 52000
%%sql
SELECT sales_id
FROM orders
GROUP BY sales_id
HAVING COUNT(sales_id) > 1
 * postgresql://shravan:***@127.0.0.1/interviews
2 rows affected.
sales_id
2
7
%%sql
SELECT name
FROM (SELECT sales_id
    FROM orders
    GROUP BY sales_id
    HAVING COUNT(sales_id) > 1) as subquery
LEFT JOIN salesperson as s
ON subquery.sales_id = s.id
 * postgresql://shravan:***@127.0.0.1/interviews
2 rows affected.
name
Bob
Dan
%%sql
SELECT name
FROM orders, salesperson
WHERE orders.sales_id = salesperson.id
GROUP BY orders.sales_id, name
HAVING COUNT(name) > 1
 * postgresql://shravan:***@127.0.0.1/interviews
2 rows affected.
name
Dan
Bob
  1. Here is the problem: find the largest order amount for each salesperson and the associated order number, along with the customer to whom that order belongs to.
%%sql
SELECT sales_id, Max(amount) as MaxOrder
FROM orders
GROUP BY sales_id
 * postgresql://shravan:***@127.0.0.1/interviews
4 rows affected.
sales_id maxorder
2 2400
7 720
1 460
8 1800
%%sql
SELECT *
FROM (SELECT sales_id, Max(amount) as MaxOrder
        FROM orders
        GROUP BY sales_id) as subquery
 * postgresql://shravan:***@127.0.0.1/interviews
4 rows affected.
sales_id maxorder
2 2400
7 720
1 460
8 1800
%%sql
SELECT order_number,amount, s.name, cust_id
FROM (SELECT sales_id, Max(amount) as MaxOrder
        FROM orders
        GROUP BY sales_id) as subquery
JOIN orders as o
ON subquery.sales_id = o.sales_id
JOIN salesperson as s
ON o.sales_id = s.id
WHERE subquery.maxorder = amount
 * postgresql://shravan:***@127.0.0.1/interviews
4 rows affected.
order_number amount name cust_id
20 1800 Ken 4
30 460 Abe 9
40 2400 Bob 7
60 720 Dan 6

%%sql
SELECT *
FROM orders
 * postgresql://shravan:***@127.0.0.1/interviews
7 rows affected.
order_number order_date cust_id sales_id amount
10 8/2/96 4 2 540
20 1/30/99 4 8 1800
30 7/14/95 9 1 460
40 1/29/98 7 2 2400
50 2/3/98 6 7 600
60 3/2/98 6 7 720
70 5/6/98 9 7 150

Window functions

The need for window functions arises from one of SQL’s limitations that we have come across time and again: “When you use grouby, you have to use the grouping column or an aggregation in the SELECT statement”

window

One other thing with this is that we can’t compare aggregate data with non-aggregate columns. We can work around this limitation, using a Window function.

Window functions are a class of functions that perform calculations on a result set that has already been generated also referred to as a window.

You can use window functions to perform aggregate calculations without having to group your data.

  • The OVER() clause allows you to pass an aggregate function down a data set
  • The OVER() clause offers significant benefits over subqueries in select – namely, your queries will run faster
  • The OVER() clause has a wide range of additional functions and clauses you can include with it.

The PARTITION BY clause allows you to calculate separate “windows” based on columns you want to divide your results. For example, you can create a single column that calculates an overall average of goals scored for each season.

import pandas as pd
olympics = pd.read_csv('https://assets.datacamp.com/production/repositories/3804/datasets/cab9ee3db38ef20757cbe48aaff61c7db00f6acd/summer.csv')
olympics.head()
Year City Sport Discipline Athlete Country Gender Event Medal
0 1896 Athens Aquatics Swimming HAJOS Alfred HUN Men 100M Freestyle Gold
1 1896 Athens Aquatics Swimming HERSCHMANN Otto AUT Men 100M Freestyle Silver
2 1896 Athens Aquatics Swimming DRIVAS Dimitrios GRE Men 100M Freestyle For Sailors Bronze
3 1896 Athens Aquatics Swimming MALOKINIS Ioannis GRE Men 100M Freestyle For Sailors Gold
4 1896 Athens Aquatics Swimming CHASAPIS Spiridon GRE Men 100M Freestyle For Sailors Silver
%%sql
DROP TABLE IF EXISTS summer_medals;
CREATE TABLE IF NOT EXISTS summer_medals(year int, city varchar, sport varchar, discipline varchar,
                                        athelete varchar, country varchar, gender varchar, event varchar,
                                        medal varchar);
 * postgresql://shravan:***@127.0.0.1/employees
Done.
Done.





[]
summer_medals_table_insert = ("""INSERT INTO summer_medals(year, city, sport, discipline,
athelete, country, gender, event, medal) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s);""")
insert_rows(olympics, summer_medals_table_insert)
%%sql
SELECT COUNT(*)
FROM summer_;
 * postgresql://shravan:***@127.0.0.1/employees
(psycopg2.errors.UndefinedTable) relation "summer_olympics" does not exist
LINE 2: FROM summer_olympics;
             ^

[SQL: SELECT COUNT(*)
FROM summer_olympics;]
(Background on this error at: http://sqlalche.me/e/f405)


Appendix

!pip install psycopg2
Collecting psycopg2
[?25l  Downloading https://files.pythonhosted.org/packages/5c/1c/6997288da181277a0c29bc39a5f9143ff20b8c99f2a7d059cfb55163e165/psycopg2-2.8.3.tar.gz (377kB)
     |████████████████████████████████| 378kB 3.5MB/s eta 0:00:01
[?25hBuilding wheels for collected packages: psycopg2
  Building wheel for psycopg2 (setup.py) ... [?25ldone
[?25h  Stored in directory: /Users/shravan/Library/Caches/pip/wheels/48/06/67/475967017d99b988421b87bf7ee5fad0dad789dc349561786b
Successfully built psycopg2
Installing collected packages: psycopg2
Successfully installed psycopg2-2.8.3
!pip install ipython-sql
Collecting ipython-sql
  Downloading https://files.pythonhosted.org/packages/ab/df/427e7cf05ffc67e78672ad57dce2436c1e825129033effe6fcaf804d0c60/ipython_sql-0.3.9-py2.py3-none-any.whl
Collecting prettytable (from ipython-sql)
  Downloading https://files.pythonhosted.org/packages/ef/30/4b0746848746ed5941f052479e7c23d2b56d174b82f4fd34a25e389831f5/prettytable-0.7.2.tar.bz2
Requirement already satisfied: ipython-genutils>=0.1.0 in /Users/shravan/anaconda3/envs/etl/lib/python3.7/site-packages (from ipython-sql) (0.2.0)
Requirement already satisfied: ipython>=1.0 in /Users/shravan/anaconda3/envs/etl/lib/python3.7/site-packages (from ipython-sql) (7.6.1)
Requirement already satisfied: six in /Users/shravan/anaconda3/envs/etl/lib/python3.7/site-packages (from ipython-sql) (1.12.0)
Requirement already satisfied: sqlalchemy>=0.6.7 in /Users/shravan/anaconda3/envs/etl/lib/python3.7/site-packages (from ipython-sql) (1.3.5)
Collecting sqlparse (from ipython-sql)
  Downloading https://files.pythonhosted.org/packages/ef/53/900f7d2a54557c6a37886585a91336520e5539e3ae2423ff1102daf4f3a7/sqlparse-0.3.0-py2.py3-none-any.whl
Requirement already satisfied: jedi>=0.10 in /Users/shravan/anaconda3/envs/etl/lib/python3.7/site-packages (from ipython>=1.0->ipython-sql) (0.13.3)
Requirement already satisfied: pygments in /Users/shravan/anaconda3/envs/etl/lib/python3.7/site-packages (from ipython>=1.0->ipython-sql) (2.4.2)
Requirement already satisfied: backcall in /Users/shravan/anaconda3/envs/etl/lib/python3.7/site-packages (from ipython>=1.0->ipython-sql) (0.1.0)
Requirement already satisfied: appnope; sys_platform == "darwin" in /Users/shravan/anaconda3/envs/etl/lib/python3.7/site-packages (from ipython>=1.0->ipython-sql) (0.1.0)
Requirement already satisfied: pexpect; sys_platform != "win32" in /Users/shravan/anaconda3/envs/etl/lib/python3.7/site-packages (from ipython>=1.0->ipython-sql) (4.7.0)
Requirement already satisfied: setuptools>=18.5 in /Users/shravan/anaconda3/envs/etl/lib/python3.7/site-packages (from ipython>=1.0->ipython-sql) (41.0.1)
Requirement already satisfied: traitlets>=4.2 in /Users/shravan/anaconda3/envs/etl/lib/python3.7/site-packages (from ipython>=1.0->ipython-sql) (4.3.2)
Requirement already satisfied: decorator in /Users/shravan/anaconda3/envs/etl/lib/python3.7/site-packages (from ipython>=1.0->ipython-sql) (4.4.0)
Requirement already satisfied: pickleshare in /Users/shravan/anaconda3/envs/etl/lib/python3.7/site-packages (from ipython>=1.0->ipython-sql) (0.7.5)
Requirement already satisfied: prompt-toolkit<2.1.0,>=2.0.0 in /Users/shravan/anaconda3/envs/etl/lib/python3.7/site-packages (from ipython>=1.0->ipython-sql) (2.0.9)
Requirement already satisfied: parso>=0.3.0 in /Users/shravan/anaconda3/envs/etl/lib/python3.7/site-packages (from jedi>=0.10->ipython>=1.0->ipython-sql) (0.5.0)
Requirement already satisfied: ptyprocess>=0.5 in /Users/shravan/anaconda3/envs/etl/lib/python3.7/site-packages (from pexpect; sys_platform != "win32"->ipython>=1.0->ipython-sql) (0.6.0)
Requirement already satisfied: wcwidth in /Users/shravan/anaconda3/envs/etl/lib/python3.7/site-packages (from prompt-toolkit<2.1.0,>=2.0.0->ipython>=1.0->ipython-sql) (0.1.7)
Building wheels for collected packages: prettytable
  Building wheel for prettytable (setup.py) ... [?25ldone
[?25h  Stored in directory: /Users/shravan/Library/Caches/pip/wheels/80/34/1c/3967380d9676d162cb59513bd9dc862d0584e045a162095606
Successfully built prettytable
Installing collected packages: prettytable, sqlparse, ipython-sql
Successfully installed ipython-sql-0.3.9 prettytable-0.7.2 sqlparse-0.3.0

Upgrading postgres on mac

So, I tried to start postgres after some time on my mac, and it asked me to upgrade it. So, I said fine let me upgrade. But little did I know that it needed some data migration. All my DBs were created in 11 version, and it asked me to run brew postgresql-upgrade-database

(base) shravan-postgres# brew postgresql-upgrade-database
==> brew install postgresql@11
==> Downloading https://homebrew.bintray.com/bottles/postgresql%4011-11.9.mojave.bottle.tar.gz
==> Downloading from https://d29vzk4ow07wi7.cloudfront.net/816c4f41cc68aa079c3b8840a414ace69006bff6914160a8bfe86d5365cf84eb?response-content-disposition=attachment%3Bfilename%3D%22postgresql%4011-11.9.
######################################################################## 100.0%
==> Pouring postgresql@11-11.9.mojave.bottle.tar.gz
==> /usr/local/Cellar/postgresql@11/11.9/bin/initdb --locale=C -E UTF-8 /usr/local/var/postgresql@11
==> Caveats
To migrate existing data from a previous major version of PostgreSQL run:
  brew postgresql-upgrade-database

This formula has created a default database cluster with:
  initdb --locale=C -E UTF-8 /usr/local/var/postgres
For more details, read:
  https://www.postgresql.org/docs/11/app-initdb.html

postgresql@11 is keg-only, which means it was not symlinked into /usr/local,
because this is an alternate version of another formula.

If you need to have postgresql@11 first in your PATH run:
  echo 'export PATH="/usr/local/opt/postgresql@11/bin:$PATH"' >> /Users/shravan/.bash_profile

For compilers to find postgresql@11 you may need to set:
  export LDFLAGS="-L/usr/local/opt/postgresql@11/lib"
  export CPPFLAGS="-I/usr/local/opt/postgresql@11/include"


To have launchd start postgresql@11 now and restart at login:
  brew services start postgresql@11
Or, if you don't want/need a background service you can just run:
  pg_ctl -D /usr/local/var/postgresql@11 start
==> Summary
🍺  /usr/local/Cellar/postgresql@11/11.9: 3,197 files, 36.2MB
==> Upgrading postgresql data from 11 to 12...
waiting for server to start....2020-08-22 21:11:06.172 EDT [22453] LOG:  listening on IPv6 address "::1", port 5432
2020-08-22 21:11:06.172 EDT [22453] LOG:  listening on IPv4 address "127.0.0.1", port 5432
2020-08-22 21:11:06.172 EDT [22453] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2020-08-22 21:11:06.222 EDT [22454] LOG:  database system was interrupted; last known up at 2020-02-25 18:03:12 EST
2020-08-22 21:11:06.879 EDT [22454] LOG:  database system was not properly shut down; automatic recovery in progress
2020-08-22 21:11:06.889 EDT [22454] LOG:  redo starts at 0/8F467F0
2020-08-22 21:11:06.889 EDT [22454] LOG:  invalid record length at 0/8F468D0: wanted 24, got 0
2020-08-22 21:11:06.889 EDT [22454] LOG:  redo done at 0/8F46898
2020-08-22 21:11:06.906 EDT [22453] LOG:  database system is ready to accept connections
 done
server started
waiting for server to shut down....2020-08-22 21:11:07.116 EDT [22453] LOG:  received fast shutdown request
2020-08-22 21:11:07.117 EDT [22453] LOG:  aborting any active transactions
2020-08-22 21:11:07.118 EDT [22453] LOG:  background worker "logical replication launcher" (PID 22460) exited with exit code 1
2020-08-22 21:11:07.118 EDT [22455] LOG:  shutting down
2020-08-22 21:11:07.122 EDT [22453] LOG:  database system is shut down
 done
server stopped
==> Moving postgresql data from /usr/local/var/postgres to /usr/local/var/postgres.old...
==> Creating database...
The files belonging to this database system will be owned by user "shravan".
This user must also own the server process.

The database cluster will be initialized with locale "C".
The default text search configuration will be set to "english".

Data page checksums are disabled.

fixing permissions on existing directory /usr/local/var/postgres ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default time zone ... America/New_York
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... ok

initdb: warning: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the option -A, or
--auth-local and --auth-host, the next time you run initdb.

Success. You can now start the database server using:

    /usr/local/opt/postgresql/bin/pg_ctl -D /usr/local/var/postgres -l logfile start

==> Migrating and upgrading data...
Performing Consistency Checks
-----------------------------
Checking cluster versions                                   ok
Checking database user is the install user                  ok
Checking database connection settings                       ok
Checking for prepared transactions                          ok
Checking for reg* data types in user tables                 ok
Checking for contrib/isn with bigint-passing mismatch       ok
Checking for tables WITH OIDS                               ok
Checking for invalid "sql_identifier" user columns          ok
Creating dump of global objects                             ok
Creating dump of database schemas
                                                            ok
Checking for presence of required libraries                 ok
Checking database user is the install user                  ok
Checking for prepared transactions                          ok

If pg_upgrade fails after this point, you must re-initdb the
new cluster before continuing.

Performing Upgrade
------------------
Analyzing all rows in the new cluster                       ok
Freezing all rows in the new cluster                        ok
Deleting files from new pg_xact                             ok
Copying old pg_xact to new server                           ok
Setting next transaction ID and epoch for new cluster       ok
Deleting files from new pg_multixact/offsets                ok
Copying old pg_multixact/offsets to new server              ok
Deleting files from new pg_multixact/members                ok
Copying old pg_multixact/members to new server              ok
Setting next multixact ID and offset for new cluster        ok
Resetting WAL archives                                      ok
Setting frozenxid and minmxid counters in new cluster       ok
Restoring global objects in the new cluster                 ok
Restoring database schemas in the new cluster
                                                            ok
Copying user relation files
                                                            ok
Setting next OID for new cluster                            ok
Sync data directory to disk                                 ok
Creating script to analyze new cluster                      ok
Creating script to delete old cluster                       ok

Upgrade Complete
----------------
Optimizer statistics are not transferred by pg_upgrade so,
once you start the new server, consider running:
    ./analyze_new_cluster.sh

Running this script will delete the old cluster's data files:
    ./delete_old_cluster.sh
==> Upgraded postgresql data from 11 to 12!
==> Your postgresql 11 data remains at /usr/local/var/postgres.old
(base) shravan-postgres#

READ the output carefully. It migrated all the databases and took a backup under postgres.old.

Now start postgres like so:

(base) shravan-postgres# /usr/local/opt/postgresql/bin/pg_ctl -D /usr/local/var/postgres -l logfile start
waiting for server to start.... done
server started
(base) shravan-postgres#