Retrieve data from API and store it in Postgres

7 minute read

Retrieve data from API and store it in a Postgres Table.

Frame.io is a cloud-based collaboration tool that allows users to share content with teams and collaborators to get feedback and approvals in real-time. The flow is as follows: a user signs up for an account, creates a project(s), and invites team members and collaborators to the platform. All user activity is captured internally in a table called audit_logs, which is a collection of activities on the account. One of our goals at Frame.io is to connect disparate sources of data to understand users and better capture opportunities that increase engagement, retention and revenue on the platform.

Frame.io’s Data is structured as follows:

  • A company has a paid enterprise Account
  • Accounts have Team(s)
  • Teams have Projects and Team Members
  • Projects have Files (also known as “Assets”)
  • Projects also have Collaborators who are 3rd party contributors. They leave Comments but can’t actually upload content.

Problem Description

Goal is to develop a Python script that makes a GET request to Frame.io’s Audit Logs API endpoint to retrieve Audit data for a specific account.

We’ll be using a dummy Account already populated with data for you to work with.

  • Account ID: d1688b1e-d748-56fe-95d2-7686a63f08f9
  • Token: bbi-u-DX98D5W456o-xdlkjfdlkjasldjfkasdjklfassdfadasdfas

NOTE: Get your own account_id and token by signing up to their API. The above won’t work, because it is just random info

The script should:

  1. Retrieve Audit data from 2020-01-21 to 2020-01-29
  2. Parse the JSON response payload
  3. Write the data to a Postgres table

The table should represent Distinct Users who have, in the provided date range, done any of the following:

  • created at least one Asset
  • created at least one Comment

The table should contain the User’s id, and some way of indicating whether they have only created Asset(s), created Comment(s), or both.

Working with APIs using requests

It is easy to load JSON data when given a path or URL to a JSON file. Here we will look at how we can work with APIs - which are a common source of JSON data. An API is a defined way for an application to communicate with other programs. They let programmers get data from an application without having to know about the application’s database architecture.

One thing to note about APIs is that they are shared resources, and they often limit how much data you can get in a specified timeframe. Using an API to get data is like using a catalog to order products. The catalog shows what’s available and provides order instructions. You then send a properly formatted order to the right address and get back what you asked for.

Similarly, an API provides an endpoint to send requests to, and also provides documentation to describe what a request should look like, such as which parameters to include.

While there are python libraries geared towards popular APIs, we will use requests library. requests lets users send and get data from any URL, so it is not tied to any particular API. The function to retrieve data from the URL is requests.get(). This takes a string of the URL to get data from, has optional keywords arguments that are useful for working with APIs.

  • The params keyword, lets you pass a dictionary of parameter names and values to customize API requests.
  • The headers keyword, also takes a dictionary of names and values. If the API you are using requires a user authentication key, it would be passed in the header.

The result is a response object, containing data and metadata. We need to use the response object’s JSON method to get just the data. Importantly, the JSON method returns a dictionary, which our pd.read_json() cannot parse - it expects a strings not dictionary.

Thus, in order to load the data into a data frame, we need to use pd.DataFrame() instead.

Here, we will use requests to work with frame.io API.

Frame.io API Description

API Structure: The Frame.io API supports common concepts like rate limiting, pagination for resource collections, versioning, and errors. This section describes the specifics of each.

Organization and style: The API is organized around common REST principles. All requests should be made over SSL. All request and response bodies, including errors, are encoded in JSON.

Unless otherwise specified, API methods generally conform to the following:

  • Properties without a value will use null instead of being undefined.
  • “Snake Case” is used for attribute names (e.g. first_name)
  • Timestamps are rendered in ISO-8601 format (e.g. 2016-02-03T16:38:46.985Z)

Path conventions:

  • Accounts
    • Teams
      • Projects
        • Assets
          • Comments

Generally, resource paths in the Frame.io API will follow the above hierarchy model, to within one parent level. Where intuitive, the API supports standalone resource paths for objects that are strictly owned, logically speaking.

For example, the Frame.io API supports both of the following paths:

  • GET /accounts/:id/teams – returns all the Teams for an Account.
  • GET /teams – returns all the Teams for the calling User. Another example: Comments don’t mean much outside their Asset context, so Comment collection and creation methods sit within the Asset scope. However, if updating or deleting a comment, the Asset context isn’t as meaningful, and is omitted from the resource path:

  • GET /assets/:id/comments
  • POST /assets/:id/comments
  • PUT /comments/:id
  • DELETE /comments/:id

Setup Authentication

Authentication is required for making API calls with the Frame.io API. The Frame.io API offers two choices for authentication. You can use developer tokens, or you can use the OAuth2.0 authorization code flow, which is referred to as an OAuth App in the context of the Frame.io API. All actions using the API, whether conducted through either type of authentication, are controlled by scopes.

Which Authentication Type is Best for My Project?: In general, you’ll want to use a Developer Token for your project. Developer Tokens are associated with an individual user’s account, have the same access permissions as the user who created them, and are restricted by scopes. These are great when you want to do something with a specific project or actions within a particular team from the perspective of an individual user. They are faster to get started with because you don’t have to implement a full OAuth2 Authorization Code flow to retrieve them.

OAuth Apps (Authorization Code Flow) are associated with applications. If you need to retrieve an access token outside of the context of a user, this is ideal. This is typically used by clients to access resources about themselves rather than to access a user’s resources.

Developer Token: A developer token is a security token that allows access to whatever scopes and privileges you set up for it. Anyone in possession of the bearer token will have all the privileges set on the token.

Quick word about Scopes: If you are using a Developer Token then scopes are set and assigned to the access token (Developer Token).

Audit Logs API

Our goal is to write a script that makes a GET request to Frame.io’s Audit Logs API endpoint to retrieve Audit data for a specific account. Frame.io supports deep Audit Logs for most events taken in the application.

Frame.io supports and maintains Audit Logs for the vast majority of activities taken in its applications. This includes both basic CRUD on core resources, and some special abstractions (e.g. AssetVersioned).

Scope and Permissions: Only Account Admins may access the Audit Logs for an Account, and all calls to the Audit Logs endpoint must be scoped to an account_id as follows:

GET https://api.frame.io/v2/accounts/:id/audit_logs

As mentioned in the problem description section, we are given these details:

We’ll be using a dummy Account already populated with data for you to work with.

  • Account ID: d1688b1e-d748-56fe-95d2-7686a63f08f9
  • Token: bbi-u-DX98D5W456o-xdlkjfdlkjasldjfkasdjklfassdfadasdfas

Code

The code for this project can be found here: github_link

Project layout

.
├── README.md
├── frameio
│   ├── __init__.py
│   ├── api_client.py
│   ├── db_connection.py
│   └── sql_queries.py
├── configs.cfg
├── requirements.txt
├── etl.py
└── validation.py

How to run this project

  • Step 1: git clone https://github.com/shravan-kuchkula/api-to-postgres.git OR unzip the zip file.
  • Step 2: Create a database in postgres. Ex createdb -h localhost -p 5432 -U postgres frameiodb
  • Step 3: cd api-to-postgres
  • Step 4: Update configs.cfg file with database connection details and API access key
  • Step 5: Run pip install virtualenv
  • Step 6: Run virtualenv venv
  • Step 7: Run . venv/bin/activate
  • Step 8: Run pip install -r requirements.txt
  • Step 9: Run python etl.py
  • Step 10: Run python validation.py

Validation Output

(etl) shravan-api-to-postgres$ python validation.py
Executing: b'SELECT COUNT(*) FROM users;'
Output:
(125,)
Executing: b'SELECT COUNT(DISTINCT user_id) FROM users;'
Output:
(125,)
Executing: b'SELECT * FROM users LIMIT 10'
Output:
('e2f9ddf5-f9e5-4279-ba83-850e98f1de49', 'no', 'yes')
('3b83a844-ab0d-4881-968e-66d73bb8a99a', 'yes', 'no')
('442c913e-e767-47b5-b100-78693a6678a8', 'no', 'yes')
('3ecba4d7-1bd0-46a9-b526-a6724527b843', 'no', 'yes')
('fe6fee15-9d1f-41c3-98b3-34b0497c686a', 'no', 'yes')
('76d63f2d-0cf0-40ab-8b32-b63ff32da3d3', 'no', 'yes')
('658f6100-9359-45c9-bbdf-3ceaf355351a', 'no', 'yes')
('9c7bae2e-5987-4209-a145-41c584d465d9', 'no', 'yes')
('c13010a3-7d55-4267-acdc-a39b8b015a90', 'no', 'yes')
('89cce5dd-caee-4c9f-9ca3-d174d3cd380b', 'yes', 'no')
Executing: b"SELECT COUNT(user_id) FROM users WHERE is_asset = 'yes';"
Output:
(96,)
Executing: b"SELECT COUNT(user_id) FROM users WHERE is_comment = 'yes';"
Output:
(29,)
Executing: b"SELECT COUNT(user_id) FROM users WHERE is_comment = 'yes' AND is_asset = 'yes';"
Output:
(0,)
(etl) shravan-api-to-postgres$