Exploratory Data Analysis of JSON data using PostgreSQL
Introduction
Doing Exploratory Data Analysis (EDA) is one of the first steps in doing any type of analysis. By asking the right type of questions, we can gain an understanding of our data and uncover any data related issues if any. When data is stored in JSON format inside the database, we can use PostgreSQL’s built in operators and functions to directly query the JSON data.
Data Description
The schema mainly contains 4 tables.
provider_json (npi bigint, record json):
This table contains provider data.insurances (uuid uuid, display text):
This table contains the unique insurance plans.specialties (uuid uuid, display text):
This table contains the unique specialties.provider_puf_data (npi bigint, source text, address json, plans json, created_at timestamp):
This table contains a couple snapshots of data that is routinely ingested. The data contains records that were ingested in January and June.
Questions
%load_ext sql
%sql postgresql://<user-name>:<passwd>@server.com/dbname
'Connected: user@dbname'
1. What is the average age of the providers in providers_json
?
%%sql
SELECT
FLOOR(AVG((record ->> 'age')::int)) as age
FROM provider_json;
* postgresql://read_user:***@server/dbname
1 rows affected.
age |
---|
57 |
2. What is the most popular specialty of the providers in providers_json
?
%%sql
SELECT
sp.display as most_popular_specialty
FROM (SELECT
-- Unnest the array to create a row for each specialty
json_array_elements_text(record -> 'specialties') as specialty
FROM provider_json) as s
JOIN specialties as sp
-- Cast to UUID type before joining
ON s.specialty::uuid = sp.uuid
GROUP BY 1
ORDER BY COUNT(*) DESC
LIMIT 1
* postgresql://read_user:***@server/dbname
1 rows affected.
most_popular_specialty |
---|
Internal Medicine |
3. Providers practicing at locations with high confidence score
Question: You’ll notice within all locations objects of the providers_json record field, there exists a “confidence” score. This is a representation of our model deciding the probability that a provider is indeed practicing at this location. We allow scores between 0 and 5. How many providers have at least 2 locations with confidence of 4 or higher?
%%sql
SELECT
COUNT(DISTINCT provider) as num_providers_high_conf_score
FROM (SELECT
provider
-- Sub query to unnest the entries of locations array
FROM (SELECT
record->>'npi' as provider,
json_array_elements(record->'locations') as location
FROM provider_json) as l
-- Only keep locations that meet the confidence score criteria
WHERE (location->>'confidence')::int>=4
GROUP BY provider
-- Only keep locations that meet minumum number of locations criteria
HAVING COUNT(location)>=2) as num_p
* postgresql://read_user:***@server/dbname
1 rows affected.
num_providers_high_conf_score |
---|
2054 |
4. Number of unique insurances at most popular location
Question: In all provider records, you’ll see a field called “insurances”. This is the unique list of all insurance plans a provider accepts (we represent this as UUID which connects to the insurances table). For now, let’s assume a provider accepts all of these insurance plans at all locations they practice at. Find the total number of unique insurance plans accepted by all providers at the most popular location of this data set. (Popular = the most providers practice there)
%%sql
-- Setup the location cte
WITH location_cte as
(
SELECT
npi as provider,
json_array_elements(record->'locations')->>'uuid' as location
FROM provider_json
),
-- Setup the popular locations cte
popular_location as
(
SELECT
location
-- Use the location_cte
FROM location_cte as l
GROUP BY location
-- Order by number of providers at each location
ORDER BY COUNT(provider) DESC
LIMIT 1
),
-- Setup the providers at the most popular location
providers_at_popular_location as
(
SELECT
provider
-- Use the popular_location to filter location_cte
FROM location_cte
WHERE location IN (SELECT location FROM popular_location)
)
-- Get count of distinct insurances for providers at popular location
SELECT
COUNT(distinct ins.insurances::text) as total_unique_insurances_at_most_popular_location
FROM providers_at_popular_location as pop_prov
LEFT JOIN (SELECT
record->>'npi' as provider,
json_array_elements(record->'insurances') as insurances
FROM provider_json) as ins
ON pop_prov.provider::text = ins.provider
* postgresql://read_user:***@server/dbname
1 rows affected.
total_unique_insurances_at_most_popular_location |
---|
361 |
5. Neglected provider fields
Which provider fields in provider_json are the most neglected? How would you go about figuring this out beyond this small sample?
Assuming that we consider a field as neglected when it does not have a value, we can build this logic into the query by making use of json_each_text()
function to get (key,value)
pairs for each top-level field in the record
. Next, we use the split_part()
function to extract the key and value into their own columns. For the sake of readability, I created provider_key_value
cte and then used a simple CASE WHEN
to encode rows that contain an ''
or []
as 1 - indicating that they are ignored. Finally, we can group by key and sum up the num_ignored
values for each key.
By using this logic, age is the most neglected field.
%%sql
-- Setup the provider_fields
WITH provider_fields as (
SELECT
json_each_text(record) as field
FROM provider_json
), provider_key_value as (
SELECT
split_part(split_part(field::text, ',', 1), '(', 2) as key,
split_part(split_part(field::text, ',', 2), ')', 1) as value
FROM provider_fields
)
SELECT
key,
SUM(CASE WHEN value != '' AND value != '[]' THEN 0 ELSE 1 END) as num_ignored
FROM provider_key_value
GROUP BY key
ORDER BY num_ignored desc;
* postgresql://read_user:***@server/dbname
16 rows affected.
key | num_ignored |
---|---|
age | 6597 |
degrees | 4276 |
insurances | 3186 |
educations | 3138 |
ratings_avg | 3052 |
languages | 2997 |
gender | 2366 |
middle_name | 2059 |
online_profiles | 2012 |
locations | 60 |
provider_types | 6 |
specialties | 6 |
first_name | 1 |
ratings_count | 0 |
npi | 0 |
last_name | 0 |
6. Top three sources/brokers with most added/removed rows
Question: Who are the top 3 sources/brokers from provider_puf_data which had the most rows added or dropped between January and June. NOTE: If a source didn’t exist in January or June, treat that source like it has 0 rows for that date.
%%sql
-- Setup the jan_brokers cte
WITH jan_brokers as(
SELECT
source,
COUNT(*) as jan_count
FROM provider_puf_data
-- Only keep the jan rows
WHERE date_trunc('month', created_at) = '2019-01-01'
GROUP BY source
),
-- Setup the jun_brokers cte
jun_brokers as(
SELECT
source,
COUNT(*) as jun_count
FROM provider_puf_data
-- Only keep the jun rows
WHERE date_trunc('month', created_at) = '2019-06-01'
GROUP BY source
)
-- Query to calculate top three brokers with biggest difference
SELECT
-- As we are using a full-outer join, we want to keep the non-null entries
CASE WHEN jan_brokers.source IS NOT NULL THEN jan_brokers.source
ELSE jun_brokers.source END as brokers,
-- Take the absoulte difference to account for added/removed rows
ABS(COALESCE(jan_count, 0) - COALESCE(jun_count, 0)) as diff
FROM jan_brokers
FULL JOIN jun_brokers
ON jan_brokers.source = jun_brokers.source
ORDER BY diff DESC
LIMIT 3;
* postgresql://read_user:***@server/dbname
3 rows affected.
brokers | diff |
---|---|
http://mydental.guardianlife.com/secure/json/index.json | 798 |
https://www.dentegra.com/hcx/cms-data-index.json | 604 |
https://api.centene.com/ambetter/reference/cms-data-index.json | 237 |
7. NPIs with most newly added addresses
Using only the “address” string in the address column of provider_puf_data, which NPIs had the most new addresses added between January and June and how many new addresses were added? (the top 3 NPIs will do)
%%sql
-- Setup jan_addresses cte
WITH jan_addresses as (
SELECT
npi,
-- Convert to uppercase before comparison
trim(upper(address->>'address')) as jan_address
FROM provider_puf_data
-- Only keep records in january
WHERE date_trunc('month', created_at) = '2019-01-01'
),
-- Setup jun_addresses cte
jun_addresses as (
SELECT
npi,
-- Convert to uppercase before comparison
trim(upper(address->>'address')) as jun_address
FROM provider_puf_data
-- Only keep records in june
WHERE date_trunc('month', created_at) = '2019-06-01'
),
-- Setup unique addresses for each npi in january
jan_unique_addresses as (
SELECT
npi,
jan_address
FROM jan_addresses
GROUP BY npi, jan_address
),
-- Setup unique addresses for each npi in june
jun_unique_addresses as (
SELECT
npi,
jun_address
FROM jun_addresses
GROUP BY npi, jun_address
),
-- Setup query to calculate new addresses added in june
newly_added_addresses as (
SELECT *
FROM jun_unique_addresses
EXCEPT
SELECT *
FROM jan_unique_addresses
)
-- Query to display top three npi with most number of new addresses
SELECT
npi,
count(*) as num_new
FROM newly_added_addresses
GROUP BY npi
ORDER BY num_new desc
LIMIT 3;
* postgresql://read_user:***@server/dbname
3 rows affected.
npi | num_new |
---|---|
1801806518 | 73 |
1457791816 | 68 |
1265411953 | 57 |
8. NPIs with most number of addresses removed
Question: Now the opposite of the above question, which NPIs saw the most addresses removed between January and June? (the top 3 NPIs will do)
%%sql
-- Setup jan_addresses cte
WITH jan_addresses as (
SELECT
npi,
-- Convert to uppercase before comparison
trim(upper(address->>'address')) as jan_address
FROM provider_puf_data
-- Only keep records in january
WHERE date_trunc('month', created_at) = '2019-01-01'
),
-- Setup jun_addresses cte
jun_addresses as (
SELECT
npi,
-- Convert to uppercase before comparison
trim(upper(address->>'address')) as jun_address
FROM provider_puf_data
-- Only keep records in june
WHERE date_trunc('month', created_at) = '2019-06-01'
),
-- Setup unique addresses for each npi in january
jan_unique_addresses as (
SELECT
npi,
jan_address
FROM jan_addresses
GROUP BY npi, jan_address
),
-- Setup unique addresses for each npi in june
jun_unique_addresses as (
SELECT
npi,
jun_address
FROM jun_addresses
GROUP BY npi, jun_address
),
-- Setup query to calculate addresses that got removed in june
removed_addresses as (
SELECT *
FROM jan_unique_addresses
EXCEPT
SELECT *
FROM jun_unique_addresses
)
-- Query to display top three npi with most number of addresses removed
SELECT
npi,
COUNT(*) as num_removed
FROM removed_addresses
GROUP BY npi
ORDER BY num_removed desc
LIMIT 3;
* postgresql://read_user:***@server/dbname
3 rows affected.
npi | num_removed |
---|---|
1659334894 | 65 |
1407379845 | 44 |
1073694808 | 30 |
9. Change in plans
Question: How did PUF plans within the plans field of provider_puf_data change from January to June? This is intentionally a bit open ended :)
Answer: I came up with two questions:
a. Which networks saw the greatest change from January to June?
%%sql
-- Setup the jan_networks cte
WITH jan_networks as (
SELECT
npi,
-- Since every npi has atleast one plan, we can use first index
plans->0->>'network_tier' as jan_network
FROM provider_puf_data
-- Only keep records for the month of January
WHERE date_trunc('month', created_at) = '2019-01-01'
),
-- Setup the jun_networks cte
jun_networks as (
SELECT
npi,
-- Since every npi has atleast one plan, we can use first index
plans->0->>'network_tier' as jun_network
FROM provider_puf_data
-- Only keep records for the month of June
WHERE date_trunc('month', created_at) = '2019-06-01'
),
-- Setup query to calculate number of npi per network in jan
jan_npi_per_network as (
SELECT
jan_network,
-- Only count unique npi
COUNT(DISTINCT npi) as jan_num_providers_in_network
FROM jan_networks
GROUP BY jan_network
),
-- Setup query to calculate number of npi per network in jun
jun_npi_per_network as (
SELECT
jun_network,
-- Only count unique npi
COUNT(DISTINCT npi) as jun_num_providers_in_network
FROM jun_networks
GROUP BY jun_network
)
-- Query to calculate the difference in providers per network
SELECT
-- As we are doing a full outer join, pick a non-null value
CASE WHEN jan.jan_network IS NOT NULL THEN jan.jan_network
ELSE jun.jun_network END as network,
-- Take the absolute difference between num providers
ABS(COALESCE(jan_num_providers_in_network, 0) -
COALESCE(jun_num_providers_in_network, 0)) as difference
FROM jan_npi_per_network as jan
-- Perform full outer join to get non-matching rows from both sides
FULL JOIN jun_npi_per_network as jun
ON jan.jan_network = jun.jun_network
ORDER BY difference desc
LIMIT 3;
* postgresql://read_user:***@server/dbname
3 rows affected.
network | difference |
---|---|
HERITAGE | 103 |
PPO | 70 |
NEIGHBORHOOD | 70 |
b. List the networks that did not add any new providers from January to June.
%%sql
-- Setup the jan_networks cte
WITH jan_networks as (
SELECT
npi,
-- Since every npi has atleast one plan, we can use first index
plans->0->>'network_tier' as jan_network
FROM provider_puf_data
-- Only keep records for the month of January
WHERE date_trunc('month', created_at) = '2019-01-01'
),
-- Setup the jun_networks cte
jun_networks as (
SELECT
npi,
-- Since every npi has atleast one plan, we can use first index
plans->0->>'network_tier' as jun_network
FROM provider_puf_data
-- Only keep records for the month of June
WHERE date_trunc('month', created_at) = '2019-06-01'
),
-- Setup query to calculate number of npi per network in jan
jan_npi_per_network as (
SELECT
jan_network,
-- Only count unique npi
COUNT(DISTINCT npi) as jan_num_providers_in_network
FROM jan_networks
GROUP BY jan_network
),
-- Setup query to calculate number of npi per network in jun
jun_npi_per_network as (
SELECT
jun_network,
-- Only count unique npi
COUNT(DISTINCT npi) as jun_num_providers_in_network
FROM jun_networks
GROUP BY jun_network
)
-- Query to calculate networks with no providers
SELECT
jan.jan_network as networks_with_no_providers
FROM jan_npi_per_network as jan
FULL JOIN jun_npi_per_network as jun
ON jan.jan_network = jun.jun_network
WHERE jun.jun_network is NULL
* postgresql://read_user:***@server/dbname
13 rows affected.
networks_with_no_providers |
---|
ACCESS |
BLUE-FOCUSCARE |
CD-CONNECTION-DENTAL--PREFI |
CR--CAREINTON-PREFIX-DJ |
DC--DENTALGUARD-CONNECT-ADMIN |
DG--DENTALGUARD-PREFERRED-P |
DM-DENTEMAX-PREFIX-DW |
ELITE |
HERITAGE-PLUS |
NEIGHBORHOOD |
PIMAFOCUS |
SK-STRATOSE-FSB-MAVAREST-PPO |
SP-AETNA-PPO-PREFIX-DSW |
10. How to address the address column?
If you look closely at the address strings during exercises 7/8, you’ll notice a lot of redundant addresses that are just slightly different. If we could merge these addresses we could get much better resolution on which addresses are actually changing between January and June. Given the data you have here, how would you accomplish this? What if you could use any tools available? How would you begin architecting this for now and the future.
A simple way is to merge these fields into one is by using Postgres string functions: CONCAT
, UPPER
and TRIM
and grouping them by provider. However, this does not handle the case when we have multiple representations of the same address, shown in the cell after this.
%%sql
SELECT
npi,
COUNT(DISTINCT TRIM(UPPER(CONCAT(address->>'address', ' ',
address->>'address_2', ' ',
address->>'city', ' ',
address->>'state', ' ',
address->>'zip'
)))) as unique_addresses
FROM provider_puf_data
GROUP BY npi
ORDER BY unique_addresses desc
LIMIT 10;
* postgresql://read_user:***@server/dbname
10 rows affected.
npi | unique_addresses |
---|---|
1659334894 | 448 |
1538375241 | 256 |
1659763092 | 238 |
1821510181 | 230 |
1457751075 | 212 |
1972540664 | 209 |
1114191335 | 178 |
1730203100 | 175 |
1205822723 | 172 |
1366824997 | 147 |
%%sql
SELECT
UPPER(address->>'address') as street,
UPPER(address->>'city') as city,
UPPER(address->>'state') as state,
address->>'zip' as zipcode
FROM provider_puf_data
WHERE npi = '1538375241' and address->>'zip' = 80023::text
GROUP BY street, city, state, zipcode
* postgresql://read_user:***@server/dbname
4 rows affected.
street | city | state | zipcode |
---|---|---|---|
14422 ORCHARD PARKWAY | WESTMINSTER | CO | 80023 |
14422 ORCHARD PARKWAY SUITE 200 | WESTMINSTER | CO | 80023 |
14422 ORCHARD PKWY | WESTMINSTER | CO | 80023 |
14422 ORCHARD PKY #200 | WESTMINSTER | CO | 80023 |
All these 4 records should be counted as 1 address. One way to further break the street into house_no
and street_name
but this does not address cases where PARKWAY is written as PKWY.
An alternative way, is to employ text similarity by enabling the extension fuzzystrmatch
in Postgres and calculate the pair-wise text similarity for each provider and discard addresses with high similarity.
11. How long did it take?
11. How long did it take to complete the exercise? (To be fair to candidates who are time constrained we like to be aware of this)
Took me about 10 hours overall.