Exploratory Data Analysis of JSON data using PostgreSQL

Introduction

Doing Exploratory Data Analysis (EDA) is one of the first steps in doing any type of analysis. By asking the right type of questions, we can gain an understanding of our data and uncover any data related issues if any. When data is stored in JSON format inside the database, we can use PostgreSQL’s built in operators and functions to directly query the JSON data.

Data Description

The schema mainly contains 4 tables.

  • provider_json (npi bigint, record json): This table contains provider data.
  • insurances (uuid uuid, display text): This table contains the unique insurance plans.
  • specialties (uuid uuid, display text): This table contains the unique specialties.
  • provider_puf_data (npi bigint, source text, address json, plans json, created_at timestamp): This table contains a couple snapshots of data that is routinely ingested. The data contains records that were ingested in January and June.

Questions

%load_ext sql
%sql postgresql://<user-name>:<passwd>@server.com/dbname
'Connected: user@dbname'

1. What is the average age of the providers in providers_json?

%%sql
SELECT
    FLOOR(AVG((record ->> 'age')::int)) as age
FROM provider_json;
 * postgresql://read_user:***@server/dbname
1 rows affected.
age
57
%%sql
SELECT
    sp.display as most_popular_specialty
FROM (SELECT
        -- Unnest the array to create a row for each specialty
        json_array_elements_text(record -> 'specialties') as specialty
    FROM provider_json) as s
JOIN specialties as sp
-- Cast to UUID type before joining
ON s.specialty::uuid = sp.uuid
GROUP BY 1
ORDER BY COUNT(*) DESC
LIMIT 1
 * postgresql://read_user:***@server/dbname
1 rows affected.
most_popular_specialty
Internal Medicine

3. Providers practicing at locations with high confidence score

Question: You’ll notice within all locations objects of the providers_json record field, there exists a “confidence” score. This is a representation of our model deciding the probability that a provider is indeed practicing at this location. We allow scores between 0 and 5. How many providers have at least 2 locations with confidence of 4 or higher?

%%sql
SELECT
    COUNT(DISTINCT provider) as num_providers_high_conf_score
FROM (SELECT
            provider
      -- Sub query to unnest the entries of locations array
      FROM (SELECT
                record->>'npi' as provider,
                json_array_elements(record->'locations') as location
            FROM provider_json) as l
      -- Only keep locations that meet the confidence score criteria
      WHERE (location->>'confidence')::int>=4
      GROUP BY provider
      -- Only keep locations that meet minumum number of locations criteria
      HAVING COUNT(location)>=2) as num_p
 * postgresql://read_user:***@server/dbname
1 rows affected.
num_providers_high_conf_score
2054

Question: In all provider records, you’ll see a field called “insurances”. This is the unique list of all insurance plans a provider accepts (we represent this as UUID which connects to the insurances table). For now, let’s assume a provider accepts all of these insurance plans at all locations they practice at. Find the total number of unique insurance plans accepted by all providers at the most ​popular​ location of this data set. (Popular = the most providers practice there)

%%sql
-- Setup the location cte
WITH location_cte as
(
    SELECT
        npi as provider,
        json_array_elements(record->'locations')->>'uuid' as location
    FROM provider_json
),
-- Setup the popular locations cte
popular_location as
(
    SELECT
        location
    -- Use the location_cte
    FROM location_cte as l
    GROUP BY location
    -- Order by number of providers at each location
    ORDER BY COUNT(provider) DESC
    LIMIT 1  
),
-- Setup the providers at the most popular location
providers_at_popular_location as
(
    SELECT
        provider
    -- Use the popular_location to filter location_cte
    FROM location_cte
    WHERE location IN (SELECT location FROM popular_location)  
)
-- Get count of distinct insurances for providers at popular location
SELECT
    COUNT(distinct ins.insurances::text) as total_unique_insurances_at_most_popular_location
FROM providers_at_popular_location as pop_prov
LEFT JOIN (SELECT
            record->>'npi' as provider,
            json_array_elements(record->'insurances') as insurances
           FROM provider_json) as ins
ON pop_prov.provider::text = ins.provider
 * postgresql://read_user:***@server/dbname
1 rows affected.
total_unique_insurances_at_most_popular_location
361

5. Neglected provider fields

Which provider fields in provider_json are the most neglected? How would you go about figuring this out beyond this small sample?

Assuming that we consider a field as neglected when it does not have a value, we can build this logic into the query by making use of json_each_text() function to get (key,value) pairs for each top-level field in the record. Next, we use the split_part() function to extract the key and value into their own columns. For the sake of readability, I created provider_key_value cte and then used a simple CASE WHEN to encode rows that contain an '' or [] as 1 - indicating that they are ignored. Finally, we can group by key and sum up the num_ignored values for each key.

By using this logic, age is the most neglected field.

%%sql
-- Setup the provider_fields
WITH provider_fields as (
    SELECT
        json_each_text(record) as field
    FROM provider_json
), provider_key_value as (
    SELECT
        split_part(split_part(field::text, ',', 1), '(', 2) as key,
        split_part(split_part(field::text, ',', 2), ')', 1) as value
    FROM provider_fields
)
SELECT
    key,
    SUM(CASE WHEN value != '' AND value != '[]' THEN 0 ELSE 1 END) as num_ignored
FROM provider_key_value
GROUP BY key
ORDER BY num_ignored desc;
 * postgresql://read_user:***@server/dbname
16 rows affected.
key num_ignored
age 6597
degrees 4276
insurances 3186
educations 3138
ratings_avg 3052
languages 2997
gender 2366
middle_name 2059
online_profiles 2012
locations 60
provider_types 6
specialties 6
first_name 1
ratings_count 0
npi 0
last_name 0

6. Top three sources/brokers with most added/removed rows

Question: Who are the top 3 sources/brokers from ​provider_puf_data​ which had the most rows added or dropped between January and June. NOTE: If a source didn’t exist in January or June, treat that source like it has 0 rows for that date.

%%sql
-- Setup the jan_brokers cte
WITH jan_brokers as(
    SELECT
        source,
        COUNT(*) as jan_count
    FROM provider_puf_data
    -- Only keep the jan rows
    WHERE date_trunc('month', created_at) = '2019-01-01'
    GROUP BY source
),
-- Setup the jun_brokers cte
jun_brokers as(
    SELECT
        source,
        COUNT(*) as jun_count
    FROM provider_puf_data
    -- Only keep the jun rows
    WHERE date_trunc('month', created_at) = '2019-06-01'
    GROUP BY source
)
-- Query to calculate top three brokers with biggest difference
SELECT
    -- As we are using a full-outer join, we want to keep the non-null entries
    CASE WHEN jan_brokers.source IS NOT NULL THEN jan_brokers.source
    ELSE jun_brokers.source END as brokers,
    -- Take the absoulte difference to account for added/removed rows
    ABS(COALESCE(jan_count, 0) - COALESCE(jun_count, 0)) as diff
FROM jan_brokers
FULL JOIN jun_brokers
ON jan_brokers.source = jun_brokers.source
ORDER BY diff DESC
LIMIT 3;
 * postgresql://read_user:***@server/dbname
3 rows affected.
brokers diff
http://mydental.guardianlife.com/secure/json/index.json 798
https://www.dentegra.com/hcx/cms-data-index.json 604
https://api.centene.com/ambetter/reference/cms-data-index.json 237

7. NPIs with most newly added addresses

Using only the “address” string in the address column of provider_puf_data, which NPIs had the most new addresses added between January and June and how many new addresses were added? (the top 3 NPIs will do)

%%sql
-- Setup jan_addresses cte
WITH jan_addresses as (
    SELECT
        npi,
        -- Convert to uppercase before comparison
        trim(upper(address->>'address')) as jan_address
    FROM provider_puf_data
    -- Only keep records in january
    WHERE date_trunc('month', created_at) = '2019-01-01'
),
-- Setup jun_addresses cte
jun_addresses as (
    SELECT
        npi,
        -- Convert to uppercase before comparison
        trim(upper(address->>'address')) as jun_address
    FROM provider_puf_data
    -- Only keep records in june
    WHERE date_trunc('month', created_at) = '2019-06-01'
),
-- Setup unique addresses for each npi in january
jan_unique_addresses as (
    SELECT
        npi,
        jan_address
    FROM jan_addresses
    GROUP BY npi, jan_address
),
-- Setup unique addresses for each npi in june
jun_unique_addresses as (
    SELECT
        npi,
        jun_address
    FROM jun_addresses
    GROUP BY npi, jun_address
),
-- Setup query to calculate new addresses added in june
newly_added_addresses as (
    SELECT *
    FROM jun_unique_addresses
    EXCEPT
    SELECT *
    FROM jan_unique_addresses
)
-- Query to display top three npi with most number of new addresses
SELECT
    npi,
    count(*) as num_new
FROM newly_added_addresses
GROUP BY npi
ORDER BY num_new desc
LIMIT 3;
 * postgresql://read_user:***@server/dbname
3 rows affected.
npi num_new
1801806518 73
1457791816 68
1265411953 57

8. NPIs with most number of addresses removed

Question: Now the opposite of the above question, which NPIs saw the most addresses removed between January and June? (the top 3 NPIs will do)

%%sql
-- Setup jan_addresses cte
WITH jan_addresses as (
    SELECT
        npi,
        -- Convert to uppercase before comparison
        trim(upper(address->>'address')) as jan_address
    FROM provider_puf_data
    -- Only keep records in january
    WHERE date_trunc('month', created_at) = '2019-01-01'
),
-- Setup jun_addresses cte
jun_addresses as (
    SELECT
        npi,
        -- Convert to uppercase before comparison
        trim(upper(address->>'address')) as jun_address
    FROM provider_puf_data
    -- Only keep records in june
    WHERE date_trunc('month', created_at) = '2019-06-01'
),
-- Setup unique addresses for each npi in january
jan_unique_addresses as (
    SELECT
        npi,
        jan_address
    FROM jan_addresses
    GROUP BY npi, jan_address
),
-- Setup unique addresses for each npi in june
jun_unique_addresses as (
    SELECT
        npi,
        jun_address
    FROM jun_addresses
    GROUP BY npi, jun_address
),
-- Setup query to calculate addresses that got removed in june
removed_addresses as (
    SELECT *
    FROM jan_unique_addresses
    EXCEPT
    SELECT *
    FROM jun_unique_addresses  
)
-- Query to display top three npi with most number of addresses removed
SELECT
    npi,
    COUNT(*) as num_removed
FROM removed_addresses
GROUP BY npi
ORDER BY num_removed desc
LIMIT 3;
 * postgresql://read_user:***@server/dbname
3 rows affected.
npi num_removed
1659334894 65
1407379845 44
1073694808 30

9. Change in plans

Question: How did PUF plans within the plans field of ​provider_puf_data​ change from January to June? This is intentionally a bit open ended :)

Answer: I came up with two questions:

a. Which networks saw the greatest change from January to June?

%%sql
-- Setup the jan_networks cte
WITH jan_networks as (
   SELECT
        npi,
        -- Since every npi has atleast one plan, we can use first index
        plans->0->>'network_tier' as jan_network
    FROM provider_puf_data
    -- Only keep records for the month of January
    WHERE date_trunc('month', created_at) = '2019-01-01'
),
-- Setup the jun_networks cte
jun_networks as (
    SELECT
        npi,
        -- Since every npi has atleast one plan, we can use first index
        plans->0->>'network_tier' as jun_network
    FROM provider_puf_data
    -- Only keep records for the month of June
    WHERE date_trunc('month', created_at) = '2019-06-01'
),
-- Setup query to calculate number of npi per network in jan
jan_npi_per_network as (
    SELECT
        jan_network,
        -- Only count unique npi
        COUNT(DISTINCT npi) as jan_num_providers_in_network
    FROM jan_networks
    GROUP BY jan_network
),
-- Setup query to calculate number of npi per network in jun
jun_npi_per_network as (
    SELECT
        jun_network,
        -- Only count unique npi
        COUNT(DISTINCT npi) as jun_num_providers_in_network
    FROM jun_networks
    GROUP BY jun_network
)
-- Query to calculate the difference in providers per network
SELECT
    -- As we are doing a full outer join, pick a non-null value
    CASE WHEN jan.jan_network IS NOT NULL THEN jan.jan_network
    ELSE jun.jun_network END as network,
    -- Take the absolute difference between num providers
    ABS(COALESCE(jan_num_providers_in_network, 0) -
       COALESCE(jun_num_providers_in_network, 0)) as difference
FROM jan_npi_per_network as jan
-- Perform full outer join to get non-matching rows from both sides
FULL JOIN jun_npi_per_network as jun
ON jan.jan_network = jun.jun_network
ORDER BY difference desc
LIMIT 3;
 * postgresql://read_user:***@server/dbname
3 rows affected.
network difference
HERITAGE 103
PPO 70
NEIGHBORHOOD 70

b. List the networks that did not add any new providers from January to June.

%%sql
-- Setup the jan_networks cte
WITH jan_networks as (
   SELECT
        npi,
        -- Since every npi has atleast one plan, we can use first index
        plans->0->>'network_tier' as jan_network
    FROM provider_puf_data
    -- Only keep records for the month of January
    WHERE date_trunc('month', created_at) = '2019-01-01'
),
-- Setup the jun_networks cte
jun_networks as (
    SELECT
        npi,
        -- Since every npi has atleast one plan, we can use first index
        plans->0->>'network_tier' as jun_network
    FROM provider_puf_data
    -- Only keep records for the month of June
    WHERE date_trunc('month', created_at) = '2019-06-01'
),
-- Setup query to calculate number of npi per network in jan
jan_npi_per_network as (
    SELECT
        jan_network,
        -- Only count unique npi
        COUNT(DISTINCT npi) as jan_num_providers_in_network
    FROM jan_networks
    GROUP BY jan_network
),
-- Setup query to calculate number of npi per network in jun
jun_npi_per_network as (
    SELECT
        jun_network,
        -- Only count unique npi
        COUNT(DISTINCT npi) as jun_num_providers_in_network
    FROM jun_networks
    GROUP BY jun_network
)
-- Query to calculate networks with no providers
SELECT
    jan.jan_network as networks_with_no_providers
FROM jan_npi_per_network as jan
FULL JOIN jun_npi_per_network as jun
ON jan.jan_network = jun.jun_network
WHERE jun.jun_network is NULL
 * postgresql://read_user:***@server/dbname
13 rows affected.
networks_with_no_providers
ACCESS
BLUE-FOCUSCARE
CD-CONNECTION-DENTAL--PREFI
CR--CAREINTON-PREFIX-DJ
DC--DENTALGUARD-CONNECT-ADMIN
DG--DENTALGUARD-PREFERRED-P
DM-DENTEMAX-PREFIX-DW
ELITE
HERITAGE-PLUS
NEIGHBORHOOD
PIMAFOCUS
SK-STRATOSE-FSB-MAVAREST-PPO
SP-AETNA-PPO-PREFIX-DSW

10. How to address the address column?

If you look closely at the address strings during exercises 7/8, you’ll notice a lot of redundant addresses that are just slightly different. If we could merge these addresses we could get much better resolution on which addresses are actually changing between January and June. Given the data you have here, how would you accomplish this? What if you could use any tools available? How would you begin architecting this for now and the future.

A simple way is to merge these fields into one is by using Postgres string functions: CONCAT, UPPER and TRIM and grouping them by provider. However, this does not handle the case when we have multiple representations of the same address, shown in the cell after this.

%%sql
SELECT
    npi,
    COUNT(DISTINCT TRIM(UPPER(CONCAT(address->>'address', ' ',
           address->>'address_2', ' ',
           address->>'city', ' ',
           address->>'state', ' ',
           address->>'zip'
    )))) as unique_addresses
FROM provider_puf_data
GROUP BY npi
ORDER BY unique_addresses desc
LIMIT 10;
 * postgresql://read_user:***@server/dbname
10 rows affected.
npi unique_addresses
1659334894 448
1538375241 256
1659763092 238
1821510181 230
1457751075 212
1972540664 209
1114191335 178
1730203100 175
1205822723 172
1366824997 147
%%sql
SELECT
    UPPER(address->>'address') as street,
    UPPER(address->>'city') as city,
    UPPER(address->>'state') as state,
    address->>'zip' as zipcode
FROM provider_puf_data
WHERE npi = '1538375241' and address->>'zip' = 80023::text
GROUP BY street, city, state, zipcode
 * postgresql://read_user:***@server/dbname
4 rows affected.
street city state zipcode
14422 ORCHARD PARKWAY WESTMINSTER CO 80023
14422 ORCHARD PARKWAY SUITE 200 WESTMINSTER CO 80023
14422 ORCHARD PKWY WESTMINSTER CO 80023
14422 ORCHARD PKY #200 WESTMINSTER CO 80023

All these 4 records should be counted as 1 address. One way to further break the street into house_no and street_name but this does not address cases where PARKWAY is written as PKWY.

An alternative way, is to employ text similarity by enabling the extension fuzzystrmatch in Postgres and calculate the pair-wise text similarity for each provider and discard addresses with high similarity.

11. How long did it take?

11. How long did it take to complete the exercise? (To be fair to candidates who are time constrained we like to be aware of this)

Took me about 10 hours overall.