Exploratory Data Analysis of JSON data using PostgreSQL

18 minute read

Introduction

Doing Exploratory Data Analysis (EDA) is one of the first steps in doing any type of analysis. By asking the right type of questions, we can gain an understanding of our data and uncover any data related issues if any. When data is stored in JSON format inside the database, we can use PostgreSQL’s built in operators and functions to directly query the JSON data. In this post, we will take a look at how we can query JSON using SQL.

Data Description

The database schema contains 4 tables.

  • provider_json (npi bigint, record json): This table contains provider data.
  • insurances (uuid uuid, display text): This table contains the unique insurance plans.
  • specialties (uuid uuid, display text): This table contains the unique specialties.
  • provider_puf_data (npi bigint, source text, address json, plans json, created_at timestamp): This table contains a couple snapshots of data that is routinely ingested. The data contains records that were ingested in January and June.
%load_ext sql
%sql postgresql://<user>:<passwd>@host/dbname
'Connected: read_user@interview'

provider_json table

Let’s first look at the first few rows of provider_json table. As you can see the record data is stored in JSON format. This could be a response from an API that we are just storing into the table. The advantange of storing data in raw JSON format is that we don’t need to think about parsing the JSON before-hand and storing that data into multiple tables. We will store them in the denormalized form and write queries against this data.

%%sql
SELECT * FROM provider_json LIMIT 2;
 * postgresql://read_user:***@server/dbname
2 rows affected.
npi record
1669554267 {'npi': '1669554267', 'first_name': 'Hyung', 'middle_name': 'Soon', 'last_name': 'Park', 'age': 83, 'gender': 'm', 'ratings_count': 1, 'ratings_avg': 2.0, 'degrees': [], 'specialties': ['c741df78-8460-46f6-a2ce-93de36eb7205'], 'languages': [], 'educations': [], 'insurances': [], 'provider_types': ['Doctor'], 'locations': [{'uuid': '8d96e755-65d3-470f-baaf-c20c73d673bb', 'name': None, 'address': '2296 Opitz Blvd # 255, Woodbridge, VA 22191, US', 'address_details': {'street': '2296 Opitz Blvd # 255', 'address_line_1': '2296 Opitz Blvd', 'address_line_2': '# 255', 'city': 'Woodbridge', 'state': 'VA', 'zip': '22191'}, 'latitude': 38.63739, 'longitude': -77.28514129999999, 'google_maps_link': 'https://www.google.com/maps/@38.63739,-77.28514129999999?q=2296%20Opitz%20Blvd%20%23%20255%2C%20Woodbridge%2C%20VA%2022191%2C%20US', 'phone_numbers': [{'phone': '7036704986', 'details': 'primary'}], 'confidence': 2}], 'online_profiles': [{'url': 'https://www.healthgrades.com/physician/dr-hyung-park-x3syc'}]}
1467524520 {'npi': '1467524520', 'first_name': 'Rohit', 'middle_name': None, 'last_name': 'Bhaskar', 'age': 57, 'gender': 'm', 'ratings_count': 35, 'ratings_avg': 9.07, 'degrees': ['MD'], 'specialties': ['dff54cb3-c9b0-43a0-8ca8-26c7c05bee83', '1de33770-eb1c-47fa-ab3e-f9a4ab924d9d', '76126f5d-723a-40bc-b4c1-7dcda477797a', '475b20fc-c188-4ca1-8e3a-b0614b8f8ce2'], 'languages': ['english'], 'educations': [{'education': {'name': 'Cornell University', 'uuid': 'c4588ea7-567d-43dc-a0e5-67937fb77f5c'}, 'type': 'Undergraduate School', 'year': 1982}, {'education': {'name': 'Baylor College of Medicine', 'uuid': 'b2ed4e48-6cb0-42d2-8804-7557b6452e8d'}, 'type': 'Residency', 'year': None}, {'education': {'name': 'Icahn School of Medicine at Mount Sinai', 'uuid': '16e70085-6b81-4581-b903-dce9d9acf74a'}, 'type': 'Medical School', 'year': None}, {'education': {'name': 'Baylor College of Medicine', 'uuid': 'b2ed4e48-6cb0-42d2-8804-7557b6452e8d'}, 'type': 'Internship', 'year': None}], 'insurances': ['5ad8ff7d-5227-46d7-a392-d28665610257', '3a56d135-dded-48c3-90db-a4f5d9da9786', 'dc059078-9115-488f-bdf6-e9106822225a', 'daaa1ac3-eb9d-424b-a249-8d7c51d39ce1', 'ae0b1dd0-7356-41ee-8df4-871a52451eb7', 'd280371e-f9f1-42a2-aed5-fcb0fe89c219', '3e1ac59e-223a-469d-bb30-63f6988e4134', '5da15840-6f46-4978-8747-8be580d48079', '9eb760da-dade-475d-9390-0c998ff50fcc', 'ad3204b8-1355-48f4-9ee3-194fc5d970f1', '2d5231c0-2d29-4026-9138-5dcbd0591404', '6ab9d1d3-b969-4352-9fac-68b6494cbe7b', '67f5d31d-88f6-4f7a-9b1d-74667ca37f08', '90e00746-69ed-430e-98f9-f637abcbe144', 'b10209f5-96d9-4bbc-8fda-38590460a34a', 'cfb5d792-2d33-444f-a1e8-5487de63e9a0', '51a98931-aace-42c1-8890-66ce53eb1971', 'a6e318c6-cf38-4d65-8bd0-6976b647974b', 'ba8cc746-665a-4925-8f5f-3b213dad238d', '36f607e8-2785-411e-8384-f17195f09468', 'cbbc9743-5cdf-4df2-92ca-fcde6a536160', 'a003a89c-eecb-470b-938d-1b97225f3d81', '791d592b-d6c0-4576-99d0-d78dac8496a1', 'f6a0b45a-042e-4886-82c6-b3b9e086cf83', '57ecd933-4bb8-4289-9080-a8259a6d683c', '6d2b08ec-2f67-4a5b-ac89-642f15e032b8', '749b0b00-ddf8-4cef-bc5b-5694cccbad0f', 'a7edd72c-4010-448e-8624-c92f997fb2a2', '730a2f48-0cb6-49b8-a5bb-807534ef0f24', '9d5266c8-a7b3-4d27-ab96-3a524b4cc491', 'f7627ef2-7d1d-4974-b934-2fdb29c0ff4f', '69af8b6c-06b3-4b66-91c5-ede90be026d8', '96292303-3e6e-4758-a851-f049d7956345', '61a6872e-3c89-4c93-b9b2-ca6ca1db7a8e', '2fdd9e5c-840a-40f1-a575-8ac6ec6094eb', 'd4357c57-76d9-4fda-94b6-16ca8869cd45', 'b1085d28-d17c-42c1-8448-fc56f5bf8ff8', 'ebf487b7-36a0-4ba9-84f6-c2f5f004e06a', '5b7ba09b-9a42-454a-84a6-bc1ba7449379', '84d17e89-55e7-4a08-8468-f52780b926e4', '2e7f2d0a-308e-43d4-9bfb-227a7571550f', 'e364db55-c32d-4e9d-957b-84bfbf62dfea', 'b7f67654-891f-4152-acc8-def976e30826', 'c94db5a7-e5f7-4d1f-9ff9-7ba8e683b57d', 'f2466b1a-782f-4cac-a204-cacb50b8f322', '6417d256-4fdb-4a60-903b-8f42b0405653', '4a1b53fe-79e5-464c-83d9-14143306c02c', 'eecb0179-615f-42cd-9fc0-8f7570f4de63', '0bcbfcff-0e3f-47bc-a32e-a52add577d5b', '9d4642de-8626-4889-8315-9e7ec4f1cd1f', '71e54292-85b7-4cff-a1f0-e06e92f64b5e', '672921ed-9c68-45a5-b808-3df8261d8ff0', 'de8b20b6-a2e3-460a-9018-9d89c009cef8', '40e72184-ff85-456b-a794-35967bc8cfb2', '4d893957-8038-409e-8c96-2e5edd626d45', 'b70e5e27-f14b-4a83-84d8-bb711b70886f', 'd122f267-a255-4eed-afa1-8252e2cbdc17', '1d76bc02-2621-4041-914a-9b9e28babeff', 'd308de64-b0a3-4a50-96b1-4aa813794a1d', '49841d97-6ede-48af-841f-8548d2930e18', 'ee9db5f9-de43-4c3e-a156-d5ba8ba1f6b3', '931810bc-cd6d-4bb6-a77a-7c979fbac3fd', 'c5734165-3e62-4f14-9559-1e5caec18179', 'a0dd1e6d-ccf8-4f09-bbd1-e4fcbdf438ac', 'c1ba934d-cf6c-462d-af28-9cc92ab24937', '4c388b98-8aca-4c28-bd43-74ab84d438f2', '01047fb0-4036-4dcd-81e0-62e6cef4e3e8', '1c27a9d5-3412-41f7-bde2-87eba4abcb5e', '9161f466-8e7e-493d-bd61-d647755dbc53', '46db02d5-a040-4ca5-9510-ef0591feca22', '3a4b2c17-4235-4982-b275-4018c5283ab4', 'ed3f6f88-a4ad-4b56-9d31-8d086ddd1521', 'dae8c52c-b0e0-4688-9078-6e5dcef9de03', '87842763-d404-418e-a265-fea0ef2cac50', '413e5c40-5f12-47c2-9142-4ee27ff61372', 'd54100c0-7990-4638-b57d-79df580f192f', '5b20ceac-5463-4255-b55d-c1a639cc3cc7', '239f0c59-4f28-43ab-9fde-ded6ad8e6cfe', '837c4cd1-4324-44eb-8af3-5bcf2622d8be', '1ec09209-7e22-4698-981d-9cbb150c5746', 'd1c4b24d-2136-4e8d-8497-8ff1fced234d', 'f67a5513-7c63-4243-afcf-84684202f901', '7dca5f3b-e88c-4c32-858f-f21f6fb1ff85', 'a21318a9-a835-4e15-87b2-da893031b6ce', '0486ad3e-5eb2-4589-8383-24bfcec56b47', 'faa0d1f5-388e-4bc4-b60f-baeab55a9930', '4762f4c5-411c-4c86-8af4-324355546ab9', '6a76f092-a95b-40ed-9bbc-c5f5dffb6495', '1038bfd5-c637-493f-95c7-f294b8a84fd6', '81d688b5-d3c2-4c83-8572-b82423525f5b', '05ac5752-da26-4347-a3d2-8b13b008fd70', 'd61b4d0a-4897-4403-afd6-d84fc909caa6', '5c1b23af-e735-408d-94c3-c0146b19b180', '2b8503a8-9ed9-4481-9dd3-23beeee297e9', 'e560745b-6da7-4b5f-b147-7b4f1fea3cd8', '2040857e-d4ad-417d-92ab-52e3c95e5cba', '6584c17c-b804-4f94-ad34-cd8e63f76162', '79cff8b6-cf98-4dcf-8b54-383fa2c43689', 'ea604298-bb16-4d84-b76a-ce6d7c8c3262', 'd4805da1-cb4a-420a-95a4-daee6f7f0fd9', '5278b602-0ea2-4716-af69-9b1560a56ef6', 'b8581687-2b4e-4f2f-91a9-9128f5bec540', '94eaf60d-67fa-4081-a37e-048dfa107f36', '787fa83d-b044-4c30-a7dd-50551c9badb1', '4c3c10e6-6371-463a-af5f-3bb5212aa05e', 'a0587784-a863-47c3-84ae-0dfda49a4be0', '80a261e0-4d0f-4e91-8a70-0c761fbe40ca'], 'provider_types': ['Doctor'], 'locations': [{'uuid': '0decdff6-1703-4d4a-9821-caad045c2f70', 'name': None, 'address': '3941 J St # 260, Sacramento, CA 95819, US', 'address_details': {'street': '3941 J St # 260', 'address_line_1': '3941 J St', 'address_line_2': '# 260', 'city': 'Sacramento', 'state': 'CA', 'zip': '95819'}, 'latitude': 38.5706552, 'longitude': -121.4535451, 'google_maps_link': 'https://www.google.com/maps/@38.5706552,-121.4535451?q=3941%20J%20St%20%23%20260%2C%20Sacramento%2C%20CA%2095819%2C%20US', 'phone_numbers': [{'phone': '9167362323', 'details': 'primary'}, {'phone': '9169845318', 'details': 'secondary'}], 'confidence': 4}, {'uuid': '7974eec5-2760-4642-9ce8-b2401f33c61f', 'name': None, 'address': '1580 Creekside Dr # 250, Folsom, CA 95630, US', 'address_details': {'street': '1580 Creekside Dr # 250', 'address_line_1': '1580 Creekside Dr', 'address_line_2': '# 250', 'city': 'Folsom', 'state': 'CA', 'zip': '95630'}, 'latitude': 38.6687282, 'longitude': -121.1479608, 'google_maps_link': 'https://www.google.com/maps/@38.6687282,-121.1479608?q=1580%20Creekside%20Dr%20%23%20250%2C%20Folsom%2C%20CA%2095630%2C%20US', 'phone_numbers': [{'phone': '9169845318', 'details': 'primary'}], 'confidence': 1}, {'uuid': '3840e421-0e5b-498b-b937-2eb75ef3e09b', 'name': None, 'address': '8120 Timberlake Way # 108, Sacramento, CA 95823, US', 'address_details': {'street': '8120 Timberlake Way # 108', 'address_line_1': '8120 Timberlake Way', 'address_line_2': '# 108', 'city': 'Sacramento', 'state': 'CA', 'zip': '95823'}, 'latitude': 38.4620284, 'longitude': -121.4144321, 'google_maps_link': 'https://www.google.com/maps/@38.4620284,-121.4144321?q=8120%20Timberlake%20Way%20%23%20108%2C%20Sacramento%2C%20CA%2095823%2C%20US', 'phone_numbers': [{'phone': '9166885551', 'details': 'primary'}], 'confidence': 1}, {'uuid': '48a62df8-f2f4-4320-905a-c67600804be6', 'name': 'Mercy Hospital of Folsom', 'address': '1650 Creekside Dr, Folsom, CA 95630, US', 'address_details': {'street': '1650 Creekside Dr', 'address_line_1': '1650 Creekside Dr', 'address_line_2': None, 'city': 'Folsom', 'state': 'CA', 'zip': '95630'}, 'latitude': 38.670333, 'longitude': -121.1458461, 'google_maps_link': 'https://www.google.com/maps/@38.670333,-121.1458461?q=1650%20Creekside%20Dr%2C%20Folsom%2C%20CA%2095630%2C%20US', 'phone_numbers': [{'phone': '9169837400', 'details': 'primary'}], 'confidence': 1}, {'uuid': 'f41bc9cd-e982-4022-9427-3a6bad872a2d', 'name': None, 'address': '7500 Hospital Dr, Sacramento, CA 95823, US', 'address_details': {'street': '7500 Hospital Dr', 'address_line_1': '7500 Hospital Dr', 'address_line_2': None, 'city': 'Sacramento', 'state': 'CA', 'zip': '95823'}, 'latitude': 38.4638056, 'longitude': -121.4164937, 'google_maps_link': 'https://www.google.com/maps/@38.4638056,-121.4164937?q=7500%20Hospital%20Dr%2C%20Sacramento%2C%20CA%2095823%2C%20US', 'phone_numbers': [{'phone': '8006774491', 'details': 'primary'}], 'confidence': 1}], 'online_profiles': [{'url': 'https://www.healthgrades.com/physician/dr-rohit-bhaskar-2hspx'}, {'url': 'https://www.vitals.com/doctors/Dr_Rohit_Bhaskar.html'}]}

Each record is made up of the following top-level keys. Some of them are of JSON array type.

%%sql
SELECT
    json_object_keys(record)
FROM
    provider_json
LIMIT 16;
 * postgresql://read_user:***@server/dbname
16 rows affected.
json_object_keys
npi
first_name
middle_name
last_name
age
gender
ratings_count
ratings_avg
degrees
specialties
languages
educations
insurances
provider_types
locations
online_profiles

insurances table

%%sql
SELECT * FROM insurances LIMIT 2;
 * postgresql://read_user:***@server/dbname
2 rows affected.
uuid display
5f854c4f-f4b2-45cb-ac7f-dfb9f1c0edba Highmark Blue Cross Blue Shield - WV - Medicare Advantage - PPO
5d624d7b-8670-4e50-b2b1-5a067f7f96f4 Blue Cross Blue Shield of Michigan - Kansas City Basic and Standard

specialties table

%%sql
SELECT * FROM specialties LIMIT 2;
 * postgresql://read_user:***@server/dbname
2 rows affected.
uuid display
2a6ea574-686a-4c60-b215-508c7c192203 Foster Care Agency
fa710988-0955-422d-9ffd-0924c2b9e16c Home Health Aide Agency

provider_puf_data table

This table contains two json columns: address and plans. Our goal is to explore both of these columns to uncover some interesting trends. Secondly, we have the created_at timestamp column, which lets us do comparisons over time.

%%sql
SELECT * FROM provider_puf_data LIMIT 2;
 * postgresql://read_user:***@server/dbname
2 rows affected.
npi source address plans created_at
1538375241 https://api.humana.com/v1/cms/index.json {'zip': '80015', 'city': 'Aurora', 'phone': '3036174488', 'state': 'CO', 'address': '16900 E Quincy Ave', 'address_2': 'Ste B'} [{'years': [2019], 'plan_id': '11469KY0010001', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '11469KY0010002', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '15980FL0160002', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '30613MO0560001', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '37001GA0540002', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '44580AL0380001', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '44580AL0380002', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '44965LA0590001', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '46958UT0470002', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '48963MS0510002', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '62294MI0420001', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '63141TX0760001', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '63141TX0760002', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '66105AZ0620002', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '66105AZ0620003', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '68303IL0690003', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '68303IL0690004', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '82120TN0630004', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '82120TN0630005', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '91604WI0490002', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '97596OH0580003', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}] 2019-06-25 00:00:00
1538375241 https://api.humana.com/v1/cms/index.json {'zip': '80015', 'city': 'Aurora', 'phone': '7148458890', 'state': 'CO', 'address': '16900 E Quincy Ave'} [{'years': [2019], 'plan_id': '11469KY0010001', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '11469KY0010002', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '15980FL0160002', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '30613MO0560001', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2018], 'plan_id': '37001GA0540001', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '37001GA0540002', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '44580AL0380001', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '44580AL0380002', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '44965LA0590001', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '46958UT0470002', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2018], 'plan_id': '48963MS0510001', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '48963MS0510002', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '62294MI0420001', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '63141TX0760001', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '63141TX0760002', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '66105AZ0620002', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '66105AZ0620003', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2018], 'plan_id': '67499PA0010002', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2018], 'plan_id': '68303IL0690001', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2018], 'plan_id': '68303IL0690002', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '68303IL0690003', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '68303IL0690004', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2018], 'plan_id': '82120TN0630002', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2018], 'plan_id': '82120TN0630003', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '82120TN0630004', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '82120TN0630005', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2018], 'plan_id': '91604WI0490001', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '91604WI0490002', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2018], 'plan_id': '97596OH0580002', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '97596OH0580003', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}] 2019-01-08 00:00:00

Questions

1. What is the average age of the providers in providers_json?

%%sql
SELECT
    FLOOR(AVG((record ->> 'age')::int)) as age
FROM provider_json;
 * postgresql://read_user:***@server/dbname
1 rows affected.
age
57

In order to calculate the most popular specialty, we first need to expand the JSON array inside the record field, then we need to perform a join with the specialty UUID and then GROUP BY and find the most popular specialty.

%%sql
SELECT
    sp.display as most_popular_specialty
FROM (SELECT
        -- Unnest the array to create a row for each specialty
        json_array_elements_text(record -> 'specialties') as specialty
    FROM provider_json) as s
JOIN specialties as sp
-- Cast to UUID type before joining
ON s.specialty::uuid = sp.uuid
GROUP BY 1
ORDER BY COUNT(*) DESC
LIMIT 1
 * postgresql://read_user:***@server/dbname
1 rows affected.
most_popular_specialty
Internal Medicine

3. Providers practicing at locations with high confidence score

Question: You’ll notice within all locations objects of the providers_json record field, there exists a “confidence” score. This is a representation of our model deciding the probability that a provider is indeed practicing at this location. We allow scores between 0 and 5. How many providers have at least 2 locations with confidence of 4 or higher?

This question requires us to filter records based on the values inside the JSON record.

%%sql
SELECT
    COUNT(DISTINCT provider) as num_providers_high_conf_score
FROM (SELECT
            provider
      -- Sub query to unnest the entries of locations array
      FROM (SELECT
                record->>'npi' as provider,
                json_array_elements(record->'locations') as location
            FROM provider_json) as l
      -- Only keep locations that meet the confidence score criteria
      WHERE (location->>'confidence')::int>=4
      GROUP BY provider
      -- Only keep locations that meet minumum number of locations criteria
      HAVING COUNT(location)>=2) as num_p
 * postgresql://read_user:***@server/dbname
1 rows affected.
num_providers_high_conf_score
2054

Question: In all provider records, you’ll see a field called “insurances”. This is the unique list of all insurance plans a provider accepts (we represent this as UUID which connects to the insurances table). For now, let’s assume a provider accepts all of these insurance plans at all locations they practice at. Find the total number of unique insurance plans accepted by all providers at the most ​popular​ location of this data set. (Popular = the most providers practice there)

Similar in concept to the earlier query, but we need to layer this step by step using CTEs until we finally reach our solution.

%%sql
-- Setup the location cte
WITH location_cte as
(
    SELECT
        npi as provider,
        json_array_elements(record->'locations')->>'uuid' as location
    FROM provider_json
),
-- Setup the popular locations cte
popular_location as
(
    SELECT
        location
    -- Use the location_cte
    FROM location_cte as l
    GROUP BY location
    -- Order by number of providers at each location
    ORDER BY COUNT(provider) DESC
    LIMIT 1  
),
-- Setup the providers at the most popular location
providers_at_popular_location as
(
    SELECT
        provider
    -- Use the popular_location to filter location_cte
    FROM location_cte
    WHERE location IN (SELECT location FROM popular_location)  
)
-- Get count of distinct insurances for providers at popular location
SELECT
    COUNT(distinct ins.insurances::text) as total_unique_insurances_at_most_popular_location
FROM providers_at_popular_location as pop_prov
LEFT JOIN (SELECT
            record->>'npi' as provider,
            json_array_elements(record->'insurances') as insurances
           FROM provider_json) as ins
ON pop_prov.provider::text = ins.provider
 * postgresql://read_user:***@server/dbname
1 rows affected.
total_unique_insurances_at_most_popular_location
361

5. Neglected provider fields

Which provider fields in provider_json are the most neglected? How would you go about figuring this out beyond this small sample?

Assuming that we consider a field as neglected when it does not have a value, we can build this logic into the query by making use of json_each_text() function to get (key,value) pairs for each top-level field in the record. Next, we use the split_part() function to extract the key and value into their own columns. For the sake of readability, I created provider_key_value cte and then used a simple CASE WHEN to encode rows that contain an '' or [] as 1 - indicating that they are ignored. Finally, we can group by key and sum up the num_ignored values for each key.

By using this logic, age is the most neglected field.

%%sql
-- Setup the provider_fields
WITH provider_fields as (
    SELECT
        json_each_text(record) as field
    FROM provider_json
), provider_key_value as (
    SELECT
        split_part(split_part(field::text, ',', 1), '(', 2) as key,
        split_part(split_part(field::text, ',', 2), ')', 1) as value
    FROM provider_fields
)
SELECT
    key,
    SUM(CASE WHEN value != '' AND value != '[]' THEN 0 ELSE 1 END) as num_ignored
FROM provider_key_value
GROUP BY key
ORDER BY num_ignored desc;
 * postgresql://read_user:***@server/dbname
16 rows affected.
key num_ignored
age 6597
degrees 4276
insurances 3186
educations 3138
ratings_avg 3052
languages 2997
gender 2366
middle_name 2059
online_profiles 2012
locations 60
provider_types 6
specialties 6
first_name 1
ratings_count 0
npi 0
last_name 0

6. Top three sources/brokers with most added/removed rows

Question: Who are the top 3 sources/brokers from ​provider_puf_data​ which had the most rows added or dropped between January and June. NOTE: If a source didn’t exist in January or June, treat that source like it has 0 rows for that date.

We turn our focus now to the second table with JSON data. In this case, we are going to find the brokers who added/removed rows from January to June.

%%sql
-- Setup the jan_brokers cte
WITH jan_brokers as(
    SELECT
        source,
        COUNT(*) as jan_count
    FROM provider_puf_data
    -- Only keep the jan rows
    WHERE date_trunc('month', created_at) = '2019-01-01'
    GROUP BY source
),
-- Setup the jun_brokers cte
jun_brokers as(
    SELECT
        source,
        COUNT(*) as jun_count
    FROM provider_puf_data
    -- Only keep the jun rows
    WHERE date_trunc('month', created_at) = '2019-06-01'
    GROUP BY source
)
-- Query to calculate top three brokers with biggest difference
SELECT
    -- As we are using a full-outer join, we want to keep the non-null entries
    CASE WHEN jan_brokers.source IS NOT NULL THEN jan_brokers.source
    ELSE jun_brokers.source END as brokers,
    -- Take the absoulte difference to account for added/removed rows
    ABS(COALESCE(jan_count, 0) - COALESCE(jun_count, 0)) as diff
FROM jan_brokers
FULL JOIN jun_brokers
ON jan_brokers.source = jun_brokers.source
ORDER BY diff DESC
LIMIT 3;
 * postgresql://read_user:***@server/dbname
3 rows affected.
brokers diff
http://mydental.guardianlife.com/secure/json/index.json 798
https://www.dentegra.com/hcx/cms-data-index.json 604
https://api.centene.com/ambetter/reference/cms-data-index.json 237

7. NPIs with most newly added addresses

Using only the “address” string in the address column of provider_puf_data, which NPIs had the most new addresses added between January and June and how many new addresses were added? (the top 3 NPIs will do)

Understanding how many addresses are added/removed can be crucial in keeping up with provider data for the business to show accurate details.

%%sql
-- Setup jan_addresses cte
WITH jan_addresses as (
    SELECT
        npi,
        -- Convert to uppercase before comparison
        trim(upper(address->>'address')) as jan_address
    FROM provider_puf_data
    -- Only keep records in january
    WHERE date_trunc('month', created_at) = '2019-01-01'
),
-- Setup jun_addresses cte
jun_addresses as (
    SELECT
        npi,
        -- Convert to uppercase before comparison
        trim(upper(address->>'address')) as jun_address
    FROM provider_puf_data
    -- Only keep records in june
    WHERE date_trunc('month', created_at) = '2019-06-01'
),
-- Setup unique addresses for each npi in january
jan_unique_addresses as (
    SELECT
        npi,
        jan_address
    FROM jan_addresses
    GROUP BY npi, jan_address
),
-- Setup unique addresses for each npi in june
jun_unique_addresses as (
    SELECT
        npi,
        jun_address
    FROM jun_addresses
    GROUP BY npi, jun_address
),
-- Setup query to calculate new addresses added in june
newly_added_addresses as (
    SELECT *
    FROM jun_unique_addresses
    EXCEPT
    SELECT *
    FROM jan_unique_addresses
)
-- Query to display top three npi with most number of new addresses
SELECT
    npi,
    count(*) as num_new
FROM newly_added_addresses
GROUP BY npi
ORDER BY num_new desc
LIMIT 3;
 * postgresql://read_user:***@server/dbname
3 rows affected.
npi num_new
1801806518 73
1457791816 68
1265411953 57

8. NPIs with most number of addresses removed

Question: Now the opposite of the above question, which NPIs saw the most addresses removed between January and June? (the top 3 NPIs will do)

%%sql
-- Setup jan_addresses cte
WITH jan_addresses as (
    SELECT
        npi,
        -- Convert to uppercase before comparison
        trim(upper(address->>'address')) as jan_address
    FROM provider_puf_data
    -- Only keep records in january
    WHERE date_trunc('month', created_at) = '2019-01-01'
),
-- Setup jun_addresses cte
jun_addresses as (
    SELECT
        npi,
        -- Convert to uppercase before comparison
        trim(upper(address->>'address')) as jun_address
    FROM provider_puf_data
    -- Only keep records in june
    WHERE date_trunc('month', created_at) = '2019-06-01'
),
-- Setup unique addresses for each npi in january
jan_unique_addresses as (
    SELECT
        npi,
        jan_address
    FROM jan_addresses
    GROUP BY npi, jan_address
),
-- Setup unique addresses for each npi in june
jun_unique_addresses as (
    SELECT
        npi,
        jun_address
    FROM jun_addresses
    GROUP BY npi, jun_address
),
-- Setup query to calculate addresses that got removed in june
removed_addresses as (
    SELECT *
    FROM jan_unique_addresses
    EXCEPT
    SELECT *
    FROM jun_unique_addresses  
)
-- Query to display top three npi with most number of addresses removed
SELECT
    npi,
    COUNT(*) as num_removed
FROM removed_addresses
GROUP BY npi
ORDER BY num_removed desc
LIMIT 3;
 * postgresql://read_user:***@server/dbname
3 rows affected.
npi num_removed
1659334894 65
1407379845 44
1073694808 30

9. Change in plans

Question: How did PUF plans within the plans field of ​provider_puf_data​ change from January to June? This is intentionally a bit open ended :)

Answer: I came up with two questions:

a. Which networks saw the greatest change from January to June?

%%sql
-- Setup the jan_networks cte
WITH jan_networks as (
   SELECT
        npi,
        -- Since every npi has atleast one plan, we can use first index
        plans->0->>'network_tier' as jan_network
    FROM provider_puf_data
    -- Only keep records for the month of January
    WHERE date_trunc('month', created_at) = '2019-01-01'
),
-- Setup the jun_networks cte
jun_networks as (
    SELECT
        npi,
        -- Since every npi has atleast one plan, we can use first index
        plans->0->>'network_tier' as jun_network
    FROM provider_puf_data
    -- Only keep records for the month of June
    WHERE date_trunc('month', created_at) = '2019-06-01'
),
-- Setup query to calculate number of npi per network in jan
jan_npi_per_network as (
    SELECT
        jan_network,
        -- Only count unique npi
        COUNT(DISTINCT npi) as jan_num_providers_in_network
    FROM jan_networks
    GROUP BY jan_network
),
-- Setup query to calculate number of npi per network in jun
jun_npi_per_network as (
    SELECT
        jun_network,
        -- Only count unique npi
        COUNT(DISTINCT npi) as jun_num_providers_in_network
    FROM jun_networks
    GROUP BY jun_network
)
-- Query to calculate the difference in providers per network
SELECT
    -- As we are doing a full outer join, pick a non-null value
    CASE WHEN jan.jan_network IS NOT NULL THEN jan.jan_network
    ELSE jun.jun_network END as network,
    -- Take the absolute difference between num providers
    ABS(COALESCE(jan_num_providers_in_network, 0) -
       COALESCE(jun_num_providers_in_network, 0)) as difference
FROM jan_npi_per_network as jan
-- Perform full outer join to get non-matching rows from both sides
FULL JOIN jun_npi_per_network as jun
ON jan.jan_network = jun.jun_network
ORDER BY difference desc
LIMIT 3;
 * postgresql://read_user:***@server/dbname
3 rows affected.
network difference
HERITAGE 103
PPO 70
NEIGHBORHOOD 70

b. List the networks that did not add any new providers from January to June.

%%sql
-- Setup the jan_networks cte
WITH jan_networks as (
   SELECT
        npi,
        -- Since every npi has atleast one plan, we can use first index
        plans->0->>'network_tier' as jan_network
    FROM provider_puf_data
    -- Only keep records for the month of January
    WHERE date_trunc('month', created_at) = '2019-01-01'
),
-- Setup the jun_networks cte
jun_networks as (
    SELECT
        npi,
        -- Since every npi has atleast one plan, we can use first index
        plans->0->>'network_tier' as jun_network
    FROM provider_puf_data
    -- Only keep records for the month of June
    WHERE date_trunc('month', created_at) = '2019-06-01'
),
-- Setup query to calculate number of npi per network in jan
jan_npi_per_network as (
    SELECT
        jan_network,
        -- Only count unique npi
        COUNT(DISTINCT npi) as jan_num_providers_in_network
    FROM jan_networks
    GROUP BY jan_network
),
-- Setup query to calculate number of npi per network in jun
jun_npi_per_network as (
    SELECT
        jun_network,
        -- Only count unique npi
        COUNT(DISTINCT npi) as jun_num_providers_in_network
    FROM jun_networks
    GROUP BY jun_network
)
-- Query to calculate networks with no providers
SELECT
    jan.jan_network as networks_with_no_providers
FROM jan_npi_per_network as jan
FULL JOIN jun_npi_per_network as jun
ON jan.jan_network = jun.jun_network
WHERE jun.jun_network is NULL
 * postgresql://read_user:***@server/dbname
13 rows affected.
networks_with_no_providers
ACCESS
BLUE-FOCUSCARE
CD-CONNECTION-DENTAL--PREFI
CR--CAREINTON-PREFIX-DJ
DC--DENTALGUARD-CONNECT-ADMIN
DG--DENTALGUARD-PREFERRED-P
DM-DENTEMAX-PREFIX-DW
ELITE
HERITAGE-PLUS
NEIGHBORHOOD
PIMAFOCUS
SK-STRATOSE-FSB-MAVAREST-PPO
SP-AETNA-PPO-PREFIX-DSW

10. How to address the address column?

If you look closely at the address strings during exercises 7/8, you’ll notice a lot of redundant addresses that are just slightly different. If we could merge these addresses we could get much better resolution on which addresses are actually changing between January and June. Given the data you have here, how would you accomplish this? What if you could use any tools available? How would you begin architecting this for now and the future.

A simple way is to merge these fields into one is by using Postgres string functions: CONCAT, UPPER and TRIM and grouping them by provider. However, this does not handle the case when we have multiple representations of the same address, shown in the cell after this.

%%sql
SELECT
    npi,
    COUNT(DISTINCT TRIM(UPPER(CONCAT(address->>'address', ' ',
           address->>'address_2', ' ',
           address->>'city', ' ',
           address->>'state', ' ',
           address->>'zip'
    )))) as unique_addresses
FROM provider_puf_data
GROUP BY npi
ORDER BY unique_addresses desc
LIMIT 10;
 * postgresql://read_user:***@server/dbname
10 rows affected.
npi unique_addresses
1659334894 448
1538375241 256
1659763092 238
1821510181 230
1457751075 212
1972540664 209
1114191335 178
1730203100 175
1205822723 172
1366824997 147
%%sql
SELECT
    UPPER(address->>'address') as street,
    UPPER(address->>'city') as city,
    UPPER(address->>'state') as state,
    address->>'zip' as zipcode
FROM provider_puf_data
WHERE npi = '1538375241' and address->>'zip' = 80023::text
GROUP BY street, city, state, zipcode
 * postgresql://read_user:***@server/dbname
4 rows affected.
street city state zipcode
14422 ORCHARD PARKWAY WESTMINSTER CO 80023
14422 ORCHARD PARKWAY SUITE 200 WESTMINSTER CO 80023
14422 ORCHARD PKWY WESTMINSTER CO 80023
14422 ORCHARD PKY #200 WESTMINSTER CO 80023

All these 4 records should be counted as 1 address. One way to further break the street into house_no and street_name but this does not address cases where PARKWAY is written as PKWY.

An alternative way, is to employ text similarity by enabling the extension fuzzystrmatch in Postgres and calculate the pair-wise text similarity for each provider and discard addresses with high similarity.

Conclusion

A lot of APIs send data in the JSON format and exploring JSON data is a useful skill to have. There are many tools that will help you work with JSON data, but when working with JSON data stored inside a SQL database we have an option of reading that into memory and use a package like pandas or write our own python module that reads data from the table and pre-processes it. If you don’t want to write your logic to work with JSON data in your program, then you can rely on Postgres’ built-in operators and functions to work with JSON.