Exploratory Data Analysis of JSON data using PostgreSQL
Introduction
Doing Exploratory Data Analysis (EDA) is one of the first steps in doing any type of analysis. By asking the right type of questions, we can gain an understanding of our data and uncover any data related issues if any. When data is stored in JSON format inside the database, we can use PostgreSQL’s built in operators and functions to directly query the JSON data. In this post, we will take a look at how we can query JSON using SQL.
Data Description
The database schema contains 4 tables.
provider_json (npi bigint, record json):
This table contains provider data.insurances (uuid uuid, display text):
This table contains the unique insurance plans.specialties (uuid uuid, display text):
This table contains the unique specialties.provider_puf_data (npi bigint, source text, address json, plans json, created_at timestamp):
This table contains a couple snapshots of data that is routinely ingested. The data contains records that were ingested in January and June.
%load_ext sql
%sql postgresql://<user>:<passwd>@host/dbname
'Connected: read_user@interview'
provider_json
table
Let’s first look at the first few rows of provider_json
table. As you can see the record data is stored in JSON format. This could be a response from an API that we are just storing into the table. The advantange of storing data in raw JSON format is that we don’t need to think about parsing the JSON before-hand and storing that data into multiple tables. We will store them in the denormalized form and write queries against this data.
%%sql
SELECT * FROM provider_json LIMIT 2;
* postgresql://read_user:***@server/dbname
2 rows affected.
npi | record |
---|---|
1669554267 | {'npi': '1669554267', 'first_name': 'Hyung', 'middle_name': 'Soon', 'last_name': 'Park', 'age': 83, 'gender': 'm', 'ratings_count': 1, 'ratings_avg': 2.0, 'degrees': [], 'specialties': ['c741df78-8460-46f6-a2ce-93de36eb7205'], 'languages': [], 'educations': [], 'insurances': [], 'provider_types': ['Doctor'], 'locations': [{'uuid': '8d96e755-65d3-470f-baaf-c20c73d673bb', 'name': None, 'address': '2296 Opitz Blvd # 255, Woodbridge, VA 22191, US', 'address_details': {'street': '2296 Opitz Blvd # 255', 'address_line_1': '2296 Opitz Blvd', 'address_line_2': '# 255', 'city': 'Woodbridge', 'state': 'VA', 'zip': '22191'}, 'latitude': 38.63739, 'longitude': -77.28514129999999, 'google_maps_link': 'https://www.google.com/maps/@38.63739,-77.28514129999999?q=2296%20Opitz%20Blvd%20%23%20255%2C%20Woodbridge%2C%20VA%2022191%2C%20US', 'phone_numbers': [{'phone': '7036704986', 'details': 'primary'}], 'confidence': 2}], 'online_profiles': [{'url': 'https://www.healthgrades.com/physician/dr-hyung-park-x3syc'}]} |
1467524520 | {'npi': '1467524520', 'first_name': 'Rohit', 'middle_name': None, 'last_name': 'Bhaskar', 'age': 57, 'gender': 'm', 'ratings_count': 35, 'ratings_avg': 9.07, 'degrees': ['MD'], 'specialties': ['dff54cb3-c9b0-43a0-8ca8-26c7c05bee83', '1de33770-eb1c-47fa-ab3e-f9a4ab924d9d', '76126f5d-723a-40bc-b4c1-7dcda477797a', '475b20fc-c188-4ca1-8e3a-b0614b8f8ce2'], 'languages': ['english'], 'educations': [{'education': {'name': 'Cornell University', 'uuid': 'c4588ea7-567d-43dc-a0e5-67937fb77f5c'}, 'type': 'Undergraduate School', 'year': 1982}, {'education': {'name': 'Baylor College of Medicine', 'uuid': 'b2ed4e48-6cb0-42d2-8804-7557b6452e8d'}, 'type': 'Residency', 'year': None}, {'education': {'name': 'Icahn School of Medicine at Mount Sinai', 'uuid': '16e70085-6b81-4581-b903-dce9d9acf74a'}, 'type': 'Medical School', 'year': None}, {'education': {'name': 'Baylor College of Medicine', 'uuid': 'b2ed4e48-6cb0-42d2-8804-7557b6452e8d'}, 'type': 'Internship', 'year': None}], 'insurances': ['5ad8ff7d-5227-46d7-a392-d28665610257', '3a56d135-dded-48c3-90db-a4f5d9da9786', 'dc059078-9115-488f-bdf6-e9106822225a', 'daaa1ac3-eb9d-424b-a249-8d7c51d39ce1', 'ae0b1dd0-7356-41ee-8df4-871a52451eb7', 'd280371e-f9f1-42a2-aed5-fcb0fe89c219', '3e1ac59e-223a-469d-bb30-63f6988e4134', '5da15840-6f46-4978-8747-8be580d48079', '9eb760da-dade-475d-9390-0c998ff50fcc', 'ad3204b8-1355-48f4-9ee3-194fc5d970f1', '2d5231c0-2d29-4026-9138-5dcbd0591404', '6ab9d1d3-b969-4352-9fac-68b6494cbe7b', '67f5d31d-88f6-4f7a-9b1d-74667ca37f08', '90e00746-69ed-430e-98f9-f637abcbe144', 'b10209f5-96d9-4bbc-8fda-38590460a34a', 'cfb5d792-2d33-444f-a1e8-5487de63e9a0', '51a98931-aace-42c1-8890-66ce53eb1971', 'a6e318c6-cf38-4d65-8bd0-6976b647974b', 'ba8cc746-665a-4925-8f5f-3b213dad238d', '36f607e8-2785-411e-8384-f17195f09468', 'cbbc9743-5cdf-4df2-92ca-fcde6a536160', 'a003a89c-eecb-470b-938d-1b97225f3d81', '791d592b-d6c0-4576-99d0-d78dac8496a1', 'f6a0b45a-042e-4886-82c6-b3b9e086cf83', '57ecd933-4bb8-4289-9080-a8259a6d683c', '6d2b08ec-2f67-4a5b-ac89-642f15e032b8', '749b0b00-ddf8-4cef-bc5b-5694cccbad0f', 'a7edd72c-4010-448e-8624-c92f997fb2a2', '730a2f48-0cb6-49b8-a5bb-807534ef0f24', '9d5266c8-a7b3-4d27-ab96-3a524b4cc491', 'f7627ef2-7d1d-4974-b934-2fdb29c0ff4f', '69af8b6c-06b3-4b66-91c5-ede90be026d8', '96292303-3e6e-4758-a851-f049d7956345', '61a6872e-3c89-4c93-b9b2-ca6ca1db7a8e', '2fdd9e5c-840a-40f1-a575-8ac6ec6094eb', 'd4357c57-76d9-4fda-94b6-16ca8869cd45', 'b1085d28-d17c-42c1-8448-fc56f5bf8ff8', 'ebf487b7-36a0-4ba9-84f6-c2f5f004e06a', '5b7ba09b-9a42-454a-84a6-bc1ba7449379', '84d17e89-55e7-4a08-8468-f52780b926e4', '2e7f2d0a-308e-43d4-9bfb-227a7571550f', 'e364db55-c32d-4e9d-957b-84bfbf62dfea', 'b7f67654-891f-4152-acc8-def976e30826', 'c94db5a7-e5f7-4d1f-9ff9-7ba8e683b57d', 'f2466b1a-782f-4cac-a204-cacb50b8f322', '6417d256-4fdb-4a60-903b-8f42b0405653', '4a1b53fe-79e5-464c-83d9-14143306c02c', 'eecb0179-615f-42cd-9fc0-8f7570f4de63', '0bcbfcff-0e3f-47bc-a32e-a52add577d5b', '9d4642de-8626-4889-8315-9e7ec4f1cd1f', '71e54292-85b7-4cff-a1f0-e06e92f64b5e', '672921ed-9c68-45a5-b808-3df8261d8ff0', 'de8b20b6-a2e3-460a-9018-9d89c009cef8', '40e72184-ff85-456b-a794-35967bc8cfb2', '4d893957-8038-409e-8c96-2e5edd626d45', 'b70e5e27-f14b-4a83-84d8-bb711b70886f', 'd122f267-a255-4eed-afa1-8252e2cbdc17', '1d76bc02-2621-4041-914a-9b9e28babeff', 'd308de64-b0a3-4a50-96b1-4aa813794a1d', '49841d97-6ede-48af-841f-8548d2930e18', 'ee9db5f9-de43-4c3e-a156-d5ba8ba1f6b3', '931810bc-cd6d-4bb6-a77a-7c979fbac3fd', 'c5734165-3e62-4f14-9559-1e5caec18179', 'a0dd1e6d-ccf8-4f09-bbd1-e4fcbdf438ac', 'c1ba934d-cf6c-462d-af28-9cc92ab24937', '4c388b98-8aca-4c28-bd43-74ab84d438f2', '01047fb0-4036-4dcd-81e0-62e6cef4e3e8', '1c27a9d5-3412-41f7-bde2-87eba4abcb5e', '9161f466-8e7e-493d-bd61-d647755dbc53', '46db02d5-a040-4ca5-9510-ef0591feca22', '3a4b2c17-4235-4982-b275-4018c5283ab4', 'ed3f6f88-a4ad-4b56-9d31-8d086ddd1521', 'dae8c52c-b0e0-4688-9078-6e5dcef9de03', '87842763-d404-418e-a265-fea0ef2cac50', '413e5c40-5f12-47c2-9142-4ee27ff61372', 'd54100c0-7990-4638-b57d-79df580f192f', '5b20ceac-5463-4255-b55d-c1a639cc3cc7', '239f0c59-4f28-43ab-9fde-ded6ad8e6cfe', '837c4cd1-4324-44eb-8af3-5bcf2622d8be', '1ec09209-7e22-4698-981d-9cbb150c5746', 'd1c4b24d-2136-4e8d-8497-8ff1fced234d', 'f67a5513-7c63-4243-afcf-84684202f901', '7dca5f3b-e88c-4c32-858f-f21f6fb1ff85', 'a21318a9-a835-4e15-87b2-da893031b6ce', '0486ad3e-5eb2-4589-8383-24bfcec56b47', 'faa0d1f5-388e-4bc4-b60f-baeab55a9930', '4762f4c5-411c-4c86-8af4-324355546ab9', '6a76f092-a95b-40ed-9bbc-c5f5dffb6495', '1038bfd5-c637-493f-95c7-f294b8a84fd6', '81d688b5-d3c2-4c83-8572-b82423525f5b', '05ac5752-da26-4347-a3d2-8b13b008fd70', 'd61b4d0a-4897-4403-afd6-d84fc909caa6', '5c1b23af-e735-408d-94c3-c0146b19b180', '2b8503a8-9ed9-4481-9dd3-23beeee297e9', 'e560745b-6da7-4b5f-b147-7b4f1fea3cd8', '2040857e-d4ad-417d-92ab-52e3c95e5cba', '6584c17c-b804-4f94-ad34-cd8e63f76162', '79cff8b6-cf98-4dcf-8b54-383fa2c43689', 'ea604298-bb16-4d84-b76a-ce6d7c8c3262', 'd4805da1-cb4a-420a-95a4-daee6f7f0fd9', '5278b602-0ea2-4716-af69-9b1560a56ef6', 'b8581687-2b4e-4f2f-91a9-9128f5bec540', '94eaf60d-67fa-4081-a37e-048dfa107f36', '787fa83d-b044-4c30-a7dd-50551c9badb1', '4c3c10e6-6371-463a-af5f-3bb5212aa05e', 'a0587784-a863-47c3-84ae-0dfda49a4be0', '80a261e0-4d0f-4e91-8a70-0c761fbe40ca'], 'provider_types': ['Doctor'], 'locations': [{'uuid': '0decdff6-1703-4d4a-9821-caad045c2f70', 'name': None, 'address': '3941 J St # 260, Sacramento, CA 95819, US', 'address_details': {'street': '3941 J St # 260', 'address_line_1': '3941 J St', 'address_line_2': '# 260', 'city': 'Sacramento', 'state': 'CA', 'zip': '95819'}, 'latitude': 38.5706552, 'longitude': -121.4535451, 'google_maps_link': 'https://www.google.com/maps/@38.5706552,-121.4535451?q=3941%20J%20St%20%23%20260%2C%20Sacramento%2C%20CA%2095819%2C%20US', 'phone_numbers': [{'phone': '9167362323', 'details': 'primary'}, {'phone': '9169845318', 'details': 'secondary'}], 'confidence': 4}, {'uuid': '7974eec5-2760-4642-9ce8-b2401f33c61f', 'name': None, 'address': '1580 Creekside Dr # 250, Folsom, CA 95630, US', 'address_details': {'street': '1580 Creekside Dr # 250', 'address_line_1': '1580 Creekside Dr', 'address_line_2': '# 250', 'city': 'Folsom', 'state': 'CA', 'zip': '95630'}, 'latitude': 38.6687282, 'longitude': -121.1479608, 'google_maps_link': 'https://www.google.com/maps/@38.6687282,-121.1479608?q=1580%20Creekside%20Dr%20%23%20250%2C%20Folsom%2C%20CA%2095630%2C%20US', 'phone_numbers': [{'phone': '9169845318', 'details': 'primary'}], 'confidence': 1}, {'uuid': '3840e421-0e5b-498b-b937-2eb75ef3e09b', 'name': None, 'address': '8120 Timberlake Way # 108, Sacramento, CA 95823, US', 'address_details': {'street': '8120 Timberlake Way # 108', 'address_line_1': '8120 Timberlake Way', 'address_line_2': '# 108', 'city': 'Sacramento', 'state': 'CA', 'zip': '95823'}, 'latitude': 38.4620284, 'longitude': -121.4144321, 'google_maps_link': 'https://www.google.com/maps/@38.4620284,-121.4144321?q=8120%20Timberlake%20Way%20%23%20108%2C%20Sacramento%2C%20CA%2095823%2C%20US', 'phone_numbers': [{'phone': '9166885551', 'details': 'primary'}], 'confidence': 1}, {'uuid': '48a62df8-f2f4-4320-905a-c67600804be6', 'name': 'Mercy Hospital of Folsom', 'address': '1650 Creekside Dr, Folsom, CA 95630, US', 'address_details': {'street': '1650 Creekside Dr', 'address_line_1': '1650 Creekside Dr', 'address_line_2': None, 'city': 'Folsom', 'state': 'CA', 'zip': '95630'}, 'latitude': 38.670333, 'longitude': -121.1458461, 'google_maps_link': 'https://www.google.com/maps/@38.670333,-121.1458461?q=1650%20Creekside%20Dr%2C%20Folsom%2C%20CA%2095630%2C%20US', 'phone_numbers': [{'phone': '9169837400', 'details': 'primary'}], 'confidence': 1}, {'uuid': 'f41bc9cd-e982-4022-9427-3a6bad872a2d', 'name': None, 'address': '7500 Hospital Dr, Sacramento, CA 95823, US', 'address_details': {'street': '7500 Hospital Dr', 'address_line_1': '7500 Hospital Dr', 'address_line_2': None, 'city': 'Sacramento', 'state': 'CA', 'zip': '95823'}, 'latitude': 38.4638056, 'longitude': -121.4164937, 'google_maps_link': 'https://www.google.com/maps/@38.4638056,-121.4164937?q=7500%20Hospital%20Dr%2C%20Sacramento%2C%20CA%2095823%2C%20US', 'phone_numbers': [{'phone': '8006774491', 'details': 'primary'}], 'confidence': 1}], 'online_profiles': [{'url': 'https://www.healthgrades.com/physician/dr-rohit-bhaskar-2hspx'}, {'url': 'https://www.vitals.com/doctors/Dr_Rohit_Bhaskar.html'}]} |
Each record is made up of the following top-level keys. Some of them are of JSON array type.
%%sql
SELECT
json_object_keys(record)
FROM
provider_json
LIMIT 16;
* postgresql://read_user:***@server/dbname
16 rows affected.
json_object_keys |
---|
npi |
first_name |
middle_name |
last_name |
age |
gender |
ratings_count |
ratings_avg |
degrees |
specialties |
languages |
educations |
insurances |
provider_types |
locations |
online_profiles |
insurances
table
%%sql
SELECT * FROM insurances LIMIT 2;
* postgresql://read_user:***@server/dbname
2 rows affected.
uuid | display |
---|---|
5f854c4f-f4b2-45cb-ac7f-dfb9f1c0edba | Highmark Blue Cross Blue Shield - WV - Medicare Advantage - PPO |
5d624d7b-8670-4e50-b2b1-5a067f7f96f4 | Blue Cross Blue Shield of Michigan - Kansas City Basic and Standard |
specialties
table
%%sql
SELECT * FROM specialties LIMIT 2;
* postgresql://read_user:***@server/dbname
2 rows affected.
uuid | display |
---|---|
2a6ea574-686a-4c60-b215-508c7c192203 | Foster Care Agency |
fa710988-0955-422d-9ffd-0924c2b9e16c | Home Health Aide Agency |
provider_puf_data
table
This table contains two json columns: address
and plans
. Our goal is to explore both of these columns to uncover some interesting trends. Secondly, we have the created_at
timestamp column, which lets us do comparisons over time.
%%sql
SELECT * FROM provider_puf_data LIMIT 2;
* postgresql://read_user:***@server/dbname
2 rows affected.
npi | source | address | plans | created_at |
---|---|---|---|---|
1538375241 | https://api.humana.com/v1/cms/index.json | {'zip': '80015', 'city': 'Aurora', 'phone': '3036174488', 'state': 'CO', 'address': '16900 E Quincy Ave', 'address_2': 'Ste B'} | [{'years': [2019], 'plan_id': '11469KY0010001', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '11469KY0010002', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '15980FL0160002', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '30613MO0560001', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '37001GA0540002', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '44580AL0380001', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '44580AL0380002', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '44965LA0590001', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '46958UT0470002', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '48963MS0510002', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '62294MI0420001', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '63141TX0760001', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '63141TX0760002', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '66105AZ0620002', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '66105AZ0620003', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '68303IL0690003', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '68303IL0690004', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '82120TN0630004', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '82120TN0630005', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '91604WI0490002', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '97596OH0580003', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}] | 2019-06-25 00:00:00 |
1538375241 | https://api.humana.com/v1/cms/index.json | {'zip': '80015', 'city': 'Aurora', 'phone': '7148458890', 'state': 'CO', 'address': '16900 E Quincy Ave'} | [{'years': [2019], 'plan_id': '11469KY0010001', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '11469KY0010002', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '15980FL0160002', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '30613MO0560001', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2018], 'plan_id': '37001GA0540001', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '37001GA0540002', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '44580AL0380001', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '44580AL0380002', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '44965LA0590001', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '46958UT0470002', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2018], 'plan_id': '48963MS0510001', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '48963MS0510002', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '62294MI0420001', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '63141TX0760001', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '63141TX0760002', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '66105AZ0620002', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '66105AZ0620003', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2018], 'plan_id': '67499PA0010002', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2018], 'plan_id': '68303IL0690001', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2018], 'plan_id': '68303IL0690002', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '68303IL0690003', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '68303IL0690004', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2018], 'plan_id': '82120TN0630002', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2018], 'plan_id': '82120TN0630003', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '82120TN0630004', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '82120TN0630005', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2018], 'plan_id': '91604WI0490001', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '91604WI0490002', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2018], 'plan_id': '97596OH0580002', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}, {'years': [2019], 'plan_id': '97596OH0580003', 'network_tier': 'PREFERRED', 'plan_id_type': 'HIOS-PLAN-ID'}] | 2019-01-08 00:00:00 |
Questions
1. What is the average age of the providers in providers_json
?
%%sql
SELECT
FLOOR(AVG((record ->> 'age')::int)) as age
FROM provider_json;
* postgresql://read_user:***@server/dbname
1 rows affected.
age |
---|
57 |
2. What is the most popular specialty of the providers in providers_json
?
In order to calculate the most popular specialty, we first need to expand the JSON array inside the record
field, then we need to perform a join with the specialty UUID and then GROUP BY and find the most popular specialty.
%%sql
SELECT
sp.display as most_popular_specialty
FROM (SELECT
-- Unnest the array to create a row for each specialty
json_array_elements_text(record -> 'specialties') as specialty
FROM provider_json) as s
JOIN specialties as sp
-- Cast to UUID type before joining
ON s.specialty::uuid = sp.uuid
GROUP BY 1
ORDER BY COUNT(*) DESC
LIMIT 1
* postgresql://read_user:***@server/dbname
1 rows affected.
most_popular_specialty |
---|
Internal Medicine |
3. Providers practicing at locations with high confidence score
Question: You’ll notice within all locations objects of the providers_json record field, there exists a “confidence” score. This is a representation of our model deciding the probability that a provider is indeed practicing at this location. We allow scores between 0 and 5. How many providers have at least 2 locations with confidence of 4 or higher?
This question requires us to filter records based on the values inside the JSON record.
%%sql
SELECT
COUNT(DISTINCT provider) as num_providers_high_conf_score
FROM (SELECT
provider
-- Sub query to unnest the entries of locations array
FROM (SELECT
record->>'npi' as provider,
json_array_elements(record->'locations') as location
FROM provider_json) as l
-- Only keep locations that meet the confidence score criteria
WHERE (location->>'confidence')::int>=4
GROUP BY provider
-- Only keep locations that meet minumum number of locations criteria
HAVING COUNT(location)>=2) as num_p
* postgresql://read_user:***@server/dbname
1 rows affected.
num_providers_high_conf_score |
---|
2054 |
4. Number of unique insurances at most popular location
Question: In all provider records, you’ll see a field called “insurances”. This is the unique list of all insurance plans a provider accepts (we represent this as UUID which connects to the insurances table). For now, let’s assume a provider accepts all of these insurance plans at all locations they practice at. Find the total number of unique insurance plans accepted by all providers at the most popular location of this data set. (Popular = the most providers practice there)
Similar in concept to the earlier query, but we need to layer this step by step using CTEs until we finally reach our solution.
%%sql
-- Setup the location cte
WITH location_cte as
(
SELECT
npi as provider,
json_array_elements(record->'locations')->>'uuid' as location
FROM provider_json
),
-- Setup the popular locations cte
popular_location as
(
SELECT
location
-- Use the location_cte
FROM location_cte as l
GROUP BY location
-- Order by number of providers at each location
ORDER BY COUNT(provider) DESC
LIMIT 1
),
-- Setup the providers at the most popular location
providers_at_popular_location as
(
SELECT
provider
-- Use the popular_location to filter location_cte
FROM location_cte
WHERE location IN (SELECT location FROM popular_location)
)
-- Get count of distinct insurances for providers at popular location
SELECT
COUNT(distinct ins.insurances::text) as total_unique_insurances_at_most_popular_location
FROM providers_at_popular_location as pop_prov
LEFT JOIN (SELECT
record->>'npi' as provider,
json_array_elements(record->'insurances') as insurances
FROM provider_json) as ins
ON pop_prov.provider::text = ins.provider
* postgresql://read_user:***@server/dbname
1 rows affected.
total_unique_insurances_at_most_popular_location |
---|
361 |
5. Neglected provider fields
Which provider fields in provider_json are the most neglected? How would you go about figuring this out beyond this small sample?
Assuming that we consider a field as neglected when it does not have a value, we can build this logic into the query by making use of json_each_text()
function to get (key,value)
pairs for each top-level field in the record
. Next, we use the split_part()
function to extract the key and value into their own columns. For the sake of readability, I created provider_key_value
cte and then used a simple CASE WHEN
to encode rows that contain an ''
or []
as 1 - indicating that they are ignored. Finally, we can group by key and sum up the num_ignored
values for each key.
By using this logic, age is the most neglected field.
%%sql
-- Setup the provider_fields
WITH provider_fields as (
SELECT
json_each_text(record) as field
FROM provider_json
), provider_key_value as (
SELECT
split_part(split_part(field::text, ',', 1), '(', 2) as key,
split_part(split_part(field::text, ',', 2), ')', 1) as value
FROM provider_fields
)
SELECT
key,
SUM(CASE WHEN value != '' AND value != '[]' THEN 0 ELSE 1 END) as num_ignored
FROM provider_key_value
GROUP BY key
ORDER BY num_ignored desc;
* postgresql://read_user:***@server/dbname
16 rows affected.
key | num_ignored |
---|---|
age | 6597 |
degrees | 4276 |
insurances | 3186 |
educations | 3138 |
ratings_avg | 3052 |
languages | 2997 |
gender | 2366 |
middle_name | 2059 |
online_profiles | 2012 |
locations | 60 |
provider_types | 6 |
specialties | 6 |
first_name | 1 |
ratings_count | 0 |
npi | 0 |
last_name | 0 |
6. Top three sources/brokers with most added/removed rows
Question: Who are the top 3 sources/brokers from provider_puf_data which had the most rows added or dropped between January and June. NOTE: If a source didn’t exist in January or June, treat that source like it has 0 rows for that date.
We turn our focus now to the second table with JSON data. In this case, we are going to find the brokers who added/removed rows from January to June.
%%sql
-- Setup the jan_brokers cte
WITH jan_brokers as(
SELECT
source,
COUNT(*) as jan_count
FROM provider_puf_data
-- Only keep the jan rows
WHERE date_trunc('month', created_at) = '2019-01-01'
GROUP BY source
),
-- Setup the jun_brokers cte
jun_brokers as(
SELECT
source,
COUNT(*) as jun_count
FROM provider_puf_data
-- Only keep the jun rows
WHERE date_trunc('month', created_at) = '2019-06-01'
GROUP BY source
)
-- Query to calculate top three brokers with biggest difference
SELECT
-- As we are using a full-outer join, we want to keep the non-null entries
CASE WHEN jan_brokers.source IS NOT NULL THEN jan_brokers.source
ELSE jun_brokers.source END as brokers,
-- Take the absoulte difference to account for added/removed rows
ABS(COALESCE(jan_count, 0) - COALESCE(jun_count, 0)) as diff
FROM jan_brokers
FULL JOIN jun_brokers
ON jan_brokers.source = jun_brokers.source
ORDER BY diff DESC
LIMIT 3;
* postgresql://read_user:***@server/dbname
3 rows affected.
brokers | diff |
---|---|
http://mydental.guardianlife.com/secure/json/index.json | 798 |
https://www.dentegra.com/hcx/cms-data-index.json | 604 |
https://api.centene.com/ambetter/reference/cms-data-index.json | 237 |
7. NPIs with most newly added addresses
Using only the “address” string in the address column of provider_puf_data, which NPIs had the most new addresses added between January and June and how many new addresses were added? (the top 3 NPIs will do)
Understanding how many addresses are added/removed can be crucial in keeping up with provider data for the business to show accurate details.
%%sql
-- Setup jan_addresses cte
WITH jan_addresses as (
SELECT
npi,
-- Convert to uppercase before comparison
trim(upper(address->>'address')) as jan_address
FROM provider_puf_data
-- Only keep records in january
WHERE date_trunc('month', created_at) = '2019-01-01'
),
-- Setup jun_addresses cte
jun_addresses as (
SELECT
npi,
-- Convert to uppercase before comparison
trim(upper(address->>'address')) as jun_address
FROM provider_puf_data
-- Only keep records in june
WHERE date_trunc('month', created_at) = '2019-06-01'
),
-- Setup unique addresses for each npi in january
jan_unique_addresses as (
SELECT
npi,
jan_address
FROM jan_addresses
GROUP BY npi, jan_address
),
-- Setup unique addresses for each npi in june
jun_unique_addresses as (
SELECT
npi,
jun_address
FROM jun_addresses
GROUP BY npi, jun_address
),
-- Setup query to calculate new addresses added in june
newly_added_addresses as (
SELECT *
FROM jun_unique_addresses
EXCEPT
SELECT *
FROM jan_unique_addresses
)
-- Query to display top three npi with most number of new addresses
SELECT
npi,
count(*) as num_new
FROM newly_added_addresses
GROUP BY npi
ORDER BY num_new desc
LIMIT 3;
* postgresql://read_user:***@server/dbname
3 rows affected.
npi | num_new |
---|---|
1801806518 | 73 |
1457791816 | 68 |
1265411953 | 57 |
8. NPIs with most number of addresses removed
Question: Now the opposite of the above question, which NPIs saw the most addresses removed between January and June? (the top 3 NPIs will do)
%%sql
-- Setup jan_addresses cte
WITH jan_addresses as (
SELECT
npi,
-- Convert to uppercase before comparison
trim(upper(address->>'address')) as jan_address
FROM provider_puf_data
-- Only keep records in january
WHERE date_trunc('month', created_at) = '2019-01-01'
),
-- Setup jun_addresses cte
jun_addresses as (
SELECT
npi,
-- Convert to uppercase before comparison
trim(upper(address->>'address')) as jun_address
FROM provider_puf_data
-- Only keep records in june
WHERE date_trunc('month', created_at) = '2019-06-01'
),
-- Setup unique addresses for each npi in january
jan_unique_addresses as (
SELECT
npi,
jan_address
FROM jan_addresses
GROUP BY npi, jan_address
),
-- Setup unique addresses for each npi in june
jun_unique_addresses as (
SELECT
npi,
jun_address
FROM jun_addresses
GROUP BY npi, jun_address
),
-- Setup query to calculate addresses that got removed in june
removed_addresses as (
SELECT *
FROM jan_unique_addresses
EXCEPT
SELECT *
FROM jun_unique_addresses
)
-- Query to display top three npi with most number of addresses removed
SELECT
npi,
COUNT(*) as num_removed
FROM removed_addresses
GROUP BY npi
ORDER BY num_removed desc
LIMIT 3;
* postgresql://read_user:***@server/dbname
3 rows affected.
npi | num_removed |
---|---|
1659334894 | 65 |
1407379845 | 44 |
1073694808 | 30 |
9. Change in plans
Question: How did PUF plans within the plans field of provider_puf_data change from January to June? This is intentionally a bit open ended :)
Answer: I came up with two questions:
a. Which networks saw the greatest change from January to June?
%%sql
-- Setup the jan_networks cte
WITH jan_networks as (
SELECT
npi,
-- Since every npi has atleast one plan, we can use first index
plans->0->>'network_tier' as jan_network
FROM provider_puf_data
-- Only keep records for the month of January
WHERE date_trunc('month', created_at) = '2019-01-01'
),
-- Setup the jun_networks cte
jun_networks as (
SELECT
npi,
-- Since every npi has atleast one plan, we can use first index
plans->0->>'network_tier' as jun_network
FROM provider_puf_data
-- Only keep records for the month of June
WHERE date_trunc('month', created_at) = '2019-06-01'
),
-- Setup query to calculate number of npi per network in jan
jan_npi_per_network as (
SELECT
jan_network,
-- Only count unique npi
COUNT(DISTINCT npi) as jan_num_providers_in_network
FROM jan_networks
GROUP BY jan_network
),
-- Setup query to calculate number of npi per network in jun
jun_npi_per_network as (
SELECT
jun_network,
-- Only count unique npi
COUNT(DISTINCT npi) as jun_num_providers_in_network
FROM jun_networks
GROUP BY jun_network
)
-- Query to calculate the difference in providers per network
SELECT
-- As we are doing a full outer join, pick a non-null value
CASE WHEN jan.jan_network IS NOT NULL THEN jan.jan_network
ELSE jun.jun_network END as network,
-- Take the absolute difference between num providers
ABS(COALESCE(jan_num_providers_in_network, 0) -
COALESCE(jun_num_providers_in_network, 0)) as difference
FROM jan_npi_per_network as jan
-- Perform full outer join to get non-matching rows from both sides
FULL JOIN jun_npi_per_network as jun
ON jan.jan_network = jun.jun_network
ORDER BY difference desc
LIMIT 3;
* postgresql://read_user:***@server/dbname
3 rows affected.
network | difference |
---|---|
HERITAGE | 103 |
PPO | 70 |
NEIGHBORHOOD | 70 |
b. List the networks that did not add any new providers from January to June.
%%sql
-- Setup the jan_networks cte
WITH jan_networks as (
SELECT
npi,
-- Since every npi has atleast one plan, we can use first index
plans->0->>'network_tier' as jan_network
FROM provider_puf_data
-- Only keep records for the month of January
WHERE date_trunc('month', created_at) = '2019-01-01'
),
-- Setup the jun_networks cte
jun_networks as (
SELECT
npi,
-- Since every npi has atleast one plan, we can use first index
plans->0->>'network_tier' as jun_network
FROM provider_puf_data
-- Only keep records for the month of June
WHERE date_trunc('month', created_at) = '2019-06-01'
),
-- Setup query to calculate number of npi per network in jan
jan_npi_per_network as (
SELECT
jan_network,
-- Only count unique npi
COUNT(DISTINCT npi) as jan_num_providers_in_network
FROM jan_networks
GROUP BY jan_network
),
-- Setup query to calculate number of npi per network in jun
jun_npi_per_network as (
SELECT
jun_network,
-- Only count unique npi
COUNT(DISTINCT npi) as jun_num_providers_in_network
FROM jun_networks
GROUP BY jun_network
)
-- Query to calculate networks with no providers
SELECT
jan.jan_network as networks_with_no_providers
FROM jan_npi_per_network as jan
FULL JOIN jun_npi_per_network as jun
ON jan.jan_network = jun.jun_network
WHERE jun.jun_network is NULL
* postgresql://read_user:***@server/dbname
13 rows affected.
networks_with_no_providers |
---|
ACCESS |
BLUE-FOCUSCARE |
CD-CONNECTION-DENTAL--PREFI |
CR--CAREINTON-PREFIX-DJ |
DC--DENTALGUARD-CONNECT-ADMIN |
DG--DENTALGUARD-PREFERRED-P |
DM-DENTEMAX-PREFIX-DW |
ELITE |
HERITAGE-PLUS |
NEIGHBORHOOD |
PIMAFOCUS |
SK-STRATOSE-FSB-MAVAREST-PPO |
SP-AETNA-PPO-PREFIX-DSW |
10. How to address the address column?
If you look closely at the address strings during exercises 7/8, you’ll notice a lot of redundant addresses that are just slightly different. If we could merge these addresses we could get much better resolution on which addresses are actually changing between January and June. Given the data you have here, how would you accomplish this? What if you could use any tools available? How would you begin architecting this for now and the future.
A simple way is to merge these fields into one is by using Postgres string functions: CONCAT
, UPPER
and TRIM
and grouping them by provider. However, this does not handle the case when we have multiple representations of the same address, shown in the cell after this.
%%sql
SELECT
npi,
COUNT(DISTINCT TRIM(UPPER(CONCAT(address->>'address', ' ',
address->>'address_2', ' ',
address->>'city', ' ',
address->>'state', ' ',
address->>'zip'
)))) as unique_addresses
FROM provider_puf_data
GROUP BY npi
ORDER BY unique_addresses desc
LIMIT 10;
* postgresql://read_user:***@server/dbname
10 rows affected.
npi | unique_addresses |
---|---|
1659334894 | 448 |
1538375241 | 256 |
1659763092 | 238 |
1821510181 | 230 |
1457751075 | 212 |
1972540664 | 209 |
1114191335 | 178 |
1730203100 | 175 |
1205822723 | 172 |
1366824997 | 147 |
%%sql
SELECT
UPPER(address->>'address') as street,
UPPER(address->>'city') as city,
UPPER(address->>'state') as state,
address->>'zip' as zipcode
FROM provider_puf_data
WHERE npi = '1538375241' and address->>'zip' = 80023::text
GROUP BY street, city, state, zipcode
* postgresql://read_user:***@server/dbname
4 rows affected.
street | city | state | zipcode |
---|---|---|---|
14422 ORCHARD PARKWAY | WESTMINSTER | CO | 80023 |
14422 ORCHARD PARKWAY SUITE 200 | WESTMINSTER | CO | 80023 |
14422 ORCHARD PKWY | WESTMINSTER | CO | 80023 |
14422 ORCHARD PKY #200 | WESTMINSTER | CO | 80023 |
All these 4 records should be counted as 1 address. One way to further break the street into house_no
and street_name
but this does not address cases where PARKWAY is written as PKWY.
An alternative way, is to employ text similarity by enabling the extension fuzzystrmatch
in Postgres and calculate the pair-wise text similarity for each provider and discard addresses with high similarity.
Conclusion
A lot of APIs send data in the JSON format and exploring JSON data is a useful skill to have. There are many tools that will help you work with JSON data, but when working with JSON data stored inside a SQL database we have an option of reading that into memory and use a package like pandas
or write our own python module that reads data from the table and pre-processes it. If you don’t want to write your logic to work with JSON data in your program, then you can rely on Postgres’ built-in operators and functions to work with JSON.