Snake_Byte #34: Joining NPI with NUCC using Apache Spark


In case you missed it, my last Snake_Byte looked at the CMS NPI dataset. If you followed along at home, you might have noticed the final JSON document for each healthcare provider only had specialty "taxonomy codes" with no name or label. In this Snake_Byte, we'll revisit the CMS NPI dataset "ETL", but this time we'll join it with the NUCC taxonomy to obtain the specialty name for each provider.

In addition, for this Snake_Byte, we'll use Apache Spark in Python (pyspark) within a jupyter notebook. First, we'll set up our environment and download the datasets. Next, we'll start up our pyspark notebook, load the NPI and NUCC DataFrames, and then perform a join on these two datasets. Finally, we'll demonstrate a few DataFrame APIs and export our results to JSON.

Development Environment

Before getting started we'll need to set up our environment. For this project I am running Mac OS X 10.11.6, El Capitan. (YMMV if you are running another OS.)

* Spark:
For this post we are using version 2.1.0. Download the tarball, decompress and cd into "spark-2.1.0-bin-hadoop2.7" directory.

* Python virtualenv:

* Jupyter notebook:

* Launch pyspark jupyter notebook:

The Data

* Download the CMS NPI dataset from here: Latest version is from July 2017. Decompress the zip file.

* Download the NUCC taxonomy from here: Latest version is 17.1 (

* Remember where you downloaded and decompressed these files. You'll need to replace "path-to" in the gists below with your local path.

ETL: extract-transform-load

Load up the NUCC taxonomy into a DataFrame:

As we can see the taxonomy has 851 rows. For this post, we are only interested in the Code, Classification, and Specialization columns. So, we'll select these headers into another DataFrame, and add aliases. These aliases will make the join syntax much nicer:

Load up the CMS NPI file into a DataFrame:

The CMS NPI has nearly 5.3M rows now, and it took about 25 seconds to traverse and count the entire dataset. For now, we'll limit ourselves to just the first taxonomy code, Healthcare Provider Taxonomy Code_1, and provider last_name and organization_name fields.

Do the JOIN

Join condition is on taxonomy_code. Note also that we're specifying the keyword argument, how='left_outer'. This ensures that all documents from the NPI dataset will be included even if there is no taxonomy_code.


Now that we have meaningful specialty labels, we can perform some analysis across the entire NPI dataset. For example, let's find the top 10 specialty occurrences and order by count descending:

Export JSONL

Now let's export the whole thing to JSONL (one JSON document per line).

By default, spark will export multiple files, one per partition:

If instead we want just one very large file we'll need to glom these files together. (Look into the cat API in bash or HDFS.) Alternatively if we have enough RAM we can just coalesce to 1 partition prior to calling saveAsTextFile.

Let's take a look at this single file:

Now we have one very large JSONL file (part-00000) that can be imported into a document storage and/or analyzed in subsequent ETL pipelines. For example, we can grep the JSONL file to find NPIs for a given specialty:

Next Steps

There are several next steps we can take:
* Extend our processing to include all provider taxonomy codes (up to 15)
* Create a stand-alone script, runnable with spark-submit
* Join with other CMS datasets
* Import into document store (mongodb, elasticsearch)

In conclusion, I hope you've seen it's straightforward to ETL, join, and analyze data using Apache Spark. At PokitDok, when working with large datasets from CMS and other trading partners it is an indispensable tool.

About jeffhoekman

At PokitDok Inc. Jeff focuses on data and search capabilities and enjoys working with open source software such as elasticsearch and Apache Spark. In his free time Jeff enjoys kite and paddle surfing and eating out in Charleston!

View All Posts