Today, we're going to look at the python csv module and the CMS NPI dataset. If you'd like to follow/code along using python/ipython browse to http://download.cms.gov/nppes/NPI_Files.html, and then download the file under "Full Replacement Monthly File" (direct link: http://download.cms.gov/nppes/NPPES_Data_Dissemination_January_2017.zip).
Decompress the zip archive. On Mac OS X you may need The Unarchiver to help (http://wakaba.c3.cx/s/apps/unarchiver.html).
After unarchiving you should have the 'npidata_20050523-20170108.csv' file. Run word count on the command line to check the document count. Currently this dataset has ~5.1 million documents!
wc -l npidata_20050523-20170108.csv
We will extract the headers, perform some basic normalization of the keys, generate python dicts for each row, and output to JSONL (one JSON document per line), for subsequent processing.
Now, we would like to cleanse and normalize our headers so they form nice looking JSON documents. These cleansed field names will make it easier to query and process the document and insert it into a database when we're ready for that.
For now, we'll convert the header field names to lowercase, strip punctuation, and convert spaces to underscores. In addition we'll strip the redundant "Provider " and "Healthcare Provider " prefix from fieldnames.
Next, let's load the first data row from the NPI csv using csv.DictReader and inspect a full document.
As we can see our clean_headers are being used, however, there are several empty string values (''). We'll define a simple dict comprehension to filter these out, and finally run the full ETL (extract-transform-load) from CSV to JSONL.
This may take a while to run depending on how fast your machine is ;). (On my 2.8GHz MacBook Pro it took about 8 minutes). In the end you should have a large JSONL file with the results!
Why JSONL and not regular JSON? JSONL is more convenient when working with large batch file processing, because the file can be arbitrarily split into chunks and processed in multiple threads or processes. We'll demonstrate this in the future when using python multi-processing tools and Apache Spark.
After going through this exercise and further inspecting the JSON documents, what are some additional steps we could take?
1. use multi-processing to speed up the entire process
2. group related items into dicts, lists, or a list of dicts (licenses, other_provider_identifiers, taxonomy_codes, addresses, authorized_official, etc.)
3. parse dates into python datetime objects resulting with ISO-8601 formatted date in the final JSON
4. phone/fax number formatting
5. consistency checking on address and license states, against known or expected values
6. additional value "cleansing" including stripping punctuation, address normalization
7. what else?
In future posts we'll look at some of these issues above as well as using a database to search across the entire collection.