Dirty data ETL in healthcare: because IBM EBCDIC is still a thing.

EBC... what?

We recently received a bulk dump of fixed width claims data files from a large payer encoded using IBM EBCDIC, so it was time for us to put on our data janitor hats. After figuring out exactly what the heck EBCDIC was (a character encoding used by IBM mainframes for those of you too young to remember those days), the next step was to get the files encoded into something a bit more modern, like UTF-8. Luckily Python makes it easy to decode and encode strings.

The files in the dump were organized roughly as follows:

  • 24 member data files where each row has a member_id
  • 1 provider file where each row has a provider_id
  • 1 employer file where each row has a employer_id
  • 12 claim service line files where each row has a claim_id, member_id, provider_id, and employer_id

What tool to use?

Now that we had the text files into a more sensible encoding it was time join all these text files into a unified and unique set. Sed, awk, cut, etc. have long been go to tools for cleaning, munging, and joining text files on *nix systems, however what do you use when dealing with 2TB of text files on HDFS? (Py)Spark is quickly becoming our tool of choice for dealing with such situations.

First steps to igniting a Spark

Our end goal was to have a single DataFrame where each row contains the member data, provider data, employer data, and all of the claim service lines for a single claim. Our first step in getting the data into Spark world was to convert the fixed width text files to Spark SQL DataFrames and persisted as Parquet files on HDFS. DataFrames allow us to easily manipulate, transform, and join data using either SQL or an API that would be very familiar to users of R or pandas. The Spark team has also spent a lot of time and energy on the Catalyst optimizer to make Spark SQL execution fast and efficient. Parquet is a columnar storage format for the Hadoop ecosystem that allows Catalyst to make further performance optimizations. Converting the text files to parquet files is relatively straightforward in PySpark.

Grouping Service Lines

Now we needed to group all of the claim service lines for each claim.

This function creates a DataFrame where each row contains a claim_id and an array of service lines.

Hmmm… too many claims

Once we unioned the 12 grouped claim service line DataFrames we noticed that there were rows with duplicate claim ids. Spark makes this easy too.

Bringing it all together

All that is left now is to join our 4 DataFrames into one single one.

It wasn’t all roses…

Adoption of Spark is growing like wildfire and it is the most active of all of the Apache projects with a very acitve and responsive mailing list. While the above method for ETL’ing awful EBCDIC to a single Spark DataFrame is very straightforward, we did run into several issues with running the process on our Mesos cluster. 95% of our issues were resolved by reading the Configuration Guide and the other 5% was fixed by adding more nodes to our cluster.

About Colin Alstad

Colin is a data scientist at PokitDok where he works on insurance claims and healthcare pricing analysis. He's a recovering mathematician, currently in relapse, studying algebraic topology and its applications.

View All Posts