Good Dataset Engineering

Avoiding Garbage In Garbage Out:

A lot of the dirty work in data science is ETL -- the process of putting data through a meat grinder. Data may present itself in the form of a public or purchased dataset, and it almost never lines up with the way you want to query, index, or present it.

Input datasets are sometime clean but more often messy or non-uniform. They could be in various formats: CSV, Excel, fixed-width, or a zip file of many different things. The one thing you can count on is that you can’t count on much of anything from input sources. You don’t control the format nor quality, and you can’t make sure they will never change. If your input dataset updates periodically, the quality or format might worsen, but even an supposed improvement in format or quality could break your ETL process. How can you assert some sort of rigor around unrigorous input? How can we make those assertions programmatic and automated? There is a simple process to avoiding garbage-in-garbage-out in data science.

To frame the problem a bit more, most modern data transformation tools like Apache Spark or Python notebook tools are bred for interactive development. They’re meant to make it fast and easy to get started, simple to share. Never confuse these things with being production-ready. Would your organization ever feel good about deploying some code to production that had no tests? Spark and notebooks should be considered prototyping unless proven otherwise. There is nothing wrong with prototyping, but you want some safety mechanisms before considering something production-ready.

Safety mechanisms should assert high-level things: does the output match the expected format? How can we programmatically make this assertion? The simplest way is to write a test that checks the output against a schema. The schema doesn’t have to be anything formal -- don’t get caught up in bikeshedding schema formats! It is way more important to automate the output assertion itself. A simple unit test can take the whole dataset or a sampling, it can check some fields or all fields, or maybe make a particular assertion around a single field being unique. The semantics are up to you and your dataset to make them meaningful.

Treat your pipelines as if they were APIs

Consider an HTTP API: when you slam an API with bad data, you inevitably receive a 400 or even 500 status code. You immediately know something has failed. On the other hand when an ETL process is given bad input -- without safety mechanisms, you can get bad output, and worse yet, this can be silent. No status codes, no sad face, just silent garbage. The easiest way to ward this off is to make schema assertions about your input. The input itself is usually not under your control, but that certainly doesn’t prevent you from constraining it. A dataset that get updated and dumped monthly can break. Humans make mistakes. Joe Schmoe in accounting could have dropped a field from the output. Heck, you could have an entire file missing. If you don’t want holes in the output data, write some code to assert things about your input. Assert that a schema holds programmatically before running the ETL job. You cannot programmatically fix bad input, but catching it is trivial and vital in preventing corruption from propagating downstream. A pre-flight check that fails for an ETL job is analogous to an HTTP status 400. Even if the input dataset is enriched (AKA novel fields), you want to catch this with a failed check so that you can get your human eyes why it failed -- before proceeding with the ETL pipeline.

If the ETL pipeline is relatively complex with many steps, but only two steps are important, then put unit tests around those milestones. Validate that the milestone steps have the expected input and output. Again you can use a schema or a simple function, but the complication with testing an interim step is that you have to be able to run that step in isolation. You don’t want to run steps A -> Z if all you’re checking is step M. Does step M have an entrypoint? Or did you build a chain of calls that always start with step A? This requires forethought but I’ve found it to result in more composable pipelines, and shared reuse between different pipelines.

Data science is an engineering discipline. Treat your pipelines as if they were an API. Write tests, write small readable functions, find appropriate abstractions, then write tests for those. Calling some code a “script” earn a pass on engineering discipline.

About Ghadi Shayban

I work on backend data engineering at PokitDok, and I'm passionate about bringing health care informatics a couple decades forward. If I were stranded on a desert island and had a laptop, I'd open up Emacs and a Clojure REPL. When not on a computer keyboard, I can be found on a piano keyboard, playing around Charleston, SC.

View All Posts