"Do you really need blockchain for that?" While the blockchain hype continues to sweep across essentially every industry vertical imaginable, we at PokitDok are not...

Read More

Avoiding Garbage In Garbage Out: A lot of the dirty work in data science is ETL -- the process of putting data through a meat...

Read More

A few weeks ago, we published a post detailing our internal graph domain specific language (DSL) which we built to query our transaction database. Since then, we have been performing some load and performance testing within our production environment. Specifically, we tested the performance of our OLTP persistence model and the latency of our OLAP entity extraction jobs which use our DSL. In this post, we detail the timing and volume statistics around the load process, entity extraction traversals, and model persistence. Additionally, this post contains an interactive example of the data we persisted and outlines our testing process.

The PokitDok APIs persist the X12 transactions into our Titan graph; an interactive sample of X12 transaction data is below. For the performance tests described in this post, we persisted varying amounts of different X12 transaction types into the graph database. The graph below shows three different graph structures for the transactions used in these tests: 270, 271, and 837 transactions. Open up the image in a new tab to explore the data further.

OLTP Transaction Graphs

We designed this partition of our graph to function as an OLTP database to handle the heavy write use case of our healthcare APIs. From these transactions, we use our domain specific language (DSL) to traverse, extract, and persist entity models for OLAP processing. Given this pipeline, we created a series of tests to track the load statistics for persistence and entity extraction from the API call to DSL usage.

Our API infrastructure for real time entity extraction, model persistence, and matching is outlined in Figure 1. We persist an X12 API call via our dynamic JSONLoader (open source coming soon) in a graph database. We also kick off a job to use our graph DSL to extract and persist the entity models observed in this transaction. Lastly, we relate the entity model to the transaction in which it was observed. As we merge duplicate entities, this edge relates the merged representation of an entity to all transactions in which it was observed.

Figure 1: Graph API process across a streaming OLAP architecture

Figure 1: Graph API process across a streaming OLAP architecture

To gather load statistics about this model, we spun up a pipeline to hit our graph with varying scenarios of API loads while also monitoring the entity extraction and persistence jobs. Each individual run persists ratios of simulated eligibility requests, eligibility responses and healthcare claims. On average, an individual X12 transaction contains up to 177 vertices (check out our previous post on the structure of X12 json). Each transaction is persisted as a tree due to the incoming structure of the JSON payload, and therefore contains 176 edges. A fundamental theorem in graph theory is that if a tree has a total of |V| vertices, then the number of edges it contains, |E|, is |V| - 1. Check out a version of the proof here.

Across this simulation, Table 1 lists the distribution of persisted vertex count and edge count with the number of simulated API calls which varies in each run. By applying the above theorem, we can use the total number of persisted vertices on each load test to calculate the number of edges in the graph. The additional step required implies that every individual persisted transaction is a disconnected sub-tree. As such, the final number of edges after persisting T total API calls is |V| - 1 - T.

Total API Calls Total Vertices Total Edges Persistence Time (seconds)
820 145,870 145,049 71.695
7,380 1,312,830 1,305,449 628.43
8,200 1,458,700 1,450,499 671.83
65,600 11,669,600 11,603,999 5,447.71
82,000 14,587,000 14,504,999 6,154.65

While persisting the API calls into the graph db, we also extract the entity information using the DSL traversals over the X12 transaction graph. Table 2 outlines the number of entities traversed, extracted, and persisted by this subroutine. Future, we calculated the wall clock time across the total number of trials starting from the API call through the round trip persistence of the entity to the Cassandra backend and the resulting graph query which verified entity extraction completion.

Total API Calls Total Payor Entities Extracted Total Consumer Entities Extracted Total Provider Entities Extracted Extraction Time via the DSL(seconds)
820 1,170 1,430 570 127.76
7,380 10,530 12,870 5,130 727.96
8,200 11,700 14,300 5,700 1,224.84
65,600 93,600 114,400 45,600 6,031.04
82,000 117,000 143,000 57,000 16,820.50

We ran the jobs across on stand alone machines with Ubuntu 14.04, Intel Core i5-5300u CPU@2.3GHZ with 16GB RAM as well as on a 2.5 GHz Intel Core i7, 16 GB 1600 MHz DDR3 MacBookPro with 10.10.5 OSX. We did not use a cluster in these runs.

The performance of our JSONLoad and DSL entity extraction job is performant for the heavy write load of our APIs. In a future post, we will release the results of our load tests on our EC2 cluster in AWS and examine the p99 latency. In the meantime, if you would like to inspect the setup of our current environment, you can grab our docker container that comes preloaded with our Gremlin-Python library and graph infrastructure. Lastly, we are also extracting the PokitDok specific components of the JSONLoader and are preparing for an open sourced version.

We are currently converting our DSL from TinkerPop 2.5.0 over to TinkerPop 3.2. Specifically, we are following the development and conversation around the DSL on this ticket. Once there, we will move on to using the GraphComputer as available in TP3 for improved OLAP processing.

Graphs are an integral part of the big data movement. Graph databases allow us to store and query data in a more inter-connected and natural way. As such, the healthcare domain presents an ideal opportunity to utilize graph connections between various entities such as subscribers, providers, and payers to model a subscriber’s eligibility, filed claims, scheduled procedures and/or other transactions. Using a graph database allows us to discover the various social and demographic relationships to be used in machine learning pipelines for identifying and predicting fraudulent claims, incorrect payments, readmission rates, etc.

To utilize a graph database, we first created a graph model to store x12 transactions as vertices and edges. The interactive tree graph below shows the complexity of the 700+ pages of the claims x12 specification. For example, it is set to show the depth required to find a subscriber's name in these transactions.

One of the goals of this post is to highlight the complexity of the native data format for healthcare and showcase the usefulness of using domain specific languages. To fully realize this objective, the second interactive visualization in this post shows the same data as the one above. (For more on the structure of X12 in healthcare, check out a previous post which visualized and broke down the use of the X12 exchange standard in healthcare.) Once in the database, we designed a domain specific language (DSL) so that data retrieval from our Titan database mimics the semantics of our API.

In the rest of this post, we wanted to take some time to showcase the work we have patented (pending) around our graph technologies. Specifically, we are going to step through our graph model, the design of our DSL as a finite state machine, and conclude with some example graph queries for data retrieval.


Titan is a scalable and transactional graph database for storing and querying very large data graphs in a distributed fashion. Titan supports the TinkerPop stack allowing the usage of various components such as Cassandra, for storing large volumes of data, and gremlin, for querying data via graph traversals.

While Titan utilizes the gremlin query language for traversals across a graph database, there are cases where we want to avoid repetition of queries and/or would like to retrieve structured entities in a more intuitive way. For this purpose, we have created a groovy-gremlin based health graph (HG) domain specific language (DSL) that allows the healthcare domain entities such as subscribers, claims, providers, their addresses, member ids, claim info, and/or eligibility info to be extracted using the DSL. The DSL is accessible from client scripts such as python or groovy or via Tinkerpop's gremlin console. The PokitDok Data Science team has also created a custom build of Titan, to further provide customization around our graph database technologies.

A Primer on DSLs

A DSL (Domain Specific Language) is a programming language designed specifically to express solutions to address specific problems in a specific domain, with limited scope and capabilities. In this case, the objective of the DSL is to be able to query and retrieve healthcare data from a graph database with simple and intuitive directives and syntax with proper semantics.

There are several advantages of developing a DSL. Using the abstraction layer, the semantics of the underlying data can be easily represented and more meaningful queries can be executed to extract structured data from the graph. By using the DSL, the users can concentrate on their algorithms for data analysis rather than data retrieval and queries setup.

We have a very well defined Domain Specific Language (DSL). We created a health care semantic model that traverses the underlying domain data via a linguistic abstraction on top. In our case, the abstraction layer sits on top of the health graph data layer in Cassandra that speaks the dialect of the user for abstracting repetitive gremlin traversals. Our DSL captures the high-level semantics of the healthcare domain entities as well as their inter-related data and information. The semantic model represents the object model which the DSL used to populate that model. The Semantic Model is based on what will be done with the information from the DSL script. A clear Semantic Model allows us to test the semantics and the parsing of the DSL separately. We can test the semantics by populating the Semantic Model directly and executing tests against the model. The figure below shows the different layers of our DSL development.

DSL Layers

The different layers of development for our domain specific language

Design of the State Machine & Semantic Model for HG-DSL

Our DSL is comprised of a semantic model of the domain embedded entities, a state machine that emulates the entity retrieval actions by the gremlin traversals, and a set of DSL functions on top of it as an abstraction. The DSL is exposed through an API to the client applications.

This design allows the evolving of all these components independently, but still loosely connected as coupled layers and microservices. We could change the behavior of a DSL function by changing the state machine and respective semantic model, or vice versa without breaking anything. For example, if a “patient” DSL were to be modified to include more detailed information in the query retrieval from the patient vertex, it could be done without changing the DSL API. This gives flexibility in a domain where the changing requirements in the data can pose problems in data querying semantics.

We follow the top down methodology of developing from semantic model to the design of syntactical query model as recommended in the (Semantics Driven DSL Design). We decomposed the healthcare domain and defined the entities that make up of the domain and their inter-relationships before developing the syntactical query and traversal representation of those semantic entities and relationships. The figure below shows our partial domain semantic model and a use case representation of a state machine for retrieving an entity from the semantic model.

Semantic Model and State Machine for Retrieval

Semantic Model and State Machine for Retrieval


Our DSL is divided into different health care sub-domains such as x12 formats and its entities, health care data such as pricing, procedures, providers etc. From the API perspective, we have used closures as Nested Closures within DSL scripts, which consists of inline nesting, deferred evaluation, and limited-scope variables.

The following interactive visual diagram presents our Claims DSL Graph API. To fully highlight the simplicity of well designed DSLs, the graph is set to show the same data as the first interactive graph in this post.

To compare the traversals side by side, the following block depicts a few examples to retrieve persisted claim data information using standard gremlin queries vs the Claims graph DSL. It shows how some of the horrendous queries to retrieve health care entities from the X12 exchange standard can be extremely simplified with DSL usage.

Gremlin Queries DSL Queries

The health graph is represented as an ordered pair G = (N, E), comprising a set of nodes (N) and a set of edges (E). We use the mathematical representation and properties of our graph to define various data analysis algorithms and their behavior. With this design we can change the ingress data model and accordingly change the semantics of the entities and info retrieved without any disruption in the client applications. Currently, the formal model and properties for our DSL are still in progress and its semantics are not yet defined in a declarative language formalism. We have also used the domain specific knowledge of the underlying data structures to improve query performance. These optimizations occur underneath the abstraction layer and within the DSL implementation.

With this DSL, we can query our graph database for testing and analytics with the same semantic structure as the PokitDok API. This establishes seamless usage for data exchange across the internal PokitDok infrastructure (patents pending). To fully test the functionality of this software, we simulated the persistence of up to 30 million vertices in our database and tested the retrieval of data via our DSL. In a future blog post, we will release the timing statistics for the persistence and retrieval of data in this system along with information on our usage of Hadoop in this infrastructure.

It's finally here. After numerous delays and false starts, the US has made ICD-10 the law of the land. We'll have a more complete picture...

Read More

Our Front-End Director Chris Wilhite and I traveled to lovely, not-particularly-sunny Boston this weekend for the Health 2.0 Boston Code-A-Thon.

Read More

PokitDok is a cutting-edge product, powered by engineers with a knack for stringing together cutting-edge tech. It's one of many reasons why PokitDok is such...

Read More

Spring is here, and that means two things are coming - Allergy flareups! PokitDok's conference season! Yesterday, I had the privilege of attending the Accel...

Read More

With healthcare records going digital, common healthcare transactions — be it claims, authorization for a service, referrals to a specialist, eligibility verification, status of a...

Read More

Our data science team uses many components in the TinkerPop stack, along with the Titan graph database. As you may know from our previous post...

Read More