HealthGraph Domain Specific Language: A Semantic Data Retrieval Layer

Graphs are an integral part of the big data movement. Graph databases allow us to store and query data in a more inter-connected and natural way. As such, the healthcare domain presents an ideal opportunity to utilize graph connections between various entities such as subscribers, providers, and payers to model a subscriber’s eligibility, filed claims, scheduled procedures and/or other transactions. Using a graph database allows us to discover the various social and demographic relationships to be used in machine learning pipelines for identifying and predicting fraudulent claims, incorrect payments, readmission rates, etc.

To utilize a graph database, we first created a graph model to store x12 transactions as vertices and edges. The interactive tree graph below shows the complexity of the 700+ pages of the claims x12 specification. For example, it is set to show the depth required to find a subscriber's name in these transactions.

One of the goals of this post is to highlight the complexity of the native data format for healthcare and showcase the usefulness of using domain specific languages. To fully realize this objective, the second interactive visualization in this post shows the same data as the one above. (For more on the structure of X12 in healthcare, check out a previous post which visualized and broke down the use of the X12 exchange standard in healthcare.) Once in the database, we designed a domain specific language (DSL) so that data retrieval from our Titan database mimics the semantics of our API.

In the rest of this post, we wanted to take some time to showcase the work we have patented (pending) around our graph technologies. Specifically, we are going to step through our graph model, the design of our DSL as a finite state machine, and conclude with some example graph queries for data retrieval.

Titan-Rexster-BluePrint-Gremlin

Titan is a scalable and transactional graph database for storing and querying very large data graphs in a distributed fashion. Titan supports the TinkerPop stack allowing the usage of various components such as Cassandra, for storing large volumes of data, and gremlin, for querying data via graph traversals.

While Titan utilizes the gremlin query language for traversals across a graph database, there are cases where we want to avoid repetition of queries and/or would like to retrieve structured entities in a more intuitive way. For this purpose, we have created a groovy-gremlin based health graph (HG) domain specific language (DSL) that allows the healthcare domain entities such as subscribers, claims, providers, their addresses, member ids, claim info, and/or eligibility info to be extracted using the DSL. The DSL is accessible from client scripts such as python or groovy or via Tinkerpop's gremlin console. The PokitDok Data Science team has also created a custom build of Titan, to further provide customization around our graph database technologies.

A Primer on DSLs

A DSL (Domain Specific Language) is a programming language designed specifically to express solutions to address specific problems in a specific domain, with limited scope and capabilities. In this case, the objective of the DSL is to be able to query and retrieve healthcare data from a graph database with simple and intuitive directives and syntax with proper semantics.

There are several advantages of developing a DSL. Using the abstraction layer, the semantics of the underlying data can be easily represented and more meaningful queries can be executed to extract structured data from the graph. By using the DSL, the users can concentrate on their algorithms for data analysis rather than data retrieval and queries setup.

We have a very well defined Domain Specific Language (DSL). We created a health care semantic model that traverses the underlying domain data via a linguistic abstraction on top. In our case, the abstraction layer sits on top of the health graph data layer in Cassandra that speaks the dialect of the user for abstracting repetitive gremlin traversals. Our DSL captures the high-level semantics of the healthcare domain entities as well as their inter-related data and information. The semantic model represents the object model which the DSL used to populate that model. The Semantic Model is based on what will be done with the information from the DSL script. A clear Semantic Model allows us to test the semantics and the parsing of the DSL separately. We can test the semantics by populating the Semantic Model directly and executing tests against the model. The figure below shows the different layers of our DSL development.

DSL Layers

The different layers of development for our domain specific language

Design of the State Machine & Semantic Model for HG-DSL

Our DSL is comprised of a semantic model of the domain embedded entities, a state machine that emulates the entity retrieval actions by the gremlin traversals, and a set of DSL functions on top of it as an abstraction. The DSL is exposed through an API to the client applications.

This design allows the evolving of all these components independently, but still loosely connected as coupled layers and microservices. We could change the behavior of a DSL function by changing the state machine and respective semantic model, or vice versa without breaking anything. For example, if a “patient” DSL were to be modified to include more detailed information in the query retrieval from the patient vertex, it could be done without changing the DSL API. This gives flexibility in a domain where the changing requirements in the data can pose problems in data querying semantics.

We follow the top down methodology of developing from semantic model to the design of syntactical query model as recommended in the (Semantics Driven DSL Design). We decomposed the healthcare domain and defined the entities that make up of the domain and their inter-relationships before developing the syntactical query and traversal representation of those semantic entities and relationships. The figure below shows our partial domain semantic model and a use case representation of a state machine for retrieving an entity from the semantic model.

Semantic Model and State Machine for Retrieval

Semantic Model and State Machine for Retrieval

HG DSL API

Our DSL is divided into different health care sub-domains such as x12 formats and its entities, health care data such as pricing, procedures, providers etc. From the API perspective, we have used closures as Nested Closures within DSL scripts, which consists of inline nesting, deferred evaluation, and limited-scope variables.

The following interactive visual diagram presents our Claims DSL Graph API. To fully highlight the simplicity of well designed DSLs, the graph is set to show the same data as the first interactive graph in this post.

To compare the traversals side by side, the following block depicts a few examples to retrieve persisted claim data information using standard gremlin queries vs the Claims graph DSL. It shows how some of the horrendous queries to retrieve health care entities from the X12 exchange standard can be extremely simplified with DSL usage.

Gremlin Queries DSL Queries

The health graph is represented as an ordered pair G = (N, E), comprising a set of nodes (N) and a set of edges (E). We use the mathematical representation and properties of our graph to define various data analysis algorithms and their behavior. With this design we can change the ingress data model and accordingly change the semantics of the entities and info retrieved without any disruption in the client applications. Currently, the formal model and properties for our DSL are still in progress and its semantics are not yet defined in a declarative language formalism. We have also used the domain specific knowledge of the underlying data structures to improve query performance. These optimizations occur underneath the abstraction layer and within the DSL implementation.

With this DSL, we can query our graph database for testing and analytics with the same semantic structure as the PokitDok API. This establishes seamless usage for data exchange across the internal PokitDok infrastructure (patents pending). To fully test the functionality of this software, we simulated the persistence of up to 30 million vertices in our database and tested the retrieval of data via our DSL. In a future blog post, we will release the timing statistics for the persistence and retrieval of data in this system along with information on our usage of Hadoop in this infrastructure.

Leave a Reply

Your email address will not be published.