Denise Gosnell, PhD

About

Dr. Denise Gosnell is a Technology Evangelist at PokitDok, specializing in blockchain, machine learning applications of graph analytics, and data science. She joined in 2014 as a Data Scientist, bringing her domain expertise in applied graph theory to extract insight from the trenches of healthcare databases and build products to bring the industry into the 21st century.

Prior to PokitDok, Dr. Gosnell earned her Ph.D. in Computer Science from the University of Tennessee. Her research on how our online interactions leave behind unique identifiers that form a “social fingerprint” led to presentations at major conferences from San Diego to London and drew the interest of such tech industry giants as Microsoft Research and Apple. Additionally, she was a leader in addressing the underrepresentation of women in her field and founded a branch of Sheryl Sandberg’s Lean In Circles.

Popular

The PokitDok HealthGraph

Why We Love Healthcare In Graphs While the front-end team has been busy putting together version 3 of our healthcare marketplace, the data science team...

The top 3 things we love most about the PokitDok HealthGraph

We are building the world's largest health graph In our last post, we introduced the latest thing the Data Science team is building for PokitDok:...

Online Recommendation Systems: What it Really Means to be 'Somebody Like You' - Part 1

PokitBlog is starting a new trend called Technical Tuesdays #techtuesdays (Throwback Thursdays are so yesterday). In honor of our amazing technical and data science teams...

Recent Posts

AWS Presents: Managing a Healthcare System with Blockchain

Earlier this month, I was in New York City at the AWS Pop-up Loft in SoHo - our Co-Founder & CTO Ted Tanner, Jr. was...

DocRank: Computer Science Capstone Project with the College of Charleston

Throughout the 2017 spring semester, PokitDok sponsored a capstone project through the Computer Science Department at the College of Charleston. PokitDok provided a curated dataset...

Notes from the Field: Consensus 2017

We just returned from spending a few days in the Big Apple at Consensus 2017. Produced by CoinDesk, this has arguably become one of the...

Snake_Byte #28: Term x Document Matrices via Sklearn's DictVectorizer

One of the places which I frequently visit for fun, open source data sets is Kaggle. It is the place that data scientists visit when...

DC Blockchain Codeathon

In mid March, Ted Tanner (CTO and co-founder of PokitDok) and I took off to Washington, DC to support and sponsor a blockchain code-a-thon organized...

Snake_Bytes #5: Vector Distances

Last time, we took a look at the application of the dot product in a simple classifier and near the end there, Bryan set the stage...

A few weeks ago, we published a post detailing our internal graph domain specific language (DSL) which we built to query our transaction database. Since then, we have been performing some load and performance testing within our production environment. Specifically, we tested the performance of our OLTP persistence model and the latency of our OLAP entity extraction jobs which use our DSL. In this post, we detail the timing and volume statistics around the load process, entity extraction traversals, and model persistence. Additionally, this post contains an interactive example of the data we persisted and outlines our testing process.

The PokitDok APIs persist the X12 transactions into our Titan graph; an interactive sample of X12 transaction data is below. For the performance tests described in this post, we persisted varying amounts of different X12 transaction types into the graph database. The graph below shows three different graph structures for the transactions used in these tests: 270, 271, and 837 transactions. Open up the image in a new tab to explore the data further.

OLTP Transaction Graphs

We designed this partition of our graph to function as an OLTP database to handle the heavy write use case of our healthcare APIs. From these transactions, we use our domain specific language (DSL) to traverse, extract, and persist entity models for OLAP processing. Given this pipeline, we created a series of tests to track the load statistics for persistence and entity extraction from the API call to DSL usage.

Our API infrastructure for real time entity extraction, model persistence, and matching is outlined in Figure 1. We persist an X12 API call via our dynamic JSONLoader (open source coming soon) in a graph database. We also kick off a job to use our graph DSL to extract and persist the entity models observed in this transaction. Lastly, we relate the entity model to the transaction in which it was observed. As we merge duplicate entities, this edge relates the merged representation of an entity to all transactions in which it was observed.

Figure 1: Graph API process across a streaming OLAP architecture

Figure 1: Graph API process across a streaming OLAP architecture

To gather load statistics about this model, we spun up a pipeline to hit our graph with varying scenarios of API loads while also monitoring the entity extraction and persistence jobs. Each individual run persists ratios of simulated eligibility requests, eligibility responses and healthcare claims. On average, an individual X12 transaction contains up to 177 vertices (check out our previous post on the structure of X12 json). Each transaction is persisted as a tree due to the incoming structure of the JSON payload, and therefore contains 176 edges. A fundamental theorem in graph theory is that if a tree has a total of |V| vertices, then the number of edges it contains, |E|, is |V| - 1. Check out a version of the proof here.

Across this simulation, Table 1 lists the distribution of persisted vertex count and edge count with the number of simulated API calls which varies in each run. By applying the above theorem, we can use the total number of persisted vertices on each load test to calculate the number of edges in the graph. The additional step required implies that every individual persisted transaction is a disconnected sub-tree. As such, the final number of edges after persisting T total API calls is |V| - 1 - T.

Total API Calls Total Vertices Total Edges Persistence Time (seconds)
820 145,870 145,049 71.695
7,380 1,312,830 1,305,449 628.43
8,200 1,458,700 1,450,499 671.83
65,600 11,669,600 11,603,999 5,447.71
82,000 14,587,000 14,504,999 6,154.65

While persisting the API calls into the graph db, we also extract the entity information using the DSL traversals over the X12 transaction graph. Table 2 outlines the number of entities traversed, extracted, and persisted by this subroutine. Future, we calculated the wall clock time across the total number of trials starting from the API call through the round trip persistence of the entity to the Cassandra backend and the resulting graph query which verified entity extraction completion.

Total API Calls Total Payor Entities Extracted Total Consumer Entities Extracted Total Provider Entities Extracted Extraction Time via the DSL(seconds)
820 1,170 1,430 570 127.76
7,380 10,530 12,870 5,130 727.96
8,200 11,700 14,300 5,700 1,224.84
65,600 93,600 114,400 45,600 6,031.04
82,000 117,000 143,000 57,000 16,820.50

We ran the jobs across on stand alone machines with Ubuntu 14.04, Intel Core i5-5300u CPU@2.3GHZ with 16GB RAM as well as on a 2.5 GHz Intel Core i7, 16 GB 1600 MHz DDR3 MacBookPro with 10.10.5 OSX. We did not use a cluster in these runs.

The performance of our JSONLoad and DSL entity extraction job is performant for the heavy write load of our APIs. In a future post, we will release the results of our load tests on our EC2 cluster in AWS and examine the p99 latency. In the meantime, if you would like to inspect the setup of our current environment, you can grab our docker container that comes preloaded with our Gremlin-Python library and graph infrastructure. Lastly, we are also extracting the PokitDok specific components of the JSONLoader and are preparing for an open sourced version.

We are currently converting our DSL from TinkerPop 2.5.0 over to TinkerPop 3.2. Specifically, we are following the development and conversation around the DSL on this ticket. Once there, we will move on to using the GraphComputer as available in TP3 for improved OLAP processing.

2016: The Year of the Graph

Insights from the inaugural independent Graph Day Conference The inaugural Graph Day, an independent graph conference, was held on Jan 17th, 2016 in Austin, TX....

TinkerPop and Titan from a Python State of Mind: A Tutorial via Docker

The PokitDok data science team uses many components in the TinkerPop stack, along with the Titan graph database. Let it be known, though, that we’re...

Custom Titan Build to integrate with Cloudera 5.5

The PokitDok Data Science team uses the latest stable build of Titan with Hadoop2 as our primary transactional graph database. We built our graph database to...

Top 100K PokitDok Providers

Recently, we took a look at the top 100,000 providers from the PokitDok Marketplace and viewed our database's structure according to each provider's primary and...