TinkerPop and Titan from a Python State of Mind: A Tutorial via Docker

The PokitDok data science team uses many components in the TinkerPop stack, along with the Titan graph database. Let it be known, though, that we’re a serious Python shop. As a team, we wanted to do data analytics and not have to context switch between all the languages that are required to stand up this graph database. There was a desire to continue to use Python syntax when defining graph schema using the management system, performing graph traversals, building recommendation systems and so on, but the TinkerPop and Titan stacks run on the JVM.

Previously, we open sourced the work we've been doing to help engineers and data scientists use Python to work within TinkerPop and Titan from a Python state of mind. Recently, we have added to it. Specifically, we wrapped up the library and infrastructure into a Docker container, along with a synthetic dataset to play with, for a talk and demo that Brian Corbin and Denise Gosnell gave at PyData in New York. To supplement our example, our front end engineers Jonathan Fann and Joe Wright whipped up an interactive visualization of the example dataset in D3. The large purple vertices depict the doctors in the sample data set. The small grey vertices are the consumers and their connections indicate whether or not a consumer viewed and/or scheduled with the doctor.

If you prefer, you can skip straight to the code and example queries by downloading and setting up our container via either option A or B below.  Else, continue on down to learn about the provided data set, prepared traversals, and unanswered questions.

Take me to the code:

With Docker (recommended):

A1. Download Docker.

A2. Open your Docker Quickstart Terminal and run the command below; this command downloads the infrastructure, installs the packages, and pulls the graph data for you.

docker run -i -t pokitdok/gremlin-python-test-drive

or, Via GitHub:

B. Fork our repository and follow the set-up instructions.

Onward to the data, traversals, and unanswered questions:

The docker container we set up for experimenting with gremlin-python traversals comes preloaded with a graph stack and sample graph. Specifically, once you fire up your docker environment by following the steps above, your set-up will load up Titan 0.5.3 with a Cassandra back-end, Elastic Search, TinkerPop 2.30 and our Gremlin-Python library. We included a custom built graph that contains as vertices 1,000 synthetic consumers and 25 doctors (we emphasize synthetic to over-communicate that this is not real data). The directed edges connect consumers to doctors and are labeled either viewed or scheduled_with.

The visualization above provides an interactive exploration for the example data set available in our Docker container; the interactive graph contains half of the doctors from this dataset and their adjacencies. You can toggle the viewed edges and explore the connectivity and components of the graph.

In addition to the interactive visualization, you can query and traverse the sample graph from the rexster console. After firing up your docker container, the example below demonstrates how to open the graph, count the vertices by type, and count the edges by type:

We set up this graph to mimic the behavior we capture on our consumer facing marketplace application which utilizes all of our APIs within our platform. This graph contains a synthetic network of consumers viewing and scheduling appointments with doctors. For example:

To properly model these adjacencies according to the behavior we view internally, we have connected this dataset in a scale-free manner. Specifically, the probability of a vertex having degree x, where a vertex's degree is its total number of adjacencies, is modeled as p(e(i,j)) = α⋅x(-β), where α and β ∈ [1,4].

Open question number 1: which values for alpha and beta were used for this mock dataset? Tweet your answers to @DeniseKGosnell, or make a pull request to our repository with the code that fits the edge set to a scale-free distribution.

In addition to the mathematical connectivity between the vertices in the provided graph, we also set up a few other properties. Once you fire up your docker container and are in the rexster console, the following command will give you an list of the properties available on a doctor vertex. You can substitute in consumer for the vertex type and get an idea of the queryable properties for the consumer vertices, as well.

Open question number 2: What are the average ages of the consumers and doctors in the provided data set? What is the variance and how do these statistics change as you slice the data by specialty?

We set up this example graph to demonstrate the power and simplicity of graph traversals for recommendations. Specifically, when we construct bi-partite graphs to represent the interactions between two silos of entities, we can take a walk across the edges and count the number of observances for a particular entity to derive a recommendation. Within this container, let us take a look at how to rank the providers according to how many consumers have viewed them.

In addition to the example above, we stepped through other versions of a graph based recommendation system in a previous post and provided more sample traversals to rank the doctors of this dataset in our documentation.  We defined a bi-partite graph and a graph traversal based recommendation system in our slides from PyData. Let us know if you have any feature requests for the sample data set and send us some pull requests to showcase your analytics.

Leave a Reply

Your email address will not be published.