Custom Titan Build to integrate with Cloudera 5.5

The PokitDok Data Science team uses the latest stable build of Titan with Hadoop2 as our primary transactional graph database. We built our graph database to house all of our transactions in an OLTP database from across the PokitDok stack. However, our API team’s existing infrastructure uses Cloudera 5.5's CDH 5 containers, which comes with Hadoop 2.6.0. As a result, we ran into multiple dependency errors when integrating the existing Titan and CDH 5 infrastructure for our API and Data Science Teams. We saw throughout the Titan community that a few other have run into the same issue and, as a result, have released our custom build back to the community.

Specifically, we released an update to the Titan 0.5.3 build which transitively excludes the requirement of Hadoop 2.2 across the build, and updates the properties to pull in Hadoop 2.6.0. This established backwards compatibility and enabled us to integrate the Titan and CDH 5 platforms. We use this custom build with Cassandra and have not confirmed whether this breaks HBase compatibility. (Most of the maven build changes are due to conflicting Hadoop version dependencies with HBase 0.98, to which Titan 0.5 is pinned.)

If you are interested in accessing this setup, check out our repository for this customization. Also, you can view all of the changes made against the vanilla Titan 0.5.3 build definition, which is useful for adapting this customization for your Titan setup.

In order to update our Titan DB to use the existing Hadoop 2.6.0 from CDH 5, we had to identify all of the transitive dependency issues across the build, which set us off on a whack-a-mole search for POM dependencies. Aren’t you glad that you don’t have to go through this?

In order to update our Titan DB to use the existing Hadoop 2.6.0 from CDH 5, we had to identify all of the transitive dependency issues across the build, which set us off on a whack-a-mole search for POM dependencies. Aren’t you glad that you don’t have to go through this?

To make a short story long: PokitDok’s API infrastructure is set up with Cloudera 5.5's CDH 5 containers and Hadoop 2.6.0.  Titan is built with Hadoop 1 and Hadoop 2 compatibility, but Hadoop 2 is not backwards compatible with Hadoop 1. In order to update our Titan DB to use the existing Hadoop 2.6.0 from CDH 5, we had to identify all of the transitive dependency issues across the build, which set us off on a whack-a-mole search for pom dependencies. Aren’t you glad that you don’t have to go through this?

Before adding in the exclusions to get Hadoop 2.6.0 to build, Maven was issuing multiple errors about conflicting versions of transitive Hadoop dependencies through the Titan project. We iteratively went through and thoroughly changed the pom files to update the build to version of Hadoop that is compatible with CDH 5. For example, look at titan-all/pom.xml in our MR. The HBase client in this build depends on titan-all/pom.xml. However, this client has a dependency which conflicts with Hadoop 2.6.0 (the version that we wanted). So, we introduced a new property to pull in Hadoop 2.6.0. After this change, we found a cascade of first level dependencies on the HBase client which caused transitive dependencies throughout the build ( enter the iterative whack-a-mole build dependency process). We would build Titan, find a different conflicting transitive dependency through maven errors, update the resulting pom property, and repeat. Build, fix, repeat. Fun.

Most other items in the MR are other exclusions which were added for the various titan subprojects (titan ES.. etc). At a high level, there were conflicts through some dependency path through the Titan HBase the version of hbase that titan is depending on has a dependency on hadoop 2.2.

Where are we going with all of this?  We are currently working through speeding up the bulk load of JSON straight from sequenced HDFS files into the Titan graph DB. More on this to come; and we expect it to be more insightful than updating transitive pom dependencies, but someone had to do it.

About Denise Gosnell, PhD

Dr. Gosnell, a driving member of the PokitDok Data Science team since 2014, has brought her research in applied graph theory to help architect the graph database while also serving as an analytics thought leader. Her work with the Data Science team aims to extract insight from the trenches of hidden data in healthcare and build products to bring the industry into the 21st century. She also helps organize the local chapter of Charleston Data Analytics, a Meetup PokitDok now sponsors, and has represented PokitDok's Data Science Team at numerous conferences including, PyData, KDD (Knowledge Discovery & Data Mining) and the inaugural GraphDay.

Prior to PokitPok, she earned her Ph.D. in Computer Science from the University of Tennessee - where she founded a branch of Sheryl Sandberg's Lean In Circle. The goal of this impressive organization is to guide women interested in computer science careers, as TechCrunch noted, and Denise has done that and more.

View All Posts

1 comment

    • Rama on February 15, 2016 at 4:19 am

    Reply

    Have you had experience getting gremlin to work with Rexster on Hadoop2?

Leave a Reply

Your email address will not be published.