Custom Titan Build to integrate with Cloudera 5.5

The PokitDok Data Science team uses the latest stable build of Titan with Hadoop2 as our primary transactional graph database. We built our graph database to house all of our transactions in an OLTP database from across the PokitDok stack. However, our API team's existing infrastructure uses Cloudera 5.5's CDH 5 containers, which comes with Hadoop 2.6.0. As a result, we ran into multiple dependency errors when integrating the existing Titan and CDH 5 infrastructure for our API and Data Science Teams. We saw throughout the Titan community that a few other have run into the same issue and, as a result, have released our custom build back to the community.

Specifically, we released an update to the Titan 0.5.3 build which transitively excludes the requirement of Hadoop 2.2 across the build, and updates the properties to pull in Hadoop 2.6.0. This established backwards compatibility and enabled us to integrate the Titan and CDH 5 platforms. We use this custom build with Cassandra and have not confirmed whether this breaks HBase compatibility. (Most of the maven build changes are due to conflicting Hadoop version dependencies with HBase 0.98, to which Titan 0.5 is pinned.)

If you are interested in accessing this setup, check out our repository for this customization. Also, you can view all of the changes made against the vanilla Titan 0.5.3 build definition, which is useful for adapting this customization for your Titan setup.

In order to update our Titan DB to use the existing Hadoop 2.6.0 from CDH 5, we had to identify all of the transitive dependency issues across the build, which set us off on a whack-a-mole search for POM dependencies. Aren't you glad that you don't have to go through this?

In order to update our Titan DB to use the existing Hadoop 2.6.0 from CDH 5, we had to identify all of the transitive dependency issues across the build, which set us off on a whack-a-mole search for POM dependencies. Aren't you glad that you don't have to go through this?

To make a short story long: PokitDok's API infrastructure is set up with Cloudera 5.5's CDH 5 containers and Hadoop 2.6.0.  Titan is built with Hadoop 1 and Hadoop 2 compatibility, but Hadoop 2 is not backwards compatible with Hadoop 1. In order to update our Titan DB to use the existing Hadoop 2.6.0 from CDH 5, we had to identify all of the transitive dependency issues across the build, which set us off on a whack-a-mole search for pom dependencies. Aren't you glad that you don't have to go through this?

Before adding in the exclusions to get Hadoop 2.6.0 to build, Maven was issuing multiple errors about conflicting versions of transitive Hadoop dependencies through the Titan project. We iteratively went through and thoroughly changed the pom files to update the build to version of Hadoop that is compatible with CDH 5. For example, look at titan-all/pom.xml in our MR. The HBase client in this build depends on titan-all/pom.xml. However, this client has a dependency which conflicts with Hadoop 2.6.0 (the version that we wanted). So, we introduced a new property to pull in Hadoop 2.6.0. After this change, we found a cascade of first level dependencies on the HBase client which caused transitive dependencies throughout the build ( enter the iterative whack-a-mole build dependency process). We would build Titan, find a different conflicting transitive dependency through maven errors, update the resulting pom property, and repeat. Build, fix, repeat. Fun.

Most other items in the MR are other exclusions which were added for the various titan subprojects (titan ES.. etc). At a high level, there were conflicts through some dependency path through the Titan HBase the version of hbase that titan is depending on has a dependency on hadoop 2.2.

Where are we going with all of this?  We are currently working through speeding up the bulk load of JSON straight from sequenced HDFS files into the Titan graph DB. More on this to come; and we expect it to be more insightful than updating transitive pom dependencies, but someone had to do it.

About Denise Gosnell, PhD

Dr. Denise Gosnell is a Technology Evangelist at PokitDok, specializing in blockchain, machine learning applications of graph analytics, and data science. She joined in 2014 as a Data Scientist, bringing her domain expertise in applied graph theory to extract insight from the trenches of healthcare databases and build products to bring the industry into the 21st century.

Prior to PokitDok, Dr. Gosnell earned her Ph.D. in Computer Science from the University of Tennessee. Her research on how our online interactions leave behind unique identifiers that form a “social fingerprint” led to presentations at major conferences from San Diego to London and drew the interest of such tech industry giants as Microsoft Research and Apple. Additionally, she was a leader in addressing the underrepresentation of women in her field and founded a branch of Sheryl Sandberg’s Lean In Circles.

View All Posts

1 comment

    • Rama on February 15, 2016 at 4:19 am

    Reply

    Have you had experience getting gremlin to work with Rexster on Hadoop2?

Leave a Reply

Your email address will not be published.