Data42: March 2014

Long time no see. I really love Python and also the Hadoop ecosystem, but there is this problem that Hadoop is all Java based, so sometimes is not so easy to use Hadoop with Python. There are some approaches to interoperatibility between Python and Java, being the Jython interpreter one of the most remarkable, and also what it's shipped with Apache Pig by default. Nevertheless, Jython is always lagging behind Python (I think it only supports Python 2.5), and I've also have found some problems when importing external libraries, even pure Python libraries, at least in the standalone version shipped with Pig. You lose access to all the cool C-based libraries available in the reference CPython implementation as well.
So I was very happy to see that CPython is now supported for UDFs in the new Pig 0.12.0. This opens a whole world of possibilities, and in particular I was thinking it would be very nice to use HBase from a CPython UDF in a Pig job, following the "HBase as a shared resource" pattern for MapReduce and HBase interactions. With this and other possible applications in mind (e.g. calling HBase from a Python Storm bolt), I decided to do some research about accesing HBase from CPython. Finally, I came with the idea of using JPype as the interface between Python and the Java client classes for HBase.

The approach in JPype is different to Jython in that, instead of implementing Python in Java, the idea is instantiating a JVM for calling Java from Python. Hence, to get an HBase driver for CPython I'd only have to call the Java driver from JPype, implementing a light wrapper for ease of use. For now I'm just in the proof of concept phase, but at least I've been able to make a simple connection to HBase from CPython. So let's go for it!

First we have to install JPype, which is available at pip and anyway is very easy to install by hand. Then we can import the jpype module from our Python code, and access to the HBase Java driver classes through the jpype.JClass Python class. For this little experiment (all the code is available at github) I first created a simple HBase table with this simple bash script

#!/bin/bash

TABLE_NAME='test_hbase_py_client'

hbase shell <<END
create '${TABLE_NAME}', 'info', 'visits'
put '${TABLE_NAME}', 'john', 'info:age', 42
put '${TABLE_NAME}', 'mary', 'info:age', 26
put '${TABLE_NAME}', 'john', 'visits:amazon.com', 5
put '${TABLE_NAME}', 'john', 'visits:google.es', 2
put '${TABLE_NAME}', 'mary', 'visits:amazon.com', 4
put '${TABLE_NAME}', 'mary', 'visits:facebook.com', 2
list
scan '${TABLE_NAME}'
exit
END

The goal now is writing a CPython program to scan that table. JPype is a very simple library, you only have to start a JVM through a call to jpype.startJVM, and then you can easy access to Java objects through simple calls like the following

HTablePoolClass = jpype.JClass("org.apache.hadoop.hbase.client.HTablePool")
connection_pool = HTablePoolClass()

Here we access to the Java class HTablePool and store it in a variable, so we can instantiate it in Python by using the usual Python notation for object creation and calling the constructors as defined in Java. JPype is smart enough to perform most of the necesary type conversions between Python and Java automatically, and also choosing the right version of overloaded methods. On the other hand, sadly JPype is not the most active project in the world, and sometimes strange exceptions may arise. In particular when you instantiate a class A that depends on a class B, which is not available in the classpath, JPype raises an exception saying that A is not found, when the problem is that B is not available. To solve this, I just added to the classpath all the jars related to Hadoop or HBase on the creation of the JVM:

_jvm_lib_path = "/usr/java/jdk1.6.0_32/jre/lib/amd64/server/libjvm.so"
cp_dirs = '/usr/lib/hadoop/client-0.20:/usr/lib/hadoop/lib:/usr/lib/hadoop:/usr/lib/hadoop/client:/usr/lib/hbase/lib/:/usr/lib/hbase/'
cp_jars_str = ":".join(set(jar for cp_dir in cp_dirs.split(':') for jar in glob.iglob(cp_dir + "/*.jar")))

jpype.startJVM(_jvm_lib_path, "-ea","-Djava.class.path=" + cp_jars_str)

After that everything worked fine for me with JPype, as you can see in the rest of the program below, in which I just create a connection to HBase, open a table, and perform a full scan. The only remarkable detail is the use of the function iterate_iterable() to traverse Java Iterable objects as Python generators.

def iterate_iterable(iterable):
    iterator = iterable.iterator()
    while iterator.hasNext():
        yield iterator.next()

test_table_name = 'test_hbase_py_client'

try:
    HTablePoolClass = jpype.JClass("org.apache.hadoop.hbase.client.HTablePool")
    connection_pool = HTablePoolClass()
    test_table = connection_pool.getTable(test_table_name)
    BytesClass = jpype.JClass("org.apache.hadoop.hbase.util.Bytes")
    ScanClass = jpype.JClass("org.apache.hadoop.hbase.client.Scan")
    scan_all = ScanClass()
        # class ResultScanner
    result_scanner = test_table.getScanner(scan_all)
    # for result in result_scanner: TypeError: 'org.apache.hadoop.hbase.client.ClientScanner' object is not iterable
    print '\n'*2, '-'*30
    print 'Scanning table "{table_name}"'.format(table_name=test_table_name)
    for result in iterate_iterable(result_scanner):
        print "row id:", result.getRow()
        for key_val in iterate_iterable(result.list()):
            print "\t", "family : {family}, qual : {qual}, value : {value}".format(family = key_val.getFamily(), qual = key_val.getQualifier(), value = BytesClass.toString(key_val.getValue()).encode('ascii', 'ignore'))
    print '-'*30, '\n'*2
    test_table.close()
except jpype.JavaException as ex:
    print 'exception', ex.javaClass(), ex.message()
    print 'stacktrace:', ex.stacktrace()

I have only tested it in my Cloudera Quickstart CDH4.4.0, so please tell me if you have any problem.
There are other CPython clients for HBase like pyhbase and hbase-thrift. Regarding pyhbase, it looks like an abandoned project, and it doesn't work with CDH4, at least in the tests I performed. On the other hand I haven't tested hbase-thrift, but I don't like the idea of having the thrift gatewat as a bottle neck for connections to the HBase cluster. Anyway I think the technique of wrapping a Java driver with JPype is interesting because it can be applied to other databases, and it would be easy to keep the driver up to date by updating the underlying jars when needed.

I hope you enjoyed the post!

Data42

Sunday, March 2, 2014

Talking to HBase from Python with JPype