Wednesday, August 6, 2014

Simple SVG charts with HBase REST service, Flask and Pygal

A little less conversation and a little more action in this post. I wanted to have a flexible way to define simple charts for small HBase tables. Maybe using HBase for small data might sound crazy, but by doing so we can take advantage of its flexible NoSQL schema. So let’s exploit HBase's REST service (see also here) for this, first of all we have to launch the service with the following command:
[cloudera@localhost ~]$ hbase rest start -ro -p 9998
14/08/06 14:49:33 INFO util.VersionInfo: HBase 0.94.6-cdh4.4.0
....
That starts the HBase REST server in read only mode, and serving at port 9998. This service is started by default in some distributions like HDP2. Once the service is started, the idea is defining a mapping from service responses to some charts. For that we might use matplotlib combined with Flask like this:
@app.route('/barchart.png')
def plotOne():
    fig = Figure()
    axis = fig.add_subplot(1, 1, 1)
    axis.bar(range(len(values)), values)

    canvas = FigureCanvas(fig)
    output = StringIO.StringIO()
    canvas.print_png(output)
    response = make_response(output.getvalue())
    response.mimetype = 'image/png'
    return response
But I wanted something simpler, and I found Pygal: it offers nice SVG charts with a high level interface, some fancy animations, and Flask integration. The first chart is easy as pie(chart):
values = [2, 1, 0, 2, 5, 7]
@app.route('/barchart.svg')
def graph_something():
     bar_chart = pygal.Bar(style=DarkSolarizedStyle)
     bar_chart.add('Values', values)
     return bar_chart.render_response()




Now with some creative URL routing in Flask we can define moderately complex graphs just in the URL, thanks to the suffix globbing of HBase's REST service, and by using a simple html table in the Jinja2 template. Autorefresh is obtained simply with a <meta http-equiv="refresh" content="{{refresh_rate}}"> element in the template. So we get


for the URL http://localhost:9999/hbase/charts/localhost:9998/test_hbase_py_client/width/1500/cols/2/refresh/500/bar/Sites%20Visited/visits/bar/Info/info/keys/* , assuming a table created in hbase shell as
create 'test_hbase_py_client', 'info', 'visits'
put '${TABLE_NAME}', 'john', 'info:age', 42
put '${TABLE_NAME}', 'mary', 'info:age', 26
put '${TABLE_NAME}', 'john', 'visits:amazon.com', 5
put '${TABLE_NAME}', 'john', 'visits:google.es', 2
put '${TABLE_NAME}', 'mary', 'visits:amazon.com', 4
put '${TABLE_NAME}', 'mary', 'visits:facebook.com', 2
list
scan '${TABLE_NAME}'
exit

The main idea for the mapping into a barchart is that each HBase row corresponds to a group of bars (a color in the chart), and that given a column family the quals for that column are the values in the x-axis, while the cell values correspond to the values for the y-axis. If several rows are specified then all the bars groups are displayed together with a different color per row key. 
Besides, to allow several charts the number of columns is specified followed by a sequence of chart specifications, which are triples (chart type, chart title, column family). Hence the URL http://localhost:9999/hbase/charts/localhost:9998/test_hbase_py_client/width/1500/cols/2/refresh/500/bar/Sites%20Visited/visits/bar/Info/info/keys/* means "read from the table test_hbase_py_client at the server localhost:9998; the chart table will be 1500 pixels width; use two columns and refresh the whole chart each 500 seconds; the first chart is a bar chart with tittle 'Sites Visited' and takes the values from the column family 'visit', the second chart is a bar chart titled 'Info' that reads from the column family 'info'; use all the keys found in the table". This URL mapping was implemented by combining Flask standard routing primitives with a custom URL converter (extending werkzeug.routing.BaseConverter)

For a more elaborate example, take a look at this simple Spark Streaming program (so simple it would be called script if it was written in Python ...), that populates an HBase table with a sliding window of one minute containing the mention count in Twitter for some musicians. 


As usual, you can find all the code for the post in my github repo, where you can see that the chart service is a single Python script. Now all that is left is extending the Python service to cover all the different types of Pygal charts, and calling a web designer so the chart page stops looking like a web page from the dotcom era.


We are hiring!

If you have enjoyed this post, you are interested in Big Data technologies, and you have a solid experience as a Java developer, take a look to this open position at my company.