Hadoop Dynamic File System as Liferay Store

At this year's Liferay North American Symposium I did a talk on Big Data.

The goal of the talk was to illustrate some cases where Liferay and the broad notion of "Big Data" intersect.

Most importantly I showed an example which stored Documents & Media portlet files in Hadoop's extremely scallable and fault tolerant Dynamic File System (a.k.a. HDFS).

The same example code also demonstrates how you might tee off the indexed result of these documents to store a copy in a separate location in HDFS so that it might be consumed as input for some MapReduce calculation in order to extract insight from it.

It demonstrates how to use the MapReduce API from a java client all the way from creating the job and sending it to submitting it for processing and then (basically) monitoring it's progress. The actual logic applied in the example is trivial, however the most important part is actually showing how you could use the APIs in making Liferay and Hadoop talk.

The code is on github as a standalone Liferay SDK (based on trunk but easily adaptable to earlier versions):

https://github.com/rotty3000/liferay-plugins-sdk-hadoop

Please feel free to fork it or use it as an example.

[update] I should also add a few links to resources I used in setting up hadoop in hybrid mode (single cluster node with access from a remote client, where my Liferay code is assumed to be a remote client):

Blogs
That's neat! It would be a great feature to add with BI tools.
Is there anything about using Liferay with Cassandra?
Sorry, I didn't do any work with Cassandra unfortunately. However, there is nothing preventing anyone from taking on the challenge. I doubt it's very difficult. emoticon
How close to production is the ability to store docs and media in HDFS? We have a new project we are starting and would like to leverage all the of Liferay security and logic, but want to store the media and docs in a very scalable manner..
Just this weekend, i started a Liferay Store implementation, that uses Cassandra.
It´s not finished , but is able to store and retrieve document library files.
It´s based in Kundera, so in theory must work with mongoDB and HBase.
@Bill, not close. However, it's stable, open source, and you are more than welcome to try it.

@Carlos, Awesome! Will you be open sourcing this project?
Hey Ray,

That is a great example on how Liferay can interact with some of the new technologies out there. I would like to add a couple of things to your entry:

- The D in HDFS is for distributed not dynamic emoticon

- For those interested on using it at a real system; I would discourage that. Using "raw" HDFS for the purpose of serving images/docs/etc (usually small files) is not usually a good idea because it will add lot of pressure on the NameNode, slowing down your system. Some other implementations like HBASE (built on top of HDFS) can satisfy this kind of needs.

If some need doing some kind of analytics, map/reduce, etc with tons of data maybe a mixed NoSQL + HDFS approach could serve; but this is a different story.

Great post Ray!

Migue
I tried to build https://github.com/rotty3000/liferay-plugins-sdk-hadoop this..
but failed..
According to documentation, it is compatible to liferay 6.1 but the compiled .war is not compatible.
How can i make it compatible for liferay6.1