At this year's Liferay North American Symposium I did a talk on Big Data.
The goal of the talk was to illustrate some cases where Liferay and the broad notion of "Big Data" intersect.
Most importantly I showed an example which stored Documents & Media portlet files in Hadoop's extremely scallable and fault tolerant Dynamic File System (a.k.a. HDFS).
The same example code also demonstrates how you might tee off the indexed result of these documents to store a copy in a separate location in HDFS so that it might be consumed as input for some MapReduce calculation in order to extract insight from it.
It demonstrates how to use the MapReduce API from a java client all the way from creating the job and sending it to submitting it for processing and then (basically) monitoring it's progress. The actual logic applied in the example is trivial, however the most important part is actually showing how you could use the APIs in making Liferay and Hadoop talk.
The code is on github as a standalone Liferay SDK (based on trunk but easily adaptable to earlier versions):
https://github.com/rotty3000/liferay-plugins-sdk-hadoop
Please feel free to fork it or use it as an example.
[update] I should also add a few links to resources I used in setting up hadoop in hybrid mode (single cluster node with access from a remote client, where my Liferay code is assumed to be a remote client):
- http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
- http://hadoop.apache.org/docs/r1.0.3/mapred_tutorial.html
- http://hadoop.apache.org/docs/r1.0.3/single_node_setup.html
- http://blog.rajeevsharma.in/2009/06/using-hdfs-in-java-0200.html
- http://www.cs.brandeis.edu//~cs147a/lab/hadoop-troubleshooting/


