Building federated search indexer against diverse data sources on top of Liferay

In brief, OpenSearch allows publishing of search results in a format suitable for syndication and aggregation. Federated search is a simultaneous search of multiple online databases or web resources, and it is an emerging feature of automated, web-based library and information retrieval systems. The portal implemented federated search based on OpenSearch standard.

Abstracted from the Liferay development cookbook: Liferay Portal Systems Development (for liferay portal 6.1 or above version, coming out soon)

This article will address an approach – building federated search indexer against diverse data sources on top of Liferay. First of all, let’s consider a use case. A capital group manages many companies (each company has many users as members), hundreds-million accounts (the entity Account) and hundreds-million related documents (meta-data and real content).

The following diagram shows ER models of the current enterprise database (DB2 for example). In one row, the entity Seller could have many Portfolio; the entity Portfolio could have many Account associated; the entity Account could have many Document associated. In another row, the entity Company could have many User; the entity User could have many DcoumentType associated; the entity DocumentType could have many Document associated.

The following is a list of main requirements:

Keep the current enterprise database running AS IS for different existing applications;
Leverage the portal for user registration and membership management;
Leverage the portal RBAC for authorization;
Search documents based on accounts’ info and document meta-data
Audit users’ activities, track downloads and send update notifications

Solution overview

The following diagram shows an example: DB2 has database schema I and schema II; the portal has its own database and database schema. The schema I will be mapped into the Plugin I in the portal; while the schema II will be mapped into the Plugin II in the portal.

The schema status will be used to store updated results – that is, whenever system updates account and / or document meta-data, record the status as a row. This schema is mapped into the plugin status in the portal.

In summary, the solution would be able to provide following feature.

Enhancement of the Service-builder for diverse data sources in plugins
Read-only approach – data lookup
The Pull/push approach using the trigger to update the indexer
Scheduling – check update status by the scheduler
Federated search indexer
Audit users' activities, records downloads and set up email notification

Diverse data sources support in plugins through the service-builder

First of all, let’s see how to support diverse data sources in plugins through the service-builder. Wiki articles (Connect to a Database with Plugins SDK and Extend Tables in Another Database) have addressed how to connect different database with plugins SDK manually. Here we are focusing on how to support diverse data sources in plugins through the service-builder directly.

This new feature and the fix patch have been addressed at the ticket "Ability to connect different data sources in plugins through the service-builder". The following steps show the main idea.

Predefine a file with JDBC settings like jdbc.properties
Predefine template file spring-ext-xml.ftl in order to generate the Ext spring configuration ext-spring.xml
Update the class ServiceBuilder.java to generate jdbc.properties and ext-spring.xml in plugins.

After applying the fix patch, the service-builder now supports diverse data sources in plugins.

Example

Let’s have a deep look on a real example. The portal (6.0.6 for examle, you can use 6.1 or 6.0 EE) uses MySQL as its default database. In DB2, there is an entity called DocumentType. Now need to bring data of document types into the portal.

Apply the fix patch in the portal
Build service xml in a plugin as follows

<service-builder package-path="com.liferay.mdb">
<namespace>MDB</namespace>
<entity name="DocumentType" table="DOCUMENTTYPE" uuid="false" local-service="true" remote-service="true" data-source="mdbDataSource">

<column name="documentTypeId" type="long" primary="true" />

<column name="name" type="String" />
<column name="description" type="String" />
</entity>
</service-builder>

Use the service-builder to generate all services.
Configure the target database (through jdbc.properties) in the plugin

## DB2
jdbc.mdb.driverClassName=com.ibm.db2.jcc.DB2Driver
jdbc.mdb.url=jdbc:db2://192.168.2.138:50000/mcm:deferPrepares=false;fullyMaterializeInputStreams=true;fullyMaterializeLobData=true;progresssiveLocators=2;progressiveStreaming=2;
jdbc.mdb.username=lportal
jdbc.mdb.password=lportal

Deploy the plugin
Manually create database table DOCUMENTTYPE and insert sample data in the DB2

CREATE TABLE DOCUMENTTYPE (
   documentTypeId bigint not null primary key,
   name VARCHAR(512),
   description VARCHAR(512)
);
insert into DOCUMENTTYPE (documentTypeId, name, description) values (1001,'Type A', 'Type A');
insert into DOCUMENTTYPE (documentTypeId, name, description) values (1002,'Type B', 'Type B');
insert into DOCUMENTTYPE (documentTypeId, name, description) values (1003,'Type C', 'Type C');
insert into DOCUMENTTYPE (documentTypeId, name, description) values (1004,'Type D', 'Type D');
insert into DOCUMENTTYPE (documentTypeId, name, description) values (1005,'Type E', 'Type E');
insert into DOCUMENTTYPE (documentTypeId, name, description) values (1006,'Type F', 'Type F');
insert into DOCUMENTTYPE (documentTypeId, name, description) values (1007,'Type G', 'Type G');
insert into DOCUMENTTYPE (documentTypeId, name, description) values (1008,'Type H', 'Type H')
insert into DOCUMENTTYPE (documentTypeId, name, description) values (1009,'Type I', 'Type I');
insert into DOCUMENTTYPE (documentTypeId, name, description) values (1010,'Type J', 'Type J');

The following screenshot shows the results of the plugin in the portal.

The pull/push data approach

The entities DocumentType and Company have hundreds rows. Thus these rows could be retrieved by the Liferay API directly. For example:

int count = 0;
List<DocumentType> list = Collections.synchronizedList(new ArrayList<DocumentType>());
try {
count = DocumentTypeLocalServiceUtil.getDocumentTypesCount();
list = DocumentTypeLocalServiceUtil.getDocumentTypes(0, count);
} catch (Exception e) {}

This is the beauty of the service-builder.

The entities Account and Document have hundreds-million rows. Thus these rows could be retrieved by the Liferay API for first-time indexing only, not for checking updates. Whenever Account and Document got updated in the DB2, trigger the updates in the table status.

In the portal, define a scheduler in the plugin – check the update of the table status regular base (for example, in one minute). Whenever found rows in the table status, retrieve these rows (and related entities like Account and Document) from the DB2, update the same in the indexer of the portal, and then remove these rows from the table status.

Building federated search indexer

Based on the entities Seller, Portfolio, Account, Company, User, Document Types, Document and View permission in the portal, build federated indexer for the plugin. From now on, you could be able to search documents via the portal default search engine.

Since all entities (from the DB2) got mounted into the portal, you can audit user activities; record downloads and set up email notification whenever in need.

Summary

The approach – building federated search indexer against diverse data sources on top of Liferay – would be useful when integrating existing enterprise database, while keeping the existing applications running. And the new feature (LPS-22552) would be nice in order to improve the functions of the service-builder.