To reindex or not to reindex, that's the question

A little while ago I ran into a strange problem at a customer. We had written a hook that contained a small REST service that accepts a multipart POST to import a document and some related metadata/permissions. This service seemed to work OK and was used to import 1000's of documents, but every now and then we received a call or an email of an enduser complaining that they couldn't see a document that was imported. When we checked the document library of the group it was supposed to be in we were able to find the document and verify that it was correctly uploaded, that it had the correct metadata and that the permissions were also correct. 
 
 
So why isn't it showing up then? We quickly determined that it had to be index, Lucene in our case, in some way or another. A first look at our code didn't turn up anything specific. We were using a couple of Liferay APIs to add the document to a certain group, set some expandos and permissions, but nothing too special or out of the ordinary (or so we thought...). After some further digging we finally found a small indicator: some documents in the index had an empty or wrongly filled groupRoleId field. When we then manually uploaded the same document over the existing one (mimicking a reindex for that specific document), without changing the expandos or permissions, the document would show up correctly again. In the end we found that running the following script, that only reindexes the permissions for a document, would also do the trick:
import com.liferay.portlet.documentlibrary.model.*;
import com.liferay.portal.kernel.search.*;

try {
   String entryClassPK = "192363";
   String className = DLFileEntry.class.getName();
   SearchEngineUtil.updatePermissionFields(className, entryClassPK);

   out.println("Reindexed: " + entryClassPK);
} catch (Exception e) {
   out.println("Failed to reindex: " + entryClassPK);
   e.printStackTrace();
}

So now we definitely knew that the indexing of the permissions was the issue and after creating a JMeter load test, I was even able to consistently reproduce the problem. When I ran the test about 1 or 2 documents in a run of a 1000 would have an incorrect groupRoleId field value (empty or only a groupId without a connecting dash and roleId). This is when we decided to open a support ticket as we couldn't find the cause of the problem and were thinking it might be a possible bug.

With the friendly help of the Liferay support guys (thank you very much again!) we found out that we were actually working under an incorrect assumption about Liferay's inner workings. As it turns out: when you call multiple Liferay APIs after each other, like we were doing in our REST service, outside of Liferay's transactions, you might need to use TransactionCommitCallbackUtil to register a callback that has to run after the initial API call has finished. In our case this was setting the correct permissions after the document was added. Even after working with Liferay all these years you still learn something new! 

We then changed out REST service to use this utility class (for which there is precious little information/documentation to be found), but needed something to verify that it works correctly now. For that I created a small standalone Java application that can be run on the command line or in a cron job that is able to run a query against a Lucene index and write out a report. The code can be found here: https://github.com/planetsizebrain/index-checker.
 
It takes the following parameters (in order):
  • A Lucene index directory, in our case the Liferay data/lucene directory (you only need to provide the root, it will find the subdirectories itself). The index will also be opened as read-only and our tests suggest you can run it against a running Liferay's index.
  • The Lucene query to run (wildcards are turned on and allowed), e.g.: "entryClassName:com.liferay.portlet.documentlibrary.model.DLFileEntry AND visible:true AND (*:* AND NOT groupRoleId:*-*)"
  • The field of the result that you want inspected, e.g.: groupRoleId
  • The field values of the matching Lucene docs you want printed in the logs, e.g: "entryClassPK,groupId,title"
You can control where the log will be output by providing the following JVM parameter: -Dlog.directory=<directory-where-to-log> (for which we chose a directory that we can access via LFM cheeky).  This will produce a log file that contains something like the following lines:
be.planetsizebrain.lucene.IndexChecker - Adding index directory '/path/to/liferay/data/lucene/10155' to search
be.planetsizebrain.lucene.IndexChecker - Adding index directory '/path/to/liferay/data/lucene/0' to search
be.planetsizebrain.lucene.IndexChecker - Found 5 possible incorrect documents, checking 'groupRoleId' field...
be.planetsizebrain.lucene.IndexChecker - Found document with empty value for 'groupRoleId': (entryClassPK: 11547), (groupId: 10916), (title: Code Complete 2nd Edition)
be.planetsizebrain.lucene.IndexChecker - Found document with empty value for 'groupRoleId': (entryClassPK: 11578), (groupId: 10916), (title: Liferay In Action)
be.planetsizebrain.lucene.IndexChecker - Found document with empty value for 'groupRoleId': (entryClassPK: 994268), (groupId: 12301), (title: Javascript - The Good Parts)
be.planetsizebrain.lucene.IndexChecker - Found document with empty value for 'groupRoleId': (entryClassPK: 600652), (groupId: 12355), (title: The Mythical Man-Month)
be.planetsizebrain.lucene.IndexChecker - Found document with empty value for 'groupRoleId': (entryClassPK: 995458), (groupId: 12355), (title: Effective Java 2nd Edition)
be.planetsizebrain.lucene.IndexChecker - Done. Found 0 incorrect and 5 empty entries
In our case running this against an index with +/- 350K documents and -Xmx set to 64Mb took about 10s.
 
Good luck trying this out and modifying it to your needs and hopefully you'll also find your needle in a haystack when the time comes!
 
 
PS: just saw this pop up in the Marketplace and it does something similar, but slightly different, from within Liferay: https://www.liferay.com/marketplace/-/mp/application/70121999
Blogs
Hi all,

Now my Index Checker portlet is available for both Liferay 6.2 and Liferay 7.0

It checks all indexed Liferay objects and also its permissions and category/tags.

Download it from marketplace: https://web.liferay.com/es/marketplace/-/mp/application/70121999