Understanding Search from a Developers Perspective

From a developer perspective, how do you properly handle indexing and searching?

If you’ve spent time rummaging around Liferay’s search and indexing documentation provided here, you’ll find a lot of details about document contributors, index writers, search registrars, etc.

The part that might be missing is what all of these things actually do, why they are important, and why you actually want to go down the road of supporting indexing and search for your custom entities.

In this blog entry, I'm going to break everything down and hopefully clarify why things are in the Liferay documentation samples and hopefully, by the end of the post, you'll have the knowledge you need to get your index and search needs done right the first time.

But first let's understand why Liferay is even using an external search index in the first place.

What Search Solves

The reason Liferay maintains a separate search index from the data store is because some things are either just really hard and sometimes not possible (at least in a practical sense) in a standard Relational Database like Oracle or MySQL.

The search index is used to match documents on keywords or phrases regardless of the “column” the data might be from.

If you imagine a table in a database with 5 columns with large text blocks, so maybe a product name, a description, installation instructions, recycling options and the sales brochure content.

If you wanted to search for a phrase such as “keyless entry” but you wanted to match on any of the 5 columns, you end up with something like:

SELECT * FROM mytable WHERE 
  (prod_name LIKE ‘%keyless entry%’) OR 
  (description LIKE ‘%keyless entry%’) OR 
  (install_instr LIKE ‘%keyless entry%’) OR 
  (recycling LIKE ‘%keyless entry%’) OR 
  (brochure LIKE ‘%keyless entry%’)

This is already pretty ugly, but now what happens if you want to search for the keywords “keyless” or “entry”? Your query starts to become more unmaintainable as you add relatively simple additional criteria into the mix.

And if you have multiple tables in your database and you want to join results of matches amongst more than one table? Your query misery has just been increased astronomically!

For certain types of queries, they can simply become too unwieldy or inefficient and in some cases impossible to do in a regular SQL-92 database.

Search Index to the Rescue

The search index solves this problem because search occurs over a Document, not over a table column.

Yes, I know that you can control the Fields included or excluded from the search; we're just focusing on theory at this point...

In search index parlance, each record from our table(s) will become a Document in the search index. The Document can have multiple Fields which may come straight from the table columns or they might be manufactured values (turning numerical codes into their string labels) or they may contain values from Parent/Child table relations to include necessary child data into the Document.

When a search for a phrase or for keywords is performed, the search index will search for Documents that match, regardless of the Fields the matches might come from. This way, as new keywords or Fields or Documents are added, the complexity of the query remains unchanged.

For the multiple table scenario, the records from the different tables are included into the same search index. Common Fields like NAME and DESCRIPTION would be reused across the different Document types so searching for “keyless entry” in a NAME Field would yield results from all tables that had a corresponding match.

For those Fields that are unique to the table, they can still be added for indexing, but when searching the query may need to be modified to include the additional Fields.

The Developer Perspective

To get back to the developer perspective, your goal in all of this is to get your entities into the search index such that when a user does a search in Liferay, your entities can be found and matched upon in the same way that a Liferay entity would be.

This is where the Liferay documentation will start to apply…

When Liferay is documenting how to contribute model entity fields into the index, they are describing what is necessary to get the fields from your entity into the index so they can be matched during a search.

When Liferay is documenting how to configure re-indexing and batch indexing behavior, they’re showing what you will be doing to ensure that your custom entities are also re-indexed when the Liferay Admin wants to reindex everything.

When Liferay is documenting how to add your model entity’s terms to the query, they’re showing how to add any additional Fields you might have defined for your custom entity to the search query so those custom Fields can be checked.

When Liferay is documenting how to pre-filter search results, they’re providing you a way to exclude matches from the results to prevent records from getting through that you don’t want included.

When Liferay is documenting how to create a results summary, they’re providing you a way to control the generated summary for your entity that the user will see in the search results.

And finally, when Liferay is documenting how to register your search services, they’re showing how all of these pieces you’ve generated will be made available to the search and indexing infrastructure to ensure they all get picked up.

Indexing and Search Customizations

Next we’ll get into each of the extension points and go into details to use for understanding how to build your own customizations.

Contribute Model Entity Fields into the Index

Liferay Documentation: https://portal.liferay.dev/docs/7-2/frameworks/-/knowledge_base/f/indexing-model-entities#contributing-model-entity-fields-to-the-index

When you have a custom Service Builder entity (or really any entity you want to search), one of the most important things you need to do is actually contribute Fields into the Document for your entity.

A ModelDocumentContributor is the class that will help get your entity’s columns mapped into Fields in the Document to be indexed.

Remember that the columns from your entity are not just going to be stored in the index on their own.

Every entity that requires indexing will need an instance of this class. If your entity is not directly indexed, you won’t need this class because you’re just not going to index it.

When adding Fields to the Document for your entity, keep these things in mind:

  • You don’t need to have all of your entity’s fields in the Document as Fields; you only need the ones that a keyword search should match on. The Document will always have your primary key value, so when your entity is a search result (AKA a Hit), you can always retrieve your entity.

  • Try to use constant values from the com.liferay.portal.kernel.search.Field class for your Field names, where they make sense. If your entity has a name, use Field.NAME. If your entity has a description, use Field.DESCRIPTION. Using the constants will reuse Fields in the Document that Liferay will already know how to include in a search query so your customization effort is reduced.

  • Don’t use the constant values for something they’re not. If your entity has an array of chemical names, for example, don’t concatenate them together and store as the Field.CAPTION type because they simply aren’t a caption. It is okay to come up with your own Field names.

  • Understand addText() vs addKeyword(). Both of these methods are overloaded and allow for text or keyword addition of many different types, but the search index will handle them very differently.

  • There are numerous other add methods for different data types such as addDate(), addNumber(), addGeolocation(), etc. Don’t coerce all of your data into Strings as this can throw off your search results (you wouldn’t want a search for “19” returning every record for the years 19xx and 2019, for example).

  • Fields do not have to be exact copies of the entity data; they are often better if they represent normalized data instead of the entity values. For example, you might have a clientId in your entity to point off to a different client record; for indexing, you will have better results if your Field is for the actual client name (from the client record) instead of (or in addition to) storing the clientId from the entity.

  • Fields can be created for data not part of the entity, so an entity with 5 fields could be represented in a Document with 20 Fields if it makes sense to have them as matching targets.

  • When adding support for filtering and/or sorting, the fields to filter or sort on must be added as Fields in the Document; you can’t sort on a field from the entity, for example, because the search includes only Hits from the index, not from an additional search of the database.

  • When handling localizable text, index the text in all languages. The Liferay documentation shows how to handle adding localized Fields by using specially crafted Field names. When a search is performed for a specific locale, the matching Fields can be used so the correct results will be returned and exclude false positives that could arise from an indirect match from another language.

  • Be sure to include all Fields that will later be used in a ModelSummaryContributor implementation (below). When building the Summary, you don’t want to have to fetch the Entity directly to get additional info to include in the Summary.

Text vs Keyword

It can be confusing when faced with the choice of choosing to add a Field as text vs adding as keyword. There is one significant difference that separates these two concepts: whether the full text is included or whether only keywords are included.

When storing as text, a phrase such as “The quick brown fox jumps over the lazy dog.” will be stored as-is in the Field. This is useful when you expect to face searches over phrases like “quick brown fox” or “lazy dog”. Because the full text is intact, only those records that include the matching phrase will be Hits.

With keyword storage, common words and duplicated words are removed from the text.

Common words, or in indexing they are known as Stop Words are words that are common in language but provide no value from an index perspective. This would include words like “the, this, a, that, those, he, him, her,” etc. From the phrase above, the removal of the stop words would index “quick brown fox jumps over lazy dog”.

Additionally duplicated words would also be removed, but count of occurrence would remain. In this blog, for example, I must have used the word Field at least 50 times so far. In keyword storage, Field would be included and 50 for the count, but all form of sentence structure or placement within the text is lost.

Of course, the actual storage of the keyword-based fields will be up to the search appliance, whether it is Solr or Elasticsearch. While they may handle things in a different way that what is described here, from a coding perspective it will be easier just to imagine this is how they do it.

Storing as keywords helps to reduce storage size and is good for keyword matching, but it is not useful for phrase searches such as “keyless entry” since the phrase is not retained in this storage method.

Since the occurrence count is retained, a search for “keyless” would be able to rank Hits higher with a higher occurrence count vs other Documents that used the word but sporadically. A keyword storage tends to be faster than raw text storage for keyword searches.

You might choose to store your text as two Fields, one using text and the other using keyword. The first would favor phrase searches and the other keyword searches.

Configure Reindexing and Batch Indexing Behavior

Liferay Documentation: https://portal.liferay.dev/docs/7-2/frameworks/-/knowledge_base/f/indexing-model-entities#configure-re-indexing-and-batch-indexing-behavior

When an administrator uses the Reindex option in the control panel, a batch reindexing process is kicked off. For a given entity, this typically means each record in the table will be retrieved, the Document is populated using the registered ModelDocumentContributor and the Document is sent to the indexing service (typically Elasticsearch) for storage.

Since this is the normal flow, you might be asking why you need to customize this process?

Many times you will actually want every record to be indexed, but it is also common to have records that should automatically be excluded from indexing.

For example, JournalArticles are obviously indexed, but only articles that are in the workflow status of APPROVED or IN_TRASH. Any that are PENDING, DENIED, etc are excluded from indexing. Depending upon your own perspective, you might think that indexing IN_TRASH entities doesn’t make sense, so you might want to exclude them. For some users, you might want to include PENDING in the index so they might see pending articles in their search results to see just how they would rank for typical searches once approved.

Every decision here is not going to be the same for all environments, for all developers. Context behind what your needs and requirements will determine which documents should be indexed and which ones should be excluded.

Rather than trying to avoid creating a Document using the ModelDocumentContributor to prevent indexing of articles in these states, a ModelIndexerWriterContributor is created to exclude these records from being processed in the first place.

This ends up being a much better process as it saves network and database bandwidth (by not retrieving records that won’t be indexed) and processing time (time wasted trying to create a document for a record that shouldn’t be indexed).

Every entity that is going to be indexed needs an instance of this class.

At the very least, most of the code from the Liferay documentation can be used as is. The only change is to the customize() method, the minimal implementation of the customize() method is going to be:

@Override
public void customize(
  BatchIndexingActionable batchIndexingActionable,
  ModelIndexerWriterDocumentHelper modelIndexerWriterDocumentHelper) {

  batchIndexingActionable.setPerformActionMethod(
    (FooEntry fooEntry) -> {
      Document document =
        modelIndexerWriterDocumentHelper.getDocument(fooEntry);

        batchIndexingActionable.addDocuments(document);
  });
}

This version does not filter any records from the table and would reindex every row.

To learn how to exclude entities/rows from being indexed, the Liferay documentation provides sample code demonstrating how to write the customize() method, but here’s some additional details that will help you decide how to implement yours:

  • The BatchIndexingActionable is a wrapper around a DynamicQuery. Anything you can do in a DynamicQuery, you can add to your BatchIndexingActionable instance.

  • Goal should be to exclude records you know should not be indexed. So this might be determined by workflow status or even your own status codes. You might want to exclude older records to prevent search Hits on them without actually deleting from the system.

  • The content for the batchIndexingActionable.setPerformActionMethod() in the example code is what you’ll use 99% of the time (modifying for your own entity class).

  • The getIndexerWriterMode() method is normally going to return IndexerWriterMode.UPDATE. The other options are used to “clean up” a record that might have been left behind previously but might need to be removed.

Adding Your Model Entity’s Terms to the Query

Liferay Documentation: https://portal.liferay.dev/docs/7-2/frameworks/-/knowledge_base/f/searching-the-index-for-model-entities#adding-your-model-entitys-terms-to-the-query

This is a sister class to your ModelDocumentContributor class. In this class you’re going to be adding Fields that you defined in your Document instance to the search context query helper to facilitate keyword searches on the fields.

Not all entities will need the KeywordQueryContributor implementation; only those that need to add Fields to an in-flight search query that were added by the entity’s ModelDocumentContributor.

This is kind of an important aspect - there is already a search being started and your class needs to add Fields for the keyword search.

So you may not want to add every Field that you did in the ModelDocumentContributor, but you absolutely want to add those that should be included in the keyword search.

In Liferay’s example code, the FooEntryModelDocumentContributor added two date Fields, a simple text subtitle Field and localized Fields for the content and the title.

In the corresponding FooEntryKeywordQueryContributor, only the subtitle, title and content Fields were added to the query; the two date Fields were not because they are not really subject to a Keyword search.

Likewise you may have other Fields that you add in your own ModelDocumentContributor that you may or may not want to include in the KeywordQueryContributor; just note that those you include will be searched, while those you exclude will not be searched.

Pre-Filtering

Liferay Documentation: https://portal.liferay.dev/docs/7-2/frameworks/-/knowledge_base/f/searching-the-index-for-model-entities#pre-filtering

Pre-filtering is a method to exclude Hits from the returned search results. You may not have a need to do this kind of thing, but the option is there.

In a previous example where it was suggested that PENDING JournalArticles might get indexed so content approvers might see the pending articles in the search results? In that situation, you would not want everyone to see PENDING articles.

Through a custom ModelPreFilterContributor implementation, you could add a role-specific filter to exclude PENDING articles from normal users and only include them for content approvers.

Not all entities will need an implementation of ModelPreFilterContributor - only in cases where some instances of your entity should not be included as Hits under specific circumstances will this be necessary.

Creating a Results Summary

Liferay Documentation: https://portal.liferay.dev/docs/7-2/frameworks/-/knowledge_base/f/returning-results#creating-a-results-summary

Every Hit (search result) will be displayed in the search results portlet in Liferay. You have control over the summary content that is displayed in the search result using your ModelSummaryContributor.

The Liferay sample is demonstrative of setting a summary for the matched entity. Remember that both the Content and Title Fields are localized; the implementation provided exposes the Field naming used for Localized fields, but in the end the localized title and content are extracted from the Document and used in the creation of the Summary class.

If you’re not using localized fields, your Summary creation will be simpler than the provided sample.

Although it is not highlighted in the Liferay example, you should try to use the Document Fields when creating the Summary instance. If you have to do a DB query to fetch your entity for something to complete the Summary, you will be facing a performance hit. It is recommended that all values you need or want in the Summary should be added as Fields in the Document to avoid the DB query.

Every entity which can be returned as a Hit (search result) should implement a ModelSummaryContributor class.

Controlling the Visibility of Model Entities

Liferay Documentation: https://portal.liferay.dev/docs/7-2/frameworks/-/knowledge_base/f/returning-results#controlling-the-visibility-of-model-entities

This will likely be a rarely used extension point. In cases where one Entity can have Related Assets, the ModelVisibilityContributor determines whether the entity can be selected as a Related Asset or not.

For example, Web Contents can have a DlDocument as a Related Asset; when creating a Web Content, the user can do a search to find documents that can be added as a Related Asset.

The ModelVisibilityContributor can be used to prevent your entity/entities from being available for selection.

In the sample Liferay implementation in the documentation, it masks FooEntry instances that are not in the right workflow status.

Not all entities will require an instance of the ModelVisibilityController; only those that can be related to another asset and want some control over whether an instance is available or not will implement one of these classes.

Search Service Registration

Liferay Documentation: https://portal.liferay.dev/docs/7-2/frameworks/-/knowledge_base/f/search-service-registration#search-service-registration

The last piece of the custom index/search implementation is your search service registrar.

Although all of the classes previously discussed are implementations of Liferay interfaces to support indexing and search, and although they are all registered as OSGi components, they will not be registered into the Liferay Search Registry.

It is through the Search Registry that Liferay finds all of the necessary pieces when dealing with indexing or searching, so our last piece to implement is the SearchRegistrar to finish wiring everything up.

The Liferay implementation uses a regular @Component with an @Activate method to trigger the search registration process. This will get invoked as soon as the @Referenced services are wired in.

You’ll want to add @Reference dependencies for all of the indexing and search classes you created as part of this document.

In the sample code, the registration sets the following in the modelSearchDefinition:

  • The default selected Field names are the default list of Fields that are selected; the list shown is mostly the standard, but they included the MODIFIED_DATE (the FooEntityModelDocumentContributor sets this Field value).

  • Default selected Localized fields; since the FooEntity has localized TITLE and CONTENT, these are added as default selected fields (you may or may not have any of these).

  • Sets the ModelIndexWriteContributor, the one for the FooEntity.

  • Sets the ModelSummaryContributor, the one for the FooEntity.

  • Sets the ModelVisibilityContributor, the one for the FooEntity (you may or may not have one of these).

Additionally there is an @Deactivate method that unregisters everything from the Search Registry when the module is unloaded.

Every entity that is being indexed/searched will register its classes in this same way.

It is recommended that each Entity has its own SearchRegistrar implementation, but this is not a requirement. While you could have a single SearchRegistrar that took care of registering all of the classes for all of the entities, there would be too high a chance of a single missing @Referenced component blocking the registration of all of the entities search classes. For this reason, it is recommended that each entity have a separate Registrar so a missing component would only block the entity that is missing the dependency.

Conclusion

Well there it is. That's all I know about building out the custom index/search code. I needed all of this for a new blog project implementation, so I figured by dumping it here, when it comes time to check out the other big project you'll know more about why I made certain decisions in the implementation.

In the mean time, I also think it can be important details that will make it easier to understand how to handle your own entity index/search needs.