Effective Liferay Batch

Liferay now has a batch mode to support batch inserts and updates.

Introduction

My friend and coworker Eric Chin recently posed a question on our internal slack channels asking if anyone had experience with Liferay's Batch mode and he was not finding enough supporting documentation on it.

Although I hadn't yet worked with Liferay Batch, I kind of took it as a challenge to not only give it a try, but more importantly to document it in a new blog post, so here we are...

Liferay introduced Batch support in 7.3 to add bulk data export and import into Liferay. Batch can be invoked through the Headless API, through Java code, and even by dropping a special zip file into the Liferay deploy folder.

Batch leverages the Headless code, so the entities that can be processed through Batch are those entities that you can access individually through the Headless APIs. Although this may seem like a shortcoming, it actually ensures that imported batch data will go through the same layers of business logic that the headless entities go through.

Before we can start using batch mode, we need to understand the supported data formats.

Batch Data Formats

Batch data, whether exported or imported, must be in a specifically supported data format: CSV, XLS/XLSX, or JSON/JSONL. The format for the export/import will be provided as an argument when invoking the Batch Engine.

When invoking the Batch Engine, you get to specify the columns in the export/import, so the contents are not static at all; required columns must be provided, but the rest are optional. Also the columns do not need to be in a specific order, they can be changed to whatever you specify.

CSV and XLS/XLSX

For both CSV and XLS/XLSX formats, you're basically getting a two-dimensional table of rows for each record and columns for each field.

Neither of these formats support a deep hierarchy of data. This means they do not support handling extracting values from child objects, so for example a StructuredContentFolder has a Creator object as a member field, but the Creator or its own member fields cannot be exported as part of that batch.

Maps are supported though, and this helps with localized values. These are referenced by the key. The StructuredContent object has a title field which is a map of language keys to values, so you can reference title.en as the column to access the English version of the title.

JSON and JSONL

These two formats are more robust in that they support the hierarchical data for child objects, and they use the familiar JSON format.

JSONL is a specialized version of JSON that is specified by jsonlines.org and has the following constraints:

  • Filename extension must be .jsonl
  • One valid JSON object per line (no pretty printed JSON allowed)
  • File encoding must be UTF-8

For simple JSON, it must be a valid array of JSON objects, so the first and last characters are going to be the square brackets:

[
  {...},
  {...},
  {...}
]

For JSON, the file can be pretty printed if you’d like, the “no pretty printing” restriction is only for the JSONL per the specification.

Batch Engine

The core of Batch is based around the Batch Engine (BE). The BE is an asynchronous system that handles two types of tasks, export tasks and import tasks.

For the export tasks, the BatchEngineExportTaskLocalService is used to add a new BatchEngineExportTask. Once added, the BatchEngineExportTaskExecutor is used to process the export task, and the exported data can be extracted from the task when it completes.

For the import tasks, the BatchEngineImportTaskLocalService is used to add a new BatchEngineImportTask. Once added, the BatchEngineImportTaskExecutor is used to process the import task.

Let's see how to import and export some blog posts using the BE...

Importing BlogPostings

In Liferay, the Blogs portlet works with BlogEntry entities, but the headless version of these is the BlogPosting. Because Batch is based off of the Headless endpoints, we need to work with the headless objects.

When using the BatchEngineImportTaskLocalService to handle the import, we can provide a Map<String,String> of field name mappings. To keep the code simple, we'll use a pretty short map:

Map<String,String> fieldMappings = new HashMap();

fieldMappings.put("altHeadline", "alternativeHeadline");
fieldMappings.put("body", "articleBody");
fieldMappings.put("pubDate", "datePublished");
fieldMappings.put("headline", "headline");
fieldMappings.put("site", "siteId");

The key is the column name in the import data and the value is the field name from the BlogPosting that the data should store to.

If your source data uses the same field names as the BlogPosting, you can skip the field mapping and just pass a null to the API.

Submitting the Import Task

To submit the import task, you need an @Reference to the BatchEngineImportTaskLocalService into your class. Then you can use the addBatchEngineImportTask() method to submit the task such as follows:

BatchEngineImportTask importTask;

importTask = _batchEngineImportTaskLocalService.addBatchEngineImportTask(
  companyId, userId, numRecords, null, BlogPosting.class.getName(), importDataContent,
  "JSON", BatchEngineTaskExecutorStatus.INITIAL.name(), fieldMappings,
  BatchEngineTaskOperation.CREATE.name(), null, null);

There are two different constant classes here to review.

The BatchEngineTaskExecutorStatus is the status for the individual task. You should always use INITIAL. As the BatchEngineTaskExecutor runs, it will change the status first to STARTED and it will finish as COMPLETED or FAILED depending upon the outcome of the batch run.

The BatchEngineTaskOperation is the operation the engine is going to complete. The options are CREATE, READ, UPDATE or DELETE (your basic CRUD options). You won't be using the READ option for the import task, but the others come in handy.

The only thing here indicating the type of data that is being imported is the BlogPosting class reference and the content data itself and optionally the field mappings. Regardless of what data you want to import using Batch, as long as you have the corresponding RESTBuilder entity class available, you can import using batch.

Note that the importDataContent above is a byte[] array of the data for the import. It can be just the data alone, or you can pass the zipped data to the call (the Batch Engine will unzip the data when it is being processed).

Executing the Import Task

Once your task has been created, you can queue it up for execution. You'll need to @Reference in a BatchEngineImportTaskExecutor to execute the task as follows:

_batchEngineImportTaskExecutor.execute(importTask);

Because the import is an asynchronous process, this may not execute the import right away and it will not wait for the import to complete before returning back to your code.

Exporting BlogPostings

Exporting entities requires three key elements: the data format to use, an optional list of field names to include, and finally the parameters to use to complete the data query.

For the optional list of field names, you could pass an empty list (to include all fields from the entity) or you can specify those fields that you wanted to use. We could create our list as:

List<String> fieldNames = new ArrayList();

fieldNames.add("id");
fieldNames.add("alternativeHeadline");
fieldNames.add("datePublished");
fieldNames.add("headline");
fieldNames.add("articleBody");

The parameters are those arguments that will be necessary to complete or include in the data query. For the blog postings, we may want to limit the export to blog postings from a single site rather than all sites. Parameters are passed using a Map<String,Serializable> created such as follows:

Map<String,Serializable> params = new HashMap();
long siteId = 1234L;

params.put("siteId", siteId);

One of the cool parts is that you can actually pass any of the parameters that you might use on a headless GET request to sort and filter the list. The search, sort, filter, fields, restrictFields and flatten parameters are all supported when it comes to exporting the data, just add the parameters to the map with the right corresponding values and you're good to go.

Just check out these links to see how to use the parameters:

When specifying fields, note that you not only control the fields that will be included in the export, you also control the order of those fields in the exported content.

Submitting the Export Task

Once we have the three key pieces of information, we can submit the export task. We need to have an @Reference on the BatchEngineExportTaskLocalService to add the new task:

BatchEngineExportTask exportTask;

exportTask = _batchEngineExportTaskLocalService.addBatchEngineExportTask(
  companyId, userId, null, BlogPosting.class.getName(), "JSON",
  BatchEngineTaskExecuteStatus.INITIAL.name(), fieldNames, params, null);

So the only things here that identify the type of data being exported is the BlogPosting class and the fieldNames list (params might also give it away depending upon what has been defined). This same call can be made to export any headless entity the system has, not just BlogPostings.

Executing the Export Task

To execute our new export task, we need an @Reference on the BatchEngineExportTaskExecutor:

_batchEngineExportTaskExecutor.execute(exportTask);

This will queue up the export task, but as it is an asynchronous process it may not start right away and may not complete before the method call returns.

Extracting the Exported Data

We need to take an extra step to get an InputStream to the exported data:

InputStream is = _batchEngineExportTaskLocalService.openContentInputStream(
  exportTask.getBatchEngineExportTaskId());

Once we have the input stream, we can write it to a file or send it out via a network connection or do whatever we need to with it, just be sure to close the stream when you're done with it.

Note that we just jumped into getting the input stream but we didn't check the status of the export to see if it has reached the BatchEngineTaskExecuteStatus.COMPLETED or FAILED states, so the input stream might not be ready yet. You could block and wait for the status to be updated if you are concerned that you are trying to access the exported data before it is ready. The headless batch export will return a stub object on each call until the status is COMPLETED, only then is the exported data returned.

Deleting BlogPostings

Batch also supports deletions, but as a special form of a BatchEngineImportTask.

The data formats are all supported, but you are only going to pass a single element, the ID. Here's what our JSONL might look like to delete a bunch of BlogPostings:

{"id":1234}
{"id":1235}
{"id":1236}

We'd also be invoking the BatchEngineImportTaskLocalService and BatchEngineImportTaskExecutor to do the real work:

BatchEngineImportTask deleteTask;

deleteTask = _batchEngineImportTaskLocalService.addBatchEngineImportTask(
  companyId, userId, 10, null, BlogPosting.class.getName(), deleteJsonContent,
  "JSONL", BatchEngineExecuteStatus.INITIAL.name(), null,
  BatchEngineTaskOperation.DELETE.name(), null, null);

_batchEngineImportTaskExcecutor.execute(deleteTask);

Again this is an asynchronous task, so the deletion may not start when this call is issued and it may not complete when the method returns.

Batch REST Services

The Batch Engine can also be accessed via REST service calls.

A PUT, POST, or DELETE request can be sent to the REST endpoints w/ the "/batch" suffix, such as http://localhost:8080/o/headless-delivery/v1.0/sites/{siteId}/blog-postings/batch.

As defined here, the only URL parameter is the site id. The body of the request is going to be the JSON (the only supported format) for the batch operation (CREATE, UPDATE, or DELETE). The batch export operation would be handled by the regular GET request.

Direct Batch Engine Services

An additional set of Batch REST services are exposed by the Batch Engine itself, so you don't have to go to the individual batch methods themselves.

These entry points are defined here, but they're pretty easy to use. They are direct methods to create ExportTasks and ImportTasks (kind of like what we did in the Java-based APIs above).

To do a batch create of blog posts using the direct Batch REST endpoints, I would be doing a POST to http://localhost:8080/o/headless-batch-engine/v1.0/import-task/com.liferay.headless.delivery.dto.v1_0.BlogPosting, and the body of the POST would be the data for the import (CSV, JSON, etc., just make sure the content type matches the data format).

The last argument on the URL is the fully qualified class name for the headless data type, so it is basically the value for BlogPosting.class.getName(). An easy way to find it is to navigate in SwaggerHub to the schema definition, and on the right hand side look for the x-class-name attribute default value. For the BlogPosting, I checked here to get the string.

There are corresponding APIs for the remaining UPDATE and DELETE operations, and another REST entry point for the /export-task operations for the batch export.

Auto-Deploy Batch Files

Yes, you can also auto-deploy Batch files too!

To deploy a batch file, you'll be creating a special zip file and dropping that file into the $LIFERAY_HOME/deploy folder.

The zip file will contain two files: a batch-engine.json file (defining the batch job) and another file (with any name) that contains the data. Both files can be in a subdirectory in the zip file, but they have to be in the same directory. I would recommend just putting them at the root of the zip file and keep things as simple as possible.

The batch-engine.json file is a JSON file that defines the batch job and basically captures all of the same details that we see in the BatchEngineImportTaskLocalService.addBatchEngineImportTask() method:

{
  "callbackURL": null,
  "className": "com.liferay.headless.delivery.dto.v1_0.BlogPosting",
  "companyId": 20095,
  "fieldNameMappingMap": {
    "altHeadline": "alternativeHeadline",
    "body": "articleBody",
    "pubDate": "datePublished",
    "headline": "headline",
    "site": "siteId"
  },
  "parameters": null,
  "userId": 20124,
  "version": "v1.0"
}

So many of the same parameters, and again the only part specifying that I'm doing a BlogPosting is the className value, so I could just as easily swap out with another Liferay Headless type or my own custom types.

NOTE: CREATE is the only supported operation for the auto-deploy batch method. I did submit a feature request to have the operation added as an optional field to the batch-engine.json object to support other types of batch updates...

The data file that goes along with this can have any name and it must have the correct extension that identifies the format of the data, so .json for JSON, .jsonl for JSONL, .csv for CSV, etc., and it has to be in the same directory in the zip file as the batch-engine.json file is.

In the example I provided above, you can see that the parameters argument is null. This will work to import blog postings as long as the data file includes the "site" column that is defined in the fieldNameMappingMap. If your blog postings data does not have a "site" column with the site id, you'd have to add it to the parameters stanza to support loading the blogs into a specific site. In this case the entry would look like "parameters": {"siteId": "20123"},. Like the fieldNameMappingMap, this is how you would declare necessary parameters for the batch data processing.

Custom Batch Handling

Liferay Batch also supports building your own Batch handling class. Just create a module and create a component that implements the BatchEngineTaskItemDelegate interface. It's a Generic interface which defines the CRUD operations for the Generic type that they handle.

So even though there is no Headless support for the Liferay Role entity, you could register an implementation of BatchEngineTaskItemDelegate<Role> and get the Batch Engine to support roles either via the API or the auto-deploy batch files (you won't get the REST endpoint without full RESTBuilder support).

Conclusion

Well that's all I know about Batch and how to use it in Liferay. I hope you find this useful for your own implementations!