Importing Large Amounts of Data into Liferay

Recently, I was faced with the task of importing a large number of products from several warehouses and  product images into Liferay Commerce.  The upsert code itself to import the data was fairly straightforward, but the challenge was the sheer number of records inserted initially. The import was bound to be long-running, no doubt, and was going to run in the middle of the night anyway, but I wanted to find any way I could to "trim the fat." Let me share with you a few tips I learned from others and my own experience that helped me slim down the import.

A Little Bit of Background

Because importing each product into the Liferay Commerce catalog involves more than just 1 table, it was going to be a large number of records for each table. I couldn't really use some batch update tool that updates records in bulk or ActionableDynamicQuery (which doesn't do inserts, only updates, as far as I know), so I called the Liferay Commerce API which took care of inserting/updating all the right entities for me. It was an upsert-only operation, no deletions, to avoid the risk of the wrong data getting deleted. 

Importing the images from the warehouse was going to take the longest the first time they are ever downloaded, but, after that, they wouldn't need to be updated every time. Hot linking is a way to avoid the import altogether, but that's often banned by sites for their own good. 

(In case you're wondering, Talend ETL jobs were an option, but the decision ended up being writing our own import.)

Shorter Subsequent Imports

The first time the import is run should be the longest, and the subsequent imports should be shorter. (This tip doesn't apply if all the data changes frequently and must be updated every time the import is run.) See if you can find which data does not need to be updated. Some data doesn't change frequently, like an image of a product, so it doesn't need to be updated every day, for instance. Even if it does eventually need to be updated, you could rig it so that it forces it to update, but it's great if you can keep the import shorter most of the time.

Small Transactions

Running the entire import in one big transaction leads to timeout errors. Use a single transaction for each entity (in this case, product) that you're inserting.

Small Memory Footprint in the Code

You can make the memory footprint smaller by doing the following:

  • Avoid using local reference variables as much as possible.  You can use inline method calls instead of local variables. Move object creation into methods or any that are already available (like Collections.singletonMap).
  • Avoid creating the same objects over and over again inside loops when the same instance can be used in the loop.
  • Avoid String.format (use simple + or StringBundler instead which can be better performing).

Troubleshooting a Subset of Data

Let's face it: troubleshooting a long-running import is laborious. Sometimes, when troubleshooting certain issues, I wanted to run the import over only a small subset of the data. One way to do this is to make the import take in a start and end index so that you can run it over a range, say, only the first 50 records. I added a few System Settings for the start and end indexes to configure the range. The caveat to configuring a range like this is that I was depending on the data to be in a certain order. The order of the product data from the 3rd party warehouse APIs (since there was one API for the catalog and one for the inventory of the products in the catalog) happened to come back from the warehouse in the same order every time. So the product that appeared first from the catalog API was also the first to appear in the inventory API.

Before these changes, the import of just the products (not including the images) took 30 minutes or so at minimum and ended up timing out. After these changes, the product import was around 10 minutes, not lightning fast, but much better. The main change that helped was using a single transaction per upsert. 

These are just a few tips I learned, and I hope they help someone else facing the same issue. I'd be happy to hear any tips or experiences you have!