Updating PDFBox

Update your PDFBox to eliminate pesky errors during PDF document upload/processing.

Intro

In case you're not aware, Liferay uses Apache PDFBox to look at your PDFs. In order to index the content of a PDF, Liferay uses PDFBox to extract the content and uses that during indexing/searching.

However, Liferay tends to fall behind a bit on updating 3rd party libraries. When you favor stability, you tend to be rather conservative when considering updates like this.

Recently though I was helping a client bulk upload documents using the Resources Importer, and wouldn't you know it but almost every one had one exception after another. Some were just errors about fonts, some were actual PDF errors, but some were from bugs in PDFBox.

I checked my DXP version and found that it was using Apache PDFBox 2.0.3, and this was released 2016-09-17. I saw that they released Apache PDFBox 2.0.11 on 2018-06-28, and I wondered if I would see any difference if I updated my PDFBox version.

Replacing Liferay's Jars

So the first question was "What do I download?" When you check the download page, you see that they have a full source zip, a number of java apps and a number of library jars for versions 1.x and 2.x.

I knew I wanted better than 2.0.3, so I was happy to stay in the 2.x downloads. Since I wasn't using the command line apps, I skipped those and went straight to the "Libraries of each subproject" section.

I ended up grabbing the PDFBox, FontBox, Preflight, XMPBox and PDFBoxTools jars.  I skipped the PDFBoxDebugger because I don't plan on debugging to that level of detail.

After stopping my Liferay appserver, I copied these jars to the webapps/ROOT/WEB-INF/lib directory. This was easy for Tomcat, but you may need to follow a slightly different path depending upon your app server choice. Worst case scenario, you could actually build an EXT plugin to deploy your libs, but I'd avoid this if at all possible.

I didn't bother with renaming the jars, so I had to get rid of the old 2.0.3 versions. Interestingly, Liferay doesn't include all of these jars, only PDFBox and FontBox, so after deleting those old jars I was ready to bring the environment up.

Results

After starting up the environment and trying my PDF loads, I found that many of the errors I had seen before were gone. I still had some, but at this point I think they are bad PDF files (they're marked as generated by some robo-pdf tool). I don't know if Preflight or XMPBox have anything to do with some of the errors disappearing, but I don't believe they hurt anything and don't know if they are actual dependencies for the newer versions. So I'm just going to keep them.

Anyway, since my errors were all gone, I'm declaring this an unqualified success.

Caveats

Well, it goes without saying that you may need to repeat part of this when you apply a new fixpack or service pack to DXP.  If they updated from 2.0.3 to 2.0.4 the patching tool will have no problem adding their jars, but you'll have the duplicates again. If you stripped the version number from the jar name, the patching tool would overwrite your newer 2.0.11 jars with the older jars, probably something you wouldn't want.

So keep an eye on your ROOT/WEB-INF/lib jars when you apply a fixpack or service pack.

Also, I think it is okay to update as new 2.0 versions get released. But if you find that there's a shiny new PDFBox 2.1 or 3.0 up there, I would resist the temptation to just blindly push those in as that kind of version bump usually points to an API change that may not be compatible with how Liferay uses PDFBox.

Blogs

Hi David,

In case of having enterprise support of Liferay DXP, doing that jar change in Liferay installation  is not advisable: Any bug caused by that change won't be supported by Liferay support team

 

If you are having problems with PDFBox and you have enterprise support, it is better to open a LESA ticket.

Jorge,

 

We've had similar problems in the past with older PDFBox versions, which were escalated to LESA but closed without a fix because support for 3rd party tools and libraries are not within the scope of LESA support.

 

So unless this policy has changed, I do think it is useful to know how to upgrade PDFBox yourself, admittedly without expecting this new version to be supported by Liferay Inc.

 

Nice article David!

Hi Peter,

 

Here in Liferay Spain Support, we have updated PDFBox some times in order to solve problems.

For example, see jira tickets: LPE-12604 , LPE-14732, LPE-15122, and LPE-15537

 

Nevertheless we only update the library to last available minor version (Liferay 6.2 => PDFBox 1.8.x and Liferay 7.x => PDFBox 2.0.x , we won't update to last major version due to API compatibility.