Better PDF previews in Liferay without ImageMagick

The problem

While doing some work for a client I ran into issues with the preview generation for some PDFs. While it did generate a preview image for each page of the document, the text on it was strange sometimes to say the least. In the table below you can see a screenshot of the first page of the problematic PDF, the preview that is generated by Liferay and the preview we're able to generate after our hack:

Actual PDF Before hack After hack

The biggest problem is with the font, but the background is also a bit screwy. So my initial thought was that the PDF might be using some specials font(s) and didn't (correctly) embed them? So my first inclination was to see how to add fonts to the system that could be picked up by whatever Liferay was using to generate the preview images. The default for Liferay to generate previews of PDF files is a pure Java library called PDFBox. There's also the option of using an OS native install of Imagemagick (and Ghostscript), but that would require at least an additional 2Gb of memory outside of the JVM allocation. As this wasn't an option in this case I first looked into the font option. While this does seem to be possible in PDFBox, by editing the PDFBox_External_Fonts.properties that can be found inside the JAR and adding the additional fonts, I couldn't quite get it to work: instead of strange/wrong characters I now got no characters at all.

After some more Googling it seems that the PDFBox that Liferay 6.2 uses, which is version 1.8.2, is known for having a lot of font issues. Most of these seem to be better/fixed in the 2.0.0 version... sadly enough this version hasn't been released yet. But in cases like this you sometimes need to take a page out of Ayrton Senna's book and push the limit a bit:

"On a given day, a given circumstance, you think you have a limit. And you then go for this limit and you touch this limit, and you think, 'Okay, this is the limit'. And so you touch this limit, something happens and you suddenly can go a little bit further. With your mind power, your determination, your instinct, and the experience as well, you can fly very high." -- Ayrton Senna

Here be dragons: what I'll be describing below is messing around with a SNAPSHOT version of an unreleased PDFBox version while simultaneously hacking Liferay a bit to get it all working together. For the versions I used this all seems to work nicely, but trying this out yourself, especially in a production environment, is completely at your own risk.

The solution

As you might expect you can't just drop in a snapshot version of PDFBox 2.0.0 jar and expect everything to be solved. I doesn't quite work like that and here are reasons why:

  • There are actually 3 related JARs: pdfbox.jar, fontbox.jar & jempbox.jar
  • PDFBox isn't only used for preview generation by Liferay, but also for stripping the text out of PDFs for indexing
  • API incompatibilities

While there are 3 JARs in Liferay's lib directory that are part of PDFBox, only the first two I mentioned are actually used in the preview generation and will need to be switched out. The jempbox JAR is used for PDF XMP metadata extraction, but can be left in its original state (in the 2.0.0 version jempbox has been renamed to xmpbox). If you take a 2.0.0-SNAPSHOT build of these two JARs and use them to replace the ones in the Liferay WEB-INF/lib directory and restart you'll run into the other two problems I mentioned.

From the stracktrace you get you'll see that not only is Liferay using PDFBox for preview and thumbnail generation, it also uses it, via Apache Tika, for text extraction. Tika uses PDFBox's PDFTextStripper class (and some auxiliary ones) for this, which were moved from the package org.apache.pdfbox.util to org.apache.pdfbox.text for PDFBox 2.0.0. Because we do not also want to patch Tika, we'll just move those classes back to their original package and call it a day.

This brings us to our last problem, but also biggest problem: there have some significant code changes/refactorings in PDFBox between version 1.8.2 and the snapshot we'd like to use. The first change we run into is something that is related to the previous problem. In version 1.8.2 the PDFTextStripper class had a method setForceParsing(boolean) which isn't present anymore in 2.0.0, but which we'll just add back with an empty implementation:

public  void  setForceParsing( boolean  forceParsingValue) {
     // NO-OP to comply with old signature
}

While it is a bit strange to solve a NoSuchMethodException like this, it seems it wasn't a really critical part because afterwards the text extraction seemed to work again like before. This means we can finally get to the important part: fixing the API incompatibilities in the Liferay PDF preview generation code that is done by the LiferayPDFBoxConverter class. Due to some refactorings in PDFBox this class won't find the getAllPages() method on PDDocumentCatalog anymore. To fix this you'll need to take the source of this class, modify it and then replace the original class, located in Liferay's portal-impl.jar, with your modified one. We did this using a fancy JAR/WAR overlay system that we use in our build/deploy system (which we'll cover in a blog post someday), but there are of course other ways to do this: manually patching the JAR, an extlet, ... .

When you add the source of the LiferayPDFBoxConverter class to a simple project you'll also see that some other stuff won't compile because of missing/changed methods. For this we'll need to make some changes to the generateImagesPB() methods so they look like the ones below:

public  void  generateImagesPB()  throws  Exception {
    PDDocument pdDocument =  null ;
 
    try  {
       pdDocument = PDDocument.load(_inputFile);
 
       PDDocumentCatalog pdDocumentCatalog =
          pdDocument.getDocumentCatalog();
 
       PDFRenderer pdfRenderer =  new  PDFRenderer(pdDocument);
 
       PDPageTree pdPages = pdDocumentCatalog.getPages();
 
       for  ( int  i =  0 ; i < pdPages.getCount(); i++) {
          if  (_generateThumbnail && (i ==  0 )) {
             _generateImagesPB(
                pdfRenderer, i, _thumbnailFile, _thumbnailExtension);
          }
 
          if  (!_generatePreview) {
             break ;
          }
 
          _generateImagesPB(pdfRenderer, i, _previewFiles[i], _extension);
       }
    }
    finally  {
       if  (pdDocument !=  null ) {
          pdDocument.close();
       }
    }
}
 
private  void  _generateImagesPB(
       PDFRenderer pdfRenderer,  int  index, File outputFile, String extension)
    throws  Exception {
 
    RenderedImage renderedImage = pdfRenderer.renderImageWithDPI(index, _dpi, ImageType.RGB);
 
    ImageTool imageTool = ImageToolImpl.getInstance();
 
    if  (_height !=  0 ) {
       renderedImage = imageTool.scale(renderedImage, _width, _height);
    }
    else  {
       renderedImage = imageTool.scale(renderedImage, _width);
    }
 
    outputFile.createNewFile();
 
    ImageIO.write(renderedImage, extension, outputFile);
}

With this modified class in place and the updated and tweaked PDFBox JARs the PDF preview generation (and text extraction) should work again and produce far better results that before. To make the lives of the developers that, like me, like to live on the edge, here's some helpful code:

 

More blogs on Liferay and Java via http://blogs.aca-it.be.

Blogs
Thank you for sharing your invaluable experience. I have also come across pdf previews for some documents and after reading your blog post I will look onto it again to get it working and test it thoroughly.
Hi Patrick,

Thank you for the kind comment. I wish you luck with your new attempt and just let me know if you have any problems with it or my code and I'll try to help out as best as I can.
Hello Jan,

Thanks for your time writing this blog and sharing your investigations & solutions.

- Vishal