Configure indexed files' size

Jamie Sammons, modified 6 Years ago. Junior Member Posts: 31 Join Date: 3/23/17 Recent Posts
Hello,

I'm working with Liferay 6.2 ce ga6 and Tomcat 7.0.62

I need to configure the environment not to index uploaded files larger than a size (I'm working with 1Mb while testing)
I add to portal-ext.properties:
dl.file.indexing.max.size=1048576
And I also have the following:
dl.file.indexing.ignore.extensions=

I restart the server and check the property at Server Administration > Properties > Portal Properties and it displays the right value.

I upload a file larger than 1Mb and it's indexed (I can view the file as a result of a search) Any value I put at dl.file.indexing.max.size gives me the same result (-1, 0, 1048576)

I've read the post: https://community.liferay.com/forums/-/message_boards/message/39828556 and looks like there's an issue with the property, but I'm not sure if the fix that is marked as a solution (https://github.com/brianchandotcom/liferay-portal/pull/24283/files) will work for this.

But, if it does, how can I add only this patch to my deployment?

​​​​​​​Thanks,
thumbnail
Jorge Díaz, modified 6 Years ago. Liferay Master Posts: 753 Join Date: 1/9/14 Recent Posts
Hi Marta,

dl.file.indexing.max.size  property only disables indexation of document content, during DLFileEntry indexation, all DLFileEntry metadata from database is indexed (title, name, description...) and after that, indexer calls FileEntryImpl.extractText  that extracts document content using tika.

So if you set dl.file.indexing.max.size to zero, DLFileEntry metadata will be indexed, but document content won't, as Liferay will call tika with tika.setMaxStringLength(maxStringLength);  where maxStringLength=0

In that situation, you will be able to search document in case you search by title but you won't be able to search it by document content.

When you say:
I upload a file larger than 1Mb and it's indexed (I can view the file as a result of a search) Any value I put at dl.file.indexing.max.size gives me the same result (-1, 0, 1048576)
Are you able to search your file using a word from document content that is not in DLFileEntry database metadata?

If you want to apply that solution, you will have to patch com.liferay.portal.util.FileImpl class. That class is used in following spring xml file:
So you will have to create a new FileImpl class with your code and replace it in spring xml.

Nevertheless I don't think it is necessary to apply that code as LPS-53776 only avoids calling Tika with maxStringLength=0 as it is unnecessary.
Marta Figueras, modified 6 Years ago. Junior Member Posts: 31 Join Date: 3/23/17 Recent Posts
Hello Jorge,

I've done a lot of tests and the behaviour is the following:

  • dl.file.indexing.max.size=0
    • any size file: not indexed
  • dl.file.indexing.max.size=1048576
    • 9KB file: indexed
    • 2MB file: indexed
    • 44MB file: nobody really knows, server crushes (but probably dies indexing)
The 9KB file is indexed and it might not be, am I wrong?
The system is reading the property as it doesn't index any file when it's 0, but can't avoid indexing big files, I don't know what else to try...

​​​​​​​Any suggestion?
thumbnail
Jorge Díaz, modified 6 Years ago. Liferay Master Posts: 753 Join Date: 1/9/14 Recent Posts
That is a strange behavior, DLFileEntryIndexer doesn't index file content in case file size is greater than dl.file.indexing.max.size

My only idea is to download Liferay source code and debug your server in order to double check the behavior of your installation.

If you want to debug Liferay source code, related java classes to content extraction are:
  1. DLFileEntryIndexer.java => https://github.com/liferay/liferay-portal/blob/65d7a800f1a57b232a127bdcaefb876a9de469f9/portal-impl/src/com/liferay/portlet/documentlibrary/util/DLFileEntryIndexer.java#L360-L379
  2. DocumentImpl.java => https://github.com/liferay/liferay-portal/blob/65d7a800f1a57b232a127bdcaefb876a9de469f9/portal-service/src/com/liferay/portal/kernel/search/DocumentImpl.java#L155-L160
  3. FileImpl.java => https://github.com/liferay/liferay-portal/blob/65d7a800f1a57b232a127bdcaefb876a9de469f9/portal-impl/src/com/liferay/portal/util/FileImpl.java#L359-L448


Regards,
Jorge
Marta Figueras, modified 6 Years ago. Junior Member Posts: 31 Join Date: 3/23/17 Recent Posts
Thanks, Jorge, I'll try that!
Marta Figueras, modified 6 Years ago. Junior Member Posts: 31 Join Date: 3/23/17 Recent Posts
Hi,

I've tried to debug Liferay's code, also Tika's code (as document's content and parameter's value arrived at Tika's call) but the thread gets lost at any point, giving me the exception:
Exception in thread "http-bio-8080-exec-1" java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Unknown Source)
    at java.io.ByteArrayOutputStream.grow(Unknown Source)
    at java.io.ByteArrayOutputStream.ensureCapacity(Unknown Source)
    at java.io.ByteArrayOutputStream.write(Unknown Source)
    at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:172)
    at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:98)
    at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:295)
    at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:237)
    at org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:172)
    at org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:108)
    at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251)
    at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
    at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
    at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:455)
    at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:379)
    at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:335)
    at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:72)
    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:153)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
    at org.apache.tika.Tika.parseToString(Tika.java:527)
    at org.apache.tika.Tika.parseToString(Tika.java:602)
    at com.liferay.portal.util.FileImpl.extractText(FileImpl.java:399)
    at com.liferay.portal.kernel.util.FileUtil.extractText(FileUtil.java:213)
    at com.liferay.portal.kernel.search.DocumentImpl.addFile(DocumentImpl.java:159)
    at com.liferay.portlet.documentlibrary.util.DLFileEntryIndexer.doGetDocument(DLFileEntryIndexer.java:366)
    at com.liferay.portal.kernel.search.BaseIndexer.getDocument(BaseIndexer.java:153)
    at com.liferay.portlet.documentlibrary.util.DLFileEntryIndexer.doReindex(DLFileEntryIndexer.java:493)
    at com.liferay.portal.kernel.search.BaseIndexer.reindex(BaseIndexer.java:446)
    at com.liferay.portlet.documentlibrary.service.impl.DLFileEntryLocalServiceImpl.reindex(DLFileEntryLocalServiceImpl.java:2413)
    at com.liferay.portlet.documentlibrary.service.impl.DLFileEntryLocalServiceImpl.updateStatus(DLFileEntryLocalServiceImpl.java:1873)</init>

Anyway, I could get a bigger pdf file but with a different kind of content, and it works fine!

So, I assume the problem is with the first document's content: it's a pdf with maps.
I'm not very optimistic with my chances to solve the issue...