Ask Questions and Find Answers
Important:
Ask is now read-only. You can review any existing questions and answers, but not add anything new.
But - don't panic! While ask is no more, we've replaced it with discuss - the new Liferay Discussion Forum! Read more here here or just visit the site here:
discuss.liferay.com
Configure indexed files' size
Hello,
I'm working with Liferay 6.2 ce ga6 and Tomcat 7.0.62
I need to configure the environment not to index uploaded files larger than a size (I'm working with 1Mb while testing)
I add to portal-ext.properties:
dl.file.indexing.max.size=1048576
And I also have the following:
dl.file.indexing.ignore.extensions=
I restart the server and check the property at Server Administration > Properties > Portal Properties and it displays the right value.
I upload a file larger than 1Mb and it's indexed (I can view the file as a result of a search) Any value I put at dl.file.indexing.max.size gives me the same result (-1, 0, 1048576)
I've read the post: https://community.liferay.com/forums/-/message_boards/message/39828556 and looks like there's an issue with the property, but I'm not sure if the fix that is marked as a solution (https://github.com/brianchandotcom/liferay-portal/pull/24283/files) will work for this.
But, if it does, how can I add only this patch to my deployment?
Thanks,
I'm working with Liferay 6.2 ce ga6 and Tomcat 7.0.62
I need to configure the environment not to index uploaded files larger than a size (I'm working with 1Mb while testing)
I add to portal-ext.properties:
dl.file.indexing.max.size=1048576
And I also have the following:
dl.file.indexing.ignore.extensions=
I restart the server and check the property at Server Administration > Properties > Portal Properties and it displays the right value.
I upload a file larger than 1Mb and it's indexed (I can view the file as a result of a search) Any value I put at dl.file.indexing.max.size gives me the same result (-1, 0, 1048576)
I've read the post: https://community.liferay.com/forums/-/message_boards/message/39828556 and looks like there's an issue with the property, but I'm not sure if the fix that is marked as a solution (https://github.com/brianchandotcom/liferay-portal/pull/24283/files) will work for this.
But, if it does, how can I add only this patch to my deployment?
Thanks,
Hi Marta,
dl.file.indexing.max.size property only disables indexation of document content, during DLFileEntry indexation, all DLFileEntry metadata from database is indexed (title, name, description...) and after that, indexer calls FileEntryImpl.extractText that extracts document content using tika.
So if you set dl.file.indexing.max.size to zero, DLFileEntry metadata will be indexed, but document content won't, as Liferay will call tika with tika.setMaxStringLength(maxStringLength); where maxStringLength=0
In that situation, you will be able to search document in case you search by title but you won't be able to search it by document content.
When you say:
If you want to apply that solution, you will have to patch com.liferay.portal.util.FileImpl class. That class is used in following spring xml file:
So you will have to create a new FileImpl class with your code and replace it in spring xml.
Nevertheless I don't think it is necessary to apply that code as LPS-53776 only avoids calling Tika with maxStringLength=0 as it is unnecessary.
dl.file.indexing.max.size property only disables indexation of document content, during DLFileEntry indexation, all DLFileEntry metadata from database is indexed (title, name, description...) and after that, indexer calls FileEntryImpl.extractText that extracts document content using tika.
So if you set dl.file.indexing.max.size to zero, DLFileEntry metadata will be indexed, but document content won't, as Liferay will call tika with tika.setMaxStringLength(maxStringLength); where maxStringLength=0
In that situation, you will be able to search document in case you search by title but you won't be able to search it by document content.
When you say:
I upload a file larger than 1Mb and it's indexed (I can view the file as a result of a search) Any value I put at dl.file.indexing.max.size gives me the same result (-1, 0, 1048576)Are you able to search your file using a word from document content that is not in DLFileEntry database metadata?
If you want to apply that solution, you will have to patch com.liferay.portal.util.FileImpl class. That class is used in following spring xml file:
So you will have to create a new FileImpl class with your code and replace it in spring xml.
Nevertheless I don't think it is necessary to apply that code as LPS-53776 only avoids calling Tika with maxStringLength=0 as it is unnecessary.
Hello Jorge,
I've done a lot of tests and the behaviour is the following:
The system is reading the property as it doesn't index any file when it's 0, but can't avoid indexing big files, I don't know what else to try...
Any suggestion?
I've done a lot of tests and the behaviour is the following:
- dl.file.indexing.max.size=0
- any size file: not indexed
- dl.file.indexing.max.size=1048576
- 9KB file: indexed
- 2MB file: indexed
- 44MB file: nobody really knows, server crushes (but probably dies indexing)
The system is reading the property as it doesn't index any file when it's 0, but can't avoid indexing big files, I don't know what else to try...
Any suggestion?
That is a strange behavior, DLFileEntryIndexer doesn't index file content in case file size is greater than dl.file.indexing.max.size
My only idea is to download Liferay source code and debug your server in order to double check the behavior of your installation.
If you want to debug Liferay source code, related java classes to content extraction are:
Regards,
Jorge
My only idea is to download Liferay source code and debug your server in order to double check the behavior of your installation.
If you want to debug Liferay source code, related java classes to content extraction are:
- DLFileEntryIndexer.java => https://github.com/liferay/liferay-portal/blob/65d7a800f1a57b232a127bdcaefb876a9de469f9/portal-impl/src/com/liferay/portlet/documentlibrary/util/DLFileEntryIndexer.java#L360-L379
- DocumentImpl.java => https://github.com/liferay/liferay-portal/blob/65d7a800f1a57b232a127bdcaefb876a9de469f9/portal-service/src/com/liferay/portal/kernel/search/DocumentImpl.java#L155-L160
- FileImpl.java => https://github.com/liferay/liferay-portal/blob/65d7a800f1a57b232a127bdcaefb876a9de469f9/portal-impl/src/com/liferay/portal/util/FileImpl.java#L359-L448
Regards,
Jorge
Thanks, Jorge, I'll try that!
Hi,
I've tried to debug Liferay's code, also Tika's code (as document's content and parameter's value arrived at Tika's call) but the thread gets lost at any point, giving me the exception:
Anyway, I could get a bigger pdf file but with a different kind of content, and it works fine!
So, I assume the problem is with the first document's content: it's a pdf with maps.
I'm not very optimistic with my chances to solve the issue...
I've tried to debug Liferay's code, also Tika's code (as document's content and parameter's value arrived at Tika's call) but the thread gets lost at any point, giving me the exception:
Exception in thread "http-bio-8080-exec-1" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Unknown Source)
at java.io.ByteArrayOutputStream.grow(Unknown Source)
at java.io.ByteArrayOutputStream.ensureCapacity(Unknown Source)
at java.io.ByteArrayOutputStream.write(Unknown Source)
at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:172)
at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:98)
at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:295)
at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:237)
at org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:172)
at org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:108)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:455)
at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:379)
at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:335)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:72)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:153)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
at org.apache.tika.Tika.parseToString(Tika.java:527)
at org.apache.tika.Tika.parseToString(Tika.java:602)
at com.liferay.portal.util.FileImpl.extractText(FileImpl.java:399)
at com.liferay.portal.kernel.util.FileUtil.extractText(FileUtil.java:213)
at com.liferay.portal.kernel.search.DocumentImpl.addFile(DocumentImpl.java:159)
at com.liferay.portlet.documentlibrary.util.DLFileEntryIndexer.doGetDocument(DLFileEntryIndexer.java:366)
at com.liferay.portal.kernel.search.BaseIndexer.getDocument(BaseIndexer.java:153)
at com.liferay.portlet.documentlibrary.util.DLFileEntryIndexer.doReindex(DLFileEntryIndexer.java:493)
at com.liferay.portal.kernel.search.BaseIndexer.reindex(BaseIndexer.java:446)
at com.liferay.portlet.documentlibrary.service.impl.DLFileEntryLocalServiceImpl.reindex(DLFileEntryLocalServiceImpl.java:2413)
at com.liferay.portlet.documentlibrary.service.impl.DLFileEntryLocalServiceImpl.updateStatus(DLFileEntryLocalServiceImpl.java:1873)</init>Anyway, I could get a bigger pdf file but with a different kind of content, and it works fine!
So, I assume the problem is with the first document's content: it's a pdf with maps.
I'm not very optimistic with my chances to solve the issue...