Ask Questions and Find Answers
Important:
Ask is now read-only. You can review any existing questions and answers, but not add anything new.
But - don't panic! While ask is no more, we've replaced it with discuss - the new Liferay Discussion Forum! Read more here here or just visit the site here:
discuss.liferay.com
RE: Arabic text indexing
I have an Arabic text. I have the same text in two formats: pdf and
txt.
The content of the txt document is indexed well, so even if I
search for a single word/phrase it is found.
While the content of
the pdf document is not indexed well. Some times the word/phrase I am
looking for is found, other times it is not. Can anyone help me?
For PDFs it's hard to say, as it's internally not necessarily text, but can be quite graphical (depending on the program that created the PDF in the first place). PDF, in its core, is Postscript - so there might be all kinds of escape sequences and positioning information in between single letters/glyphs.
Try to extract text with some of the available tools or services and see what it reveals. It might show that your PDF indeed is not text, but an image.
I tried to extract the text from the pdf and it is extracted. Except that sometimes there are differences between the words in the PDF and the words extracted, and consequently when searching for the word extracted from the PDF it is not found (because the extracted word is "transformed" in another word).
Powered by Liferay™