Arabic text indexing

Jamie Sammons, modified 2 Years ago.

New Member Posts: 5 Join Date: 2/29/24 Recent Posts

I have an Arabic text. I have the same text in two formats: pdf and txt.
The content of the txt document is indexed well, so even if I search for a single word/phrase it is found.
While the content of the pdf document is not indexed well. Some times the word/phrase I am looking for is found, other times it is not. Can anyone help me?

Olaf Kock, modified 2 Years ago.

RE: Arabic text indexing

Liferay Legend Posts: 6441 Join Date: 9/23/08 Recent Posts

For PDFs it's hard to say, as it's internally not necessarily text, but can be quite graphical (depending on the program that created the PDF in the first place). PDF, in its core, is Postscript - so there might be all kinds of escape sequences and positioning information in between single letters/glyphs.

Try to extract text with some of the available tools or services and see what it reveals. It might show that your PDF indeed is not text, but an image.

Jamie Sammons, modified 2 Years ago.

RE: RE: Arabic text indexing

New Member Posts: 5 Join Date: 2/29/24 Recent Posts

I tried to extract the text from the pdf and it is extracted. Except that sometimes there are differences between the words in the PDF and the words extracted, and consequently when searching for the word extracted from the PDF it is not found (because the extracted word is "transformed" in another word).

Community

Company

Feedback

Ask Questions and Find Answers

Important:

Ask is now read-only. You can review any existing questions and answers, but not add anything new.

But - don't panic! While ask is no more, we've replaced it with discuss - the new Liferay Discussion Forum! Read more here here or just visit the site here:

discuss.liferay.com

Arabic text indexing