Arabic text indexing

Jamie Sammons, modified 1 Year ago. New Member Posts: 5 Join Date: 2/29/24 Recent Posts

I have an Arabic text. I have the same text in two formats: pdf and txt.
The content of the txt document is indexed well, so even if I search for a single word/phrase it is found.
While the content of the pdf document is not indexed well. Some times the word/phrase I am looking for is found, other times it is not. Can anyone help me?

thumbnail
Olaf Kock, modified 1 Year ago. Liferay Legend Posts: 6441 Join Date: 9/23/08 Recent Posts

For PDFs it's hard to say, as it's internally not necessarily text, but can be quite graphical (depending on the program that created the PDF in the first place). PDF, in its core, is Postscript - so there might be all kinds of escape sequences and positioning information in between single letters/glyphs.

Try to extract text with some of the available tools or services and see what it reveals. It might show that your PDF indeed is not text, but an image.

Jamie Sammons, modified 1 Year ago. New Member Posts: 5 Join Date: 2/29/24 Recent Posts

I tried to extract the text from the pdf and it is extracted. Except that sometimes there are differences between the words in the PDF and the words extracted, and consequently when searching for the word extracted from the PDF it is not found (because the extracted word is "transformed" in another word).