Ez Find don't index PDF with special chars like "ç"

Author Message

eric figo

Wednesday 19 November 2008 6:12:14 am

Hi,

I'm using EZ Find and pstotext in order to indexing PDF files.
Some are indexed and some not.

I tried many files and for exemple a pdf with this text single text is indexed : "Website un test d’indexation pour voir si ca marche ….. hsdhhdhhd"

But if i change the "c" for à "ç" like "Website un test d’indexation pour voir si ça marche ….. hsdhhdhhd", the pdf is not indexed.

Any ideas ? My database is in UTF 8, and i don't change the configurtaion of charset in EZ Publish.

Thanks for your responses

Paul Borgermans

Wednesday 26 November 2008 10:59:20 am

pstotext is not the best solution for converting pdf's to raw text, I guess that it fails to onvert the pdf file in question (try on the command line to se what happens)

Better is to use pdftotext from the xpdf project, then configure a new script, for example called ezpdftotext with the following content (change the path tp pdftotext with your installation):

#!/bin/sh
<path to >/pdftotext -enc "UTF-8" $1 -

And configure this script in binaryfile.ini

Note that the default installation will "normalize" Latin1 characters, so eZ Find/Solr will transform "reçu" to "recu" and more ... so searching either form will produce the hit

Best regards

eZ Publish, eZ Find, Solr expert consulting and training
http://twitter.com/paulborgermans

eric figo

Monday 01 December 2008 2:27:53 am

HI,

Thanks for the response.

Some precisions, when I run pstotext in command line, with my pdf, I get the plain text without trouble.

The problem is when i use the script to index, the files with specials chars are not index.
I can't find them, even if I'm searching an over word ot the PDF without spécialchars.

I tried you solution with pdftotext but I have the same problem.

Best regards

Paul Borgermans

Friday 05 December 2008 1:17:21 pm

Which versions are you using (ez find, ez publish)?

eZ Publish, eZ Find, Solr expert consulting and training
http://twitter.com/paulborgermans

eric figo

Tuesday 09 December 2008 1:09:54 am

I'm using EZ Publish 4.0.1 with eZ Find 1.0.0beta2

Powered by eZ Publish™ CMS Open Source Web Content Management. Copyright © 1999-2014 eZ Systems AS (except where otherwise noted). All rights reserved.