pdf search

Author Message

Carlos Revillo

Thursday 04 May 2006 4:02:51 am

hi. i'm trying to index pdf documents i've uploaded through admin interface, but i cannot make it work.

I've installed pdftotex in my server and its working.

Next, i've created a file called ezpdftotext with this content.

#!/bin/sh
#ezpdftotext script
pdftotext $1 -

and i've created a file (binaryfile.ini.append.php) in my /settings/override folder with this content.

# Here you can add handlers for new datatypes.
[HandlerSettings]
MetaDataExtractor[text/plain]=plaintext
MetaDataExtractor[application/pdf]=pdf

# The path to the text extraction tool to use to 
# fetch the information in PDF files
[PDFHandlerSettings]
TextExtractionTool=ezpdftotext

I've also made file searchable in my class.

If i upload a plain txt file all is well. Ez read words of my file and do the things needed in ezsearch_word and related tables, but nothing happens when i upload a pdf file.

Any help would be very appreciated. thanks.

Nicolas Frey

Thursday 02 April 2009 3:18:15 am

Hi,

I have the same problem. Someone have a solution ?

I'm using ezpublish 4.1.

My binaryfile.ini

[HandlerSettings]
MetaDataExtractor[text/plain]=ezplaintext
MetaDataExtractor[application/pdf]=ezpdf
MetaDataExtractor[application/msword]=ezword

[PDFHandlerSettings]
TextExtractionTool=pdftotext.bat

My pdftotext.bat

pdftotext -enc UTF-8 %1

In my class, I was check "searchable" for file.

If I look in \var\site\storage\original\application after launch updatesearchindex.php or upload file, a text-file was create with the good content.

6fb8fdcc583cf155d3aeb82e289c0b31.pdf
6fb8fdcc583cf155d3aeb82e289c0b31.txt

In the search result, only text files are shown..

An idea ?

Thanks.

Nicolas Frey (2ST)

Damien Pobel

Thursday 02 April 2009 3:56:17 am

Hi,

You should try to enable DebugOutput and DebugRedirection to see if something went wrong.
In addition, I think you should put the full path to your script in binaryfile.ini.append.php and perhaps try to launch your script on the original PDF file in a shell to check if it's able to extract the text from your PDF, with some weird PDF files, sometimes it fails.

Damien
Planet eZ Publish.fr : http://www.planet-ezpublish.fr
Certification : http://auth.ez.no/certification/verify/372448
Publications about eZ Publish : http://pwet.fr/tags/keywords/weblog/ez_publish

Nicolas Frey

Thursday 02 April 2009 6:38:55 am

I found the problem.

class eZPDFParser
{
    function parseFile( $fileName )
    {
        $binaryINI = eZINI::instance( 'binaryfile.ini' );

        $textExtractionTool = $binaryINI->variable( 'PDFHandlerSettings', 'TextExtractionTool' );

        // save the buffer contents
        $buffer = ob_get_contents();
        ob_end_clean();

        // fetch the module printout
        ob_start();
        passthru( "$textExtractionTool $fileName" );
        $metaData = ob_get_contents();
        ob_end_clean();

        // fill the buffer with the old values
        ob_start();
        print( $buffer );

        return $metaData;
    }
}

This class runs the script contained in "binaryfile.ini" and retrieves the output stream for the search indexing.
There is no help in pdftotext, which tells how to make a direct result. After some research, I found this command:

PDFtoText.exe filename.pdf -

Nicolas Frey (2ST)

Johann Lemaitre

Thursday 09 April 2009 7:57:09 am

Hi,

I followed all points specified in this topic.
So I changed my binary.ini file and it looks like :

[PDFHandlerSettings]
TextExtractionTool=/var/www/ez/xpdf-3.02/xpdf/pdftotext -enc UTF-8

Just for a test, I have added a dash "-" after le filename in the file "ezpdfparser.php"

passthru( "$textExtractionTool $fileName -" );

This dash modify the pdf conversion because now I have no generated text file (Everything is send to the stdout).

Finally, my search result with ezFind 2.0.0 is always empty.

Could you help me ?
thanks

Johann

Powered by eZ Publish™ CMS Open Source Web Content Management. Copyright © 1999-2014 eZ Systems AS (except where otherwise noted). All rights reserved.