WebDav - PDF - Indexing

Author Message

laurent le cadet

Monday 06 November 2006 12:02:44 am

Hi,

I have a client which want to upload about 2000 pdf files. Each file will be around 250 ko.
I think the most efficient will be to use webdav.
This client wants files contents to be searchable. I read thinks about indexing content with pdftotext. Is it compatible with the webdav process ?

Regards.

Laurent

Łukasz Serwatka

Monday 06 November 2006 12:27:05 am

Yes it works fine and works with WebDAV. See this page for more info.
http://ez.no/content/view/full/34091

Personal website -> http://serwatka.net
Blog (about eZ Publish) -> http://serwatka.net/blog

laurent le cadet

Monday 06 November 2006 12:41:59 am

Hi Lukasz,

I'm already on this page.
I downloaded pdftotext but my knowledge about installing such a thing is close to 0 and explanations (which seems to be simple) are not really clear to me.

Actually I'm on a windows system.

I dowloaded xpdf-3.01but I don't know where to put it. Same for the small script "ezpdftotext".

Do you have anu hints ?

Laurent

laurent le cadet

Monday 06 November 2006 2:22:41 am

Sorry but I just can't figure out what to do with it... :((

Łukasz Serwatka

Monday 06 November 2006 2:49:39 am

Copy pdftotext.exe to C:\ make a new file under C:\ called "ezpdftotext.bat"

put inside ezpdftotext.bat code:

pdftotext.exe %1 -

Test your file from command line:

C:\ezpdftotext.bat mypdffile.pdf

You should see text on the output.

Next override binaryfile.ini file and put full path to C:\ezpdftotext.bat for

[PDFHandlerSettings]
TextExtractionTool=C:\ezpdftotext.bat

Make sure that "File" contant class contains attribute file and is_searchable.

Optionally you can edit your system variables and put IN $PATH variable paths to ezpdftotext.bat script.

Personal website -> http://serwatka.net
Blog (about eZ Publish) -> http://serwatka.net/blog

laurent le cadet

Monday 06 November 2006 3:07:53 am

I did what you said and the ezpdftotext.bat call pdftotext.exe but the output is that this not recognized as an executable program.

In your example "mypdffile.pdf" for testing should be placed also in C: ?

Laurent

laurent le cadet

Monday 06 November 2006 3:17:13 am

WOOPS !

Sorry, wrong placement on C.
Files should be on document and settings/user...

This first test works : output is ok

Łukasz Serwatka

Monday 06 November 2006 3:17:56 am

Yes it is, but in eZ publish that will be path to PDF file in var directory. The most important is that eZ publish must find ezpdftotex.bat and ezpdftotext must execute pdftotext.exe. As parameter try provide full PATH during testing from command line.

Personal website -> http://serwatka.net
Blog (about eZ Publish) -> http://serwatka.net/blog

Łukasz Serwatka

Monday 06 November 2006 3:19:23 am

Make file attribute "is_searchable" then upload 1 PDF file and do search for words which are inside that PDF.

Personal website -> http://serwatka.net
Blog (about eZ Publish) -> http://serwatka.net/blog

laurent le cadet

Monday 06 November 2006 3:44:38 am

Lukasz,

#The most important is that eZ publish must find ezpdftotex.bat and ezpdftotext must execute pdftotext.exe.#

If I understand, this is declared in binaryfile.ini.append.php :

[PDFHandlerSettings]
TextExtractionTool=C:\ezpdftotext.bat

But how can I provide the path for files in the var dir ?

In our first test it was rather clear to me that we tested the application on 1 file with a path to it but how it works with several files ?

Should I placed ezpdftotext.bat in the var dir ?

Laurent

Łukasz Serwatka

Monday 06 November 2006 4:52:00 am

Path to uploaded PDF file is given to script by eZ publish on upload. So you don't need to care about it.

Example in PHP passthru( "C:\ezpdftotext.bat var/path/to/file.pdf" );
var is a eZ publish subdir so path is relative to eZ publish root dir.

Did you test if indexing works?

Personal website -> http://serwatka.net
Blog (about eZ Publish) -> http://serwatka.net/blog

laurent le cadet

Monday 06 November 2006 5:07:38 am

I tryed with the full path to the a file in the var dir of my ezPublish installation with success :

C:\Documents nad Settings\Laurent le Cadet>ezpdftotext.bat \eZpublish\ezpublish\var\plain\storage\original\application\079f4b0a1368e5185fa2bef20ce1cf42.pdf

what is rather normal.

In binaryfile.ini.append.php I added this line :

[PDFHandlerSettings]
TextExtractionTool=C:\ezpdftotext.bat

ezpdftotext.bat contains :

pdftotext.exe %1 -

and is located at the root of C:

File attribute of the File class is searchable...

But no success when trying to search for a word in padf from the ez search engine...

I feel frustrated cause I'm sure to be close from a correct output but don't know what to do.

Björn [email protected]

Monday 06 November 2006 5:24:16 am

http://pubsvn.ez.no/community/trunk/extension/indexing/

hmm this should instantly work with 3.8 just enable the extension

Looking for a new job? http://www.xrow.com/xrow-GmbH/Jobs
Looking for hosting? http://hostingezpublish.com
-----------------------------------------------------------------------------
GMT +01:00 Hannover, Germany
Web: http://www.xrow.com/

Łukasz Serwatka

Monday 06 November 2006 5:27:54 am

Laurent, I did quick test and works just fine. I have pdftotext.exe in my $PATH variable. In your case change

pdftotext.exe %1 -

to

C:\pdftotext.exe %1 -

Of course both files (.bat and .exe) are under C:\

Let me know if works.

Personal website -> http://serwatka.net
Blog (about eZ Publish) -> http://serwatka.net/blog

laurent le cadet

Monday 06 November 2006 5:48:49 am

@Björn : this a 3.6.4 intsall

@Lukasz : where can I check/set this $PATH variable ?

Łukasz Serwatka

Monday 06 November 2006 6:00:52 am

Go to Control panel->System->Advanced then Click "Environment variables" button. Find PATH variable and click edit.

Did you test with full path C:\pdftotext.exe in .bat file?

Personal website -> http://serwatka.net
Blog (about eZ Publish) -> http://serwatka.net/blog

laurent le cadet

Monday 06 November 2006 6:09:25 am

Yes I tested it with :

C:\pdftotext.exe %1 -

No result.

I don't have neither pdftotext.exe in the path variable. Should I add C:\pdftotext.exe at the end of the line ?

laurent le cadet

Monday 06 November 2006 7:06:09 am

on my local install, it appear that both .bat and .exe must be placed under C:\Documents and Settings\My User Name\ to be executed via command line.
Error message if the files are placed directly at the root of C.

Is that just a path problem ? which path to use ?

[PDFHandlerSettings]
TextExtractionTool=

Björn [email protected]

Monday 06 November 2006 10:11:15 am

http://pubsvn.ez.no/community/trunk/extension/indexing/patch/3.6/

I hope you did see this 3.6 patch

Looking for a new job? http://www.xrow.com/xrow-GmbH/Jobs
Looking for hosting? http://hostingezpublish.com
-----------------------------------------------------------------------------
GMT +01:00 Hannover, Germany
Web: http://www.xrow.com/

laurent le cadet

Monday 06 November 2006 11:48:08 pm

Hi Björn,

Is that extension available somewhere ?

Laurent

Powered by eZ Publish™ CMS Open Source Web Content Management. Copyright © 1999-2014 eZ Systems AS (except where otherwise noted). All rights reserved.