Forums / Install & configuration / Indexing content of files using Solr (ezfind)

Indexing content of files using Solr (ezfind)

Author Message

Laurence Bonhomme

Wednesday 01 October 2008 1:36:23 am

Hi there,

I'm trying to index the content of files (txt, pdf) using eZFind + Solr and have troubles with it.

eZPublish 4.0.1
eZFind 1.0.0 beta2
Linux Debian 4

First of all, I've installed the eZPublish and eZFind package as recommended.
When I create a new media/file and upload a file (txt or pdf), indexing works perfectly and I can make searches (I can find my words into the database table ezkeyword as well).

Because I found the raw Search a bit "light", I decided to test with the Solr.
... And everything gets wrong now.

Pretty sure that the thing is well installed because I can search for articles contents or file summary into the admin Solr search.

But nothing about the <b>content of the uploaded file</b> itself.

What am I doing wrong?
Is there a trick?

Having a look at the Solr guide (http://wiki.apache.org/solr/), I found this :
"Solr has an extensible DocumentHandler architecture that allows you to feed it XML and CSV documents. There is now a patch file available as part of SOLR-284 that adds support for parsing rich binary formats. "

Do we have to patch the provided Solr?

Would anyone be so kind to help?

Thanks a lot
Laurence

Christian Rößler

Wednesday 22 October 2008 11:01:54 am

Laurence,

perhaps a bit late but better late than never :-)

I have had pretty the same problems. Perhaps this link will help you: http://ez.no/developer/articles/indexing_multiple_binary_file_types

I was able to setup a generic binaryfilehandler was was called for every physical file by ezfind. This binaryfilehandler was calling several external programs (pdftotxt, doc2txt), the parsed contents of each file was printed to stdout and catched by the binaryfilehandler, later on returned to ezflow, which saved it into ezsolr-index via a http-request.

The tricky point is to get ezfind use the custom file-handler to parse the binaryfile's content that ezflow/ezsolr can work with.

The above supplied link contains a full featured howto + downloads to get it working.
If you need any further help, feel free to reply to this post.

chris.

Hannover, Germany
eZ-Certified http://auth.ez.no/certification/verify/395613

Paul Borgermans

Wednesday 22 October 2008 2:21:15 pm

Something to point out here: the configuration for indexing files like pdf, word, ... depends on the configuration of eZ publish to convert these to plain text. It has nothing to do with the search plugin used (default, Solr/eZ Find, ...).

We'll improve the conversion mechanism options in eZ Publish for the next iteration of eZ Publish (4.1), I'm investigating a few more options to handle also more file formats.

You'll learn more about that very soon (< 3 weeks)

Paul

eZ Publish, eZ Find, Solr expert consulting and training
http://twitter.com/paulborgermans

Geoff Bentley

Wednesday 25 February 2009 3:21:12 pm

Check out Paul's ezTika extension ( http://projects.ez.no/eztika ) which draws on the Apache Tika toolkit ( http://lucene.apache.org/tika/ ) - this works seamlessly (so far) with eZ Find.