Forums / General / Searching Office Documents: ez's search vs. integration of "commercial" search engine

Searching Office Documents: ez's search vs. integration of "commercial" search engine

Author Message

Marco Zinn

Saturday 04 October 2003 7:35:29 am

We are currently working on setting up an ez 3.2 system, which focusses on document management a lot. We have lots of Office documents and PDFs, which will be "file" objects.
Using the new binary file indexing, we can index PDF quite nicely with pdftotext (Thanks to Paul).
We have some problems with indexing Word (97) documents, as the results of the wvware converter is unusable at the moment. I don't know, why this happens, but even a blank word-document with just some test lines is not indexed (converted) correctly.

Furthermore, we need indexing of PowerPoint (97) and Excel (97) documents. Also, we have some ZIPed files, which include this kind of office docs+pdfs.
These issues are not yet solved by ez's search engine, so we think about using a 3rd-party solution for searching / indexing / crawling the content. This may be a commercial solution as well.

I'd like to know, which experiences other users (you?) have with this issue.
Can anyone recommend some (non-ez) search engine?
Or can anyone at least give us some hints, where will find tools for indexing PowerPoint files? Our server uses solaris, btw.
As this if for an intranet site, we cannot use indexing services; we need a solution/software, that we can install ourselfs.
Oh, one more neat thing: When we use an external search engine, this should also take care of ez's permissions, as we have quite some content, which required a login.

Marco Zinn

Marco
http://www.hyperroad-design.com

Paul Borgermans

Saturday 04 October 2003 8:25:48 am

Hi Marco

The search engine indexing is an issue which hinders ez publish to work as a good DMS or for any large web site. Especially the ranking or relevance is not up to the level actually required. And users DO rely on global search. The ez crew is certainly aware of this, but I don't know what the future will bring (apparently only a few are asking for binary file indexing for instance). I haven't played with openfts yet.

Below my hints (but not yet done it myself, powerpoint is the most urgent for me)

---------powerpoint and excel--------
For powerpoint and excel files, you may try

http://chicago.sourceforge.net/xlhtml/

the powerpoint conversion is included in the xlhtml archive.

You will also need lynx to do html to text conversion and wrap everything in a shell script to be called by the binary file handler. Idem dito for zipped versions.

--------msword--------
I'm surprisd wvware does not work for you, since I got it nicely running. Does it work on the command-line? I first had to make sure the right xml config was actually there (on SuSE Linux). Do you have the most recent version?

--------zipped --- files
Add a cpu or two and wrap unzip,gunzip in a shell script

---openoffice-----
There are some xslt filters included which should work after unzipping.

---wordperfect---
See the openoffice filter, its standalone!

I hope openoffice will provide more command-line options, so we can use it as a vehicle for all kinds of office formats

Have a nice weekend

-paul

eZ Publish, eZ Find, Solr expert consulting and training
http://twitter.com/paulborgermans