Indexing Binary Files - Excel and Powerpoint

Author Message

Mindshare Interactive Campaigns

Thursday 26 May 2005 9:45:16 am

Hi All,

I've been through these forums time and time again but have been unable to come across an answer to my question.

We are creating a media library for a client and are providing a search on binary file contents.(i.e. The text within a microsoft word or excel document) I've been able to get this functionality to work on Adobe PDF Files and Microsoft Word Files but have been unsuccessful with Excel and Powerpoint files. We are using third party utilities to extract the text, and they are working just fine. It just seems like eZ isn't correctly indexing the contents for the two mentioned filetypes.

I'm assuming the problem is somewhere in my binaryfile.ini file because if I add a handler for excel, the rest of the files end up not getting indexed...here is my ini file...

<b>This works</b>

# Here you can add handlers for new datatypes.
[HandlerSettings]
MetaDataExtractor[text/plain]=plaintext
MetaDataExtractor[application/pdf]=pdf
MetaDataExtractor[application/msword]=word

# The path to the text extraction tool to use to 
# fetch the information in PDF files
[PDFHandlerSettings]
TextExtractionTool=extractTxt

# The path to the text extraction tool to use to 
# fetch the information in Word files
[WordHandlerSettings]
TextExtractionTool=extractTxt

<b>This doesn't</b>

# Here you can add handlers for new datatypes.
[HandlerSettings]
MetaDataExtractor[text/plain]=plaintext
MetaDataExtractor[application/pdf]=pdf
MetaDataExtractor[application/msword]=word
#-----Additional Line Here-----#
MetaDataExtractor[application/vnd.ms-excel]=excel

# The path to the text extraction tool to use to 
# fetch the information in PDF files
[PDFHandlerSettings]
TextExtractionTool=extractTxt

# The path to the text extraction tool to use to 
# fetch the information in Word files
[WordHandlerSettings]
TextExtractionTool=extractTxt

#-----Additional Lines Here-----#
# The path to the text extraction tool to use to 
# fetch the information in Word files
[ExcelHandlerSettings]
TextExtractionTool=extractTxt

Any ideas? I'm completely at a loss.

Any help would be highly appreciated. Thanks in advance.

OS: Freebsd
eZ: Version 3.5.1

http://www.mindshare.net

Mindshare Interactive Campaigns

Friday 27 May 2005 10:52:27 am

Anyone???

http://www.mindshare.net

kracker (the)

Friday 27 May 2005 1:40:27 pm

<b>Patience</b>; is a virtue, learned <i>only</i> through the trials and tribulations provided by the <i>passage of time</i>.

Course the impatient can always <i>do the work and look</i> to find the applicable sections of code and if it's not quite right for you, you do have the freedom to improve the code to meet your needs.

Search the code for the same information as you could bully out of the forums.

cd /path/to/ezpublish/;
grep "MetaDataExtractor" -R *

<i>kernel/classes/datatypes/ezbinaryfile/ezbinaryfile.php</i>

//kracker

<i>eminem : bully</i>

Member since: 2001.07.13 || http://ezpedia.se7enx.com/

kracker (the)

Friday 27 May 2005 1:52:27 pm

Why do I do it, why?</i>

So it's not working because you need to either make static modifications to the <i>kernel/classes/datatypes/ezbinaryfile/ezbinaryfile.php</i>
or replace it with an extension.

<b>eZ Binary File : Plugin Directory</b>
<i>kernel/classes/datatypes/ezbinaryfile/plugins/</i>

<b>eZ Binary File : Methods:</b>
<i>ezpdfparser.php, ezplaintextparser.php, ezwordparser.php</i>

Contains the methods for ezbinaryfile for each specific datatype (<i>pdf, word, etc</i>). You'll need one for your ezexcelparser.php ....

check mate,
//kracker

<i>eminem : monkey see monkey do</i>

Member since: 2001.07.13 || http://ezpedia.se7enx.com/

kracker (the)

Friday 27 May 2005 2:16:59 pm

Watch Dees,

I did this blind, no thought, no real effort,
so hence the standard free software, no warranty.
Copyleft/Copyright given back up to eZ systems for a reason.

Run with it as far as you can, don't turn or you'll turn to stone.

<b>Settings:</b>

# Here you can add handlers for new datatypes.
[HandlerSettings]
MetaDataExtractor[text/plain]=plaintext
MetaDataExtractor[application/pdf]=pdf
MetaDataExtractor[application/msword]=word

# rfc says this is to simple: MetaDataExtractor[application/excel]=excel
MetaDataExtractor[application/vnd.ms-excel]=excel

# The path to the text extraction tool to use to
# fetch the information in PDF files
[PDFHandlerSettings]
TextExtractionTool=pstotext

# The path to the text extraction tool to use to
# fetch the information in Word files
[WordHandlerSettings]
TextExtractionTool=wvWare -x /usr/local/wv/wvText.xml

# The path to the text extraction tool to use to
# fetch the information in Excel files
[ExcelHandlerSettings]
TextExtractionTool=extractTxt

<b>eZ Binary File : Excel Plugin</b>

 <?php
//
// Definition of eZExcelParser class
//
// Created on: <27-May-2005 16:08:42 kracker>
//
// Copyright (C) 1999-2005 eZ systems as. All rights reserved.
//
// This source file is part of the eZ publish (tm) Open Source Content
// Management System.
//
// This file may be distributed and/or modified under the terms of the
// "GNU General Public License" version 2 as published by the Free
// Software Foundation and appearing in the file LICENSE included in
// the packaging of this file.
//
// Licencees holding a valid "eZ publish professional licence" version 2
// may use this file in accordance with the "eZ publish professional licence"
// version 2 Agreement provided with the Software.
//
// This file is provided AS IS with NO WARRANTY OF ANY KIND, INCLUDING
// THE WARRANTY OF DESIGN, MERCHANTABILITY AND FITNESS FOR A PARTICULAR
// PURPOSE.
//
// The "eZ publish professional licence" version 2 is available at
// http://ez.no/ez_publish/licences/professional/ and in the file
// PROFESSIONAL_LICENCE included in the packaging of this file.
// For pricing of this licence please contact us via e-mail to licence@ez.no.
// Further contact information is available at http://ez.no/company/contact/.
//
// The "GNU General Public License" (GPL) is available at
// http://www.gnu.org/copyleft/gpl.html.
//
// Contact licence@ez.no if any conditions of this licencing isn't clear to
// you.
//

/*!
  \class eZExcelParser ezexcelparser.php
  \ingroup eZKernel
  \brief The class eZExcelParser handles parsing of Excel files and returns the metadata

*/

class eZExcelParser
{
    function &parseFile( $fileName )
    {
        $binaryINI =& eZINI::instance( 'binaryfile.ini' );

        $textExtractionTool = $binaryINI->variable( 'ExcelHandlerSettings', 'TextExtractionTool' );

        // save the buffer contents
        $buffer =& ob_get_contents();
        ob_end_clean();

        // fetch the module printout
        ob_start();
        passthru( "$textExtractionTool $fileName" );
        $metaData = ob_get_contents();
        ob_end_clean();

        // fill the buffer with the old values
        ob_start();
        print( $buffer );

        return $metaData;
    }
}

?> 

//kracker

<i>eminem : watch dees</i>
<i>eminem : the watcher freestyle</i>
<i>jay z : the watcher 2</i>

Member since: 2001.07.13 || http://ezpedia.se7enx.com/

kracker (the)

Sunday 29 May 2005 3:26:16 am

I've posted the above solution as a feature enhancement:
http://www.ez.no/bugs/view/6697

The report also includes a zip of the above code.

cheers,
//kracker

<i>Eminem : (Fat Joe feat. Mase, , Lil' Jon) : Lean Back (remix)</i>

Member since: 2001.07.13 || http://ezpedia.se7enx.com/

kracker (the)

Sunday 29 May 2005 1:55:17 pm

Hey Hipsters!

Can anyone else suggest another larger set of binary file types which should be search able within eZ publish by default.

While I'm at it I might as well write a few, perhaps more useful wrappers for the rest of you :)

My first thought was to add the ability to search OpenOffice files?

<b>Example OpenOffice Extensions(v1.1.3):</b>
sxw, stw, sxc, stc, sxi, sti, sxd, sxm, sxg ... rtf, pps, ppt.

<b>MS Extensions(v2003):</b>
doc, xls, pps, ppt, mdb, rtf

<b>Binary Archives:</b>
zip, tar, gzip, tar.gz, bz2, rar

Anyone else care to chime in with a list of binary files which may contain text they wish to search for?

Also what kind of program is required to search within an xls file? Mindshare seemed to know of one but failed to mention which he was using, hrm...

<i>http://www.webopedia.com/quick_ref/fileextensionsm.asp
http://www.openoffice.org/dev_docs/source/file_extensions.html</i>

//kracker
<i>sole : salt on everything</i>

Member since: 2001.07.13 || http://ezpedia.se7enx.com/

Mindshare Interactive Campaigns

Tuesday 31 May 2005 7:31:12 am

That's awesome Kracker. I appreciate your help with that. Oh, and of course I apologize for my impatience :-)

I didn't realize that eZPublish didn't have that capability built into it. It seems like it was an easy code addition though. I will take a look through the code and try to implement it today. I'll let you know how I make out.

In the meantime this is the response that I received from the eZ Customer Support. Perhaps this will help someone else out. It seems to be pretty similar to the code from Kracker.

in <ezpublish_install_dir>/kernel/classes/datatypes/ezbinaryfile/plugins are existing plugins for handling different filetypes.
Unfortunatelly, there is no plugin neither for excel nor powerpoint. The problem is, that every 3rd party text extractor usually takes different parameters on commandline, so this approach cannot be completely generic. 

To solve your problem:

in <ezpublish_install_dir>/kernel/classes/datatypes/ezbinaryfile/plugins create files:

ezexcelparser.php
ezpowerpointparser.php

both files should have very similar content like this:

<?php

class eZExcelParser
{
    function &parseFile( $fileName )
    {
        $binaryINI =& eZINI::instance( 'binaryfile.ini' );

        $textExtractionTool = $binaryINI->variable( 'ExcelHandlerSettings', 'TextExtractionTool' );

        $tmpName = "var/cache/" . md5( mktime() ) . '.txt';
        $handle = fopen( $tmpName, "w" );
        fclose( $handle );
        chmod( $tmpName, 0777 );

        exec( "$textExtractionTool $fileName > $tmpName", $ret );   // make sure to pass correct parameters

        $metaData = "";
        if ( file_exists( $tmpName ) )
        {
            $fp = fopen( $tmpName, "r" );
            $metaData = fread( $fp, filesize( $tmpName ) );
            $metaData = strip_tags( $metaData );
            fclose( $fp );
            unlink( $tmpName );
        }

        return $metaData;
    }
}

?>

in the function exec(); you will have to take care of the parameters on commandline.
Similar Solution goes for the power point just replace eZExcelParser -> eZPowerPointParser, and ExcelHandlerSettings -> PowerPointHandlerSettings.

I hope this will help to solve your problem...

Thanks again.

http://www.mindshare.net

kracker (the)

Wednesday 08 June 2005 1:57:14 pm

Nate,

What kind of program do you use to search inside these binary file types (powerpoint and excel)?

What OS platform are you using GNU/Linux or Windows?

I've been looking for programs which can search inside of binary file types especially office / Open Office binary file types.

So far I have not been able to get a clear picture from my searches of the net.

Respectfully Wondering,
//kracker

<i>sole : theme</i>

Member since: 2001.07.13 || http://ezpedia.se7enx.com/

Mindshare Interactive Campaigns

Thursday 09 June 2005 1:18:34 pm

Our old solution didn't work, but we have a new one and will be posting it as an article shortly.

http://www.mindshare.net

Xavier Dutoit

Friday 10 June 2005 12:28:44 am

Hi mindshare team,

Just bookmarked your post, that's a fine piece of work. Thanks for contributing!

X+

http://www.sydesy.com

Mindshare Interactive Campaigns

Friday 10 June 2005 10:59:56 am

Awesome. I'm glad you found it helpful.

Let us know how you make out with it.

http://www.mindshare.net

eric grandais

Wednesday 21 December 2005 3:48:51 pm

could you tell us where the file extractTxt should be stored ?

thanks

eric

Siniša Šehović

Thursday 29 June 2006 10:34:12 am

Hi Mindshare :-)

I have trouble with indexing by your way.

I did all as you described but looks like I don't get indexed PDFs?!?!

How can I debug it?

In my apache logs and ez logs I don't have any errors.

Btw, where is that data from PDF stored in mysql table?

Please help!

Best regards,
S.

---
If at first you don't succeed, look in the trash for the instructions.

Siniša Šehović

Thursday 29 June 2006 10:36:05 am

Hi Eric

The script extractTxt must be somwhere where apache/php can find it!

S.

---
If at first you don't succeed, look in the trash for the instructions.

Mindshare Interactive Campaigns

Thursday 29 June 2006 11:09:59 am

I'm clearing out our old posts on this subject so people don't use out of date information - I'll be putting our real working solution in an article shortly.

http://www.mindshare.net

Siniša Šehović

Friday 30 June 2006 12:08:07 am

Hi Mindshare crew :-)

I still can't index my PDF files.

I have this warnings:

domxml_open_mem(): Input is not proper UTF-8, indicate encoding !
 in /srv/www/htdocs/intranet/lib/ezxml/classes/ezxml.php on line 87
domxml_open_mem(): Bytes: 0xE8 0x61 0x6B 0x20
 in /srv/www/htdocs/intranet/lib/ezxml/classes/ezxml.php on line 87

My site is encoded with iso-8859-2.

Best regards,
S.

---
If at first you don't succeed, look in the trash for the instructions.

Mindshare Interactive Campaigns

Friday 30 June 2006 8:50:46 am

Sinisa,

We're putting our heads together over here to see if we can figure out your problem. We'll post again as soon as we think of something.

-Mindshare

http://www.mindshare.net

Siniša Šehović

Saturday 01 July 2006 4:23:35 am

Hi Mindshare

Thanx for help!

How can I debug indexing process?
How can I see if "extractTxt" script is executed?

Executing script from shell works but looks like ez does not fetch output from script.

Best regards,
S.

---
If at first you don't succeed, look in the trash for the instructions.

Mindshare Interactive Campaigns

Wednesday 05 July 2006 7:05:25 am

Sinisa,

The only way I know to debug the indexing is to run it from the command-line with very verbose debugging turned on, like so:

# php -d -C update/common/scripts/updatesearchindex.php --db-user=[user] \
--db-database=[database] -s [siteaccess] --clean --db-password='[password]' --sql -d -v -v -v -c 

The 3 "-v"s above are for extra verbose output. You should be able to see the index script go through all your content when you run this. If this doesn't give you searchable results, try deleting the index tables and re-creating them. Details here:
http://ez.no/products/ez_publish_cms/documentation/configuration/troubleshooting/i_need_to_reindex_my_site_for_search

Also, make sure you are using the updated code we published to this thread a couple weeks ago for the new PHP class.

I don't know about that UTF error, we've never encountered that one. Good luck!

http://www.mindshare.net

Powered by eZ Publish™ CMS Open Source Web Content Management. Copyright © 1999-2014 eZ Systems AS (except where otherwise noted). All rights reserved.