Learn / eZ Publish / Indexing Multiple Binary File Types

Indexing Multiple Binary File Types

Our client needed to index PDFs, PowerPoint presentations, Excel spreadsheets, and Word documents. They also indicated they might want to add other file types in the future. The native eZ publish indexing funcitonality (using pstotext and wvware handled PDFs and Word documents, but at the time (May 2005) we did not know of a good way to handle Excel or PowerPoint files and so we had to do some discovery work.

We also didn't like the idea of having a separate parsing script (i.e. ezpdfparser.php, ezwordparser.php, etc.) for each file type - it seemed more extensible to keep all the parsing code in one file where we could add a new condition to our case statement as we added new file types.

In addition, we found that both pstotext and xpdf caused the same problem with very large PDF files - the SQL INSERT statement got too large (this is related to the eZ publish indexer, not the parsing tools) and would crash the indexer, resulting in no further content being indexed. Addressing that, in addition to handling multiple binary types in one file, led us to write our own custom parsing plugin.

Parsers

eZ publish ships with the ability to index PDF files and Word documents (assuming you have installed the pstotext and wvware utilities). However, we found that this functionality didn't meet our needs, so we did an extensive search for other parsing tools. Our solution is based on the tools listed below.

pdftotext (for parsing PDFs): a full blown PDF reader that also provides numerous PDF and PS utilities.
catdoc (for parsing Word documents): a set of parsers and utilities including:
- catppt (for parsing Powerpoint documents)
- xls2csv (for parsing Excel documents): by default, this parses XLS files into comma-delimited format, but it also provides options to specify other output formats.

These parsers handle PDFs, Word documents, Powerpoint presentations, and Excel spreadsheets. Our solution is customizable, allowing you to add other parsers as needed, but this set of parsers covers the most common file formats.

Install these parsers in a locations where they can be executed by your web server user / group.

Configuration

Place the following code in your settings/override/binaryfile.ini.append.php file (in the siteaccess folder of choice):

# Here you can add handlers for new datatypes.
[HandlerSettings]
MetaDataExtractor[text/plain]=plaintext
MetaDataExtractor[application/pdf]=ezbinaryfile
MetaDataExtractor[application/msword]=ezbinaryfile
MetaDataExtractor[application/vnd.ms-excel]=ezbinaryfile
MetaDataExtractor[application/vnd.ms-powerpoint]=ezbinaryfile

# The full path to your log file (used for debugging/testing)</span>
[BinaryFileHandlerSettings]
LogFile=var/log/index.log

Note that this configuration example is for eZ publish version 3.8. If you are using previous versions of eZ publish (we tried it on 3.6) remove "ez" from the "ezbinaryfile" strings.

Save this file and clear the cache. Next, touch the file where you placed the configuration code to create an empty log file in the specified location. (Make sure that this file is writeable by your web server user / group.)

Rather than call each parsing utility individually, we specified in the configuration file that our custom plugin gets called for every file type. The plugin will then determine which file type is being indexed and call the appropriate parsing utility.

Read this code carefully before you implement it. Note that we are doing things like limiting the number of characters indexed from each file, and also stripping out irregular characters. (We did this to track down a problem we were having with very large files. We think the character limit fixed the issue, but we left the character stripping in there just in case. You may want to remove it and see what kind of results you get.)

Create the file ezbinaryfileparser.php in the directory
/kernel/classes/datatypes/ezbinaryfile/plugins/.
Place the following code in the php file:

<?php
/*!
\class eZBinaryFileParser ezbinaryfileparser.php
\ingroup eZKernel
\brief The class eZBinaryFileParser handles parsing of Word, Excel, Powerpoint, and PDF files and returns the metadata
*/
class eZBinaryFileParser
{
     function &parseFile( $sFileName )
     {
 
          //The number below is the maximum number of characters that we will
          //allow ezpublish to attempt to index per document
          $iCharacterLimit = 250000;
 
          // save the buffer contents
          $sBuffer =& ob_get_contents();
 
          ob_end_clean();
          ob_start();
          $sExtension = strtolower(substr($sFileName,-3,3));
 
          if(file_exists($sFileName))
          {
 
               $this->customLog("filename: " . $sFileName . "\n");
 
               switch($sExtension):
                    case "pdf":
                         $sCommand = "pdftotext -nopgbrk  -enc UTF-8 " . $sFileName . " -";
                    break;
                    case "doc":
                         $sCommand = "catdoc " . $sFileName . "";
                    break;
                    case "xls":
                         $sCommand = "xls2csv -c -q0 " . $sFileName . "";
                    break;
                    case "ppt":
                         $sCommand = "catppt " . $sFileName . "";
                    break;
                    default:
                         $this->customLog("Invalid File Type\n\n");
                         return false;
               endswitch;
 
               $aSpec = array(
                    0 => array("pipe", "r"),  // stdin is a pipe that the child will read from
                    1 => array("pipe", "w"),  // stdout is a pipe that the child will write to
                    2 => array("pipe", "w")   // stderr is a pipe that the child will write to.
               );
 
               $pHandle = proc_open($sCommand, $aSpec, $aPipes);
 
               while (!feof($aPipes[1]) )
               {
                    $sData .= fread($aPipes[1], 8192);
               }
               while (!feof($aPipes[2]) )
               {
                    $sError .= fread($aPipes[2], 8192);
               }
 
               if($sError)
               {
                    $this->customLog( $sError );
               }
 
               $bReturn = fclose($aPipes[1]);
               $bReturn = fclose($aPipes[2]);
 
               $iExitCode = proc_close($pHandle);
 
               $sData = preg_replace("([^A-Za-z\d\n])", " ", $sData);
 
               if($sExtension != "pdf")
               {
                    $sData = utf8_encode($sData);
               } 
 
               //Trim Data down to acceptable size.
               $sData = substr($sData, 0, $iCharacterLimit);
 
          } //if file exists
          else
          {
               $this->customLog("$sFileName was missing...\n");
               $sData = "";
          }
 
          ob_end_clean();
 
          // fill the buffer with the old values
          ob_start();
          print($sBuffer);
          return $sData;
 
     } //end method parseFile()
 
     function customLog($sData)
     {
          $oBinaryINI =& eZINI::instance( 'binaryfile.ini' );
          $sLogFile = $oBinaryINI->variable( 'BinaryFileHandlerSettings', 'LogFile' );
 
          $sData = date("m/d/Y [H:i] ") . " " . $sData;
 
          // Let's make sure the file exists and is writable first.
          if (is_writable($sLogFile))
          {
 
               // In our example we're opening $filename in append mode.
               // The file pointer is at the bottom of the file hence
               // that's where $somecontent will go when we fwrite() it.
               if (!$pHandle = fopen($sLogFile, 'a'))
               {
                    fwrite(STDERR,"Cannot open file ($sLogFile)");
                    return false;
               }
 
               // Write data to our opened file.
               if (fwrite($pHandle, $sData) === FALSE)
               {
                    fwrite(STDERR,"Cannot write to file ($sLogFile)");
                    return false;
               }
 
               fclose($pHandle);
               return true;
 
          }
          else
          {
               fwrite(STDERR,"The file $sLogFile is not writable");
               return false;
          } //end is_writable
     } //end method customLog()
} //end class eZBinaryFileParser
?>

Identifying content as searchable

Remember that for a file to be indexed, its content class must be configured as "Searchable". The following steps show how to make the default eZ publish content class "File" searchable. In the Administration Interface:

Click the Setup button in the top navigation bar.
Click the Classes link in the left navigation panel.
Click the Media link in the Class groups section of the page.
Click the File link under Classes Inside [Media].
Click Edit to modify the content class.
In the File attribute section, make sure that Searchable is enabled. This will tell eZ publish to index the contents of objects that belong to the "File" class.
Click OK to save.
Clear the cache.

Manually indexing the site

Depending on the size of your site and the debug flags you pass to the indexing command, the indexing can take anywhere from a few minutes to several hours. We recommend performing the manual re-indexing during off-hours and warning your clients / users that site search (and any template {fetch/search} calls) will not be fully functional during the re-indexing process.

To manually index your site, first clear out your old index so that eZ publish knows it has to start indexing all your existing content. To delete old indexes, run these SQL commands:

DELETE FROM ezsearch_word;

DELETE FROM ezsearch_object_word_link;

Next, on the command line in your site's root folder, run the index command as shown below. This example assumes that your PHP CLI binary is located in /usr/local/bin/php - adjust as necessary. You may also need to adjust the memory limit, depending on your server. (For more information about site reindexing, see the forum topic "I need to reindex my site for search".)

# /usr/local/bin/php -d memory_limit=256M -C \update/common/scripts/updatesearchindex.php --db-user=[your_db_user] \--db-database=[your_database] -s [your_site_access] --clean \--db-password=[your_db_password] -c

That's it. Once the indexing finishes, your site should now properly index binary files whether you are using a cron-based index or the "index on upload" method.

Please feel free to add your own tips and experiences as comments to this article.

Article Discussion

Indexing Multiple Binary File Types