Solr Indexing Error

Author Message

Sylvain Gogel

Tuesday 07 April 2009 5:04:14 am

Hi there running ezfind2 indexation i notice some data are not indexed >_<

Doing some digging i found out that Solr::addDocs() got some serious issues

   function addDocs ( $docs = array(), $commit = true, $optimize = false  )
    {
        //
        if (! is_array( $docs ) )
        {
        	echo("docs is not an array\n");
            return false;
        }
        if ( count ( $docs ) == 0)
        {
        	echo("docs is empty\n");
        	return false;
        }
        else
        {
            $postString = '<add>';
            foreach ( $docs as $doc )
            {
                $postString .= $doc->docToXML();
            }
            $postString .= '</add>';
            
			//echo($postString."\n");
			
            $updateResult = $this->postQuery ( '/update', $postString, 'text/xml' );
			echo $updateResult;

This last echo output some java errors:

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>
<title>Error 500 </title>
</head>
<body><h2>HTTP ERROR: 500</h2><pre>ParseError at [row,col]:[25,1]
Message: An invalid XML character (Unicode: 0xc) was found in the element content of the document.

javax.xml.stream.XMLStreamException: ParseError at [row,col]:[25,1]
Message: An invalid XML character (Unicode: 0xc) was found in the element content of the document.
  at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:588)
  at org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:321)
  at org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.java:195)
  at org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:123)
  at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204)
  at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303)
  at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232)
  at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
  at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
  at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
  at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
  at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
  at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
  at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
  at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
  at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
  at org.mortbay.jetty.Server.handle(Server.java:285)
  at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
  at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
  at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
  at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
  at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
  at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
  at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
</pre>
<p>RequestURI=/solr/update</p><p><i><small><a href="http://jetty.mortbay.org/">Powered by Jetty://</a></small></i></p><br/>

Obvisouly the generated xml is not parsable and the resulting content is not indexed !
The content object contains binary pdf files and images.

Anyone got a fix for EzFind stable?

--
http://www.ecedi.fr
Agence Web, Créa/Conseils, Accessibilité
eZPublish, Drupal, Zend, Symfony

Sylvain Gogel

Tuesday 07 April 2009 5:12:41 am

I use both

[PDFHandlerSettings]
TextExtractionTool=pstotext

and

[PDFHandlerSettings]
TextExtractionTool=mypdftotext

the last is a shell script based on xpdf tool pdftotext

#!/bin/sh
/usr/bin/pdftotext -enc "UTF-8" $1 -

--
http://www.ecedi.fr
Agence Web, Créa/Conseils, Accessibilité
eZPublish, Drupal, Zend, Symfony

Geoff Bentley

Wednesday 08 April 2009 10:15:04 pm

Another option is to use the eZ Tika extension, which allows indexing of a large variety of binary file types like MsWord, MsOffice, PDF, Excel, ODF:

* http://projects.ez.no/eztika
* http://lucene.apache.org/tika/

Christian Rößler

Friday 08 May 2009 4:40:23 pm

As it seems there is a utf8 doublebyte character being interpreted as two 8-bit characters - which is not correct.

I think i have the same issue here (german umlauts and stuff like that) and found a promising site which explains the tomcat/solr charset settings:

http://wiki.apache.org/solr/SolrTomcat#head-20147ee4d9dd5ca83ed264898280ab60457847c4

Perhaps this is the issue. xml cannot have those characters in it. so the solr-xml-parser crashes. can you try it out? currently i am not able to get access to a ez-installation.

cheers,
christian

Hannover, Germany
eZ-Certified http://auth.ez.no/certification/verify/395613

Powered by eZ Publish™ CMS Open Source Web Content Management. Copyright © 1999-2014 eZ Systems AS (except where otherwise noted). All rights reserved.