Solr Indexing Error

Next topic

Author	Message
Sylvain Gogel	Tuesday 07 April 2009 5:04:14 am Hi there running ezfind2 indexation i notice some data are not indexed >_< Doing some digging i found out that Solr::addDocs() got some serious issues function addDocs ( $docs = array(), $commit = true, $optimize = false ) { // if (! is_array( $docs ) ) { echo("docs is not an array\n"); return false; } if ( count ( $docs ) == 0) { echo("docs is empty\n"); return false; } else { $postString = '<add>'; foreach ( $docs as $doc ) { $postString .= $doc->docToXML(); } $postString .= '</add>'; //echo($postString."\n"); $updateResult = $this->postQuery ( '/update', $postString, 'text/xml' ); echo $updateResult; This last echo output some java errors: <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/> <title>Error 500 </title> </head> <body><h2>HTTP ERROR: 500</h2><pre>ParseError at [row,col]:[25,1] Message: An invalid XML character (Unicode: 0xc) was found in the element content of the document. javax.xml.stream.XMLStreamException: ParseError at [row,col]:[25,1] Message: An invalid XML character (Unicode: 0xc) was found in the element content of the document. at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:588) at org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:321) at org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.java:195) at org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:123) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139) at org.mortbay.jetty.Server.handle(Server.java:285) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502) at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226) at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442) </pre> <p>RequestURI=/solr/update</p><p><i><small><a href="http://jetty.mortbay.org/">Powered by Jetty://</a></small></i></p><br/> Obvisouly the generated xml is not parsable and the resulting content is not indexed ! The content object contains binary pdf files and images. Anyone got a fix for EzFind stable? -- http://www.ecedi.fr Agence Web, Créa/Conseils, Accessibilité eZPublish, Drupal, Zend, Symfony
Sylvain Gogel	Tuesday 07 April 2009 5:12:41 am I use both [PDFHandlerSettings] TextExtractionTool=pstotext and [PDFHandlerSettings] TextExtractionTool=mypdftotext the last is a shell script based on xpdf tool pdftotext #!/bin/sh /usr/bin/pdftotext -enc "UTF-8" $1 - -- http://www.ecedi.fr Agence Web, Créa/Conseils, Accessibilité eZPublish, Drupal, Zend, Symfony
Geoff Bentley	Wednesday 08 April 2009 10:15:04 pm Another option is to use the eZ Tika extension, which allows indexing of a large variety of binary file types like MsWord, MsOffice, PDF, Excel, ODF: * http://projects.ez.no/eztika * http://lucene.apache.org/tika/
Christian Rößler	Friday 08 May 2009 4:40:23 pm As it seems there is a utf8 doublebyte character being interpreted as two 8-bit characters - which is not correct. I think i have the same issue here (german umlauts and stuff like that) and found a promising site which explains the tomcat/solr charset settings: http://wiki.apache.org/solr/SolrTomcat#head-20147ee4d9dd5ca83ed264898280ab60457847c4 Perhaps this is the issue. xml cannot have those characters in it. so the solr-xml-parser crashes. can you try it out? currently i am not able to get access to a ez-installation. cheers, christian Hannover, Germany eZ-Certified http://auth.ez.no/certification/verify/395613