Accented characters are not working in solr search

Next topic

Author	Message
Praveen Kumar	Tuesday 16 August 2011 5:00:23 pm Hi, This is Praveen. I am using apache-solr in our project to support search on cities. I having a problem with the accented characters while searching. For example: My city name is 'vrély'. if i search for 'vr', it is giving the result. But if i search for 'vrél', it is not giving any results. But if i search without accented characters** like 'vre*', it again give results. My city field type is "text" and my schema.xml for this as follows: <fieldType name="text" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.ASCIIFoldingFilterFactory"/> <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.ASCIIFoldingFilterFactory"/> <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/> </analyzer> </fieldType> Any suggestions or solution to resolve my problem is appreciable. Thanks in Advance... Regards, Praveen Kumar
Ivo Lukac	Wednesday 17 August 2011 12:59:15 am There could be 2 things: - either your index and query analyzer are not the same (e.g. there is a small difference: catenateWords="0" catenateNumbers="0") so tokens are not the same in both situations or - the "é" character is somehow badly encoded when sent to solr as a query I had a similar problem before when I used jetty, it didn't support utf-8 queries very well. I switched to tomcat. Could be that jetty resolved those issues in newer version, I didn't check. Anyway, you need to be aware that "vrély" is always tokenized as "vrely", that is why you are finding it with vr* and vre* http://www.linkedin.com/in/ivolukac http://www.netgen.hr/eng/blog http://twitter.com/ilukac
Philippe VINCENT-ROYOL	Wednesday 17 August 2011 1:24:20 am Just a question : which version of solr do you use? Certified Developer (4.1): http://auth.ez.no/certification/verify/272607 Certified Developer (4.4): http://auth.ez.no/certification/verify/377321 G+ : http://plus.tl/dspe Twitter : http://twitter.com/dspe

Author

Message

Tuesday 16 August 2011 5:00:23 pm

Hi,
This is Praveen. I am using apache-solr in our project to support search on cities. I having a problem with the accented characters while searching.
For example:
My city name is 'vrély'.
if i search for 'vr*', it is giving the result.
But if i search for 'vrél*', it is not giving any results.
But if i search without accented characters like 'vre*', it again give results.
My city field type is "text" and my schema.xml for this as follows:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>


<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
</analyzer>
</fieldType>
Any suggestions or solution to resolve my problem is appreciable.
Thanks in Advance...
Regards,
Praveen Kumar

Ivo Lukac

Wednesday 17 August 2011 12:59:15 am

There could be 2 things:

- either your index and query analyzer are not the same (e.g. there is a small difference: catenateWords="0" catenateNumbers="0") so tokens are not the same in both situations or

- the "é" character is somehow badly encoded when sent to solr as a query

I had a similar problem before when I used jetty, it didn't support utf-8 queries very well. I switched to tomcat. Could be that jetty resolved those issues in newer version, I didn't check.

Anyway, you need to be aware that "vrély" is always tokenized as "vrely", that is why you are finding it with vr* and vre*

http://www.linkedin.com/in/ivolukac
http://www.netgen.hr/eng/blog
http://twitter.com/ilukac

Philippe VINCENT-ROYOL

Wednesday 17 August 2011 1:24:20 am

Just a question : which version of solr do you use?

Certified Developer (4.1): http://auth.ez.no/certification/verify/272607
Certified Developer (4.4): http://auth.ez.no/certification/verify/377321

G+ : http://plus.tl/dspe
Twitter : http://twitter.com/dspe