Forums / Developer / Accented characters are not working in solr search
Praveen Kumar
Tuesday 16 August 2011 5:00:23 pm
Hi, This is Praveen. I am using apache-solr in our project to support search on cities. I having a problem with the accented characters while searching. For example: My city name is 'vrély'. if i search for 'vr*', it is giving the result. But if i search for 'vrél*', it is not giving any results. But if i search without accented characters like 'vre*', it again give results. My city field type is "text" and my schema.xml for this as follows: <fieldType name="text" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.ASCIIFoldingFilterFactory"/> <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.ASCIIFoldingFilterFactory"/> <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/> </analyzer> </fieldType> Any suggestions or solution to resolve my problem is appreciable. Thanks in Advance... Regards, Praveen Kumar
Ivo Lukac
Wednesday 17 August 2011 12:59:15 am
There could be 2 things:
- either your index and query analyzer are not the same (e.g. there is a small difference: catenateWords="0" catenateNumbers="0") so tokens are not the same in both situations or
- the "é" character is somehow badly encoded when sent to solr as a query
I had a similar problem before when I used jetty, it didn't support utf-8 queries very well. I switched to tomcat. Could be that jetty resolved those issues in newer version, I didn't check.
Anyway, you need to be aware that "vrély" is always tokenized as "vrely", that is why you are finding it with vr* and vre*
http://www.linkedin.com/in/ivolukac http://www.netgen.hr/eng/blog http://twitter.com/ilukac
Philippe VINCENT-ROYOL
Wednesday 17 August 2011 1:24:20 am
Just a question : which version of solr do you use?
Certified Developer (4.1): http://auth.ez.no/certification/verify/272607 Certified Developer (4.4): http://auth.ez.no/certification/verify/377321 G+ : http://plus.tl/dspe Twitter : http://twitter.com/dspe