Forums / Developer / Accented characters are not working in solr search

Accented characters are not working in solr search

Author Message

Praveen Kumar

Tuesday 16 August 2011 5:00:23 pm

Hi, 
This is Praveen. I am using apache-solr in our project to support search on cities. I having a problem with the accented characters while searching. 
For example: 
My city name is 'vrély'. 
if i search for 'vr*', it is giving the result. 
But if i search for 'vrél*', it is not giving any results.  
But if i search without accented characters like 'vre*', it again give results. 
My city field type is "text" and my schema.xml for this as follows: 
        <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
            <analyzer type="index">
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                
                
                <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
                <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.ASCIIFoldingFilterFactory"/>
                <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
            </analyzer>
            <analyzer type="query">
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
                <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
                <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.ASCIIFoldingFilterFactory"/>
                <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
            </analyzer>
        </fieldType>
Any suggestions or solution to resolve my problem is appreciable. 
Thanks in Advance... 
Regards, 
Praveen Kumar 

Ivo Lukac

Wednesday 17 August 2011 12:59:15 am

There could be 2 things:

- either your index and query analyzer are not the same (e.g. there is a small difference: catenateWords="0" catenateNumbers="0") so tokens are not the same in both situations or

- the "é" character is somehow badly encoded when sent to solr as a query

I had a similar problem before when I used jetty, it didn't support utf-8 queries very well. I switched to tomcat. Could be that jetty resolved those issues in newer version, I didn't check.

Anyway, you need to be aware that "vrély" is always tokenized as "vrely", that is why you are finding it with vr* and vre*

http://www.linkedin.com/in/ivolukac
http://www.netgen.hr/eng/blog
http://twitter.com/ilukac

Philippe VINCENT-ROYOL

Wednesday 17 August 2011 1:24:20 am

Just a question : which version of solr do you use? 

Certified Developer (4.1): http://auth.ez.no/certification/verify/272607
Certified Developer (4.4): http://auth.ez.no/certification/verify/377321

G+ : http://plus.tl/dspe
Twitter : http://twitter.com/dspe