This tutorial describes how eZ Find transforms and adapts eZ Publish content, and the respective datatypes, to index them in Solr. It by the way presents a few of the 2.2 version's novelties. The understanding of these low-level mechanisms are essential pre-requisites for development and debugging phases, be it only to know where to search for the key information, or be able to read code snippets that help understanding the exact role of a given configuration directive, parameter or filter.
This tutorial requires to know how to set up eZ Find. The online documentation describes the required operation in details, there : http://ez.no/doc/extensions/ez_find/2_2.
The following tips will make development on eZ Find more comfortable, and also significantly increase your speed as a developer, and the quality of the output work.
Re-indexing all content to merely test one single, minor modification's impact on one's application can quickly become a drawn-out process. A few, hidden, life-saver arguments exist in the concerned script /bin/php/updatesearchindexsolr.php, allowing for pointing to:
It is mandatory to use the 3 parameters simultaneously :
php extension/ezfind/bin/php/updatesearchindexsolr.php --siteaccess=mysiteaccess --topNodeID=2546 --offset=0 --limit=10
In order to check whether the content and its attributes were properly indexed, simply search for them in Solr's Web administration : http://localhost:8983/solr/admin/. This interface will also help you figure how the Solr fields were named. For instance, when using the 2.2 version of eZ Find, one can observe two fields looking like duplicates when it comes to articles' titles :
The rest of the tutorial explains this behaviour.
By leaving an active console opened, you will be able to view all messages sent to Solr, of the following form :
INFO: [] webapp=/solr path=/select params={ ... MESSAGE ... } status=400 QTime=5
This message can be copied and pasted from the console to the end of the URL, in Solr's administration interface : http://localhost:8983/solr/select/?MESSAGE. The obtained result is the exact output sent by Solr to eZ Find before transformation and display of the results. Using this trick is pretty useful when debugging, by, for example, directly manipulating the messages to retrieve the expected result.
Here is the code execution flow when an eZ Publish content object is added/updated:
eZ Find was built-in as an eZ Publish search plug-in, system in which the /content/search view was natively built to be externalized under the form of a search plug-in. Thus, in the eZ Find extension, the /ezfind/settings/site.ini.append.php file declares eZ Find as a search plug-in :
[SearchSettings] SearchEngine=ezsolr
The getEngine() method (kernel/classes/ezsearch.php) then takes care of retrieving the associated PHP class, to interface the search plug-in, namely /search/plugins/ezsolr/ezsolr.php in our case.
For every content update operation, the addObject method of the eZSolr class is invoked. This plug-in-based way of functioning allows to inherit from other native features of eZ Publish's search, like the DelayedIndexing one, which gives the possibility to make asynchronous content indexing (globally or per content class). This mechanism relies on a scheduled task (cronjob) : indexcontent.php :
[SearchSettings] DelayedIndexing=disabled|enabled|classbased DelayedIndexingClassList[] DelayedIndexingClassList[]=mycontentclass
Note :
This technique is particularly efficient when it comes to optimizing the back office response time (and consequently : the user experience), or regular content imports. However, the DelayedIndexing technique suffers from a few limitations : it is generic (read : not specifically optimized for eZ Find), and merely loops over the entries of the 'ezpending_actions' table to index the objects, without using Solr finesses like batch-indexing of content, followed by a single 'commit'.
The addObject (/search/plugins/ezsolr/ezsolr.php) method takes the following parameters :
In brief, this methods acts as follows :
Note : since eZ Find 2.2, it is possible to use language-specific 'cores' (/extension/ezfind/java/solr.multicore), allowing for having per-language indexes and configuration files (spellings.txt, synonmys.txt, stopwords.txt, etc.).
For this, we will have to brush-up the way eZ Find names the fields transmitted to Solr. Advanced usages and techniques for eZ Find require an in-depth understanding of the fields semantics and naming mechanism.
The /ezfind/classes/ezfsolrdocumentfieldbase.php class, or potentially inheriting classes when some complex datatypes are in use ( see my contribution for instance : ezfsolrdocumentfieldobjectrelation ), takes care of creating the Solr fied names, following precise semantics, detailed here :
Solr name :
attr_[contentattributename]_[contentattributetype],
example : 'attr_title_s'. Note the absence of the content class identifier, opening for nice perspectives like filtering on several content classes having identical names ( will be covered in a another post ).
Mapping in eZ Find's fetch function (in 'filter' for example) :
[contentclassname]/[contentattributename]/[contentsubattributename],
example 'article/title'. The content class name, 'article', is used as an additional filter when building the query. The 'title' is transformed into 'attr_title_s', the _s being infered from eZ Find's settings (see below)
Solr name :
meta_[metadataname]_[metadatatype], example : meta_class_identifier_s
Mapping in eZ Find's fetch function (for sorting for example) :
Internal usage, to sort on 'class_identifier' for instance.
Solr name :
subattr_[contentattributename]-[contentsubattributename]_[contentsubattributetype], example : subattr_relatedimage-alttext_s
Natively, the subattribute concept is not or little used, because the standard state and features of eZ Publish does not require it massively. It however is here as an opening for advanced usages, and it is, for instance, a great tool to extend eZ Find and index additional fields.
As an example, check my contribution : ezfsolrdocumentfieldobjectrelation, indexing all attributes of an object's related objects, storing them as subattributes. This then opens for applying all sorts of operations to there subattributes (search, filtering, facetting), using the 'myclass/myattribute/mysubattribute' syntax.
The next posts in this series about eZ Find will explain how to leverage this mechanism, presenting the full portfolio of possible applications.
In Solr field names, the last information signifies the field type, consequently how Solr will process the information (string, text, date, array, etc..). On the eZ Find side, this definition is made the standard way, through settings. Since eZ Find 2.2, it is possible to define a field type er usage context :
For example, by default in eZ Find 2.2, the “Text line” attributes (ezstring) are assigned a different solr field type for the sorting :
The consequence on the Solr side is that the “Text line” attributes will systematically be mapped on two different fields :
Note :
It is recommended to also use the 'string' type for facetting, in order for Solr to consider the whitespace character as a “normal” character, and get fully qualified facets like 'my facet' instead of 'my' and 'facet'.
Solr relies on several configuration files, one of them is used to tell him that a '_s' at the end of a field name means string, '_t' for text, etc. Here is the file : /ezfind/java/solr/conf/schema.xml.
This configuration file contains the hard-coded definition for a certain amount of fields (metadata fields for instance), but also defines the so-called dynamic fields, each of them associating the dynamic fields names (content attributes ones, coming from eZ Publish) to Solr types :
<dynamicField name="*_i" type="int" indexed="true" stored="true" multiValued="true"/> <dynamicField name="*_f" type="float" indexed="true" stored="true" multiValued="true"/> <dynamicField name="*_d" type="double" indexed="true" stored="true" multiValued="true"/> <dynamicField name="*_si" type="sint" indexed="true" stored="true" multiValued="true"/> <dynamicField name="*_sf" type="sfloat" indexed="true" stored="true" multiValued="true"/> <dynamicField name="*_sd" type="sdouble" indexed="true" stored="true" multiValued="true"/> <dynamicField name="*_s" type="string" indexed="true" stored="true" multiValued="true" termVectors="true"/> <dynamicField name="*_sl" type="slong" indexed="true" stored="true" multiValued="true"/> <dynamicField name="*_l" type="long" indexed="true" stored="true" multiValued="true"/> <dynamicField name="*_t" type="text" indexed="true" stored="true" multiValued="true" termVectors="true"/> <dynamicField name="*_b" type="boolean" indexed="true" stored="true" multiValued="true"/> <dynamicField name="*_dt" type="date" indexed="true" stored="true" multiValued="true"/> <dynamicField name="*_random" type="random" indexed="true" stored="true" multiValued="true"/> <dynamicField name="*_k" type="keyword" indexed="true" stored="true" multiValued="true"/> <dynamicField name="*_lk" type="lckeyword" indexed="true" stored="true" multiValued="true"/> <!-- some trie-coded dynamic fields for faster range queries → <dynamicField name="*_ti" type="tint" indexed="true" stored="true"/> <dynamicField name="*_tl" type="tlong" indexed="true" stored="true"/> <dynamicField name="*_tf" type="tfloat" indexed="true" stored="true"/> <dynamicField name="*_td" type="tdouble" indexed="true" stored="true"/> <dynamicField name="*_tdt" type="tdate" indexed="true" stored="true"/> <!-- geopoint for geospatial/location searches, boosting, ... → <dynamicField name="*_gpt" type="geopoint" indexed="true" stored="true"/>
This files can also be used to define more complex behaviours for given eZ Publish datatypes, like the keywords datatype (ezkeyword). Two different field types definitions can be found, related to the keywords datatype ( 'keyword' for case-sensitive cases, and 'lckeyword' for lower-case cases ), to be used equally at the eZ Find level (DatatypeMap) :
<fieldtype name="lckeyword" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.PatternTokenizerFactory" pattern=", *" /> <filter class="solr.TrimFilterFactory" /> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.PatternTokenizerFactory" pattern=", *" /> <filter class="solr.TrimFilterFactory" /> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> </fieldtype>
This example, keywords fields management, teaches a lot about Solr configuration. One can note the way Solr filters are called, how coma-based word separations are handled (PatternTokenizerFactory), case-sensitivity management (LowerCaseFilterFactory), duplicate removal (RemoveDuplicatesTokenFilterFactory), etc.
This first part of the series describes fundamental elements in understanding eZ Find, namely :
These points are essential pre-requisites before envisaging any advanced development on eZ Find. I will continue covering this subject in the two forthcoming parts of this series. Stay tuned !
I would like to thank Nicolas Pastorino for translating this tutorial to english, and Paul Borgermans for his availability.
This tutorial is the english version of the originally french one, published on the author's blog :
http://www.gandbox.fr/Blogs/Technologies-Web/Developpement-avance-avec-eZ-Find-partie-1-La-gestion-des-datatypes-entre-eZ-Find-Solr
The following resources were referred to during this tutorial :
This tutorial is available in PDF for offline reading :
Gilles Guirand - Advanced development with eZ Find - part 1 - Datatype management in both Solr and eZ Find - PDF Version
Gilles Guirand is a certified eZ Publish Developer. He is widely acknowledged by the community to be one of the national experts on highly technical and complex eZ Publish issues. With over 12 years experience in designing complex web architectures, he has been the driving force behind some of the most ambitious eZ Publish Projects: Web Site Generators, HighAvailability, Widgets, SOA, eZ Find, SSO, Web Accessibility and IT systems Integrations.
This work is licensed under the Creative Commons – Share Alike license ( http://creativecommons.org/licenses/by-sa/3.0 ).