Learn / eZ Publish / Creating a Search Engine

Creating a Search Engine

The existing search engine implementation in eZ Publish has some limitations. The biggest limitation is the lack of relevance-based ranking. You can only sort the results for conditions like class name, time of publishing or URL. The sorting options are limited to those available when fetching content (as described in the documentation). In addition to that drawback, defining search conditions is also limited. For example, it is not possible to define search conditions that match only in headings of an XML field, or to check if a specific custom XML tag exists in an XML field.

First, I'll discuss the main differences between searching in a local context (such as on a website) compared to a normal web search. The differences are based mainly on the different datasets and also on the users' information requirements.

The second part of the article introduces a general model of how search engines work. After that we will go into some detail, looking at the popular and easy-to-understand vector space model (the basis of many commercial search engines).

The last step describes the query language, our test implementation and the evaluation process. We encourage you to evaluate our new ranking approach. Because it is not possible to define relevance in a mathematical formula, your feedback is very important for us, as it allows us to test whether our approaches are successful or not. (Feel free to jump straight to the end of the article for information about trying out our test implementation.)

Searching the web

The widely used web search engines enable users to search for content on billions of different websites. To do this, web search engines index the content on websites. What kinds of content do web search engines index? In other words: What kind of dataset is the whole internet?

These are some of the major constraints:

  • Websites deal with a wide range of content topics.
  • For most topics there are many websites, often with similar content.
  • The quality of sites ranges from sites that are well-organized and contain deep information to sites that are not useful at all (for example due to outdated, incomplete or inaccurate information).
  • There is no common semantic markup of website content. Mostly, search engines only see (almost) plain text. The only useful meta information is logical (like text in headings) or visual markup (such as bold or italic text).

In addition to these constraints, there is another problem: having the top position in search results is very important for many websites (especially commercial sites). Therefore, much effort is put into tweaking sites to optimize their ranking position in search engines. But - especially for popular search queries - some people use dirty tricks to try to achieve a top-ranking position. (You may have seen a search engine query where one or more of the top results were not related to your search terms.) Using dirty tricks to improve ranking position is called search engine spamming.

The main task of web search engines is to separate good and bad websites, and then to provide relevant query results that point to the good websites. For example, in a normal web search, the existence of links between websites is an important measure of the quality of a site. The linking relationship is interpreted as a popularity vote: if many websites link to one site, the site probably has valuable content.

Searching in a local context

Searching in a local context has different challenges than searching the web. You do not have the problems of spam or differing content quality. However there are other challenges regarding the existing data and the users' information needs. Fortunately - as we will see later - the situation data-wise fits perfectly with the users' information needs.

In the next section we will look at the special situations when searching in a local context.

General conditions

In contrast to the web, content in a local context is usually focused on one topic. Often the content is of a higher quality than that on an average website.

For example, let's look at ez.no. The website contains content about eZ Systems and its products. While there is a range of content subtopics, all the information is related to eZ. Information generated by a company generally adheres to an underlying quality assurance process where each page has a specific purpose and content is not duplicated. The ez.no site also has content created by users, mainly in the forums. But in contrast to the global web, there is no serious spam problem in the forums (as there is no incentive).

Structured content

Besides the general differences between indexing all the sites on the web and indexing a single site, there are some special considerations if you run a site on eZ Publish. For one thing, valuable metadata like language, author, publishing date, etc. is available. But there is something much more valuable: content objects based on content classes that consist of attributes, which organize the site content into a well-defined schema.

With eZ Publish, we know two things about content object attributes:

  • Attribute names provide a semantic meaning. For example, a text line attribute called "street" contains some localization information.
  • The datatypes provide special information about the stored data. You can have special search operators for specific datatypes. For example, if you have an attribute of type "integer", you could search for a range from "72" to "83". Alternatively, imagine an operator based on localization information that searches for towns around "Oslo".

Of course datatype definition is not only restricted to the attribute level. For example, if you want to search all email addresses, you can find relevant information in attributes of datatype "email". But email addresses are stored in other attributes as well, for example the attribute of type "authors" (because each author can have an email address).

Additionally, custom XML tags can have a semantic meaning. eZ Publish could be extended so that XML elements are mapped to datatypes as well. In that case, the search for email addresses would also search in relevant XML elements.

The ability to combine the semantic and datatype information is very powerful. For example, perhaps you want to search for maximum temperature values of some test series. In this case you would search in all elements (attributes and XML elements) called "maximum temperature", and, as you know they all are defined as floats, you could use the "range" or "less than" operator. In some cases it could even be useful to create your own "temperature" datatype. This datatype would be a subset of the float datatype and could perform the conversion between Celsius, Fahrenheit and Kelvin when searching.

Any website could benefit from a specialized local search engine that uses datatype and semantic information. However this is especially important in an enterprise environment, because information requirements and data structure tend to be more interdependent. First we will look at the special data situation in enterprises. While some or all of these characteristics may apply to other websites as well, they are almost always present in an enterprise context:

  • The data is mostly well structured (for example invoices, product information, technical documentation, etc). In general, structured data is the most valuable for users.
  • On an enterprise website, the link structure between the documents is much less manual and ad hoc than on the web. Often links are primarily implemented through a navigation menu based on the document hierarchy. Thus the link relationships that indicate site popularity on the web do not apply in a local context.

Beyond the analysis of data, a more important question is: What is the difference in the information needs of enterprise users compared to users performing normal web searches? On the web there are many similar relevant documents - users are searching for the best documents. In an enterprise environment, users are most often searching for the right answer. [1] In that case, only a few documents are relevant. The popularity of a document does not help here - therefore, fortunately, the lack of link relationships is not a problem in the local context.

Often users are looking for a particular document that they have already seen and for which they remember some characteristics. For example, a user might remember the author, or when the document was published, or some other special information. Because the content in an enterprise environment is more structured, these data characteristics are available for searching.

A good search interface enables users to search for both structure and content related conditions.

The following graphic [2] shows a simplified model of a search engine:

Search engine model

  • The left side shows the content / documents that should be searchable via the search engine.
  • The search engine does not index all parts of each document. For example, in eZ Publish you can specify the attributes that should be indexed by the search engine. In addition, some metadata of the document (like publishing date or owner) can be part of the record representing a document.
  • In the next step the document content is normalized. Each word (called term in the context of information retrieval) is reduced to its root. For example, "cars" and "car" should be indexed in the same way, as should "going" and "go". When users search for a term the different declinations and grammatical cases should be found as well. The knowledge about how to reduce terms exists in the morphologic knowledge. If this reduction is done by cutting (or sometimes replacing) the suffixes defined by a set of rules, it is called stemming. Other approaches use a dictionary to find the grammatical basic form.
  • As a result of the indexing and normalization processes, you get a representation of each document - including some statistical data for each term. In our case the representation of each document is saved in a so-called document vector. Note that the representations of all documents are saved in a search index. This index is similar to the index at the end of a book, used to quickly find the location of a term. (The index is not drawn in the graphic above because it is not part of the model but rather a performance requirement.)
  • On the right side of the model the user inputs a query. This input is split into individual terms that are reduced to their root (to enable the query terms to match the terms in the search index). In our situation the query is represented in the same way the documents are represented - called query vector.
  • When a user inputs a query, a similarity function calculates a similarity or ranking value for each document (vector) related to the query (vector). The documents are retrieved in the order of their ranking value.

The user is most interested in good results. Therefore, in the next section, we will look at the similarity function, which calculates the ranking value.

The most frequent approach in commercial search engines is the vector space model. This is the basis of almost all web search engines (of course in different variations and combined with other strategies like evaluating the popularity of a site based on link relationships).

It is interesting that the vector space model has no theoretical / scientific foundation. But other approaches (which are, for example, based on probability theory) do not yield better results in general, that is if they are used on different document/data sets. In general, the other approaches are much more complex, both in understanding them and in their calculations. Some of them need to be trained on the dataset (and sometimes even on each query) to provide good results. Even after training they are sometimes still less accurate than the simple vector space model. Compared to those alternative approaches the vector space model is easy to understand and calculate.

Visualizing the vector space model

The vector space model can be visualized as shown in the following graphic. The graphic shows some (blue) document vectors and one (green) query vector in a coordinate system (black).

Vector space model

A vector is simply an arrow from the origin (where the axis of the coordinate system intersects) to a specific point in the coordinate system. The vector is represented by a value for each axis, for example (2, 3) (that is, "2 units right and 3 units up").

In the vector space model there is one axis for each term that exists in the whole document set. Of course there are many more than two axis - but this doesn't matter when we want to understand the idea of the model

In a document vector, the value on each term axis represents the importance of the term in the document. Therefore the document vector is merely a set of all terms with a value representing the importance of each term in the document. Note that all terms that do not occur in the document have a value of "0". You could write the information of a document or query vector like a row in a table:

document\term ez publish cms admin user and ...
document 1 3 2 4   1 10  
document 2       3 5 3  
document 3 4 6 2 1 3 4  
               
Query "ez publish" 1 1          

Calculating the similarity of a document related to a specific query is like calculating the similarity of the document vector and the query vector. As a simple visual model, this could be something like the smallest angle between the document and the query vector. The smaller the angle, the more similar the document is to the query.

Now that we understand what similarity means, we need to know how the importance of a term in a document is calculated.

The importance (or weight) of each term in a document is individually calculated. There is no magic or intelligence but only the simple counting of words. The weight is based on two factors: the local weighting factor and the global weighting factor.

Local weighting factor

The local weighting factor is based on each individual document. The idea is to measure how important a term is in the document itself. Terms that occur often should be more important than keywords that occur only a few times. This is based on simply counting how often the term occurs in the document the so-called term frequency.

Terms occur more often in longer texts than in short texts. Because of that, the term frequency can be normalized - that is, setting the term frequency in relation to the length or the maximum term frequency of the document. Without normalization longer documents rank higher simply because they contain more words.

But there is still another problem. Terms like "and", "a", "of" and "the" occur profusely in many documents. Those terms are not important. But unfortunately they have a very high term frequency because they occur so often. So while the term weight is high, they neither discriminate different documents nor indicate the content of a document.

For example, if your search query is "a website", the term "a" is not discriminating the documents while the term "website" is. Therefore, there is a second weighting factor - the global weighting factor.

Global weighting factor

The global weighting factor is based on the whole dataset as opposed to each individual document. The idea is to measure how discriminating a term is. A term is discriminating if it only occurs in a few documents, and is not discriminating at all if it occurs in every document.

For example, on ez.no terms like "is" or "a" are not discriminating at all. The terms "ez", "systems" and "publish" are a little bit more discriminating. Terms like "cluster" and "rss" are much more discriminating.

Like the local weighting factor, the global weighting factor is also based on simple counting. It is based on the number of documents in which a term occurs, the so-called document frequency. Normally the document frequency is set in relation to the total number of all documents.

If a term occurs in only a few documents, it is a discriminating term. Terms that occur in almost every document are not discriminating, regardless of whether they have a high or low term frequency in a document.

Combining term and document frequency

The importance of a term in a document is based on how often it occurs in the document and how discriminating the term is related to the whole dataset.

In the simplest way, the weight is calculated by taking the local weighting factor and dividing it by the global weighting factor. Thus the importance of terms that occur in many documents will become really small as they are divided by a big number. And on the other side, terms that occur only in a few documents will have a high weight even if they do not occur very often in those documents.

Vector space on-the-fly

Normally the term frequency and document frequency are pre-calculated when indexing the documents. But this has one disadvantage. Imagine a query like "partner bug". Assume those two terms would discriminate the documents on ez.no in a similar dimension (that is, in the whole ez.no dataset both terms have a similar document frequency).

But what if you limit your search to the Partner section on ez.no? The proportion of documents that contain the term "partner" is much higher than in the whole dataset. So "partner" is less discriminating in that context than it is in the whole ez.no dataset. Similarly, the "bug" is much more discriminating. If you search in the bug system for the same terms it will be the other way around.

Thus it could be interesting to calculate the global weighting factor - the document frequency - on-the-fly, depending on the context to which you limit your search.

Now that we understand how search engines work, the next question is how to write powerful queries. The next section introduces a powerful query language that supports structural and content-related query conditions.

Normal users want to have a simple form where they just input some words. More advanced users want additional options. Then there are developers who want to have the maximum power to create customized and advanced search forms. In all these cases, the system uses a powerful query language in the background. Users enter the query conditions on a simple form; the conditions are transformed into the query language on the back end.

Let's think about what our query language should be able to do.

Fine-grained search queries

We talked about the advantages of structured information. Attributes and XML elements have a semantic meaning and are based on a datatype with special comparison operators. We looked at the example of searching for temperature values that could be stored in attributes or (custom) XML tags. Additionally, the temperature values of different measurements could be converted for comparison.

To have this kind of power, you need functionality similar to the following:

  • limit the search to one or a few content classes
  • write constraints that are based on attributes and custom XML elements with the same (or even similar) names in different classes
  • search in all attributes and XML tags of a specific datatype
  • limit the search to one or a few sub-trees
  • nested definitions of logical operators like and, or and not

In a limited way these options are already available with the fetch() and search() operators in eZ Publish. However, this functionality could be enhanced.

eZ Publish and XML

All content objects in eZ Publish are stored as XML (or can easily be mapped to XML). But not only the content objects are XML: you can even imagine the whole content tree of an eZ Publish installation as one big XML tree, based on the content objects' XML representations and the node tree hierarchy.

There is also additional information that web search engines can't use. For example, although this article is split over multiple web pages, we know that it is one unit. You might want to combine the article pages when calculating the ranking value of the whole article. The same may be true if you embed an object in another object or if objects are related.

The power and limits of XPath

In the world of XML, XPath is a very general and powerful query language for structured XML documents (like SQL is for relational databases). Here are some XPath examples:

//article
  • retrieves all article
//article[./author[contains( 'fred' )]]
  • retrieves all articles where the author element contains "fred"
//article[./author[contains( 'fred' )]]//heading
  • retrieves all headings of articles where the author element contains "fred". Note that in this example the headings are retrieved, not the articles. First, all articles are searched for the author condition. Next, for those articles that matched the author condition, the headings are retrieved.
//article[./author[contains( 'fred' )]]/heading[contains( 'license' )]
  • retrieves all headings which contain "licence" in articles where the author element contains "fred"

These are only a few examples of the power of XPath with a really informal explanation of how they work. More information can be found on Wikipedia or on the W3C site.

But XPath can only be used to retrieve XML elements that fulfil some conditions - there is no relevance-based ranking of the retrieved XML sub-trees (in the same way that there is no such ranking in SQL). Therefore we need an approach that extends XPath.

XIRQL

XIRQL (spoken like "circle") enhances XPath by adding some important concepts and functionality for information retrieval. [3] For example, it introduces:

  • vagueness related to structural and content conditions (for example if you search for an XML element called "list" it could also match tags called "list-item")
  • a concept of datatypes with vague operators (for example if you want to search locations around "Oslo")
  • structural conditions based on hyperlinks
  • a concept for calculating the similarity of combined objects (useful for including the content of child objects and related or embedded objects with reduced weight in the actual object)

We extended XIRQL with the concepts of filter() and rank() functions:

  • filter() is responsible for building a context set. Based on this context set the document frequency can be calculated for the vector space on-the-fly
  • rank() contains all constraints that should be used for relevance-based ranking

A XIRQL query could, for example, look like this:

//article[filter(./author[contains( 'fred' )]), rank(./heading[contains( 'licence' )])]

All constraints that are possible in XPath and XIRQL can be defined as filter or ranking conditions.

Ok, enough theory. Let's have a look at our current status.

Based on the existing search engine infrastructure in eZ Publish 3, different variants of the concepts described above were implemented. Because the main focus was research and evaluation, the actual implementation is a naive approach intended as a "proof of concept" rather than an actual implementation.

All terms in English texts were stemmed by the PHP PECL stem extension.

Implementation challenges

Scientific papers in the area of information retrieval often only contain the basics of their concepts. Information for special or more complex cases is sometimes missing. Often you cannot find successful parameters for the formulas used by the algorithms. Sometimes you cannot even find the formulas.

Sometimes this is because of the shortness of the articles. But it also appears that people working for commercial search engines want to show great results without disclosing details that would benefit their competitors.

Therefore, there is some trial and error when putting the theory into practice. An evaluation with many users shows which algorithms are good and which are not.

Configuration

In order to get valuable results, we weighted different parts of our content on a class level. For example, we increased the weight of attributes like the title and the abstract / intro in the article content class (relative to the other attributes). The same could be done for different XML tags in an XML attribute, for example headings.

Additionally we wanted to experiment with the index. In an XML document, you can store statistical data for each XML element. But probably this is too fine-grained, because users tend to rate really small pieces of information as less relevant because they do not contain much information. So instead we combined different XML elements into one index node.

As a short example, the content of an XML field could be visualized as tree structure as shown in the following graphic. The ellipses (for example "section") are the XML tags, while the rectangles represent the content. If the elements of each section are included in one index node, the blue boxes show which XML elements are combined together. [4]

Combining index nodes

The configuration of index nodes is flexible and if a retrieved object contains different index nodes the statistical data is "summarized". This would also be useful if, for example, you wanted to calculate a ranking value for this article including its sub-pages while also calculating a ranking value for each sub-page separately (as a sub-page might be more relevant for a query than the whole article). In the same way, when calculating the ranking value of a content object, the statistics of related, embedded or linked content objects can be included.

In the current test implementation, the search is configured in a simple way that treats documents as almost plain text with up-weighting only applied to a few important tags and attributes. This configuration is comparable to normal web search. In the last step of the evaluation phase the configuration of the search index will be changed and the existing queries will be run automatically and compared to the user ratings. In this way we can evaluate many different configurations and variants of the algorithms.

You are invited to try this new search and evaluate the returned results (see the end of the article for information).

Limitations

At the moment only a subset of XIRQL is implemented. But it is enough to answer the most important question for now: What are good configurations and algorithms for relevance-based ranking of content-related queries?

The hardest thing when implementing a search engine is that you don't know what the result will be. This is mainly because:

  • Computers and algorithms are not intelligent. They do not understand what a document is about. They just have some statistical data about occurrences of terms.
  • It is not possible to define objectively which documents are relevant for a user. Even with the same query, different documents might be relevant as users have different background knowledge.

Because of this, each evaluation depends on user feedback about the relevance of the result. In general, we ask that users try the search and rate if the result is relevant or not. Based on this relevance feedback we can evaluate the quality of a search algorithms and configurations. (Instructions for helping us with this evaluation are at the end of the article.)

Comparing algorithms

A classical approach is based on measuring precision and recall:

  • Precision measures the quality in the returned objects. It is the fraction of the number of retrieved and relevant documents to the number of all retrieved documents. For example, if you get 60 results but only 15 of them are relevant, you have a precision of 15 / 60 = 0.25.
  • Recall measures how many of all relevant results in the whole dataset are found. It is the fraction of retrieved and relevant documents to all relevant documents. If there are 25 relevant documents in total but only 15 of them are returned (of 60 returned results) the recall is 15 / 25 = 0.6.

In general the precision is better if there are only a few documents returned. But the recall is not that good in that situation. The reverse is also true: if your search engine returns many results, the recall will get better with a worse precision value.

Assessing these values requires a large human effort to decide which documents are relevant. Especially for recall, you must look at the whole dataset - not only the retrieved documents - to find all relevant documents.

A more pragmatic approach is called pairwise accuracy. In this approach users only rate the top results and decide how relevant each document is (for example "relevant", "partially relevant" or "not relevant"). Now you look at each pair of documents where the user rated the first document better than the second one. (Pairs of documents that are rated in the same category are not accounted.) Now you calculate the fraction of pairs, where the algorithm ranks the first document higher than the second one, also.

Of course this approach assumes that you have retrieved the most relevant documents in the top results - a highly relevant result in a bottom-ranking position will probably never be rated by the user. But as we mix the results of many algorithms we assume that the most relevant documents are found in the top results of at least one algorithm.

We have built a search interface to the new implementation. This installation includes the dataset of ez.no including the documentation but excluding the new bug system. The search interface is located at: http://eval.ez.no/content/searchevaluation

Searching

First you can limit the search to classes and / or sub-trees in the content hierarchy. This will define the context set for the search (important for calculating the document frequency on-the-fly).

Next, input the search query. These conditions are used for relevance-based ranking.

Unfortunately, in the current implementation, query performance is slow. It is not unusual for the search to take more than one minute. The lack of performance is mainly because of two reasons: First, there are different ranking algorithms which need to be processed. Second, the implementation was intended as a"proof-of-concept" and not a final implementation.

Rating of results

The list of search results combines the top 20 results of all algorithms. After the results are returned, you can provide relevance feedback by dragging and dropping the results to one of the boxes ("relevant", "partially relevant" and "not relevant").

Important: Please do not pick the relevant results only and move them to the "relevant" box. This is useless for our comparison of different algorithms. The best way is that you start with the first result, rate it by dragging and dropping it to the correct box and then go on with the next result.

When you are done (or when you don't want to rate any more results), you save the data by clicking save ratings. You will be redirected to the search result that shows your ratings so you can continue rating results if you want.

Thank you very much for your help. We welcome your comments and feedback.

[1] Rajat Mukherjee and Jianchang Mao, Verity, Enterprise Search: Tough Stuff, p. 40. In: ACM Queue, May 2004.

[2] Graphic about model of a search engine is inspired by graphics from Reginald Ferber, Information Retrieval: Suchmodelle und Data-Mining-Verfahren für Textsammlungen und das Web. dpunkt.verlag 2003, pp. 42 and 62

[3] Norbert Fuhr, Kai Großjohann, XIRQL: An XML Query Language Based on Information Retrieval Concepts. In: ACM Transactions on Information Systems, Vol. 22, No. 2, April 2004, Pages 313 - 356

[4] Graphic about index nodes is inspired by the XIRQL article cited below (p. 316).

Additional reading

Ricardo Baeza-Yates and Berthier Ribeiro-Neto: Modern Information Retrieval. Addison Wesley Publishing Company, 1999.

Article Discussion

Creating a Search Engine