Modern Search Engines — Informatics4ai

This is part two of a three-part blog post. (See part-one here.)

As we discussed in the previous post, index engines were developed to make searching across large textual repositories fast. But once high-speed retrieval was achieved, a new problem occurred – users were unable to find the most relevant/interesting documents within a large set of search results. The obvious answer/solution was to rank the documents by relevancy and present the most relevant results first.

As we saw in the previous post, indexing products often have problems with relevancy, giving birth to modern search engines, which improve relevancy by using two key techniques. Let’s take a look.

Enhancing Relevance with Context

Google is by far the best example of using context to improve relevance. Early internet search engines largely ranked results by counting the number of times the search terms appeared on the page. Google took a new approach to relevance by introducing the page’s importance (called PageRank) into ranking search results. (PageRank is loosely based on how many other websites linked to that page.) The inclusion of context into the relevance ranking had a huge and dramatic effect enabling Google to leapfrog their competitors Yahoo and Lycos, largely because users found Google’s searches so much better.

While PageRank is an awesome piece of context on the web, it does not work inside an organization. So enterprise search engines developed other techniques to improve relevance.

Enhancing Relevance with Tuning

Modern enterprise search engines attempt to address the problem of poor relevance by making the relevance calculation tunable for an organization’s specific set of circumstances. For example, Elastic Search (a widely used open source search engine based on Lucene) has many options for tuning relevancy. A few examples (all of which were critical to the organization described in the previous post) include:

Commonly Used Adjustments/Techniques

Field Boosting – used to boost the relevance of documents when the search term is in a field such as “Title” as opposed to buried on page 24.
Time Boosting – used to make more recent items more relevant and therefore is very helpful in applications like news or research.
Search Term frequency saturation – used to ensure that a large document does not dominate all others just because it contains more search terms.

Specialized Adjustments/Techniques

Location Boosting – used to increase the relevance of items that are close.
Price Boosting – used to increase or decrease relevancy based on price (often critical for e-commerce applications).
Boosting by Popularity – used to increase the relevance based on data from another field such as a popularity rating (like Google, context is being used to increase relevance).

Modern search engines are much more tunable than indexing engines, and therefore often produce a much better search experience for the user. We recommend that when organizations adopt a search engine, they include relevancy tuning as part of the project to ensure a custom fit for their needs. However, once the low hanging fruit has been done (e.g. Boost the Title field), we strongly recommend against some organization’s desire to continue tweaking relevance daily. As Elastic notes “relevancy tuning is a rabbit hole that you can easily fall into and never emerge.” We advise organizations to visit tuning regularly but infrequently, and only doing so when they have the proper instrumentation and monitoring in place to know if you are increasing or decreasing relevance. Again, according to Elastic, you should monitor relevance by keeping track of items such as “how often your users click the top result, the top 10, and the first page; how often they execute a secondary query without selecting a result first; how often they click a result and immediately go back to the search results, and so forth.” With these objective measures in hand, you can clearly understand how relevance tuning is affecting users search experience.

The final post in this three-part series will discuss next generation search and why some organizations are already reaping big benefits from Insight engines.