Smart suggestions with Django, Elasticsearch, and Haystack

One of the primary issues when gathering information from users is suggesting the right options. At HackerEarth, we gather information from all our developers, which helps us provide them with a better experience. When a humongous amount of data has to be indexed and returned as smart suggestions, an inverted index is perhaps one of the most efficient ways. An inverted index is a list of words that appear in a document, and a list of documents in which the word is found. Popular Lucene-based search servers like Elasticsearch and Solr are tools to maintain large inverted indexes and provide an efficient means to look up documents.

Here is an example from the Profiles page on HackerEarth.

suggestion_pic

We use Elasticsearch to index millions of documents with various fields. While solving this problem, we need to cross two hurdles—latency and relevance. We need to suggest relevant documents to the user while keeping the time taken to retrieve them (i.e. latency) as low as possible. Elasticsearch uses analyzers that help achieve good relevance but only if used in an appropriate manner. It also allows us to build our own custom analyzers. So by assaying the user input, astute analyzers can be built to increase relevance. A simple example for a document can be something like this:

So what are analyzers?

An analyzer converts the text to be indexed and creates lookups tofind the text when needed using appropriate search terms. An analyzer is composed of a tokenizer that splits your text into multiple tokens, which is followed by many token filters which modify, delete, or add new tokens. The tokenizer can be preceded by character filters which modify the text before passing it to the tokenizer.

Every field in a document has an index analyzer and a search analyzer. The index analyzer is used while the text for that field is being indexed for a particular document. And the search analyzer is used when a search is being made for documents based on that particular field. These analyzers for all the fields can be provided using the mapping for the particular index type in the index. Various combinations of these tokenizers, token filters, and character filters can be used to build custom analyzers in the settings. Here's an example of a mapping and a setting:

By default, Elasticsearch uses the Standard analyzer for indexing and searching. The Standard analyzer comprises the Standard Tokenizer with the Standard Token Filter, Lower Case Token Filter, and Stop Token Filter. It splits the text on spaces and converts all tokens to the lower case.

An example of how it is used:

The tokens generated are "this," "is," "hackerearth," but unless the user queries with these words, Elasticsearch will not look up the document. So to increase the discoverability and the relevancy of the search, Ngrams and Edge Ngrams are used. The topic that follows will explain them a bit more.

Elasticsearch provides many filters, tokenizers, and analyzers. So go ahead and read about them as Elasticsearch gives complete freedom to mash them up to build your own analyzers.

The secret sauce!

Ngrams and Edge Ngrams are the secret ingredients when it comes to suggesting the right document based on a user query. So what are they? Wikipedia defines Ngrams as a contiguous sequence of n items from a given sequence of text or speech. They are basically a set of co-occurring letters in a piece of text in the case of Elasticsearch. For example,

Elasticsearch provides both Ngram tokenizer and Ngram token filter, which basically split the token into various ngrams for looking up.

In the example above, for settings, a custom Ngram analyzer is created with an Ngram filter. If you notice, there are two parameters min_gram and max_gram that are provided. These are the min and max sizes of the ngrams that are to be generated for the lookup tokens. For example,

The only difference between Edge Ngram and Ngram is that the Edge Ngram generates the ngrams from one of the two edges of the text which will be used for the lookup. Elasticsearch provides an Edge Ngram filter and a tokenizer which again do the same thing and can be used based on how you design your custom analyzer. Edge Ngrams take an extra parameter, “side,” which denotes the side of the text from which the ngrams have to be generated. An example is provided in the settings above. Here is an edge ngram example:

As an intelligent way to suggest documents to the user, use Ngrams or Edge Ngrams to create custom analyzers for indexing and querying on the fields of the document type.

Deployment

For deployment, we have used Haystack to index the models and query the index. Haystack provides an easy way of creating, updating, building, and rebuilding indexes. As some of the fields require their own analyzers for indexing and searching, we have created custom fields for the search indexes.

To create our custom analyzers, we overrode the build_schema function by creating a custom backend for Elasticsearch. The ElasticseachSearchBackend is inherited and the DEFAULT_SETTINGS parameter can be set with our custom Elasticsearch settings. The custom analyzers are now ready for use.

Now that all of this is set up, index your data and make smart suggestions!

Going ahead, we plan to deploy this site wide and make suggestions better by analyzing the user input to create new options in the drop-downs.

Send an email to support@hackerearth.com for any bugs or suggestions.
This post was originally written for the HackerEarth Engineering blog by Karthik Srivatsa.

About the Author

Guest Author
Our guest articles are a collection of the best contributions made by members of the developer community on our blog. Discover articles on a wide range of topics, shared by top programmers across the world.
Share