Tweaking the search

Setting up a search engine in one thing, but to make it actually work, in the human sense, is another story. Having a search which basically does a grep on all your content is the bare minimum you can do, but what makes Google awesome is that when you search something, you find quickly good content. When you search something, you should find good results, and the good results first. Here some of the things we did to make the Scoop.it search looks good.

Boosts

The simplest way of tweaking the search is manipulating boosts, at index time and at query time.

There are boosts per document on the indexing side. In Scoop.it we have a notion of "topic score" which tries to give better score to the better topics in term of activity and quality. So this is a perfect fit for a document boost.

And there are the boosts on the query. The search query will be a composition of subqueries on each field of your document. It is about boosting the subqueries relatively to the importance of the fields. For instance in Scoop.it, the topic title has a better boost than the topic description.

A generally good rule is to make the search must match at least one of the fields which are displayed when showing the canonical view of the document. From the end user point of view, if it is not something visible, he wouldn't understand why it matched. Then there are some fields which are optional in term of matching. For instance, when searching for posts in Scoop.it, we want to find the query term in the post title, the post content or the post insight. But a post can be tagged too. Since tags are not visible, we don't want them to be part of the main query which requires at least a match among these fields. But if a tag match the query, the post should have a better score. Lucene has a very nice way of doing this, this is really a search tool contrary to the SQL LIKE command. Boolean queries in Lucene are composed of occurrences of subqueries, an occurrence which could be any of MUST, SHOULD or MUST_NOT. Here what looks like a basic query against the posts in Scoop.it:

{ "query": {
    "bool": {
        "must": {
            "bool": {
                "should": [{ "match": {
                    "title": {
                        "query": "shark",
                        "type": "phrase_prefix",
                        "boost": 5
                    }
                }},
                { "match": {
                    "content": {
                        "query": "shark",
                        "type": "phrase_prefix",
                        "boost": 2
                    }
                }},
                { "match": {
                    "insight": {
                        "query": "shark",
                        "type": "phrase_prefix",
                        "boost": 1
                    }
                }}]
            }
        },
        "should": { "match": {
            "tag": {
                "query": "shark",
                "type": "phrase_prefix",
                "boost": 1
            }
        }}
    }
}

As you can see, there is a first boolean query which is composed of a first mandatory-matching query, and an optional one. The first subquery is then composed of a series of should, which is equivalent in classic boolean algebra to a disjunction query (OR query). This way matching the second subquery doesn't affect which document qualifies or not to be in the result set. It only gives an additional matching if any, thus some additional bonus points in the final score of the document.

Not there yet

But even after manipulating these basic potentiometers, we were not satisfied by the quality of the result. As inputs here we just have the contents and our score which is an indication of the quality of the content. We were missing the freshness, and the popularity (number of visitors and views). The issue with these two new inputs is that they vary in time.

In Scoop.it we have a good amount of topics, but a periodic reindex of all of them every day is manageable. So for topics the fix is easy: cron a reindex. On the other hand we have millions of posts, reindexing them would take days. We had to find another solution. And the solution is to use a custom scorer.

Writing a scorer

As the example of the documentation, you could try to use default scripting language which is mvel. It is nice enough for prototyping. But if you have a nice amount of documents, use "native" scripts for a production environment. Native scripts are actual java code you will add in elasticsearch as a plugin. Even if the documentation of mvel tells it is very fast, and it may be fast for a scripting engine, but don't forget that the scorer will be called for each matching document, so there will be a lot of calls. Here we want to leverage the jvm and its jit since a scorer will be "just" about some computation over some numbers in RAM.

Writing a scorer is about implementing org.elasticsearch.script.SearchScript. The computation of the final score can be based on the score which Lucene has computed based on the initial query, applying all the boost and boolean scorings, as explained in the first paragraphs. And fields of the document being scored can be accessed very easily, elasticsearch is taking care of retrieving them and is doing aggressive caching. Here is a version of a "post scorer" which implements the requirements explained earlier.

public class ScorePostScript extends AbstractSearchScript {

    private final long time;

    public ScorePostScript() {
        time = System.currentTimeMillis();
    }

    @Override
    public Object run() {
        long curationDate = ((Longs) doc.get("curation_date")).getValue();
        float s = 1 + 0.1f / (1e-10f * (time - curationDate) + 1);
        float topicScore = (float) (topicScoreField)(Doubles) doc.get("topic_score")).getValue();
        s = s * (1 + 0.1f * topicScore);
        float luceneScore = score();
        s = s * (0.1f * (1 - 1 / (value + 1)) + 1);
        return s;
    }
}

Implementing a such scorer will require a little bit of mathematical skills: it will be quite important that the scorer reacts correctly to the range of data it is expected to score. For instance, the curation_date data range is unbounded, so I chose to use the y=a/(b*x+1) function to fit the result in a reasonable fixed range. On the contrary the topic_score data is bounded, so I just need to apply some factor. The final score is about adjusting all the different ranges of the different fields so they are comparable (you may apply some boost on more important fields than others), and mixing them (here I simply chose the product of the values). And to play with these functions and their factors, I just used the stupidest tool, Google.

It is recommended to test the computation on real data, so that all the tweaking is biased by all the weird data you actually have, not the "qsdf" data you have in your integration environment. But then it often means testing on a good amount of lucene documents, so you cannot debug it the usual way, by putting a breaking point or by logging. It is to be reminded that the scorer is applied to every document that matches the query. So I recommend that you actually implement org.elasticsearch.script.ExplainableSearchScript (my little piece of contribution to Elasticsearch), so that your scorer will explain how a document has been scored, fluently integrated in the explain API of Lucene. That way, launch a search on elasticsearch with the extra parameter explain to true, and you'll get all the gory details of the scoring, but only for the documents that matter, the documents that are displayed in the first pages.

It works, so what's next ?

That tweaking have been in production for more than a year, it is working like a charm. Still, having our custom scorer using these data fields has added some good pressure of the JVM memory. It was recently awesomely reduced as shown in my first post. The latencies of the search increased a little bit compared to classical query boosting, but we have not explored yet the tuning of elasticsearch itself, we are still running with the default config, so we probably have some room there (to not care too much about config optimization, it also shows that elasticsearch is a nice and easy piece of software !).

The quality of the search has obviously still room for improvement. We are seeing some really nice features coming within elasticsearch. For instance there is this common query which is adapting its conjunction query into an disjunction one as needed. We also did some semantic analysis on the content in Scoop.it for of other purpose (teasing !), and I bet that it could boost even more the quality of our search engine. So keep reading, there'll be more awesome stuff in the next episodes !