Elasticsearch vs Azure Search vs DPR

December 15, 20206 Min Read

How to search your documents and retrieve information?

How about opening each document and pressing Ctrl + F to find exact matches? If the document has the desired word, great. Done! Easy!

But what if the document does not contain the word? Or has it in too many places?

What if you search for the term "dog" but the document contains "dogs". You are out of luck 😁


In this blog post we will have a look at the so-called problem of Text Ranking. The goal of text ranking is to generate an ordered list of texts retrieved in response to a query. Although the most common formulation of text ranking is search, it applies to many natural language processing applications such as:

  • Question-answering: In question answering, a system aims to identify a span of text that directly answers the user's question instead of returning a list of documents that the user must then manually scan
  • Community Question Answering: Users sometimes search for answers not by attempting to find relevant information directly, but by locating another user who has asked the same or similar question, for example, in a frequently-asked questions (FAQ) list or an online forum such as Quora or Stack Overflow. Answers to those questions usually address the user's information need.
  • Information Filtering: Filtering considers the opposite scenario where, for the most part, a static query is posed against a stream of texts. For example, the google scholar service sends email digests to users whenever a paper that matches the user's interest is published.

We will explore two different commercial systems and a novel approach of retrieving information:

  • Elasticsearch: A highly scalable open-source and commercial search engine used by many companies in production. It allows you to store, search, and analyze big volumes of data quickly and in near real time.
  • Azure Cognitive Search: A managed search service from Azure. It has build in AI capabilities but we won't explore those in this post.
  • Dense Passage Retriever: Information retrieval using dense vector representation. Utilizes neural networks and learns embeddings from a small number of queries and passages. For more informaton refer to this paper.

These systems retrieve documents thus, we will refer to them as retrievers. All three retrievers represent the query and documents as vectors. In Elasticsearch and Azure, these vectors are sparse, while DPR obtains dense vectors.

TF-IDF and BM25 algorithm

BM25 is the default similarity algorithm used by elasticsearch and azure search. TF-IDF is a commonly used baseline for information retrieval that exploits two key intuitions:

  • documents that have more lexical overlap with the query are more likely to be relevant
  • words that occur in fewer documents are more significant than words that occur in many documents

Given a query, a tf-idf score is computed for each document as follows:

score = tf * idf


  • tf is how many times words in the query occur in that document.
  • idf is the inverse of the fraction of documents containing the word.

In practice, both terms are usually log normalised.

BM25 is a variant of TF-IDF. Not only that, but it improves upon TF-IDF in two main aspects:

  • It saturates tf after a set number of occurrences of the given term in the document
  • It normalises by document length so that short documents are favoured over long documents if they have the same amount of word overlap with the query

BM25 has also two free variables we can tweak, namely k1 and b

The k1 variable controls the scaling between the query term frequency of each matching terms to the final relevance score of a document-query pair. Trust us and read that sentence again 😉

Example: k1 = 0

In this setting the contribution of a single matching term is the same for all matching documents, regardless of how many times that term appears in the text.

Example: k1 = 3

In this setting the larger k1 value allows the score to increase as more instances of the same term are found in the documents.

The b variable controls how the length of a document affects the score. A value of 0 means the document length will have no influence. A value of 1 means the score will be normalized by the documents length.

For our tests we choose the common values of k1=1.2 and b=0.75


Exact match techniques such as TF-IDF and BM25 are powerless in cases where terms from the query and the texts don't match at all.

Consider the case where a user searches for "black bears attacks". If the document talks about "brown grizzly bears", the search will not find any matches.

DPR tries to solve this problem by using two BERT language models. These language models have attention and can understand semantics.

DPR encodes all documents and the incoming query into dense vector representation. Ranking happens by taking the dot product similarity between query and document vectors. The more similar they are, the smaller the dot product output.

DPR is comparatively expensive in required computation since all the database documents need processing through a language model. In contrast to BM25, it is both more time consuming and resources intensive (CPU + GPU)

Options for storing the outputs (dense vectors) and calculating the dot product at scale include elasticsearch, FAISS, and SPTAG.

In a future post we might write a follow up on our experience with FAISS and SPTAG.


To be able to evaluate and compare the systems, we need labeled data. That is, for a set of queries the most relevant documents as judged by a human.

We use a subset of the Natural Questions development set containing 50 documents. In particular, the datasets consist of two subsets. The first one includes 50 documents that hold different texts. The second subset includes 54 questions, the corresponding answers, and the document where the answer is retrieved. You can check out the data here.

We will compare the retrievers' Recall. For each query, we let the systems retrieve the top 3 most relevant documents. Knowing the results to all queries, and most importantly, the documents, we can count how many times the right document was among the top 3.

We define Recall as the number of corret retrieved documents / the number of questions.



For 51 out of 54 questions (94.44%), the answer was in the top-3 candidate passages selected by the retriever.

Retriever Recall: 0.9444

For 52 out of 54 questions (96.30%), the answer was in the top-3 candidate passages selected by the retriever.

Retriever Recall: 0.9630


For 52 out of 54 questions (96.30%), the answer was in the top-3 candidate passages selected by the retriever.

Retriever Recall: 0.9630


Retriever Recall
Elastic 0.9444
DPR 0.9630
Azure Search 0.9630

We had higher hopes for DPR but given the dataset we used to evaluate, we see that the BM25 algorithm is comparable to DPR. We see that the azure search service, even though it uses the same? BM25 algorithm as elasticsearch it obtains a slightly higher recall score.

To be able to better judge these results we want to evaluate the systems on a more demanding dataset for example Microsofts MARCO and Microsofts MIMICS.

Stay tunned for our follow up post!

Spread the

Thirsty for more?