Making ElasticSearch Perform Well with Large Text Fields

We're continuing our story about creating Ambar, and this is the second paper about ElasticSearch. The first one is Highlighting Large Documents in ElasticSearch.
This paper tells the story about making ElasticSearch perform well with documents containing a text field more than 100 Mb in size.
The result we achieved is the performance improvement by more than 1100 times compared with the default 'out-of the box' setup.

Ok, let's start!

The Problem

At first, here is the example of a document Ambar stores in ElasticSearch:

    sha256: "1a4ad2c5469090928a318a4d9e4f3b21cf1451c7fdc602480e48678282ced02c",
    meta: [
            id: "21264f64460498d2d3a7ab4e1d8550e4b58c0469744005cd226d431d7a5828d0",
            short_name: "quarter.pdf",
            full_name: "//winserver/store/reports/quarter.pdf",
            source_id: "crReports",
            extension: ".pdf",
            created_datetime: "2017-01-14 14:49:36.788",
            updated_datetime: "2017-01-14 14:49:37.140",
            extra: [],
            indexed_datetime: "2017-01-16 18:32:03.712"
    content: {
        size: 112387192,
        indexed_datetime: "2017-01-16 18:32:33.321",
        author: "John Smith",
        processed_datetime: "2017-01-16 18:32:33.321",
        length: "",
        language: "",
        state: "processed",
        title: "Quarter Report (Q4Y2016)",
        type: "application/pdf",
        text: ".... looooong text here ...."

By default the atomic and undividable part of the Lucene's index is _source, it contains of all the fields from the indexed document. The index, on the other hand, is a sequence of tokens and terms from all the indexed documents.

So, let's dive into the problem itself.

Let's create a new index and index a few thousands of documents like one I specified before. Every document contains of about twenty short fields and one extra long text field content.text with length of 100 Mb.

Now let's try to query documents from the index by some range of dates:

curl -X POST -H "Content-Type: application/json" -d '{ range: { 'meta.created_datetime': { gt: '2017-01-14 00:00:00.000' } } }' "http://ambar:9200/ambar_file_data/_search"  

You'll see the result only after a few minutes, and that's why:

  1. First of all, in this case, ES performs search through all the terms from all the documents and their fields, including gigantic content.text
  2. During merge ES collects all the found documents into the memory (again, including huge content.text)
  3. After building the results in memory ES will try to send this gigantic documents as a single Json response.

We can stop ES from sending the whole documents in the response body by source filtering, but how can we exclude content.text field from the first two steps?

Tuning The Index

It's quite obvious that merging and serializing results containing huge content.text field is extremely expensive in terms of performance. To make ES omit this field during search and results composing we have to make ES handle this field separately from other fields.

First of all, in the index mapping you should set parameter store: true for content.text field. This will tell ES to store the field separately from other document's fields. But that won't exclude the field from _source and ES will never omit this field during search.

That's why we also should exclude content.text from _source by adding this parameters to the index settings:

_source: {  
    excludes: [

So, now we've finally separated content.text from _source. From that point ES indexes _source and content.text separately, and will never touch content.text on search unless you intentionally ask for it. Every search querying documents by 'small' fields will be done through dramatically reduced in size _source's without searching through and merging huge content.text fields. Also querying content.text now is much more effective since it's stored seperately and there's no need in merging it with _source.

With new index settings, let's query some documents with size more than 100 Mb and containing 'John Smith' phrase in content.text field from our index:

curl -X POST -H "Content-Type: application/json" -d '{  
    "from": 0,
    "size": 10,
    "query": {
        "bool": {
            "must": [
                { "range": { "content.size": { "gte": 100000000 } } },
                { "match_phrase": { "content.text": "John Smith"} }
}' "http://ambar:9200/ambar_file_data/_search"

Here are the average search timings for the index that contains of about 3.5 millions of documents with average size of 100 Mb:

  • Auto mappings - took 6.8 seconds
  • Modified mappings - took 6 milliseconds

1100 times search performance improvement is quite a result, that definitely worth some time spent setting up ElasticSearch properly! But there's a nasty side effect...

Side Effects

Let's get straight to it... You can't partially update a document with an update script without loosing the separately stored field's value! When you update some field in a document, ES has to re-index the whole document. It's ok. But it won't save and re-index the separately stored field. For the reference, in our index, if you add a new object to meta array of any document, the updated document's content.text value after update will become undefined and will be lost. So, if you're going to partially update this kind of documents, you'll have to update the stored field manually.

The End

We've been using ES for a long time now, and we've never regret using it. You just need to be patient and set it up right!

Ilya P

Read more posts by this author.

Subscribe to Ambar Blog. How we made your docs searchable

Get the latest posts delivered right to your inbox.

or subscribe via RSS with Feedly!