Tuning Elasticsearch Index Settings for Logs

Indexing logs in Elasticsearch is resource intensive, and a poorly configured log configuration can make for brutally slow log queries. There are surprisingly few guides on how to address the common question of how to best index logs. Here I will walk through start of a sane index template for most uses.

The factors we want to optimize for are:

The settings here are tested on Elasticsearch 6.x and 7.x. This guide also assumes that you have parsed and transformed your logs into structured JSON using something like fluentd or logstash.

Index Template

An index template is a set of configurations for an index that are applied every time a matching index is created. Usually, index templates are used with the Rollover API to create size-constrained indexes for logs.

An index template contains 3 top-level fields, index_patterns , settings , and mappings . index_patterns is a list of glob expressions that define the index names that match the template.

PUT /_template/logstash_template
{
  "index_patterns": ["logstash*"],
  "settings": {...}
  "mappings": {...}
}

The sane default for index_patterns is ["logstash*"] , since most log collection tools will automatically create indexes of the format logstash-yyyy-MM-dd (logstash-2019.08.24).

Settings

settings contains index-level settings, as well as settings that apply to all fields.

"settings": {
  "index.mapping.ignore_malformed": true,
  "index.query.default_field": "message",
  "index.refresh_interval": "30s",
  "index.search.slowlog.threshold.query.debug": "0ms",
  "index.search.slowlog.threshold.query.info": "1s",
  "index.search.slowlog.threshold.fetch.debug": "0ms",
  "index.search.slowlog.threshold.fetch.info": "1s",
  "index.translog.sync_interval": "1m",
  "index.number_of_shards": 4
  "index.number_of_replicas": 1
}

A critical field to define here is "index.query.default_field" . This selects a field that will be used to match queries that don’t specify any terms (like when someone types just hello into Kibana). The default behavior is to search over all fields, which is wasteful and slow.

"index.mapping.ignore_malformed": true is important. Otherwise, if a field was first encountered as a number, and then appears as a different type, like string , Elasticsearch will reject the subsequent documents. With it set to true , Elasticsearch will not index the field, but will accept the document.

"index.translog.sync_interval" makes Elasticsearch flush to disk less often. "index.refresh_interval" controls the amount of time between when a document gets indexed and when it becomes visible. Increasing these values can increase indexing throughput. More performance settings can be found at Tune for indexing speed.

"index.search.slowlog" settings enable slow query logs, so that you can tune the index as you learn about common query patterns for your logs.

"index.number_of_shards" control the number of primary shards for your index. To fully utilize your cluster, this should usually be set to 1/2 the number of data nodes in your cluster. "index.number_of_replicas": 1 will create 1 replica shard for every primary shard to saturate the other half. Conventional wisdom recommends setting the shard count to 2-3x the number of nodes to account for adding nodes later. This doesn’t apply in a logging cluster, since you can just roll over the index and add more shards to the new index. Consequently, these settings are actually better used in index rollover, which I’ll talk about in a subsequent post.

settings is also the place to set analyzers and normalizers , which control how Elasticsearch processes values it indexes.

"settings": {
  "index.analysis": {
    "analyzer": {
      "log_analyzer": {
        "type": "pattern",
        "pattern": "\\W+",
        "lowercase": true
      }
    },
    "normalizer": {
      "lowercase_normalizer": {
        "type": "custom",
        "filter": ["lowercase"]
      }
    }
  }
}

The log_analyzer breaks tokens on non-word characters for full text indexes, instead of the default of breaking on whitespace. This allows URLs and package paths to get tokenized.

The lowercase_normalizer can be used on keyword (exact string match) indexes to lowercase the field when it is indexed and to lowercase any search queries on this field. Humans don’t usually expect queries to be case sensitive!

Note that analyzers and normalizers are not applied automatically. They’ll be referenced from the mappings below.

Mappings

Mappings define what fields in your log get indexed. By default, Elasticsearch will automatically create a mapping for every field it sees. This can lead to an explosion in the number of indexed fields and index size, not to mention slow indexing throughput. Here’s a reasonable set of mapping settings. Here’s a reasonable set of mapping settings.

"mappings": {
  "_size": {
    "enabled": true
  },
  "dynamic": true,
  "dynamic_templates": [
    {
      "no_index_fields_past_depth_2": {
        "path_match": "*.*",
        "match_mapping_type": "object",
        "mapping": {
          "type": "object",
          "enabled": false
        }
      }
    },
    {
      "create_keyword_index_for_all_string_fields": {
        "match": "*",
        "match_mapping_type": "string",
        "mapping": {
          "type": "keyword",
          "normalizer": "lowercase_normalizer",
          "ignore_above": 1000
        }
      }
    }
  ]
}

"_size”: true enables the mapper-size plugin, available in most installations. This creates a _size field that can later be used to figure out where the largest logs are coming from.

"dynamic": true enables dynamic field mappings, allowing Elasticsearch to automatically create indexes for new fields it encounters. This is essential for teams that do structured logging, as it’ll allow devs to create arbitrary fields that can be searched later, without requiring a change to index settings.

However, setting **"dynamic": true can be dangerous.** By default it creates indexes for every field, regardless of nesting level. Also by default, string fields are analyzed and indexed for full text search, which can be expensive.

Instead, we create custom dynamic mappings to constrain what indexes can be created automatically. The no_index_fields_past_depth_2 dynamic mapping prevents nested fields past the depth 2 (foo.bar.baz) from being indexed.

The create_keyword_index_for_all_string_fields dynamic mapping applies the lower case normalizer from settings , and defaults all string fields to keyword indexes. This allows for quick searches of a potential field like customer_id:“cust_12345" . "ignore_above": 1000 is an arbitrary limit on the length of the field that can be indexed, to prevent giant strings (think stacktraces and response payloads) from blowing up the size of the index, since they aren’t tokenized.

Finally, we can explicitly define mappings for some special fields

"mappings": {
    "properties": {
      "@timestamp": {
        "type": "date"
      },
      "thread": {
        "type": "text",
        "analyzer": "log_analyzer",
        "norms": false,
        "similarity": "boolean",
        "fields": {
          "keyword": {
            "type": "keyword",
            "normalizer": "lowercase_normalizer"
          }
        }
      },
      "message": {
        "type": "text",
        "analyzer": "log_analyzer",
        "norms": false,
        "similarity": "boolean"
      }
    }
  }
}

Most log collectors will create a @timestamp field for the timestamp of the log entry. We explicitly create a date index for it.

thread here is a common field that usually contains the name of the executing thread. We want to support full text search on it, thus we create a full text index, and apply the log_analyzer from before. We set "similarity": boolean to disable scoring (since we almost always want to order logs by time), for a minor performance gain. Likewise, "norms": false saves removes some metadata only used for scoring. Finally, we add a thread.keyword index, which allows for regular expression queries, like thread.keyword:/consumer-thread-\d+/ .

The message field is where the log message goes, and can be quite long, so we create only a text index to enable full text queries. If you want to support regular expressions on the message field, just add a message.keyword index like for thread. It’s important to note that regexp searches in Elasticsearch are extremely inefficient. Elasticsearch will more or less scan the entire index to match the regexp.

Conclusion

Elasticsearch makes it really easy to index large amounts logs, making it a popular choice for logging backends. However, as hopefully clear from this post, it’s actually quite tricky to craft an efficient configuration to support scalable indexing and search. The settings in this post should provide a sane, solid foundation for any ELK logging backend. Hopefully this post saves a few people from having to crawl the internet to do the same things.