Tuning Elasticsearch Index Settings for Logs
Indexing logs in Elasticsearch is resource intensive, and a poorly configured log configuration can make for brutally slow log queries. There are surprisingly few guides on how to address the common question of how to best index logs. Here I will walk through start of a sane index template for most uses.
The factors we want to optimize for are:
- Time to ingest logs
- Time to return query results
- User experience
- Cost
The settings here are tested on Elasticsearch 6.x and 7.x. This guide also
assumes that you have parsed and transformed your logs into structured JSON
using something like fluentd
or logstash
.
Index Template
An index template is a set of configurations for an index that are applied every time a matching index is created. Usually, index templates are used with the Rollover API to create size-constrained indexes for logs.
An index template contains 3 top-level fields, index_patterns
, settings
,
and mappings
. index_patterns
is a list of glob expressions that define the
index names that match the template.
PUT /_template/logstash_template
{
"index_patterns": ["logstash*"],
"settings": {...}
"mappings": {...}
}
The sane default for index_patterns
is ["logstash*"]
, since most log
collection tools will automatically create indexes of the format
logstash-yyyy-MM-dd
(logstash-2019.08.24).
Settings
settings
contains index-level settings, as well as settings that apply to all
fields.
"settings": {
"index.mapping.ignore_malformed": true,
"index.query.default_field": "message",
"index.refresh_interval": "30s",
"index.search.slowlog.threshold.query.debug": "0ms",
"index.search.slowlog.threshold.query.info": "1s",
"index.search.slowlog.threshold.fetch.debug": "0ms",
"index.search.slowlog.threshold.fetch.info": "1s",
"index.translog.sync_interval": "1m",
"index.number_of_shards": 4
"index.number_of_replicas": 1
}
A critical field to define here is "index.query.default_field"
. This selects
a field that will be used to match queries that don’t specify any terms (like
when someone types just hello
into Kibana). The default behavior is to search
over all fields, which is wasteful and slow.
"index.mapping.ignore_malformed": true
is important. Otherwise,
if a field was first encountered as a number
, and then appears as a different
type, like string
, Elasticsearch will reject the subsequent documents. With
it set to true
, Elasticsearch will not index the field, but will accept the
document.
"index.translog.sync_interval"
makes Elasticsearch flush to disk less often.
"index.refresh_interval"
controls the amount of time between when a document
gets indexed and when it becomes visible. Increasing these values can increase
indexing throughput. More performance settings can be found at Tune for indexing
speed.
"index.search.slowlog"
settings enable slow query logs, so that you can tune
the index as you learn about common query patterns for your logs.
"index.number_of_shards"
control the number of primary shards for your index.
To fully utilize your cluster, this should usually be set to 1/2 the number of
data nodes in your cluster. "index.number_of_replicas": 1
will create 1
replica shard for every primary shard to saturate the other half. Conventional
wisdom recommends setting the shard count to 2-3x the number of nodes to account
for adding nodes later. This doesn’t apply in a logging cluster, since you
can just roll over the index and add more shards to the new index. Consequently,
these settings are actually better used in index rollover, which I’ll talk about
in a subsequent post.
settings
is also the place to set analyzers
and normalizers
, which
control how Elasticsearch processes values it indexes.
"settings": {
"index.analysis": {
"analyzer": {
"log_analyzer": {
"type": "pattern",
"pattern": "\\W+",
"lowercase": true
}
},
"normalizer": {
"lowercase_normalizer": {
"type": "custom",
"filter": ["lowercase"]
}
}
}
}
The log_analyzer
breaks tokens on non-word characters for full text indexes,
instead of the default of breaking on whitespace. This allows URLs and package
paths to get tokenized.
The lowercase_normalizer
can be used on keyword
(exact string match) indexes
to lowercase the field when it is indexed and to lowercase any search queries on this field. Humans don’t usually expect queries to be case sensitive!
Note that analyzers
and normalizers
are not applied automatically. They’ll
be referenced from the mappings below.
Mappings
Mappings define what fields in your log get indexed. By default, Elasticsearch will automatically create a mapping for every field it sees. This can lead to an explosion in the number of indexed fields and index size, not to mention slow indexing throughput. Here’s a reasonable set of mapping settings. Here’s a reasonable set of
mapping
settings.
"mappings": {
"_size": {
"enabled": true
},
"dynamic": true,
"dynamic_templates": [
{
"no_index_fields_past_depth_2": {
"path_match": "*.*",
"match_mapping_type": "object",
"mapping": {
"type": "object",
"enabled": false
}
}
},
{
"create_keyword_index_for_all_string_fields": {
"match": "*",
"match_mapping_type": "string",
"mapping": {
"type": "keyword",
"normalizer": "lowercase_normalizer",
"ignore_above": 1000
}
}
}
]
}
"_size”: true
enables the mapper-size
plugin, available in most
installations. This creates a _size
field that can later be used to figure out
where the largest logs are coming from.
"dynamic": true
enables dynamic field
mappings,
allowing Elasticsearch to automatically create indexes for new fields it
encounters. This is essential for teams that do structured logging, as it’ll
allow devs to create arbitrary fields that can be searched later, without
requiring a change to index settings.
However, setting **"dynamic": true
can be dangerous.** By default it
creates indexes for every field, regardless of nesting level. Also by default,
string fields are analyzed and indexed for full text search, which can be
expensive.
Instead, we create custom dynamic mappings to constrain what indexes can be
created automatically. The no_index_fields_past_depth_2
dynamic mapping
prevents nested fields past the depth 2 (foo.bar.baz
) from being indexed.
The create_keyword_index_for_all_string_fields
dynamic mapping applies the
lower case normalizer from settings
, and defaults all string fields to
keyword
indexes. This allows for quick searches of a potential field like
customer_id:“cust_12345"
. "ignore_above": 1000
is an arbitrary limit on the
length of the field that can be indexed, to prevent giant strings (think
stacktraces and response payloads) from blowing up the size of the index, since
they aren’t tokenized.
Finally, we can explicitly define mappings for some special fields
"mappings": {
"properties": {
"@timestamp": {
"type": "date"
},
"thread": {
"type": "text",
"analyzer": "log_analyzer",
"norms": false,
"similarity": "boolean",
"fields": {
"keyword": {
"type": "keyword",
"normalizer": "lowercase_normalizer"
}
}
},
"message": {
"type": "text",
"analyzer": "log_analyzer",
"norms": false,
"similarity": "boolean"
}
}
}
}
Most log collectors will create a @timestamp
field for the timestamp of the
log entry. We explicitly create a date
index for it.
thread
here is a common field that usually contains the name of the executing
thread. We want to support full text search on it, thus we create a full text
index, and apply the log_analyzer
from before. We set "similarity": boolean
to disable scoring (since we almost always want to order logs by time), for a
minor performance gain. Likewise, "norms": false
saves removes some metadata
only used for scoring. Finally, we add a thread.keyword
index, which allows
for regular expression queries, like thread.keyword:/consumer-thread-\d+/
.
The message
field is where the log message goes, and can be quite long, so we
create only a text
index to enable full text queries. If you want to support
regular expressions on the message field, just add a message.keyword
index
like for thread
. It’s important to note that regexp searches in Elasticsearch
are extremely inefficient. Elasticsearch will more or less scan the entire index
to match the regexp.
Conclusion
Elasticsearch makes it really easy to index large amounts logs, making it a popular choice for logging backends. However, as hopefully clear from this post, it’s actually quite tricky to craft an efficient configuration to support scalable indexing and search. The settings in this post should provide a sane, solid foundation for any ELK logging backend. Hopefully this post saves a few people from having to crawl the internet to do the same things.