InfluxDB vs Elasticsearch for time series and metrics data

When it comes to store time series data, a multitude of time series database (TSDB) are available. In this blog post, we will focus on Elasticsearch and InfluxDB. Which of these solutions best suits our needs?

boxing minion

Disclaimer: Each benchmark is different, time and performance measured here depend strongly on our dataset. It’s wise to benchmark with your own dataset to have a more accurate vision on how it behaves specifically. InfluxData has already done an excellent benchmark on their blog which has led to different time and performance.

The comparison between these solutions will be done in 3 steps:

  • Data ingest performance (in points/second);
  • On-disk storage requirements (in Bytes);
  • Mean query response time (in milliseconds).

Overview 🎬

At redirection.io, we handle each incoming HTTP request from all our client’s websites in order to redirect or not the user, according to a previously defined ruleset. Then, we log the request in Elasticsearch. This allows our clients to see the evolution of their web traffic, detect recurring HTTP errors and fix them by creating new rules. But logging each request for each project is expensive. Therefore, we have set a logs retention from 1 to 31 days, depending on the selected plan.

However, a one month retention isn’t enough to have beautiful graphs 😢. That’s why we have introduced statistics that “aggregate” the logs per hour, to keep track of them in the long run… and which are stored as time series. Oh wait, it’s today’s topic! 🎉📍

Dataset ✨

What kind ?

The idea was to be able to get a set of points indexed on time, within a date range.
We wanted to make such requests:

Get me the list of statistics per hour over the last 7 days for project n, aggregated by status code type and only for Chrome Mobile.

Let’s quickly look at the structure of a statistic. It contains at least:

  • time: timestamp (well, obvious 😉?);
  • value: stored value.

Plus, in our case, a set of indexed fields/tags for the search:

  • project: project uuid;
  • statusCode: HTTP status code;
  • statusCodeType: HTTP status code type (1xx, 2xx, 3xx, 4xx, 5xx);
  • userAgent: user agent;
  • userAgentType: user agent type (mobile, desktop, tool, and bot).

How many ?

Big projects on redirection.io log millions requests per hour. On average, once we aggregate these logs, we can receive per project every hour up to:

  • 100 different user agents;
  • 50 different status codes.

For a single project:

One hour: 100 × 50 = 5000 statistics.
One year: 5000 × 24h x 31d x 12m = 45 million statistics.

For 1,000 projects:

One hour: 100 × 50 × 1000 = 5 million statistics.
One year: 5000 × 24h x 31d x 12m x 1000 = 45 billion statistics.

For the sake of simplicity, we will benchmark a single project over one year so “only” 45 million entries.

Environment 👨🏻‍💻

The entire benchmark was performed under the following conditions:

  • Mac OS X (Mojave 10.14) – 16 GB RAM;
  • Docker (with Dinghy);
  • PHP 7.3;
  • InfluxDB 1.7.5 (Chronograf 1.7.10);
  • Elasticsearch 7.0.0 (Kibana 7.0.0).

InfluxDB

Dockerization 🐳

Let’s create appropriate network, volume and containers for this benchmark.
Chronograf will be accessible on: localhost:18888.

docker network create bench_influxdb && \
docker run -p 18086:8086 -v $PWD/influxdb:/var/lib/influxdb --name bench_influxdb -d -e INFLUXDB_DB=rio --net=bench_influxdb influxdb && \
docker run -p 18888:8888 --name bench_chronograf -d --net=bench_influxdb chronograf --influxdb-url=http://bench_influxdb:8086

Data ingest performance

Using influxdb-php, we add the equivalent of one year of statistics for a project. The insertion is done by batch of 25,000 points, optimal size found after several tests.

$database = InfluxDB\Client::fromDSN('influxdb://localhost:18086/rio');

$points = [];
foreach (generateFixtures() as $fixtures) {
    foreach ($fixtures as $point) {
        $points[] = new InfluxDB\Point(
            'statistic',
            $point['value'],
            $point['tags'],
            [],
            $point['date']->getTimestamp()
        );
    }

    if (count($points) > 25000) {
        $database->writePoints($points, InfluxDB\Database::PRECISION_SECONDS);
        $points = [];
    }
}

if ($points) {
    $database->writePoints($points, InfluxDB\Database::PRECISION_SECONDS);
}

Result: 15m 20s for 45 million entries.
Performance: 50,000 pts/s.

On-disk storage

du -sh ./influxdb/data/rio

On-disk storage requirement: 356 MB.

So we need 356 MB disk space for a single project, over a year, storing 45 million points (without data replication).
For 1,000 projects: ~356 GB.

Mean query response time

To find the mean query response time, we will execute several times a “complex” query without cache, like:

Get me the list of statistics per hour over the last 7 days, aggregated by status code type.

$database = InfluxDB\Client::fromDSN('influxdb://localhost:18086/rio');

$hours = generateHours(100);
$start = microtime(true);

foreach ($hours as $hour) {
    $from = $hour->modify("-7 day")->format(DateTime::RFC3339);
    $to = $hour->format(DateTime::RFC3339);

    $result = $database->query("SELECT count(*) FROM statistic WHERE time > '$from' AND time <= '$to' GROUP BY time(1h),statusCodeType");
}

$end = microtime(true);
$executionTime = getTime($end - $start);

echo "Execution time: $executionTime\r\n";

Result: 2m 35s for 100 queries.
Mean query response time (without cache): ~1.5s.

Elasticsearch

Dockerization 🐳

Let’s create appropriate network, volume and containers for this benchmark.
Kibana will be accessible on: localhost:15601.

docker network create bench_elasticsearch && \
docker run -p 19200:9200 -v $PWD/elasticsearch:/usr/share/elasticsearch/data --name bench_elasticsearch -d -e "discovery.type=single-node" --net bench_elasticsearch elasticsearch:7.0.0 && \
docker run -p 15601:5601 --name bench_kibana -d -e ELASTICSEARCH_HOSTS=http://bench_elasticsearch:9200 --net bench_elasticsearch kibana:7.0.0

Fine tuning for indexing speed as a time series data store

Because Elasticsearch isn’t a TSDB by default, we will make some optimizations by following the recommendations found in these two articles:

Let’s begin with the settings.

$settings = [
    'number_of_shards' => 5, // after several tests, 5 shards is in our case a good compromise between data ingest performance and disk usage
    'number_of_replicas' => 0, // we won’t need replication in this benchmark
    'refresh_interval' => -1, // we unset the refresh interval, it will speed up indexing but will require to do the refresh manually
    'codec' => 'best_compression', // we switch the default compression codec for a better (but slower) compression
];

Then, the most important part: let’s define the mapping.

$mappings = [
    '_source' => ['enabled' => false], // as we only use aggregations and want to save disk space, we can safely disable _source. But be careful, it also prevents seeing JSON documents and reindexing to a new mapping.
    'dynamic' => false, // we disable dynamic mapping
    'properties' => [
        '@timestamp' => [
            'type' => 'date',
        ],
        'count' => [
            'type' => 'integer',
            'index' => false, // we disable indexation on this field
            'doc_values' => false, // we disable doc_values (we don't sort or aggregate on this field)
        ],
        'project' => [
            'type' => 'keyword', // as we only do exact matching, we prefer keyword type rather than text type
            'norms' => false, // we disable norms to save disk space (we don't use scoring)
        ],
        'statusCode' => [
            'type' => 'short', // we prefer the use of small data types to save disk space
        ],
        'statusCodeType' => [
            'type' => 'keyword',
            'norms' => false,
        ],
        'userAgent' => [
            'type' => 'keyword',
            'norms' => false,
        ],
        'userAgentType' => [
            'type' => 'short',
        ],
    ],
];

Data ingest performance

Using elastica, we add the equivalent of one year of statistics for a project. The insertion is done by bulk operation of 15,000 documents, optimal size found after several tests.

$elasticaClient = new Elastica\Client(['host' => 'localhost', 'port' => 19200]);

$index = $elasticaClient->getIndex('statistic');
$index->create(['settings' => $settings, 'mappings' => $mappings], true);

$docs = [];
foreach (generateFixtures() as $fixtures) {
    foreach ($fixtures as $doc) {
        $docs[] = new Elastica\Document('', [
            '@timestamp' => $doc['date']->format(DateTime::RFC3339),
            'count' => $doc['value'],
            'project' => $doc['tags']['project'],
            'statusCode' => $doc['tags']['statusCode'],
            'statusCodeType' => $doc['tags']['statusCodeType'],
            'userAgent' => $doc['tags']['userAgent'],
            'userAgentType' => $doc['tags']['userAgentType'],
        ]);
    }

    if (count($docs) > 15000) {
        $index->addDocuments($docs);
        $docs = [];
    }
}

if ($docs) {
    $index->addDocuments($docs);
}

$index->refresh(); // don't forget to refresh manually the index

Result: 1h 24m 8s for 45 million entries.
Performance: 8,800 docs/s.

On-disk storage

curl -s "http://dev.test:19200/_cat/indices?v" | grep statistic

Response: green open statistic s9o3maE-QNKkmuolF14fQA 5 0 44640000 0 1.6gb 1.6g.

On-disk storage requirement: 1.6 GB.

So we need 1.6 GB disk space for a single project, over a year, storing 45 million documents (without data replication).
For 1,000 projects: ~1.6 TB.

Mean query response time

To find the mean query response time, we will execute several times a “complex” query without cache, like:

Get me the list of statistics per hour over the last 7 days, aggregated by status code type.

First, clear the cache: curl -s -XPOST "http://dev.test:19200/statistic/_cache/clear".

$elasticaClient = new Elastica\Client(['host' => 'localhost', 'port' => 19200]);

$search = new Elastica\Search($elasticaClient);

$index = $elasticaClient->getIndex('statistic');
$search->addIndex($index);

$hours = generateHours(100);
$queryTime = 0;

foreach ($hours as $hour) {
    $from = $hour->modify("-7 day")->format(DateTime::RFC3339);
    $to = $hour->format(DateTime::RFC3339);

    $query = new Elastica\Query([
        'size' => 0,
        'track_total_hits' => true,
        'query' => [
            'bool' => [
                'filter' => [
                    [
                        'range' => [
                            '@timestamp' => [
                                'gte' => $from,
                            ],
                        ],
                    ],
                    [
                        'range' => [
                            '@timestamp' => [
                                'lt' => $to,
                            ],
                        ],
                    ],
                ],
            ],
        ],
        'aggs' => [
            'data' => [
                'date_histogram' => [
                    'field' => '@timestamp',
                    'interval' => 'hour',
                    'order' => ['_key' => 'asc'],
                ],
                'aggs' => [
                    'data' => [
                        'terms' => [
                            'field' => 'statusCodeType',
                            'order' => ['_key' => 'asc'],
                        ],
                    ],
                ],
            ],
        ],
    ]);

    $search->setQuery($query);
    $result = $search->search();

    $queryTime += $result->getResponse()->getQueryTime();
}

echo "Execution time: $queryTime\r\n";

Result (without cache): 12.6s for 100 queries.
Mean query response time (without cache) : ~130ms.
Result (with cache which is enabled by default): 1.5s for 100 queries.
Mean query response time (with cache) : ~15ms.

Final words 🙌

InfluxDB:

  • Ingest performance: 50,000 pts/s;
  • Mean query response time: ~1.5s without cache, caching should be handled;
  • On-disk storage requirements: 356 MB/project.

Elasticsearch:

  • Ingest performance: 8,800 docs/s;
  • Mean query response time: ~130ms without cache, ~15ms with cache;
  • On-disk storage requirements: 1.6 GB/project.

InfluxDB offers much better performance in data ingestion (5.6x better than Elasticsearch) and for less disk space. That being said, with our dataset containing several indexed text tags, we end up quickly with a (very) large number of unique series (also known as a high cardinality), which can not be handled easily without clustering. Therefore, complex queries take time.

Elasticsearch, on the other hand, has a much better response time for complex queries, but requires more disk space and is slower at indexing. Note that before the compression is complete and the Lucene engine does its merge job on segments of each shard, the index can reach 5 to 6 GB.

Both solutions scale, but Elasticsearch do it free while the open-source single node instances of InfluxDB are limited.

So we will stick with Elasticsearch:

  • the “slowness” of data ingestion is not a problem in our use case;
  • InfluxDB clustering is in InfluxDB commercial edition;
  • the on-disk storage is reasonable;
  • we already have it in our stack for logs.

Still, I’ve found InfluxDB really interesting and combined with Grafana it remains a serious alternative to the ELK stack in the case of storing metrics and time series data.

hugging minions

The code used for this benchmark is available on GitHub.

blog comments powered by Disqus