Kelson Martins Blog

Summary

Anyone that manages Elasticsearch clusters must be aware at all times of the health of the cluster, being through the use of monitoring tools such as ElasticserachHQ or X-Pack or even through simple REST API call scripts.
Monitoring is extremely important but there is one point in which may be as crucial as monitoring, which is knowing the cluster’s capacity and limits.
Knowing such metrics is key to allow the cluster administrator to evaluate the capacity against usage so chances of having a slow indexing period or even an incident are reduced.
Saying that this post will provide some basic steps for using Rally for performing Elasticsearch Benchmark tests against an existing cluster so we can extract key metrics such as indexing time, latency, throughput, errors and others.
These metrics will be key to decision making moments on whether we must scale nodes, reduce logging and etc.

What is Rally and how to Install it?

According to the official documentation, Rally is an Elasticsearch Benchmark tool that can help us by performing the following tasks:
  • Setup and teardown of an Elasticsearch cluster for benchmarking
  • Management of benchmark data and specifications even across Elasticsearch versions
  • Running benchmarks and recording results
  • Finding performance problems by attaching so-called telemetry devices
  • Comparing performance results
For now, the feature that we will focus on is to perform benchmarking on a remote cluster, so prior to performing the steps that will follow, it is expected that you have access to an Elasticsearch cluster.
To install Rally, you can perform the following on a Debian/Ubuntu based system:
Install Python 3.4 or higher.
sudo apt-get install gcc python3-pip python3-dev

Install Git (version 1.9 is required).

sudo apt-get install git # install git

Finally, install Rally.

pip3 install esrally
If you are using a system based on other architecture than Debian/Ubuntu, refer to the official installation guide here.

Configuring Rally

With Rally installed, a one-time configuration is required.
To configure Rally, perform the following on your terminal:
esrally
Once executed, you will see an output similar to:
   / __ \____ _/ / /_  __
  / /_/ / __ `/ / / / / /
/ _, _/ /_/ / / / /_/ /
/_/ |_|\__,_/_/_/\__, /
                /____/

Running simple configuration. Run the advanced configuration with:

  esrally configure --advanced-config


WARNING: Will overwrite existing config file at [/home/kelson/.rally/rally.ini]

* Autodetecting available third-party software
  git    : [OK]
  JDK    : [MISSING] (You cannot benchmark Elasticsearch on this machine without a JDK.)

* Setting up benchmark data directory in /home/kelson/.rally/benchmarks
Enter the JDK 10 root directory (Press Enter to skip):
Note that Rally auto detected our git but failed to detect JDK. This happened because we indeed did not installed JDK.
Is that a problem? Not for us. Our goal is to perform benchmarking against remote clusters so our Rally does not have to build any Elasticsearch node from source.
You can then press “Enter” a couple times to accept default Rally configuration until you are presented with the following:
Configuration successfully written to /home/kelson/.rally/rally.ini. Happy benchmarking!

More info about Rally:

* Type esrally --help
* Read the documentation at https://esrally.readthedocs.io/en/0.11.0/
* Ask a question on the forum at https://discuss.elastic.co/c/elasticsearch/rally

We are now ready to race!

Choose your Destiny (track)

With Rally configured, we are now able to start our benchmarking race (the execution of a benchmarking experiment) but first, we must choose a track.
Tracks in Rally are nothing more than different scenarios that you can choose from and you can list the default tracks with the following:
esrally list tracks
The expected output is something similar to:
Name        Description                                                                                                                                                                          Documents  Compressed Size    Uncompressed Size    Default Challenge        All Challenges
----------  ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  -----------  -----------------  -------------------  -----------------------  --------------------------------------------------------------------------------------------------------
so          Indexing benchmark using up to questions and answers from StackOverflow                                                                                                               36062278  8.9 GB             33.1 GB              append-no-conflicts      append-no-conflicts
geopoint    Point coordinates from PlanetOSM                                                                                                                                                      60844404  481.9 MB           2.3 GB               append-no-conflicts      append-no-conflicts,append-no-conflicts-index-only,append-fast-with-conflicts
eventdata   This benchmark indexes HTTP access logs generated based sample logs from the elastic.co website using the generator available in https://github.com/elastic/rally-eventdata-track     20000000  755.1 MB           15.3 GB              append-no-conflicts      append-no-conflicts
geonames    POIs from Geonames                                                                                                                                                                    11396505  252.4 MB           3.3 GB               append-no-conflicts      append-no-conflicts,append-no-conflicts-index-only,append-sorted-no-conflicts,append-fast-with-conflicts
noaa        Global daily weather measurements from NOAA                                                                                                                                           33659481  947.3 MB           9.0 GB               append-no-conflicts      append-no-conflicts,append-no-conflicts-index-only
nested      StackOverflow Q&A stored as nested docs                                                                                                                                               11203029  663.1 MB           3.4 GB               nested-search-challenge  nested-search-challenge,index-only
pmc         Full text benchmark with academic papers from PMC                                                                                                                                       574199  5.5 GB             21.7 GB              append-no-conflicts      append-no-conflicts,append-no-conflicts-index-only,append-sorted-no-conflicts,append-fast-with-conflicts
http_logs   HTTP server log data                                                                                                                                                                 247249096  1.2 GB             31.1 GB              append-no-conflicts      append-no-conflicts,append-no-conflicts-index-only,append-sorted-no-conflicts
percolator  Percolator benchmark based on AOL queries                                                                                                                                              2000000  102.7 kB           104.9 MB             append-no-conflicts      append-no-conflicts
nyc_taxis   Taxi rides in New York in 2015                                                                                                                                                       165346692  4.5 GB             74.3 GB              append-no-conflicts      append-no-conflicts,append-no-conflicts-index-only,append-sorted-no-conflicts-index-only
Note that each different track has different properties (Documents, Size, …) and each provides a specific set of documents to be used in your benchmarking.
For our scenario, we will aim to benchmark with the most number of documents possible and the http_logs track seems ideal as it will index 247249096 nginx log events on the cluster during benchmarking.

Let’s Race

With our track chosen, it is time to start our benchmarking. This can be done through the following command:
esrally --track=http_logs --target-hosts=ELASTICSEARCH_IP:ELASTICSEARCH_PORT --pipeline=benchmark-only
Breaking down the command, these are the key parameters:
--track -> the chosen track from the "esrally list tracks" command
--target-hosts -> the remote Elasticsearch cluster that we will be benchmarking
--target-port  -> which port the Elasticsearch service is responding from
--pipeliine -> benchmark-only means that we are benchmarking against an existing Elasticsearch instance. List more options with "esrally list pipelines"
Once running, it may take a good while for the benchmarking process to complete but once it does, your results will be displayed on stdout.
|   Lap |                          Metric |         Task |       Value |    Unit |
|------:|--------------------------------:|-------------:|------------:|--------:|
|   All |                   Indexing time |              |     250.609 |     min |
|   All |          Indexing throttle time |              |           0 |     min |
|   All |                      Merge time |              |     294.864 |     min |
|   All |                    Refresh time |              |     53.2047 |     min |
|   All |                      Flush time |              |      1.5581 |     min |
|   All |             Merge throttle time |              |     170.567 |     min |
|   All |              Total Young Gen GC |              |      57.939 |       s |
|   All |                Total Old Gen GC |              |       5.145 |       s |
|   All |                      Store size |              |     46.5778 |      GB |
|   All |                   Translog size |              |  0.00478806 |      GB |
|   All |          Heap used for segments |              |     104.422 |      MB |
|   All |        Heap used for doc values |              |    0.991104 |      MB |
|   All |             Heap used for terms |              |     85.7431 |      MB |
|   All |             Heap used for norms |              |    0.791382 |      MB |
|   All |            Heap used for points |              |     4.73582 |      MB |
|   All |     Heap used for stored fields |              |     12.1608 |      MB |
|   All |                   Segment count |              |        2294 |         |
|   All |                  Min Throughput | index-append |     5946.26 |  docs/s |
|   All |               Median Throughput | index-append |     6121.22 |  docs/s |
|   All |                  Max Throughput | index-append |     7214.93 |  docs/s |
|   All |         50th percentile latency | index-append |     5700.31 |      ms |
|   All |         90th percentile latency | index-append |     9638.46 |      ms |
|   All |         99th percentile latency | index-append |     15103.5 |      ms |
|   All |       99.9th percentile latency | index-append |     63454.2 |      ms |
|   All |      99.99th percentile latency | index-append |      133296 |      ms |
|   All |        100th percentile latency | index-append |      189127 |      ms |
|   All |    50th percentile service time | index-append |     5700.31 |      ms |
|   All |    90th percentile service time | index-append |     9638.46 |      ms |
|   All |    99th percentile service time | index-append |     15103.5 |      ms |
|   All |  99.9th percentile service time | index-append |     63454.2 |      ms |
|   All | 99.99th percentile service time | index-append |      133296 |      ms |
|   All |   100th percentile service time | index-append |      189127 |      ms |
|   All |                      error rate | index-append |        0.07 |       % |
|   All |                  Min Throughput |      default |        1.44 |   ops/s |
|   All |               Median Throughput |      default |        1.45 |   ops/s |
|   All |                  Max Throughput |      default |        1.46 |   ops/s |
|   All |         50th percentile latency |      default |      310299 |      ms |
|   All |         90th percentile latency |      default |      331423 |      ms |
|   All |         99th percentile latency |      default |      336346 |      ms |
|   All |        100th percentile latency |      default |      336869 |      ms |
|   All |    50th percentile service time |      default |     646.843 |      ms |
|   All |    90th percentile service time |      default |     658.976 |      ms |
|   All |    99th percentile service time |      default |     939.048 |      ms |
|   All |   100th percentile service time |      default |     940.766 |      ms |
|   All |                      error rate |      default |           0 |       % |
|   All |                  Min Throughput |         term |        1.68 |   ops/s |
|   All |               Median Throughput |         term |        1.68 |   ops/s |
|   All |                  Max Throughput |         term |        1.69 |   ops/s |
|   All |         50th percentile latency |         term |      316159 |      ms |
|   All |         90th percentile latency |         term |      338408 |      ms |
|   All |         99th percentile latency |         term |      343543 |      ms |
|   All |        100th percentile latency |         term |      344095 |      ms |
|   All |    50th percentile service time |         term |     569.796 |      ms |
|   All |    90th percentile service time |         term |     576.895 |      ms |
|   All |    99th percentile service time |         term |     800.579 |      ms |
|   All |   100th percentile service time |         term |     1010.78 |      ms |
|   All |                      error rate |         term |           0 |       % |
|   All |                  Min Throughput |        range |        1.02 |   ops/s |
|   All |               Median Throughput |        range |        1.03 |   ops/s |
|   All |                  Max Throughput |        range |        1.03 |   ops/s |
|   All |         50th percentile latency |        range |     46187.2 |      ms |
|   All |         90th percentile latency |        range |     58905.8 |      ms |
|   All |         99th percentile latency |        range |     61606.7 |      ms |
|   All |        100th percentile latency |        range |     61869.4 |      ms |
|   All |    50th percentile service time |        range |     938.223 |      ms |
|   All |    90th percentile service time |        range |     975.216 |      ms |
|   All |    99th percentile service time |        range |     1681.98 |      ms |
|   All |   100th percentile service time |        range |     1878.68 |      ms |
|   All |                      error rate |        range |           0 |       % |
|   All |                  Min Throughput |   hourly_agg |         0.2 |   ops/s |
|   All |               Median Throughput |   hourly_agg |         0.2 |   ops/s |
|   All |                  Max Throughput |   hourly_agg |         0.2 |   ops/s |
|   All |         50th percentile latency |   hourly_agg |     4860.77 |      ms |
|   All |         90th percentile latency |   hourly_agg |     9422.83 |      ms |
|   All |         99th percentile latency |   hourly_agg |     10390.7 |      ms |
|   All |        100th percentile latency |   hourly_agg |     10609.1 |      ms |
|   All |    50th percentile service time |   hourly_agg |        4662 |      ms |
|   All |    90th percentile service time |   hourly_agg |     5086.03 |      ms |
|   All |    99th percentile service time |   hourly_agg |     7019.26 |      ms |
|   All |   100th percentile service time |   hourly_agg |     9548.57 |      ms |
|   All |                      error rate |   hourly_agg |           0 |       % |
|   All |                  Min Throughput |       scroll |        0.39 | pages/s |
|   All |               Median Throughput |       scroll |         0.4 | pages/s |
|   All |                  Max Throughput |       scroll |        0.41 | pages/s |
|   All |         50th percentile latency |       scroll | 1.21201e+07 |      ms |
|   All |         90th percentile latency |       scroll | 1.67727e+07 |      ms |
|   All |         99th percentile latency |       scroll | 1.77756e+07 |      ms |
|   All |        100th percentile latency |       scroll | 1.78884e+07 |      ms |
|   All |    50th percentile service time |       scroll |     58627.6 |      ms |
|   All |    90th percentile service time |       scroll |     73354.6 |      ms |
|   All |    99th percentile service time |       scroll |     86941.1 |      ms |
|   All |   100th percentile service time |       scroll |     87455.5 |      ms |
|   All |                      error rate |       scroll |           0 |       % |


-----------------------------------
[INFO] SUCCESS (took 62851 seconds)
-----------------------------------

Understanding the Race Results

Based on our chosen track which was the http_logs containing 247249096 documents, Rally summary report display how the cluster behaved against such track.
Let’s now take a look at some of the key metrics and what those metrics mean:
— Indexing Time: How long the node took to index all the 247249096 documents.
|   Lap |                          Metric |         Task |       Value |    Unit |
|   All |                   Indexing time |              |     250.609 |     min |
— Indexing throttle time: Total time indexing has been throttled as reported by the indices stats API (the less the better).
|   Lap |                          Metric |         Task |       Value |    Unit |
|   All |          Indexing throttle time |              |           0 |     min |
— Throughput: Measure of how many Events per Second (EPS) the cluster was able to handle.
|   Lap |                          Metric |         Task |       Value |    Unit |
|   All |                  Min Throughput | index-append |     5946.26 |  docs/s |
|   All |               Median Throughput | index-append |     6121.22 |  docs/s |
|   All |                  Max Throughput | index-append |     7214.93 |  docs/s |
— Error rate: Errors response codes or Exceptions thrown by Elasticsearch client. Ideally, this must be 0% and if it is greater, inspection of Elasticsearch logs is required to understand the root cause of why Elasticsearch failed to index some documents.
|   Lap |                          Metric |         Task |       Value |    Unit |
|   All |                      error rate | index-append |        0.07 |       % |
|   All |                      error rate |      default |           0 |       % |
|   All |                      error rate |         term |           0 |       % |
|   All |                      error rate |        range |           0 |       % |
|   All |                      error rate |   hourly_agg |           0 |       % |
|   All |                      error rate |       scroll |           0 |       % |
This metrics breakdown are just some of the metrics given by Rally, which should be enough to give you a better understanding of how well your cluster behave under high indexing scenarios.
Understanding of other metrics is extremely recommended and you can find more details on them here.

Conclusion

This post presented a quick introduction to Rally where we moved from a quick installation to executing some benchmarking steps against a remote Elasticsearch in just a few steps.
The results given by the tool will greatly increase your understanding of your cluster and this will be crucial for an Elasticsearch administrator to efficiently make decisions based on the cluster usage.
There are many other features that were not covered in this introduction article but this was hopefully enough to get you going and allow further experimentation.
Apart from the default tracks, Rally also provides features such as the creation of custom tracks, tournaments (clusters benchmarking results comparisons) and others but these will be topics for future posts. Stay tuned!

Software engineer, geek, traveler, wannabe athlete and a lifelong learner. Works at @IBM

Next Post