Metrics Aggregations

The aggregations in this family compute metrics based on values extracted in one way or another from the documents that are being aggregated. The values are typically extracted from the fields of the document (using the field data), but can also be generated using scripts.

Numeric metrics aggregations are a special type of metrics aggregation which output numeric values. Some aggregations output a single numeric metric (e.g. avg) and are called single-value numeric metrics aggregation, others generate multiple metrics (e.g. stats) and are called multi-value numeric metrics aggregation. The distinction between single-value and multi-value numeric metrics aggregations plays a role when these aggregations serve as direct sub-aggregations of some bucket aggregations (some bucket aggregations enable you to sort the returned buckets based on the numeric metrics in each bucket).

Avg Aggregation

A single-value metrics aggregation that computes the average of numeric values that are extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script.

Assuming the data consists of documents representing exams grades (between 0 and 100) of students

{
    "aggs" : {
        "avg_grade" : { "avg" : { "field" : "grade" } }
    }
}

The above aggregation computes the average grade over all documents. The aggregation type is avg and the field setting defines the numeric field of the documents the average will be computed on. The above will return the following:

{
    ...

    "aggregations": {
        "avg_grade": {
            "value": 75
        }
    }
}

The name of the aggregation (avg_grade above) also serves as the key by which the aggregation result can be retrieved from the returned response.

Script

Computing the average grade based on a script:

{
    ...,

    "aggs" : {
        "avg_grade" : {
            "avg" : {
                "script" : {
                    "inline" : "doc['grade'].value",
                    "lang" : "painless"
                }
            }
        }
    }
}

This will interpret the script parameter as an inline script with the painless script language and no script parameters. To use a file script use the following syntax:

{
    ...,

    "aggs" : {
        "avg_grade" : {
            "avg" : {
                "script" : {
                    "file": "my_script",
                    "params": {
                        "field": "grade"
                    }
                }
            }
        }
    }
}
Tip

for indexed scripts replace the file parameter with an id parameter.

Value Script

It turned out that the exam was way above the level of the students and a grade correction needs to be applied. We can use value script to get the new average:

{
    "aggs" : {
        ...

        "aggs" : {
            "avg_corrected_grade" : {
                "avg" : {
                    "field" : "grade",
                    "script" : {
                        "lang": "painless",
                        "inline": "_value * params.correction",
                        "params" : {
                            "correction" : 1.2
                        }
                    }
                }
            }
        }
    }
}

Missing value

The missing parameter defines how documents that are missing a value should be treated. By default they will be ignored but it is also possible to treat them as if they had a value.

{
    "aggs" : {
        "grade_avg" : {
            "avg" : {
                "field" : "grade",
                "missing": 10 
            }
        }
    }
}

Documents without a value in the grade field will fall into the same bucket as documents that have the value 10.

Cardinality Aggregation

A single-value metrics aggregation that calculates an approximate count of distinct values. Values can be extracted either from specific fields in the document or generated by a script.

Assume you are indexing books and would like to count the unique authors that match a query:

{
    "aggs" : {
        "author_count" : {
            "cardinality" : {
                "field" : "author"
            }
        }
    }
}

Precision control

This aggregation also supports the precision_threshold option:

Warning

The precision_threshold option is specific to the current internal implementation of the cardinality agg, which may change in the future.

{
    "aggs" : {
        "author_count" : {
            "cardinality" : {
                "field" : "author_hash",
                "precision_threshold": 100 
            }
        }
    }
}

The precision_threshold options allows to trade memory for accuracy, and defines a unique count below which counts are expected to be close to accurate. Above this value, counts might become a bit more fuzzy. The maximum supported value is 40000, thresholds above this number will have the same effect as a threshold of 40000. The default values is 3000.

Counts are approximate

Computing exact counts requires loading values into a hash set and returning its size. This doesn’t scale when working on high-cardinality sets and/or large values as the required memory usage and the need to communicate those per-shard sets between nodes would utilize too many resources of the cluster.

This cardinality aggregation is based on the HyperLogLog++ algorithm, which counts based on the hashes of the values with some interesting properties:

  • configurable precision, which decides on how to trade memory for accuracy,
  • excellent accuracy on low-cardinality sets,
  • fixed memory usage: no matter if there are tens or billions of unique values, memory usage only depends on the configured precision.

For a precision threshold of c, the implementation that we are using requires about c * 8 bytes.

The following chart shows how the error varies before and after the threshold:

images/cardinality_error.png

For all 3 thresholds, counts have been accurate up to the configured threshold (although not guaranteed, this is likely to be the case). Please also note that even with a threshold as low as 100, the error remains very low, even when counting millions of items.

Pre-computed hashes

On string fields that have a high cardinality, it might be faster to store the hash of your field values in your index and then run the cardinality aggregation on this field. This can either be done by providing hash values from client-side or by letting elasticsearch compute hash values for you by using the mapper-murmur3 plugin.

Note

Pre-computing hashes is usually only useful on very large and/or high-cardinality fields as it saves CPU and memory. However, on numeric fields, hashing is very fast and storing the original values requires as much or less memory than storing the hashes. This is also true on low-cardinality string fields, especially given that those have an optimization in order to make sure that hashes are computed at most once per unique value per segment.

Script

The cardinality metric supports scripting, with a noticeable performance hit however since hashes need to be computed on the fly.

{
    "aggs" : {
        "author_count" : {
            "cardinality" : {
                "script": {
                    "lang": "painless",
                    "inline": "doc['author.first_name'].value + ' ' + doc['author.last_name'].value"
                }
            }
        }
    }
}

This will interpret the script parameter as an inline script with the painless script language and no script parameters. To use a file script use the following syntax:

{
    "aggs" : {
        "author_count" : {
            "cardinality" : {
                "script" : {
                    "file": "my_script",
                    "params": {
                        "first_name_field": "author.first_name",
                        "last_name_field": "author.last_name"
                    }
                }
            }
        }
    }
}
Tip

for indexed scripts replace the file parameter with an id parameter.

Missing value

The missing parameter defines how documents that are missing a value should be treated. By default they will be ignored but it is also possible to treat them as if they had a value.

{
    "aggs" : {
        "tag_cardinality" : {
            "cardinality" : {
                "field" : "tag",
                "missing": "N/A" 
            }
        }
    }
}

Documents without a value in the tag field will fall into the same bucket as documents that have the value N/A.

Extended Stats Aggregation

A multi-value metrics aggregation that computes stats over numeric values extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script.

The extended_stats aggregations is an extended version of the stats aggregation, where additional metrics are added such as sum_of_squares, variance, std_deviation and std_deviation_bounds.

Assuming the data consists of documents representing exams grades (between 0 and 100) of students

{
    "aggs" : {
        "grades_stats" : { "extended_stats" : { "field" : "grade" } }
    }
}

The above aggregation computes the grades statistics over all documents. The aggregation type is extended_stats and the field setting defines the numeric field of the documents the stats will be computed on. The above will return the following:

{
    ...

    "aggregations": {
        "grade_stats": {
           "count": 9,
           "min": 72,
           "max": 99,
           "avg": 86,
           "sum": 774,
           "sum_of_squares": 67028,
           "variance": 51.55555555555556,
           "std_deviation": 7.180219742846005,
           "std_deviation_bounds": {
            "upper": 100.36043948569201,
            "lower": 71.63956051430799
           }
        }
    }
}

The name of the aggregation (grades_stats above) also serves as the key by which the aggregation result can be retrieved from the returned response.

Standard Deviation Bounds

By default, the extended_stats metric will return an object called std_deviation_bounds, which provides an interval of plus/minus two standard deviations from the mean. This can be a useful way to visualize variance of your data. If you want a different boundary, for example three standard deviations, you can set sigma in the request:

{
    "aggs" : {
        "grades_stats" : {
            "extended_stats" : {
                "field" : "grade",
                "sigma" : 3 
            }
        }
    }
}

sigma controls how many standard deviations +/- from the mean should be displayed

sigma can be any non-negative double, meaning you can request non-integer values such as 1.5. A value of 0 is valid, but will simply return the average for both upper and lower bounds.

Note

Standard Deviation and Bounds require normality

The standard deviation and its bounds are displayed by default, but they are not always applicable to all data-sets. Your data must be normally distributed for the metrics to make sense. The statistics behind standard deviations assumes normally distributed data, so if your data is skewed heavily left or right, the value returned will be misleading.

Script

Computing the grades stats based on a script:

{
    ...,

    "aggs" : {
        "grades_stats" : {
            "extended_stats" : {
                "script" : {
                    "inline" : "doc['grade'].value",
                    "lang" : "painless"
                 }
             }
         }
    }
}

This will interpret the script parameter as an inline script with the painless script language and no script parameters. To use a file script use the following syntax:

{
    ...,

    "aggs" : {
        "grades_stats" : {
            "extended_stats" : {
                "script" : {
                    "file": "my_script",
                    "params": {
                        "field": "grade"
                    }
                }
            }
        }
    }
}
Tip

for indexed scripts replace the file parameter with an id parameter.

Value Script

It turned out that the exam was way above the level of the students and a grade correction needs to be applied. We can use value script to get the new stats:

{
    "aggs" : {
        ...

        "aggs" : {
            "grades_stats" : {
                "extended_stats" : {
                    "field" : "grade",
                    "script" : {
                        "lang" : "painless",
                        "inline": "_value * params.correction",
                        "params" : {
                            "correction" : 1.2
                        }
                    }
                }
            }
        }
    }
}

Missing value

The missing parameter defines how documents that are missing a value should be treated. By default they will be ignored but it is also possible to treat them as if they had a value.

{
    "aggs" : {
        "grades_stats" : {
            "extended_stats" : {
                "field" : "grade",
                "missing": 0 
            }
        }
    }
}

Documents without a value in the grade field will fall into the same bucket as documents that have the value 0.

Geo Bounds Aggregation

A metric aggregation that computes the bounding box containing all geo_point values for a field.

Example:

{
    "query" : {
        "match" : { "business_type" : "shop" }
    },
    "aggs" : {
        "viewport" : {
            "geo_bounds" : {
                "field" : "location", 
                "wrap_longitude" : true 
            }
        }
    }
}

The geo_bounds aggregation specifies the field to use to obtain the bounds

wrap_longitude is an optional parameter which specifies whether the bounding box should be allowed to overlap the international date line. The default value is true

The above aggregation demonstrates how one would compute the bounding box of the location field for all documents with a business type of shop

The response for the above aggregation:

{
    ...

    "aggregations": {
        "viewport": {
            "bounds": {
                "top_left": {
                    "lat": 80.45,
                    "lon": -160.22
                },
                "bottom_right": {
                    "lat": 40.65,
                    "lon": 42.57
                }
            }
        }
    }
}

Geo Centroid Aggregation

A metric aggregation that computes the weighted centroid from all coordinate values for a Geo-point datatype field.

Example:

{
    "query" : {
        "match" : { "crime" : "burglary" }
    },
    "aggs" : {
        "centroid" : {
            "geo_centroid" : {
                "field" : "location" 
            }
        }
    }
}

The geo_centroid aggregation specifies the field to use for computing the centroid. (NOTE: field must be a Geo-point datatype type)

The above aggregation demonstrates how one would compute the centroid of the location field for all documents with a crime type of burglary

The response for the above aggregation:

{
    ...

    "aggregations": {
        "centroid": {
            "location": {
                "lat": 80.45,
                "lon": -160.22
            }
        }
    }
}

The geo_centroid aggregation is more interesting when combined as a sub-aggregation to other bucket aggregations.

Example:

{
    "query" : {
        "match" : { "crime" : "burglary" }
    },
    "aggs" : {
        "towns" : {
            "terms" : { "field" : "town" },
            "aggs" : {
                "centroid" : {
                    "geo_centroid" : { "field" : "location" }
                }
            }
        }
    }
}

The above example uses geo_centroid as a sub-aggregation to a terms bucket aggregation for finding the central location for all crimes of type burglary in each town.

The response for the above aggregation:

{
    ...

    "buckets": [
       {
           "key": "Los Altos",
           "doc_count": 113,
           "centroid": {
              "location": {
                 "lat": 37.3924582824111,
                 "lon": -122.12104808539152
              }
           }
       },
       {
           "key": "Mountain View",
           "doc_count": 92,
           "centroid": {
              "location": {
                 "lat": 37.382152481004596,
                 "lon": -122.08116559311748
              }
           }
        }
    ]
}

Max Aggregation

A single-value metrics aggregation that keeps track and returns the maximum value among the numeric values extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script.

Note

The min and max aggregation operate on the double representation of the data. As a consequence, the result may be approximate when running on longs whose absolute value is greater than 2^53.

Computing the max price value across all documents

{
    "aggs" : {
        "max_price" : { "max" : { "field" : "price" } }
    }
}

Response:

{
    ...

    "aggregations": {
        "max_price": {
            "value": 35
        }
    }
}

As can be seen, the name of the aggregation (max_price above) also serves as the key by which the aggregation result can be retrieved from the returned response.

Script

Computing the max price value across all document, this time using a script:

{
    "aggs" : {
        "max_price" : {
            "max" : {
                "script" : {
                    "inline" : "doc['price'].value",
                    "lang" : "painless"
                }
            }
        }
    }
}

This will interpret the script parameter as an inline script with the painless script language and no script parameters. To use a file script use the following syntax:

{
    "aggs" : {
        "max_price" : {
            "max" : {
                "script" : {
                    "file": "my_script",
                    "params": {
                        "field": "price"
                    }
                }
            }
        }
    }
}
Tip

for indexed scripts replace the file parameter with an id parameter.

Value Script

Let’s say that the prices of the documents in our index are in USD, but we would like to compute the max in EURO (and for the sake of this example, lets say the conversion rate is 1.2). We can use a value script to apply the conversion rate to every value before it is aggregated:

{
    "aggs" : {
        "max_price_in_euros" : {
            "max" : {
                "field" : "price",
                "script" : {
                    "lang": "painless",
                    "inline": "_value * params.conversion_rate",
                    "params" : {
                        "conversion_rate" : 1.2
                    }
                }
            }
        }
    }
}

Missing value

The missing parameter defines how documents that are missing a value should be treated. By default they will be ignored but it is also possible to treat them as if they had a value.

{
    "aggs" : {
        "grade_max" : {
            "max" : {
                "field" : "grade",
                "missing": 10 
            }
        }
    }
}

Documents without a value in the grade field will fall into the same bucket as documents that have the value 10.

Min Aggregation

A single-value metrics aggregation that keeps track and returns the minimum value among numeric values extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script.

Note

The min and max aggregation operate on the double representation of the data. As a consequence, the result may be approximate when running on longs whose absolute value is greater than 2^53.

Computing the min price value across all documents:

{
    "aggs" : {
        "min_price" : { "min" : { "field" : "price" } }
    }
}

Response:

{
    ...

    "aggregations": {
        "min_price": {
            "value": 10
        }
    }
}

As can be seen, the name of the aggregation (min_price above) also serves as the key by which the aggregation result can be retrieved from the returned response.

Script

Computing the min price value across all document, this time using a script:

{
    "aggs" : {
        "min_price" : {
            "min" : {
                "script" : {
                    "inline" : "doc['price'].value",
                    "lang" : "painless"
                }
            }
        }
    }
}

This will interpret the script parameter as an inline script with the painless script language and no script parameters. To use a file script use the following syntax:

{
    "aggs" : {
        "min_price" : {
            "min" : {
                "script" : {
                    "file": "my_script",
                    "params": {
                        "field": "price"
                    }
                }
            }
        }
    }
}
Tip

for indexed scripts replace the file parameter with an id parameter.

Value Script

Let’s say that the prices of the documents in our index are in USD, but we would like to compute the min in EURO (and for the sake of this example, lets say the conversion rate is 1.2). We can use a value script to apply the conversion rate to every value before it is aggregated:

{
    "aggs" : {
        "min_price_in_euros" : {
            "min" : {
                "field" : "price",
                "script" :
                    "lang" : "painless",
                    "inline": "_value * params.conversion_rate",
                    "params" : {
                        "conversion_rate" : 1.2
                    }
                }
            }
        }
    }
}

Missing value

The missing parameter defines how documents that are missing a value should be treated. By default they will be ignored but it is also possible to treat them as if they had a value.

{
    "aggs" : {
        "grade_min" : {
            "min" : {
                "field" : "grade",
                "missing": 10 
            }
        }
    }
}

Documents without a value in the grade field will fall into the same bucket as documents that have the value 10.

Percentiles Aggregation

A multi-value metrics aggregation that calculates one or more percentiles over numeric values extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script.

Percentiles show the point at which a certain percentage of observed values occur. For example, the 95th percentile is the value which is greater than 95% of the observed values.

Percentiles are often used to find outliers. In normal distributions, the 0.13th and 99.87th percentiles represents three standard deviations from the mean. Any data which falls outside three standard deviations is often considered an anomaly.

When a range of percentiles are retrieved, they can be used to estimate the data distribution and determine if the data is skewed, bimodal, etc.

Assume your data consists of website load times. The average and median load times are not overly useful to an administrator. The max may be interesting, but it can be easily skewed by a single slow response.

Let’s look at a range of percentiles representing load time:

{
    "aggs" : {
        "load_time_outlier" : {
            "percentiles" : {
                "field" : "load_time" 
            }
        }
    }
}

The field load_time must be a numeric field

By default, the percentile metric will generate a range of percentiles: [ 1, 5, 25, 50, 75, 95, 99 ]. The response will look like this:

{
    ...

   "aggregations": {
      "load_time_outlier": {
         "values" : {
            "1.0": 15,
            "5.0": 20,
            "25.0": 23,
            "50.0": 25,
            "75.0": 29,
            "95.0": 60,
            "99.0": 150
         }
      }
   }
}

As you can see, the aggregation will return a calculated value for each percentile in the default range. If we assume response times are in milliseconds, it is immediately obvious that the webpage normally loads in 15-30ms, but occasionally spikes to 60-150ms.

Often, administrators are only interested in outliers — the extreme percentiles. We can specify just the percents we are interested in (requested percentiles must be a value between 0-100 inclusive):

{
    "aggs" : {
        "load_time_outlier" : {
            "percentiles" : {
                "field" : "load_time",
                "percents" : [95, 99, 99.9] 
            }
        }
    }
}

Use the percents parameter to specify particular percentiles to calculate

Script

The percentile metric supports scripting. For example, if our load times are in milliseconds but we want percentiles calculated in seconds, we could use a script to convert them on-the-fly:

{
    "aggs" : {
        "load_time_outlier" : {
            "percentiles" : {
                "script" : {
                    "lang": "painless",
                    "inline": "doc['load_time'].value / params.timeUnit", 
                    "params" : {
                        "timeUnit" : 1000   
                    }
                }
            }
        }
    }
}

The field parameter is replaced with a script parameter, which uses the script to generate values which percentiles are calculated on

Scripting supports parameterized input just like any other script

This will interpret the script parameter as an inline script with the painless script language and no script parameters. To use a file script use the following syntax:

{
    "aggs" : {
        "load_time_outlier" : {
            "percentiles" : {
                "script" : {
                    "file": "my_script",
                    "params" : {
                        "timeUnit" : 1000
                    }
                }
            }
        }
    }
}
Tip

for indexed scripts replace the file parameter with an id parameter.

Percentiles are (usually) approximate

There are many different algorithms to calculate percentiles. The naive implementation simply stores all the values in a sorted array. To find the 50th percentile, you simply find the value that is at my_array[count(my_array) * 0.5].

Clearly, the naive implementation does not scale — the sorted array grows linearly with the number of values in your dataset. To calculate percentiles across potentially billions of values in an Elasticsearch cluster, approximate percentiles are calculated.

The algorithm used by the percentile metric is called TDigest (introduced by Ted Dunning in Computing Accurate Quantiles using T-Digests).

When using this metric, there are a few guidelines to keep in mind:

  • Accuracy is proportional to q(1-q). This means that extreme percentiles (e.g. 99%) are more accurate than less extreme percentiles, such as the median
  • For small sets of values, percentiles are highly accurate (and potentially 100% accurate if the data is small enough).
  • As the quantity of values in a bucket grows, the algorithm begins to approximate the percentiles. It is effectively trading accuracy for memory savings. The exact level of inaccuracy is difficult to generalize, since it depends on your data distribution and volume of data being aggregated

The following chart shows the relative error on a uniform distribution depending on the number of collected values and the requested percentile:

images/percentiles_error.png

It shows how precision is better for extreme percentiles. The reason why error diminishes for large number of values is that the law of large numbers makes the distribution of values more and more uniform and the t-digest tree can do a better job at summarizing it. It would not be the case on more skewed distributions.

Compression

Warning

The compression parameter is specific to the current internal implementation of percentiles, and may change in the future.

Approximate algorithms must balance memory utilization with estimation accuracy. This balance can be controlled using a compression parameter:

{
    "aggs" : {
        "load_time_outlier" : {
            "percentiles" : {
                "field" : "load_time",
                "tdigest": {
                  "compression" : 200 
                }
            }
        }
    }
}

Compression controls memory usage and approximation error

The TDigest algorithm uses a number of "nodes" to approximate percentiles — the more nodes available, the higher the accuracy (and large memory footprint) proportional to the volume of data. The compression parameter limits the maximum number of nodes to 20 * compression.

Therefore, by increasing the compression value, you can increase the accuracy of your percentiles at the cost of more memory. Larger compression values also make the algorithm slower since the underlying tree data structure grows in size, resulting in more expensive operations. The default compression value is 100.

A "node" uses roughly 32 bytes of memory, so under worst-case scenarios (large amount of data which arrives sorted and in-order) the default settings will produce a TDigest roughly 64KB in size. In practice data tends to be more random and the TDigest will use less memory.

HDR Histogram

Warning

This functionality is experimental and may be changed or removed completely in a future release.

HDR Histogram (High Dynamic Range Histogram) is an alternative implementation that can be useful when calculating percentiles for latency measurements as it can be faster than the t-digest implementation with the trade-off of a larger memory footprint. This implementation maintains a fixed worse-case percentage error (specified as a number of significant digits). This means that if data is recorded with values from 1 microsecond up to 1 hour (3,600,000,000 microseconds) in a histogram set to 3 significant digits, it will maintain a value resolution of 1 microsecond for values up to 1 millisecond and 3.6 seconds (or better) for the maximum tracked value (1 hour).

The HDR Histogram can be used by specifying the method parameter in the request:

{
    "aggs" : {
        "load_time_outlier" : {
            "percentiles" : {
                "field" : "load_time",
                "percents" : [95, 99, 99.9],
                "hdr": { 
                  "number_of_significant_value_digits" : 3 
                }
            }
        }
    }
}

hdr object indicates that HDR Histogram should be used to calculate the percentiles and specific settings for this algorithm can be specified inside the object

number_of_significant_value_digits specifies the resolution of values for the histogram in number of significant digits

The HDRHistogram only supports positive values and will error if it is passed a negative value. It is also not a good idea to use the HDRHistogram if the range of values is unknown as this could lead to high memory usage.

Missing value

The missing parameter defines how documents that are missing a value should be treated. By default they will be ignored but it is also possible to treat them as if they had a value.

{
    "aggs" : {
        "grade_percentiles" : {
            "percentiles" : {
                "field" : "grade",
                "missing": 10 
            }
        }
    }
}

Documents without a value in the grade field will fall into the same bucket as documents that have the value 10.

Percentile Ranks Aggregation

A multi-value metrics aggregation that calculates one or more percentile ranks over numeric values extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script.

Note

Please see Percentiles are (usually) approximate and Compression for advice regarding approximation and memory use of the percentile ranks aggregation

Percentile rank show the percentage of observed values which are below certain value. For example, if a value is greater than or equal to 95% of the observed values it is said to be at the 95th percentile rank.

Assume your data consists of website load times. You may have a service agreement that 95% of page loads completely within 15ms and 99% of page loads complete within 30ms.

Let’s look at a range of percentiles representing load time:

{
    "aggs" : {
        "load_time_outlier" : {
            "percentile_ranks" : {
                "field" : "load_time", 
                "values" : [15, 30]
            }
        }
    }
}

The field load_time must be a numeric field

The response will look like this:

{
    ...

   "aggregations": {
      "load_time_outlier": {
         "values" : {
            "15": 92,
            "30": 100
         }
      }
   }
}

From this information you can determine you are hitting the 99% load time target but not quite hitting the 95% load time target

Script

The percentile rank metric supports scripting. For example, if our load times are in milliseconds but we want to specify values in seconds, we could use a script to convert them on-the-fly:

{
    "aggs" : {
        "load_time_outlier" : {
            "percentile_ranks" : {
                "values" : [3, 5],
                "script" : {
                    "lang": "painless",
                    "inline": "doc['load_time'].value / params.timeUnit", 
                    "params" : {
                        "timeUnit" : 1000   
                    }
                }
            }
        }
    }
}

The field parameter is replaced with a script parameter, which uses the script to generate values which percentile ranks are calculated on

Scripting supports parameterized input just like any other script

This will interpret the script parameter as an inline script with the painless script language and no script parameters. To use a file script use the following syntax:

{
    "aggs" : {
        "load_time_outlier" : {
            "percentile_ranks" : {
                "values" : [3, 5],
                "script" : {
                    "file": "my_script",
                    "params" : {
                        "timeUnit" : 1000
                    }
                }
            }
        }
    }
}
Tip

for indexed scripts replace the file parameter with an id parameter.

HDR Histogram

Warning

This functionality is experimental and may be changed or removed completely in a future release.

HDR Histogram (High Dynamic Range Histogram) is an alternative implementation that can be useful when calculating percentile ranks for latency measurements as it can be faster than the t-digest implementation with the trade-off of a larger memory footprint. This implementation maintains a fixed worse-case percentage error (specified as a number of significant digits). This means that if data is recorded with values from 1 microsecond up to 1 hour (3,600,000,000 microseconds) in a histogram set to 3 significant digits, it will maintain a value resolution of 1 microsecond for values up to 1 millisecond and 3.6 seconds (or better) for the maximum tracked value (1 hour).

The HDR Histogram can be used by specifying the method parameter in the request:

{
    "aggs" : {
        "load_time_outlier" : {
            "percentile_ranks" : {
                "field" : "load_time",
                "values" : [15, 30],
                "hdr": { 
                  "number_of_significant_value_digits" : 3 
                }
            }
        }
    }
}

hdr object indicates that HDR Histogram should be used to calculate the percentiles and specific settings for this algorithm can be specified inside the object

number_of_significant_value_digits specifies the resolution of values for the histogram in number of significant digits

The HDRHistogram only supports positive values and will error if it is passed a negative value. It is also not a good idea to use the HDRHistogram if the range of values is unknown as this could lead to high memory usage.

Missing value

The missing parameter defines how documents that are missing a value should be treated. By default they will be ignored but it is also possible to treat them as if they had a value.

{
    "aggs" : {
        "grade_ranks" : {
            "percentile_ranks" : {
                "field" : "grade",
                "missing": 10 
            }
        }
    }
}

Documents without a value in the grade field will fall into the same bucket as documents that have the value 10.

Scripted Metric Aggregation

Warning

This functionality is experimental and may be changed or removed completely in a future release.

A metric aggregation that executes using scripts to provide a metric output.

Example:

POST ledger/_search?size=0
{
    "query" : {
        "match_all" : {}
    },
    "aggs": {
        "profit": {
            "scripted_metric": {
                "init_script" : "params._agg.transactions = []",
                "map_script" : "params._agg.transactions.add(doc.type.value == 'sale' ? doc.amount.value : -1 * doc.amount.value)", 
                "combine_script" : "double profit = 0; for (t in params._agg.transactions) { profit += t } return profit",
                "reduce_script" : "double profit = 0; for (a in params._aggs) { profit += a } return profit"
            }
        }
    }
}

map_script is the only required parameter

The above aggregation demonstrates how one would use the script aggregation compute the total profit from sale and cost transactions.

The response for the above aggregation:

{
    "took": 218,
    ...
    "aggregations": {
        "profit": {
            "value": 240.0
        }
   }
}

The above example can also be specified using file scripts as follows:

POST ledger/_search?size=0
{
    "aggs": {
        "profit": {
            "scripted_metric": {
                "init_script" : {
                    "file": "my_init_script"
                },
                "map_script" : {
                    "file": "my_map_script"
                },
                "combine_script" : {
                    "file": "my_combine_script"
                },
                "params": {
                    "field": "amount", 
                    "_agg": {}        
                },
                "reduce_script" : {
                    "file": "my_reduce_script"
                }
            }
        }
    }
}

script parameters for init, map and combine scripts must be specified in a global params object so that it can be share between the scripts.

if you specify script parameters then you must specify "_agg": {}.

For more details on specifying scripts see script documentation.

Allowed return types

Whilst any valid script object can be used within a single script, the scripts must return or store in the _agg object only the following types:

  • primitive types
  • String
  • Map (containing only keys and values of the types listed here)
  • Array (containing elements of only the types listed here)

Scope of scripts

The scripted metric aggregation uses scripts at 4 stages of its execution:

init_script

Executed prior to any collection of documents. Allows the aggregation to set up any initial state.

In the above example, the init_script creates an array transactions in the _agg object.

map_script

Executed once per document collected. This is the only required script. If no combine_script is specified, the resulting state needs to be stored in an object named _agg.

In the above example, the map_script checks the value of the type field. If the value is sale the value of the amount field is added to the transactions array. If the value of the type field is not sale the negated value of the amount field is added to transactions.

combine_script

Executed once on each shard after document collection is complete. Allows the aggregation to consolidate the state returned from each shard. If a combine_script is not provided the combine phase will return the aggregation variable.

In the above example, the combine_script iterates through all the stored transactions, summing the values in the profit variable and finally returns profit.

reduce_script

Executed once on the coordinating node after all shards have returned their results. The script is provided with access to a variable _aggs which is an array of the result of the combine_script on each shard. If a reduce_script is not provided the reduce phase will return the _aggs variable.

In the above example, the reduce_script iterates through the profit returned by each shard summing the values before returning the final combined profit which will be returned in the response of the aggregation.

Worked Example

Imagine a situation where you index the following documents into and index with 2 shards:

PUT /transactions/stock/_bulk?refresh
{"index":{"_id":1}}
{"type": "sale","amount": 80}
{"index":{"_id":2}}
{"type": "cost","amount": 10}
{"index":{"_id":2}}
{"type": "cost","amount": 30}
{"index":{"_id":2}}
{"type": "sale","amount": 130}

Lets say that documents 1 and 3 end up on shard A and documents 2 and 4 end up on shard B. The following is a breakdown of what the aggregation result is at each stage of the example above.

Before init_script

No params object was specified so the default params object is used:

"params" : {
    "_agg" : {}
}

After init_script

This is run once on each shard before any document collection is performed, and so we will have a copy on each shard:

Shard A
"params" : {
    "_agg" : {
        "transactions" : []
    }
}
Shard B
"params" : {
    "_agg" : {
        "transactions" : []
    }
}

After map_script

Each shard collects its documents and runs the map_script on each document that is collected:

Shard A
"params" : {
    "_agg" : {
        "transactions" : [ 80, -30 ]
    }
}
Shard B
"params" : {
    "_agg" : {
        "transactions" : [ -10, 130 ]
    }
}

After combine_script

The combine_script is executed on each shard after document collection is complete and reduces all the transactions down to a single profit figure for each shard (by summing the values in the transactions array) which is passed back to the coordinating node:

Shard A
50
Shard B
120

After reduce_script

The reduce_script receives an _aggs array containing the result of the combine script for each shard:

"_aggs" : [
    50,
    120
]

It reduces the responses for the shards down to a final overall profit figure (by summing the values) and returns this as the result of the aggregation to produce the response:

{
    ...

    "aggregations": {
        "profit": {
            "value": 170
        }
   }
}

Other Parameters

params

Optional. An object whose contents will be passed as variables to the init_script, map_script and combine_script. This can be useful to allow the user to control the behavior of the aggregation and for storing state between the scripts. If this is not specified, the default is the equivalent of providing:

"params" : {
    "_agg" : {}
}

Stats Aggregation

A multi-value metrics aggregation that computes stats over numeric values extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script.

The stats that are returned consist of: min, max, sum, count and avg.

Assuming the data consists of documents representing exams grades (between 0 and 100) of students

{
    "aggs" : {
        "grades_stats" : { "stats" : { "field" : "grade" } }
    }
}

The above aggregation computes the grades statistics over all documents. The aggregation type is stats and the field setting defines the numeric field of the documents the stats will be computed on. The above will return the following:

{
    ...

    "aggregations": {
        "grades_stats": {
            "count": 6,
            "min": 60,
            "max": 98,
            "avg": 78.5,
            "sum": 471
        }
    }
}

The name of the aggregation (grades_stats above) also serves as the key by which the aggregation result can be retrieved from the returned response.

Script

Computing the grades stats based on a script:

{
    ...,

    "aggs" : {
        "grades_stats" : {
             "stats" : {
                 "script" : {
                     "lang": "painless",
                     "inline": "doc['grade'].value"
                 }
             }
         }
    }
}

This will interpret the script parameter as an inline script with the painless script language and no script parameters. To use a file script use the following syntax:

{
    ...,

    "aggs" : {
        "grades_stats" : {
            "stats" : {
                "script" : {
                    "file": "my_script",
                    "params" : {
                        "field" : "grade"
                    }
                }
            }
        }
    }
}
Tip

for indexed scripts replace the file parameter with an id parameter.

Value Script

It turned out that the exam was way above the level of the students and a grade correction needs to be applied. We can use a value script to get the new stats:

{
    "aggs" : {
        ...

        "aggs" : {
            "grades_stats" : {
                "stats" : {
                    "field" : "grade",
                    "script" :
                        "lang": "painless",
                        "inline": "_value * params.correction",
                        "params" : {
                            "correction" : 1.2
                        }
                    }
                }
            }
        }
    }
}

Missing value

The missing parameter defines how documents that are missing a value should be treated. By default they will be ignored but it is also possible to treat them as if they had a value.

{
    "aggs" : {
        "grades_stats" : {
            "stats" : {
                "field" : "grade",
                "missing": 0 
            }
        }
    }
}

Documents without a value in the grade field will fall into the same bucket as documents that have the value 0.

Sum Aggregation

A single-value metrics aggregation that sums up numeric values that are extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script.

Assuming the data consists of documents representing stock ticks, where each tick holds the change in the stock price from the previous tick.

{
    "query" : {
        "constant_score" : {
            "filter" : {
                "range" : { "timestamp" : { "from" : "now/1d+9.5h", "to" : "now/1d+16h" }}
            }
        }
    },
    "aggs" : {
        "intraday_return" : { "sum" : { "field" : "change" } }
    }
}

The above aggregation sums up all changes in the today’s trading stock ticks which accounts for the intraday return. The aggregation type is sum and the field setting defines the numeric field of the documents of which values will be summed up. The above will return the following:

{
    ...

    "aggregations": {
        "intraday_return": {
           "value": 2.18
        }
    }
}

The name of the aggregation (intraday_return above) also serves as the key by which the aggregation result can be retrieved from the returned response.

Script

Computing the intraday return based on a script:

{
    ...,

    "aggs" : {
        "intraday_return" : {
            "sum" : {
                "script" : {
                   "lang": "painless",
                   "inline": "doc['change'].value"
                }
            }
        }
    }
}

This will interpret the script parameter as an inline script with the painless script language and no script parameters. To use a file script use the following syntax:

{
    ...,

    "aggs" : {
        "intraday_return" : {
            "sum" : {
                "script" : {
                    "file": "my_script",
                    "params" : {
                        "field" : "change"
                    }
                }
            }
        }
    }
}
Tip

for indexed scripts replace the file parameter with an id parameter.

Value Script

Computing the sum of squares over all stock tick changes:

{
    "aggs" : {
        ...

        "aggs" : {
            "daytime_return" : {
                "sum" : {
                    "field" : "change",
                    "script" : {
                        "lang": "painless",
                        "inline": "_value * _value"
                    }
                }
            }
        }
    }
}

Missing value

The missing parameter defines how documents that are missing a value should be treated. By default they will be ignored but it is also possible to treat them as if they had a value.

{
    "aggs" : {
        "total_time" : {
            "sum" : {
                "field" : "took",
                "missing": 100 
            }
        }
    }
}

Documents without a value in the took field will fall into the same bucket as documents that have the value 100.

Top hits Aggregation

A top_hits metric aggregator keeps track of the most relevant document being aggregated. This aggregator is intended to be used as a sub aggregator, so that the top matching documents can be aggregated per bucket.

The top_hits aggregator can effectively be used to group result sets by certain fields via a bucket aggregator. One or more bucket aggregators determines by which properties a result set get sliced into.

Options

  • from - The offset from the first result you want to fetch.
  • size - The maximum number of top matching hits to return per bucket. By default the top three matching hits are returned.
  • sort - How the top matching hits should be sorted. By default the hits are sorted by the score of the main query.

Supported per hit features

The top_hits aggregation returns regular search hits, because of this many per hit features can be supported:

Example

In the following example we group the questions by tag and per tag we show the last active question. For each question only the title field is being included in the source.

{
    "aggs": {
        "top-tags": {
            "terms": {
                "field": "tags",
                "size": 3
            },
            "aggs": {
                "top_tag_hits": {
                    "top_hits": {
                        "sort": [
                            {
                                "last_activity_date": {
                                    "order": "desc"
                                }
                            }
                        ],
                        "_source": {
                            "includes": [
                                "title"
                            ]
                        },
                        "size" : 1
                    }
                }
            }
        }
    }
}

Possible response snippet:

"aggregations": {
  "top-tags": {
     "buckets": [
        {
           "key": "windows-7",
           "doc_count": 25365,
           "top_tags_hits": {
              "hits": {
                 "total": 25365,
                 "max_score": 1,
                 "hits": [
                    {
                       "_index": "stack",
                       "_type": "question",
                       "_id": "602679",
                       "_score": 1,
                       "_source": {
                          "title": "Windows port opening"
                       },
                       "sort": [
                          1370143231177
                       ]
                    }
                 ]
              }
           }
        },
        {
           "key": "linux",
           "doc_count": 18342,
           "top_tags_hits": {
              "hits": {
                 "total": 18342,
                 "max_score": 1,
                 "hits": [
                    {
                       "_index": "stack",
                       "_type": "question",
                       "_id": "602672",
                       "_score": 1,
                       "_source": {
                          "title": "Ubuntu RFID Screensaver lock-unlock"
                       },
                       "sort": [
                          1370143379747
                       ]
                    }
                 ]
              }
           }
        },
        {
           "key": "windows",
           "doc_count": 18119,
           "top_tags_hits": {
              "hits": {
                 "total": 18119,
                 "max_score": 1,
                 "hits": [
                    {
                       "_index": "stack",
                       "_type": "question",
                       "_id": "602678",
                       "_score": 1,
                       "_source": {
                          "title": "If I change my computers date / time, what could be affected?"
                       },
                       "sort": [
                          1370142868283
                       ]
                    }
                 ]
              }
           }
        }
     ]
  }
}

Field collapse example

Field collapsing or result grouping is a feature that logically groups a result set into groups and per group returns top documents. The ordering of the groups is determined by the relevancy of the first document in a group. In Elasticsearch this can be implemented via a bucket aggregator that wraps a top_hits aggregator as sub-aggregator.

In the example below we search across crawled webpages. For each webpage we store the body and the domain the webpage belong to. By defining a terms aggregator on the domain field we group the result set of webpages by domain. The top_hits aggregator is then defined as sub-aggregator, so that the top matching hits are collected per bucket.

Also a max aggregator is defined which is used by the terms aggregator’s order feature the return the buckets by relevancy order of the most relevant document in a bucket.

{
  "query": {
    "match": {
      "body": "elections"
    }
  },
  "aggs": {
    "top-sites": {
      "terms": {
        "field": "domain",
        "order": {
          "top_hit": "desc"
        }
      },
      "aggs": {
        "top_tags_hits": {
          "top_hits": {}
        },
        "top_hit" : {
          "max": {
            "script": {
              "lang": "painless",
              "inline": "_score"
            }
          }
        }
      }
    }
  }
}

At the moment the max (or min) aggregator is needed to make sure the buckets from the terms aggregator are ordered according to the score of the most relevant webpage per domain. Unfortunately the top_hits aggregator can’t be used in the order option of the terms aggregator yet.

top_hits support in a nested or reverse_nested aggregator

If the top_hits aggregator is wrapped in a nested or reverse_nested aggregator then nested hits are being returned. Nested hits are in a sense hidden mini documents that are part of regular document where in the mapping a nested field type has been configured. The top_hits aggregator has the ability to un-hide these documents if it is wrapped in a nested or reverse_nested aggregator. Read more about nested in the nested type mapping.

If nested type has been configured a single document is actually indexed as multiple Lucene documents and they share the same id. In order to determine the identity of a nested hit there is more needed than just the id, so that is why nested hits also include their nested identity. The nested identity is kept under the _nested field in the search hit and includes the array field and the offset in the array field the nested hit belongs to. The offset is zero based.

Top hits response snippet with a nested hit, which resides in the third slot of array field nested_field1 in document with id 1:

...
"hits": {
 "total": 25365,
 "max_score": 1,
 "hits": [
   {
     "_index": "a",
     "_type": "b",
     "_id": "1",
     "_score": 1,
     "_nested" : {
       "field" : "nested_field1",
       "offset" : 2
     }
     "_source": ...
   },
   ...
 ]
}
...

If _source is requested then just the part of the source of the nested object is returned, not the entire source of the document. Also stored fields on the nested inner object level are accessible via top_hits aggregator residing in a nested or reverse_nested aggregator.

Only nested hits will have a _nested field in the hit, non nested (regular) hits will not have a _nested field.

The information in _nested can also be used to parse the original source somewhere else if _source isn’t enabled.

If there are multiple levels of nested object types defined in mappings then the _nested information can also be hierarchical in order to express the identity of nested hits that are two layers deep or more.

In the example below a nested hit resides in the first slot of the field nested_grand_child_field which then resides in the second slow of the nested_child_field field:

...
"hits": {
 "total": 2565,
 "max_score": 1,
 "hits": [
   {
     "_index": "a",
     "_type": "b",
     "_id": "1",
     "_score": 1,
     "_nested" : {
       "field" : "nested_child_field",
       "offset" : 1,
       "_nested" : {
         "field" : "nested_grand_child_field",
         "offset" : 0
       }
     }
     "_source": ...
   },
   ...
 ]
}
...

Value Count Aggregation

A single-value metrics aggregation that counts the number of values that are extracted from the aggregated documents. These values can be extracted either from specific fields in the documents, or be generated by a provided script. Typically, this aggregator will be used in conjunction with other single-value aggregations. For example, when computing the avg one might be interested in the number of values the average is computed over.

{
    "aggs" : {
        "grades_count" : { "value_count" : { "field" : "grade" } }
    }
}

Response:

{
    ...

    "aggregations": {
        "grades_count": {
            "value": 10
        }
    }
}

The name of the aggregation (grades_count above) also serves as the key by which the aggregation result can be retrieved from the returned response.

Script

Counting the values generated by a script:

{
    ...,

    "aggs" : {
        "grades_count" : {
            "value_count" : {
                "script" : {
                    "inline" : "doc['grade'].value",
                    "lang" : "painless"
                }
            }
        }
    }
}

This will interpret the script parameter as an inline script with the painless script language and no script parameters. To use a file script use the following syntax:

{
    ...,

    "aggs" : {
        "grades_count" : {
            "value_count" : {
                "script" : {
                    "file": "my_script",
                    "params" : {
                        "field" : "grade"
                    }
                }
            }
        }
    }
}
Tip

for indexed scripts replace the file parameter with an id parameter.