This functionality is experimental and may be changed or removed completely in a future release.
Pipeline aggregations work on the outputs produced from other aggregations rather than from document sets, adding information to the output tree. There are many different types of pipeline aggregation, each computing different information from other aggregations, but these types can be broken down into two families:
 Parent
 A family of pipeline aggregations that is provided with the output of its parent aggregation and is able to compute new buckets or new aggregations to add to existing buckets.
 Sibling
 Pipeline aggregations that are provided with the output of a sibling aggregation and are able to compute a new aggregation which will be at the same level as the sibling aggregation.
Pipeline aggregations can reference the aggregations they need to perform their computation by using the buckets_path
parameter to indicate the paths to the required metrics. The syntax for defining these paths can be found in the
buckets_path
Syntax section below.
Pipeline aggregations cannot have subaggregations but depending on the type it can reference another pipeline in the buckets_path
allowing pipeline aggregations to be chained. For example, you can chain together two derivatives to calculate the second derivative
(i.e. a derivative of a derivative).
Because pipeline aggregations only add to the output, when chaining pipeline aggregations the output of each pipeline aggregation will be included in the final output.
buckets_path
Syntax
Most pipeline aggregations require another aggregation as their input. The input aggregation is defined via the buckets_path
parameter, which follows a specific format:
AGG_SEPARATOR = '>' ; METRIC_SEPARATOR = '.' ; AGG_NAME = <the name of the aggregation> ; METRIC = <the name of the metric (in case of multivalue metrics aggregation)> ; PATH = <AGG_NAME> [ <AGG_SEPARATOR>, <AGG_NAME> ]* [ <METRIC_SEPARATOR>, <METRIC> ] ;
For example, the path "my_bucket>my_stats.avg"
will path to the avg
value in the "my_stats"
metric, which is
contained in the "my_bucket"
bucket aggregation.
Paths are relative from the position of the pipeline aggregation; they are not absolute paths, and the path cannot go back "up" the
aggregation tree. For example, this moving average is embedded inside a date_histogram and refers to a "sibling"
metric "the_sum"
:
POST /_search { "aggs": { "my_date_histo":{ "date_histogram":{ "field":"timestamp", "interval":"day" }, "aggs":{ "the_sum":{ "sum":{ "field": "lemmings" } }, "the_movavg":{ "moving_avg":{ "buckets_path": "the_sum" } } } } } }
buckets_path
is also used for Sibling pipeline aggregations, where the aggregation is "next" to a series of buckets
instead of embedded "inside" them. For example, the max_bucket
aggregation uses the buckets_path
to specify
a metric embedded inside a sibling aggregation:
POST /_search { "aggs" : { "sales_per_month" : { "date_histogram" : { "field" : "date", "interval" : "month" }, "aggs": { "sales": { "sum": { "field": "price" } } } }, "max_monthly_sales": { "max_bucket": { "buckets_path": "sales_per_month>sales" } } } }

Special Paths
Instead of pathing to a metric, buckets_path
can use a special "_count"
path. This instructs
the pipeline aggregation to use the document count as it’s input. For example, a moving average can be calculated on the document count of each bucket, instead of a specific metric:
POST /_search { "aggs": { "my_date_histo": { "date_histogram": { "field":"timestamp", "interval":"day" }, "aggs": { "the_movavg": { "moving_avg": { "buckets_path": "_count" } } } } } }
By using 
The buckets_path
can also use "_bucket_count"
and path to a multibucket aggregation to use the number of buckets
returned by that aggregation in the pipeline aggregation instead of a metric. for example a bucket_selector
can be
used here to filter out buckets which contain no buckets for an inner terms aggregation:
POST /sales/_search { "size": 0, "aggs": { "histo": { "date_histogram": { "field": "date", "interval": "day" }, "aggs": { "categories": { "terms": { "field": "category" } }, "min_bucket_selector": { "bucket_selector": { "buckets_path": { "count": "categories._bucket_count" }, "script": { "inline": "params.count != 0" } } } } } } }
By using 
Dealing with dots in agg names
An alternate syntax is supported to cope with aggregations or metrics which
have dots in the name, such as the 99.9
th
percentile. This metric
may be referred to as:
"buckets_path": "my_percentile[99.9]"
Dealing with gaps in the data
Data in the real world is often noisy and sometimes contains gaps — places where data simply doesn’t exist. This can occur for a variety of reasons, the most common being:
 Documents falling into a bucket do not contain a required field
 There are no documents matching the query for one or more buckets
 The metric being calculated is unable to generate a value, likely because another dependent bucket is missing a value. Some pipeline aggregations have specific requirements that must be met (e.g. a derivative cannot calculate a metric for the first value because there is no previous value, HoltWinters moving average need "warmup" data to begin calculating, etc)
Gap policies are a mechanism to inform the pipeline aggregation about the desired behavior when "gappy" or missing
data is encountered. All pipeline aggregations accept the gap_policy
parameter. There are currently two gap policies
to choose from:
 skip
 This option treats missing data as if the bucket does not exist. It will skip the bucket and continue calculating using the next available value.
 insert_zeros

This option will replace missing values with a zero (
0
) and pipeline aggregation computation will proceed as normal.
This functionality is experimental and may be changed or removed completely in a future release.
A sibling pipeline aggregation which calculates the (mean) average value of a specified metric in a sibling aggregation. The specified metric must be numeric and the sibling aggregation must be a multibucket aggregation.
An avg_bucket
aggregation looks like this in isolation:
{ "avg_bucket": { "buckets_path": "the_sum" } }
Table 1. avg_bucket
Parameters
Parameter Name  Description  Required  Default Value 
 The path to the buckets we wish to find the average for (see the section called “  Required  
 The policy to apply when gaps are found in the data (see the section called “Dealing with gaps in the data” for more details)  Optional, defaults to  
 format to apply to the output value of this aggregation  Optional, defaults to 
The following snippet calculates the average of the total monthly sales
:
POST /_search { "size": 0, "aggs": { "sales_per_month": { "date_histogram": { "field": "date", "interval": "month" }, "aggs": { "sales": { "sum": { "field": "price" } } } }, "avg_monthly_sales": { "avg_bucket": { "buckets_path": "sales_per_month>sales" } } } }

And the following may be the response:
{ "took": 11, "timed_out": false, "_shards": ..., "hits": ..., "aggregations": { "sales_per_month": { "buckets": [ { "key_as_string": "2015/01/01 00:00:00", "key": 1420070400000, "doc_count": 3, "sales": { "value": 550.0 } }, { "key_as_string": "2015/02/01 00:00:00", "key": 1422748800000, "doc_count": 2, "sales": { "value": 60.0 } }, { "key_as_string": "2015/03/01 00:00:00", "key": 1425168000000, "doc_count": 2, "sales": { "value": 375.0 } } ] }, "avg_monthly_sales": { "value": 328.33333333333333 } } }
This functionality is experimental and may be changed or removed completely in a future release.
A parent pipeline aggregation which calculates the derivative of a specified metric in a parent histogram (or date_histogram)
aggregation. The specified metric must be numeric and the enclosing histogram must have min_doc_count
set to 0
(default
for histogram
aggregations).
A derivative
aggregation looks like this in isolation:
"derivative": { "buckets_path": "the_sum" }
Table 2. derivative
Parameters
Parameter Name  Description  Required  Default Value 
 The path to the buckets we wish to find the derivative for (see the section called “  Required  
 The policy to apply when gaps are found in the data (see the section called “Dealing with gaps in the data” for more details)  Optional, defaults to  
 format to apply to the output value of this aggregation  Optional, defaults to 
The following snippet calculates the derivative of the total monthly sales
:
POST /sales/_search { "size": 0, "aggs" : { "sales_per_month" : { "date_histogram" : { "field" : "date", "interval" : "month" }, "aggs": { "sales": { "sum": { "field": "price" } }, "sales_deriv": { "derivative": { "buckets_path": "sales" } } } } } }

And the following may be the response:
{ "took": 11, "timed_out": false, "_shards": ..., "hits": ..., "aggregations": { "sales_per_month": { "buckets": [ { "key_as_string": "2015/01/01 00:00:00", "key": 1420070400000, "doc_count": 3, "sales": { "value": 550.0 } }, { "key_as_string": "2015/02/01 00:00:00", "key": 1422748800000, "doc_count": 2, "sales": { "value": 60.0 }, "sales_deriv": { "value": 490.0 } }, { "key_as_string": "2015/03/01 00:00:00", "key": 1425168000000, "doc_count": 2, "sales": { "value": 375.0 }, "sales_deriv": { "value": 315.0 } } ] } } }
No derivative for the first bucket since we need at least 2 data points to calculate the derivative  
Derivative value units are implicitly defined by the  
The number of documents in the bucket are represented by the 
A second order derivative can be calculated by chaining the derivative pipeline aggregation onto the result of another derivative pipeline aggregation as in the following example which will calculate both the first and the second order derivative of the total monthly sales:
POST /sales/_search { "size": 0, "aggs" : { "sales_per_month" : { "date_histogram" : { "field" : "date", "interval" : "month" }, "aggs": { "sales": { "sum": { "field": "price" } }, "sales_deriv": { "derivative": { "buckets_path": "sales" } }, "sales_2nd_deriv": { "derivative": { "buckets_path": "sales_deriv" } } } } } }
And the following may be the response:
{ "took": 50, "timed_out": false, "_shards": ..., "hits": ..., "aggregations": { "sales_per_month": { "buckets": [ { "key_as_string": "2015/01/01 00:00:00", "key": 1420070400000, "doc_count": 3, "sales": { "value": 550.0 } }, { "key_as_string": "2015/02/01 00:00:00", "key": 1422748800000, "doc_count": 2, "sales": { "value": 60.0 }, "sales_deriv": { "value": 490.0 } }, { "key_as_string": "2015/03/01 00:00:00", "key": 1425168000000, "doc_count": 2, "sales": { "value": 375.0 }, "sales_deriv": { "value": 315.0 }, "sales_2nd_deriv": { "value": 805.0 } } ] } } }
The derivative aggregation allows the units of the derivative values to be specified. This returns an extra field in the response
normalized_value
which reports the derivative value in the desired xaxis units. In the below example we calculate the derivative
of the total sales per month but ask for the derivative of the sales as in the units of sales per day:
POST /sales/_search { "size": 0, "aggs" : { "sales_per_month" : { "date_histogram" : { "field" : "date", "interval" : "month" }, "aggs": { "sales": { "sum": { "field": "price" } }, "sales_deriv": { "derivative": { "buckets_path": "sales", "unit": "day" } } } } } }
And the following may be the response:
{ "took": 50, "timed_out": false, "_shards": ..., "hits": ..., "aggregations": { "sales_per_month": { "buckets": [ { "key_as_string": "2015/01/01 00:00:00", "key": 1420070400000, "doc_count": 3, "sales": { "value": 550.0 } }, { "key_as_string": "2015/02/01 00:00:00", "key": 1422748800000, "doc_count": 2, "sales": { "value": 60.0 }, "sales_deriv": { "value": 490.0, "normalized_value": 15.806451612903226 } }, { "key_as_string": "2015/03/01 00:00:00", "key": 1425168000000, "doc_count": 2, "sales": { "value": 375.0 }, "sales_deriv": { "value": 315.0, "normalized_value": 11.25 } } ] } } }
This functionality is experimental and may be changed or removed completely in a future release.
A sibling pipeline aggregation which identifies the bucket(s) with the maximum value of a specified metric in a sibling aggregation and outputs both the value and the key(s) of the bucket(s). The specified metric must be numeric and the sibling aggregation must be a multibucket aggregation.
A max_bucket
aggregation looks like this in isolation:
{ "max_bucket": { "buckets_path": "the_sum" } }
Table 3. max_bucket
Parameters
Parameter Name  Description  Required  Default Value 
 The path to the buckets we wish to find the maximum for (see the section called “  Required  
 The policy to apply when gaps are found in the data (see the section called “Dealing with gaps in the data” for more details)  Optional, defaults to  
 format to apply to the output value of this aggregation  Optional, defaults to 
The following snippet calculates the maximum of the total monthly sales
:
POST /sales/_search { "size": 0, "aggs" : { "sales_per_month" : { "date_histogram" : { "field" : "date", "interval" : "month" }, "aggs": { "sales": { "sum": { "field": "price" } } } }, "max_monthly_sales": { "max_bucket": { "buckets_path": "sales_per_month>sales" } } } }

And the following may be the response:
{ "took": 11, "timed_out": false, "_shards": ..., "hits": ..., "aggregations": { "sales_per_month": { "buckets": [ { "key_as_string": "2015/01/01 00:00:00", "key": 1420070400000, "doc_count": 3, "sales": { "value": 550.0 } }, { "key_as_string": "2015/02/01 00:00:00", "key": 1422748800000, "doc_count": 2, "sales": { "value": 60.0 } }, { "key_as_string": "2015/03/01 00:00:00", "key": 1425168000000, "doc_count": 2, "sales": { "value": 375.0 } } ] }, "max_monthly_sales": { "keys": ["2015/01/01 00:00:00"], "value": 550.0 } } }
This functionality is experimental and may be changed or removed completely in a future release.
A sibling pipeline aggregation which identifies the bucket(s) with the minimum value of a specified metric in a sibling aggregation and outputs both the value and the key(s) of the bucket(s). The specified metric must be numeric and the sibling aggregation must be a multibucket aggregation.
A max_bucket
aggregation looks like this in isolation:
{ "min_bucket": { "buckets_path": "the_sum" } }
Table 4. min_bucket
Parameters
Parameter Name  Description  Required  Default Value 
 The path to the buckets we wish to find the minimum for (see the section called “  Required  
 The policy to apply when gaps are found in the data (see the section called “Dealing with gaps in the data” for more details)  Optional, defaults to  
 format to apply to the output value of this aggregation  Optional, defaults to 
The following snippet calculates the minimum of the total monthly sales
:
POST /sales/_search { "size": 0, "aggs" : { "sales_per_month" : { "date_histogram" : { "field" : "date", "interval" : "month" }, "aggs": { "sales": { "sum": { "field": "price" } } } }, "min_monthly_sales": { "min_bucket": { "buckets_path": "sales_per_month>sales" } } } }

And the following may be the response:
{ "took": 11, "timed_out": false, "_shards": ..., "hits": ..., "aggregations": { "sales_per_month": { "buckets": [ { "key_as_string": "2015/01/01 00:00:00", "key": 1420070400000, "doc_count": 3, "sales": { "value": 550.0 } }, { "key_as_string": "2015/02/01 00:00:00", "key": 1422748800000, "doc_count": 2, "sales": { "value": 60.0 } }, { "key_as_string": "2015/03/01 00:00:00", "key": 1425168000000, "doc_count": 2, "sales": { "value": 375.0 } } ] }, "min_monthly_sales": { "keys": ["2015/02/01 00:00:00"], "value": 60.0 } } }
This functionality is experimental and may be changed or removed completely in a future release.
A sibling pipeline aggregation which calculates the sum across all bucket of a specified metric in a sibling aggregation. The specified metric must be numeric and the sibling aggregation must be a multibucket aggregation.
A sum_bucket
aggregation looks like this in isolation:
{ "sum_bucket": { "buckets_path": "the_sum" } }
Table 5. sum_bucket
Parameters
Parameter Name  Description  Required  Default Value 
 The path to the buckets we wish to find the sum for (see the section called “  Required  
 The policy to apply when gaps are found in the data (see the section called “Dealing with gaps in the data” for more details)  Optional, defaults to  
 format to apply to the output value of this aggregation  Optional, defaults to 
The following snippet calculates the sum of all the total monthly sales
buckets:
POST /sales/_search { "size": 0, "aggs" : { "sales_per_month" : { "date_histogram" : { "field" : "date", "interval" : "month" }, "aggs": { "sales": { "sum": { "field": "price" } } } }, "sum_monthly_sales": { "sum_bucket": { "buckets_path": "sales_per_month>sales" } } } }

And the following may be the response:
{ "took": 11, "timed_out": false, "_shards": ..., "hits": ..., "aggregations": { "sales_per_month": { "buckets": [ { "key_as_string": "2015/01/01 00:00:00", "key": 1420070400000, "doc_count": 3, "sales": { "value": 550.0 } }, { "key_as_string": "2015/02/01 00:00:00", "key": 1422748800000, "doc_count": 2, "sales": { "value": 60.0 } }, { "key_as_string": "2015/03/01 00:00:00", "key": 1425168000000, "doc_count": 2, "sales": { "value": 375.0 } } ] }, "sum_monthly_sales": { "value": 985.0 } } }
This functionality is experimental and may be changed or removed completely in a future release.
A sibling pipeline aggregation which calculates a variety of stats across all bucket of a specified metric in a sibling aggregation. The specified metric must be numeric and the sibling aggregation must be a multibucket aggregation.
A stats_bucket
aggregation looks like this in isolation:
{ "stats_bucket": { "buckets_path": "the_sum" } }
Table 6. stats_bucket
Parameters
Parameter Name  Description  Required  Default Value 
 The path to the buckets we wish to calculate stats for (see the section called “  Required  
 The policy to apply when gaps are found in the data (see the section called “Dealing with gaps in the data” for more details)  Optional 

 format to apply to the output value of this aggregation  Optional 

The following snippet calculates the sum of all the total monthly sales
buckets:
POST /sales/_search { "size": 0, "aggs" : { "sales_per_month" : { "date_histogram" : { "field" : "date", "interval" : "month" }, "aggs": { "sales": { "sum": { "field": "price" } } } }, "stats_monthly_sales": { "stats_bucket": { "buckets_path": "sales_per_month>sales" } } } }

And the following may be the response:
{ "took": 11, "timed_out": false, "_shards": ..., "hits": ..., "aggregations": { "sales_per_month": { "buckets": [ { "key_as_string": "2015/01/01 00:00:00", "key": 1420070400000, "doc_count": 3, "sales": { "value": 550.0 } }, { "key_as_string": "2015/02/01 00:00:00", "key": 1422748800000, "doc_count": 2, "sales": { "value": 60.0 } }, { "key_as_string": "2015/03/01 00:00:00", "key": 1425168000000, "doc_count": 2, "sales": { "value": 375.0 } } ] }, "stats_monthly_sales": { "count": 3, "min": 60.0, "max": 550.0, "avg": 328.3333333333333, "sum": 985.0 } } }
This functionality is experimental and may be changed or removed completely in a future release.
A sibling pipeline aggregation which calculates a variety of stats across all bucket of a specified metric in a sibling aggregation. The specified metric must be numeric and the sibling aggregation must be a multibucket aggregation.
This aggregation provides a few more statistics (sum of squares, standard deviation, etc) compared to the stats_bucket
aggregation.
A extended_stats_bucket
aggregation looks like this in isolation:
{ "extended_stats_bucket": { "buckets_path": "the_sum" } }
Table 7. extended_stats_bucket
Parameters
Parameter Name  Description  Required  Default Value 
 The path to the buckets we wish to calculate stats for (see the section called “  Required  
 The policy to apply when gaps are found in the data (see the section called “Dealing with gaps in the data” for more details)  Optional 

 format to apply to the output value of this aggregation  Optional 

 The number of standard deviations above/below the mean to display  Optional  2 
The following snippet calculates the sum of all the total monthly sales
buckets:
POST /sales/_search { "size": 0, "aggs" : { "sales_per_month" : { "date_histogram" : { "field" : "date", "interval" : "month" }, "aggs": { "sales": { "sum": { "field": "price" } } } }, "stats_monthly_sales": { "extended_stats_bucket": { "buckets_path": "sales_per_month>sales" } } } }

And the following may be the response:
{ "took": 11, "timed_out": false, "_shards": ..., "hits": ..., "aggregations": { "sales_per_month": { "buckets": [ { "key_as_string": "2015/01/01 00:00:00", "key": 1420070400000, "doc_count": 3, "sales": { "value": 550.0 } }, { "key_as_string": "2015/02/01 00:00:00", "key": 1422748800000, "doc_count": 2, "sales": { "value": 60.0 } }, { "key_as_string": "2015/03/01 00:00:00", "key": 1425168000000, "doc_count": 2, "sales": { "value": 375.0 } } ] }, "stats_monthly_sales": { "count": 3, "min": 60.0, "max": 550.0, "avg": 328.3333333333333, "sum": 985.0, "sum_of_squares": 446725.0, "variance": 41105.55555555556, "std_deviation": 202.74505063146563, "std_deviation_bounds": { "upper": 733.8234345962646, "lower": 77.15676792959795 } } } }
This functionality is experimental and may be changed or removed completely in a future release.
A sibling pipeline aggregation which calculates percentiles across all bucket of a specified metric in a sibling aggregation. The specified metric must be numeric and the sibling aggregation must be a multibucket aggregation.
A percentiles_bucket
aggregation looks like this in isolation:
{ "percentiles_bucket": { "buckets_path": "the_sum" } }
Table 8. sum_bucket
Parameters
Parameter Name  Description  Required  Default Value 
 The path to the buckets we wish to find the sum for (see the section called “  Required  
 The policy to apply when gaps are found in the data (see the section called “Dealing with gaps in the data” for more details)  Optional 

 format to apply to the output value of this aggregation  Optional 

 The list of percentiles to calculate  Optional 

The following snippet calculates the sum of all the total monthly sales
buckets:
POST /sales/_search { "size": 0, "aggs" : { "sales_per_month" : { "date_histogram" : { "field" : "date", "interval" : "month" }, "aggs": { "sales": { "sum": { "field": "price" } } } }, "percentiles_monthly_sales": { "percentiles_bucket": { "buckets_path": "sales_per_month>sales", "percents": [ 25.0, 50.0, 75.0 ] } } } }
 

And the following may be the response:
{ "took": 11, "timed_out": false, "_shards": ..., "hits": ..., "aggregations": { "sales_per_month": { "buckets": [ { "key_as_string": "2015/01/01 00:00:00", "key": 1420070400000, "doc_count": 3, "sales": { "value": 550.0 } }, { "key_as_string": "2015/02/01 00:00:00", "key": 1422748800000, "doc_count": 2, "sales": { "value": 60.0 } }, { "key_as_string": "2015/03/01 00:00:00", "key": 1425168000000, "doc_count": 2, "sales": { "value": 375.0 } } ] }, "percentiles_monthly_sales": { "values" : { "25.0": 375.0, "50.0": 375.0, "75.0": 550.0 } } } }
The Percentile Bucket returns the nearest input data point that is not greater than the requested percentile; it does not interpolate between data points.
The percentiles are calculated exactly and is not an approximation (unlike the Percentiles Metric). This means
the implementation maintains an inmemory, sorted list of your data to compute the percentiles, before discarding the
data. You may run into memory pressure issues if you attempt to calculate percentiles over many millions of
datapoints in a single percentiles_bucket
.
This functionality is experimental and may be changed or removed completely in a future release.
Given an ordered series of data, the Moving Average aggregation will slide a window across the data and emit the average
value of that window. For example, given the data [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
, we can calculate a simple moving
average with windows size of 5
as follows:
 (1 + 2 + 3 + 4 + 5) / 5 = 3
 (2 + 3 + 4 + 5 + 6) / 5 = 4
 (3 + 4 + 5 + 6 + 7) / 5 = 5
 etc
Moving averages are a simple method to smooth sequential data. Moving averages are typically applied to timebased data, such as stock prices or server metrics. The smoothing can be used to eliminate high frequency fluctuations or random noise, which allows the lower frequency trends to be more easily visualized, such as seasonality.
A moving_avg
aggregation looks like this in isolation:
{ "moving_avg": { "buckets_path": "the_sum", "model": "holt", "window": 5, "gap_policy": "insert_zero", "settings": { "alpha": 0.8 } } }
Table 9. moving_avg
Parameters
Parameter Name  Description  Required  Default Value 
 Path to the metric of interest (see  Required  
 The moving average weighting model that we wish to use  Optional 

 Determines what should happen when a gap in the data is encountered.  Optional 

 The size of window to "slide" across the histogram.  Optional 

 If the model should be algorithmically minimized. See Minimization for more details  Optional 

 Modelspecific settings, contents which differ depending on the model specified.  Optional 
moving_avg
aggregations must be embedded inside of a histogram
or date_histogram
aggregation. They can be
embedded like any other metric aggregation:
POST /_search { "size": 0, "aggs": { "my_date_histo":{ "date_histogram":{ "field":"timestamp", "interval":"day" }, "aggs":{ "the_sum":{ "sum":{ "field": "lemmings" } }, "the_movavg":{ "moving_avg":{ "buckets_path": "the_sum" } } } } } }
A  
A  
Finally, we specify a 
Moving averages are built by first specifying a histogram
or date_histogram
over a field. You can then optionally
add normal metrics, such as a sum
, inside of that histogram. Finally, the moving_avg
is embedded inside the histogram.
The buckets_path
parameter is then used to "point" at one of the sibling metrics inside of the histogram (see
the section called “buckets_path
Syntax” for a description of the syntax for buckets_path
.
The moving_avg
aggregation includes four different moving average "models". The main difference is how the values in the
window are weighted. As datapoints become "older" in the window, they may be weighted differently. This will
affect the final average for that window.
Models are specified using the model
parameter. Some models may have optional configurations which are specified inside
the settings
parameter.
The simple
model calculates the sum of all values in the window, then divides by the size of the window. It is effectively
a simple arithmetic mean of the window. The simple model does not perform any timedependent weighting, which means
the values from a simple
moving average tend to "lag" behind the real data.
{ "the_movavg":{ "moving_avg":{ "buckets_path": "the_sum", "window" : 30, "model" : "simple" } } }
A simple
model has no special settings to configure
The window size can change the behavior of the moving average. For example, a small window ("window": 10
) will closely
track the data and only smooth out small scale fluctuations:
In contrast, a simple
moving average with larger window ("window": 100
) will smooth out all higherfrequency fluctuations,
leaving only lowfrequency, long term trends. It also tends to "lag" behind the actual data by a substantial amount:
The linear
model assigns a linear weighting to points in the series, such that "older" datapoints (e.g. those at
the beginning of the window) contribute a linearly less amount to the total average. The linear weighting helps reduce
the "lag" behind the data’s mean, since older points have less influence.
{ "the_movavg":{ "moving_avg":{ "buckets_path": "the_sum", "window" : 30, "model" : "linear" } }
A linear
model has no special settings to configure
Like the simple
model, window size can change the behavior of the moving average. For example, a small window ("window": 10
)
will closely track the data and only smooth out small scale fluctuations:
In contrast, a linear
moving average with larger window ("window": 100
) will smooth out all higherfrequency fluctuations,
leaving only lowfrequency, long term trends. It also tends to "lag" behind the actual data by a substantial amount,
although typically less than the simple
model:
The ewma
model (aka "singleexponential") is similar to the linear
model, except older datapoints become exponentially less important,
rather than linearly less important. The speed at which the importance decays can be controlled with an alpha
setting. Small values make the weight decay slowly, which provides greater smoothing and takes into account a larger
portion of the window. Larger valuers make the weight decay quickly, which reduces the impact of older values on the
moving average. This tends to make the moving average track the data more closely but with less smoothing.
The default value of alpha
is 0.3
, and the setting accepts any float from 01 inclusive.
The EWMA model can be Minimized
{ "the_movavg":{ "moving_avg":{ "buckets_path": "the_sum", "window" : 30, "model" : "ewma", "settings" : { "alpha" : 0.5 } } }
The holt
model (aka "double exponential") incorporates a second exponential term which
tracks the data’s trend. Single exponential does not perform well when the data has an underlying linear trend. The
double exponential model calculates two values internally: a "level" and a "trend".
The level calculation is similar to ewma
, and is an exponentially weighted view of the data. The difference is
that the previously smoothed value is used instead of the raw value, which allows it to stay close to the original series.
The trend calculation looks at the difference between the current and last value (e.g. the slope, or trend, of the
smoothed data). The trend value is also exponentially weighted.
Values are produced by multiplying the level and trend components.
The default value of alpha
is 0.3
and beta
is 0.1
. The settings accept any float from 01 inclusive.
The HoltLinear model can be Minimized
{ "the_movavg":{ "moving_avg":{ "buckets_path": "the_sum", "window" : 30, "model" : "holt", "settings" : { "alpha" : 0.5, "beta" : 0.5 } } }
In practice, the alpha
value behaves very similarly in holt
as ewma
: small values produce more smoothing
and more lag, while larger values produce closer tracking and less lag. The value of beta
is often difficult
to see. Small values emphasize longterm trends (such as a constant linear trend in the whole series), while larger
values emphasize shortterm trends. This will become more apparently when you are predicting values.
The holt_winters
model (aka "triple exponential") incorporates a third exponential term which
tracks the seasonal aspect of your data. This aggregation therefore smooths based on three components: "level", "trend"
and "seasonality".
The level and trend calculation is identical to holt
The seasonal calculation looks at the difference between
the current point, and the point one period earlier.
HoltWinters requires a little more handholding than the other moving averages. You need to specify the "periodicity"
of your data: e.g. if your data has cyclic trends every 7 days, you would set period: 7
. Similarly if there was
a monthly trend, you would set it to 30
. There is currently no periodicity detection, although that is planned
for future enhancements.
There are two varieties of HoltWinters: additive and multiplicative.
Unfortunately, due to the nature of HoltWinters, it requires two periods of data to "bootstrap" the algorithm. This
means that your window
must always be at least twice the size of your period. An exception will be thrown if it
isn’t. It also means that HoltWinters will not emit a value for the first 2 * period
buckets; the current algorithm
does not backcast.
Because the "cold start" obscures what the moving average looks like, the rest of the HoltWinters images are truncated to not show the "cold start". Just be aware this will always be present at the beginning of your moving averages!
Additive seasonality is the default; it can also be specified by setting "type": "add"
. This variety is preferred
when the seasonal affect is additive to your data. E.g. you could simply subtract the seasonal effect to "deseasonalize"
your data into a flat trend.
The default values of alpha
and gamma
are 0.3
while beta
is 0.1
. The settings accept any float from 01 inclusive.
The default value of period
is 1
.
The additive HoltWinters model can be Minimized
{ "the_movavg":{ "moving_avg":{ "buckets_path": "the_sum", "window" : 30, "model" : "holt_winters", "settings" : { "type" : "add", "alpha" : 0.5, "beta" : 0.5, "gamma" : 0.5, "period" : 7 } } }
Figure 10. HoltWinters moving average with window of size 120, alpha = 0.5, beta = 0.7, gamma = 0.3, period = 30
Multiplicative is specified by setting "type": "mult"
. This variety is preferred when the seasonal affect is
multiplied against your data. E.g. if the seasonal affect is x5 the data, rather than simply adding to it.
The default values of alpha
and gamma
are 0.3
while beta
is 0.1
. The settings accept any float from 01 inclusive.
The default value of period
is 1
.
The multiplicative HoltWinters model can be Minimized
Multiplicative HoltWinters works by dividing each data point by the seasonal value. This is problematic if any of
your data is zero, or if there are gaps in the data (since this results in a dividbyzero). To combat this, the
mult
HoltWinters pads all values by a very small amount (1*10^{10}) so that all values are nonzero. This affects
the result, but only minimally. If your data is nonzero, or you prefer to see NaN
when zero’s are encountered,
you can disable this behavior with pad: false
{ "the_movavg":{ "moving_avg":{ "buckets_path": "the_sum", "window" : 30, "model" : "holt_winters", "settings" : { "type" : "mult", "alpha" : 0.5, "beta" : 0.5, "gamma" : 0.5, "period" : 7, "pad" : true } } }
All the moving average model support a "prediction" mode, which will attempt to extrapolate into the future given the current smoothed, moving average. Depending on the model and parameter, these predictions may or may not be accurate.
Predictions are enabled by adding a predict
parameter to any moving average aggregation, specifying the number of
predictions you would like appended to the end of the series. These predictions will be spaced out at the same interval
as your buckets:
{ "the_movavg":{ "moving_avg":{ "buckets_path": "the_sum", "window" : 30, "model" : "simple", "predict" : 10 } }
The simple
, linear
and ewma
models all produce "flat" predictions: they essentially converge on the mean
of the last value in the series, producing a flat:
In contrast, the holt
model can extrapolate based on local or global constant trends. If we set a high beta
value, we can extrapolate based on local constant trends (in this case the predictions head down, because the data at the end
of the series was heading in a downward direction):
Figure 12. HoltLinear moving average with window of size 100, predict = 20, alpha = 0.5, beta = 0.8
In contrast, if we choose a small beta
, the predictions are based on the global constant trend. In this series, the
global trend is slightly positive, so the prediction makes a sharp uturn and begins a positive slope:
Figure 13. Double Exponential moving average with window of size 100, predict = 20, alpha = 0.5, beta = 0.1
The holt_winters
model has the potential to deliver the best predictions, since it also incorporates seasonal
fluctuations into the model:
Figure 14. HoltWinters moving average with window of size 120, predict = 25, alpha = 0.8, beta = 0.2, gamma = 0.7, period = 30
Some of the models (EWMA, HoltLinear, HoltWinters) require one or more parameters to be configured. Parameter choice can be tricky and sometimes nonintuitive. Furthermore, small deviations in these parameters can sometimes have a drastic effect on the output moving average.
For that reason, the three "tunable" models can be algorithmically minimized. Minimization is a process where parameters are tweaked until the predictions generated by the model closely match the output data. Minimization is not fullproof and can be susceptible to overfitting, but it often gives better results than handtuning.
Minimization is disabled by default for ewma
and holt_linear
, while it is enabled by default for holt_winters
.
Minimization is most useful with HoltWinters, since it helps improve the accuracy of the predictions. EWMA and
HoltLinear are not great predictors, and mostly used for smoothing data, so minimization is less useful on those
models.
Minimization is enabled/disabled via the minimize
parameter:
{ "the_movavg":{ "moving_avg":{ "buckets_path": "the_sum", "model" : "holt_winters", "window" : 30, "minimize" : true, "settings" : { "period" : 7 } } }
When enabled, minimization will find the optimal values for alpha
, beta
and gamma
. The user should still provide
appropriate values for window
, period
and type
.
Minimization works by running a stochastic process called simulated annealing. This process will usually generate a good solution, but is not guaranteed to find the global optimum. It also requires some amount of additional computational power, since the model needs to be rerun multiple times as the values are tweaked. The runtime of minimization is linear to the size of the window being processed: excessively large windows may cause latency.
Finally, minimization fits the model to the last n
values, where n = window
. This generally produces
better forecasts into the future, since the parameters are tuned around the end of the series. It can, however, generate
poorer fitting moving averages at the beginning of the series.
This functionality is experimental and may be changed or removed completely in a future release.
A parent pipeline aggregation which calculates the cumulative sum of a specified metric in a parent histogram (or date_histogram)
aggregation. The specified metric must be numeric and the enclosing histogram must have min_doc_count
set to 0
(default
for histogram
aggregations).
A cumulative_sum
aggregation looks like this in isolation:
{ "cumulative_sum": { "buckets_path": "the_sum" } }
Table 10. cumulative_sum
Parameters
Parameter Name  Description  Required  Default Value 
 The path to the buckets we wish to find the cumulative sum for (see the section called “  Required  
 format to apply to the output value of this aggregation  Optional, defaults to 
The following snippet calculates the cumulative sum of the total monthly sales
:
POST /sales/_search { "size": 0, "aggs" : { "sales_per_month" : { "date_histogram" : { "field" : "date", "interval" : "month" }, "aggs": { "sales": { "sum": { "field": "price" } }, "cumulative_sales": { "cumulative_sum": { "buckets_path": "sales" } } } } } }

And the following may be the response:
{ "took": 11, "timed_out": false, "_shards": ..., "hits": ..., "aggregations": { "sales_per_month": { "buckets": [ { "key_as_string": "2015/01/01 00:00:00", "key": 1420070400000, "doc_count": 3, "sales": { "value": 550.0 }, "cumulative_sales": { "value": 550.0 } }, { "key_as_string": "2015/02/01 00:00:00", "key": 1422748800000, "doc_count": 2, "sales": { "value": 60.0 }, "cumulative_sales": { "value": 610.0 } }, { "key_as_string": "2015/03/01 00:00:00", "key": 1425168000000, "doc_count": 2, "sales": { "value": 375.0 }, "cumulative_sales": { "value": 985.0 } } ] } } }
This functionality is experimental and may be changed or removed completely in a future release.
A parent pipeline aggregation which executes a script which can perform per bucket computations on specified metrics in the parent multibucket aggregation. The specified metric must be numeric and the script must return a numeric value.
A bucket_script
aggregation looks like this in isolation:
{ "bucket_script": { "buckets_path": { "my_var1": "the_sum", "my_var2": "the_value_count" }, "script": "my_var1 / my_var2" } }
Here, 
Table 11. bucket_script
Parameters
Parameter Name  Description  Required  Default Value 
 The script to run for this aggregation. The script can be inline, file or indexed. (see Scripting for more details)  Required  
 A map of script variables and their associated path to the buckets we wish to use for the variable
(see the section called “  Required  
 The policy to apply when gaps are found in the data (see the section called “Dealing with gaps in the data” for more details)  Optional, defaults to  
 format to apply to the output value of this aggregation  Optional, defaults to 
The following snippet calculates the ratio percentage of tshirt sales compared to total sales each month:
POST /sales/_search { "size": 0, "aggs" : { "sales_per_month" : { "date_histogram" : { "field" : "date", "interval" : "month" }, "aggs": { "total_sales": { "sum": { "field": "price" } }, "tshirts": { "filter": { "term": { "type": "tshirt" } }, "aggs": { "sales": { "sum": { "field": "price" } } } }, "tshirtpercentage": { "bucket_script": { "buckets_path": { "tShirtSales": "tshirts>sales", "totalSales": "total_sales" }, "script": "params.tShirtSales / params.totalSales * 100" } } } } } }
And the following may be the response:
{ "took": 11, "timed_out": false, "_shards": ..., "hits": ..., "aggregations": { "sales_per_month": { "buckets": [ { "key_as_string": "2015/01/01 00:00:00", "key": 1420070400000, "doc_count": 3, "total_sales": { "value": 550.0 }, "tshirts": { "doc_count": 1, "sales": { "value": 200.0 } }, "tshirtpercentage": { "value": 36.36363636363637 } }, { "key_as_string": "2015/02/01 00:00:00", "key": 1422748800000, "doc_count": 2, "total_sales": { "value": 60.0 }, "tshirts": { "doc_count": 1, "sales": { "value": 10.0 } }, "tshirtpercentage": { "value": 16.666666666666664 } }, { "key_as_string": "2015/03/01 00:00:00", "key": 1425168000000, "doc_count": 2, "total_sales": { "value": 375.0 }, "tshirts": { "doc_count": 1, "sales": { "value": 175.0 } }, "tshirtpercentage": { "value": 46.666666666666664 } } ] } } }
This functionality is experimental and may be changed or removed completely in a future release.
A parent pipeline aggregation which executes a script which determines whether the current bucket will be retained
in the parent multibucket aggregation. The specified metric must be numeric and the script must return a boolean value.
If the script language is expression
then a numeric return value is permitted. In this case 0.0 will be evaluated as false
and all other values will evaluate to true.
Note: The bucket_selector aggregation, like all pipeline aggregations, executions after all other sibling aggregations. This means that using the bucket_selector aggregation to filter the returned buckets in the response does not save on execution time running the aggregations.
A bucket_selector
aggregation looks like this in isolation:
{ "bucket_selector": { "buckets_path": { "my_var1": "the_sum", "my_var2": "the_value_count" }, "script": "params.my_var1 > params.my_var2" } }
Here, 
Table 12. bucket_selector
Parameters
Parameter Name  Description  Required  Default Value 
 The script to run for this aggregation. The script can be inline, file or indexed. (see Scripting for more details)  Required  
 A map of script variables and their associated path to the buckets we wish to use for the variable
(see the section called “  Required  
 The policy to apply when gaps are found in the data (see the section called “Dealing with gaps in the data” for more details)  Optional, defaults to 
The following snippet only retains buckets where the total sales for the month is more than 400:
POST /sales/_search { "size": 0, "aggs" : { "sales_per_month" : { "date_histogram" : { "field" : "date", "interval" : "month" }, "aggs": { "total_sales": { "sum": { "field": "price" } }, "sales_bucket_filter": { "bucket_selector": { "buckets_path": { "totalSales": "total_sales" }, "script": "params.totalSales > 200" } } } } } }
And the following may be the response:
{ "took": 11, "timed_out": false, "_shards": ..., "hits": ..., "aggregations": { "sales_per_month": { "buckets": [ { "key_as_string": "2015/01/01 00:00:00", "key": 1420070400000, "doc_count": 3, "total_sales": { "value": 550.0 } }, { "key_as_string": "2015/03/01 00:00:00", "key": 1425168000000, "doc_count": 2, "total_sales": { "value": 375.0 }, } ] } } }
This functionality is experimental and may be changed or removed completely in a future release.
Serial differencing is a technique where values in a time series are subtracted from itself at different time lags or periods. For example, the datapoint f(x) = f(x_{t})  f(x_{tn}), where n is the period being used.
A period of 1 is equivalent to a derivative with no time normalization: it is simply the change from one point to the next. Single periods are useful for removing constant, linear trends.
Single periods are also useful for transforming data into a stationary series. In this example, the Dow Jones is plotted over ~250 days. The raw data is not stationary, which would make it difficult to use with some techniques.
By calculating the firstdifference, we detrend the data (e.g. remove a constant, linear trend). We can see that the data becomes a stationary series (e.g. the first difference is randomly distributed around zero, and doesn’t seem to exhibit any pattern/behavior). The transformation reveals that the dataset is following a randomwalk; the value is the previous value +/ a random amount. This insight allows selection of further tools for analysis.
Larger periods can be used to remove seasonal / cyclic behavior. In this example, a population of lemmings was synthetically generated with a sine wave + constant linear trend + random noise. The sine wave has a period of 30 days.
The firstdifference removes the constant trend, leaving just a sine wave. The 30thdifference is then applied to the firstdifference to remove the cyclic behavior, leaving a stationary series which is amenable to other analysis.
A serial_diff
aggregation looks like this in isolation:
{ "serial_diff": { "buckets_path": "the_sum", "lag": "7" } }
Table 13. serial_diff
Parameters
Parameter Name  Description  Required  Default Value 
 Path to the metric of interest (see  Required  
 The historical bucket to subtract from the current value. E.g. a lag of 7 will subtract the current value from the value 7 buckets ago. Must be a positive, nonzero integer  Optional 

 Determines what should happen when a gap in the data is encountered.  Optional 

 Format to apply to the output value of this aggregation  Optional 

serial_diff
aggregations must be embedded inside of a histogram
or date_histogram
aggregation:
POST /_search { "size": 0, "aggs": { "my_date_histo": { "date_histogram": { "field": "timestamp", "interval": "day" }, "aggs": { "the_sum": { "sum": { "field": "lemmings" } }, "thirtieth_difference": { "serial_diff": { "buckets_path": "the_sum", "lag" : 30 } } } } } }
A  
A  
Finally, we specify a 
Serial differences are built by first specifying a histogram
or date_histogram
over a field. You can then optionally
add normal metrics, such as a sum
, inside of that histogram. Finally, the serial_diff
is embedded inside the histogram.
The buckets_path
parameter is then used to "point" at one of the sibling metrics inside of the histogram (see
the section called “buckets_path
Syntax” for a description of the syntax for buckets_path
.