Pipeline Aggregations

Warning

This functionality is experimental and may be changed or removed completely in a future release.

Pipeline aggregations work on the outputs produced from other aggregations rather than from document sets, adding information to the output tree. There are many different types of pipeline aggregation, each computing different information from other aggregations, but these types can be broken down into two families:

Parent
A family of pipeline aggregations that is provided with the output of its parent aggregation and is able to compute new buckets or new aggregations to add to existing buckets.
Sibling
Pipeline aggregations that are provided with the output of a sibling aggregation and are able to compute a new aggregation which will be at the same level as the sibling aggregation.

Pipeline aggregations can reference the aggregations they need to perform their computation by using the buckets_path parameter to indicate the paths to the required metrics. The syntax for defining these paths can be found in the buckets_path Syntax section below.

Pipeline aggregations cannot have sub-aggregations but depending on the type it can reference another pipeline in the buckets_path allowing pipeline aggregations to be chained. For example, you can chain together two derivatives to calculate the second derivative (i.e. a derivative of a derivative).

Note

Because pipeline aggregations only add to the output, when chaining pipeline aggregations the output of each pipeline aggregation will be included in the final output.

buckets_path Syntax

Most pipeline aggregations require another aggregation as their input. The input aggregation is defined via the buckets_path parameter, which follows a specific format:

AGG_SEPARATOR       =  '>' ;
METRIC_SEPARATOR    =  '.' ;
AGG_NAME            =  <the name of the aggregation> ;
METRIC              =  <the name of the metric (in case of multi-value metrics aggregation)> ;
PATH                =  <AGG_NAME> [ <AGG_SEPARATOR>, <AGG_NAME> ]* [ <METRIC_SEPARATOR>, <METRIC> ] ;

For example, the path "my_bucket>my_stats.avg" will path to the avg value in the "my_stats" metric, which is contained in the "my_bucket" bucket aggregation.

Paths are relative from the position of the pipeline aggregation; they are not absolute paths, and the path cannot go back "up" the aggregation tree. For example, this moving average is embedded inside a date_histogram and refers to a "sibling" metric "the_sum":

POST /_search
{
    "aggs": {
        "my_date_histo":{
            "date_histogram":{
                "field":"timestamp",
                "interval":"day"
            },
            "aggs":{
                "the_sum":{
                    "sum":{ "field": "lemmings" } 
                },
                "the_movavg":{
                    "moving_avg":{ "buckets_path": "the_sum" } 
                }
            }
        }
    }
}

The metric is called "the_sum"

The buckets_path refers to the metric via a relative path "the_sum"

buckets_path is also used for Sibling pipeline aggregations, where the aggregation is "next" to a series of buckets instead of embedded "inside" them. For example, the max_bucket aggregation uses the buckets_path to specify a metric embedded inside a sibling aggregation:

POST /_search
{
    "aggs" : {
        "sales_per_month" : {
            "date_histogram" : {
                "field" : "date",
                "interval" : "month"
            },
            "aggs": {
                "sales": {
                    "sum": {
                        "field": "price"
                    }
                }
            }
        },
        "max_monthly_sales": {
            "max_bucket": {
                "buckets_path": "sales_per_month>sales" 
            }
        }
    }
}

buckets_path instructs this max_bucket aggregation that we want the maximum value of the sales aggregation in the sales_per_month date histogram.

Special Paths

Instead of pathing to a metric, buckets_path can use a special "_count" path. This instructs the pipeline aggregation to use the document count as it’s input. For example, a moving average can be calculated on the document count of each bucket, instead of a specific metric:

POST /_search
{
    "aggs": {
        "my_date_histo": {
            "date_histogram": {
                "field":"timestamp",
                "interval":"day"
            },
            "aggs": {
                "the_movavg": {
                    "moving_avg": { "buckets_path": "_count" } 
                }
            }
        }
    }
}

By using _count instead of a metric name, we can calculate the moving average of document counts in the histogram

The buckets_path can also use "_bucket_count" and path to a multi-bucket aggregation to use the number of buckets returned by that aggregation in the pipeline aggregation instead of a metric. for example a bucket_selector can be used here to filter out buckets which contain no buckets for an inner terms aggregation:

POST /sales/_search
{
  "size": 0,
  "aggs": {
    "histo": {
      "date_histogram": {
        "field": "date",
        "interval": "day"
      },
      "aggs": {
        "categories": {
          "terms": {
            "field": "category"
          }
        },
        "min_bucket_selector": {
          "bucket_selector": {
            "buckets_path": {
              "count": "categories._bucket_count" 
            },
            "script": {
              "inline": "params.count != 0"
            }
          }
        }
      }
    }
  }
}

By using _bucket_count instead of a metric name, we can filter out histo buckets where they contain no buckets for the categories aggregation

Dealing with dots in agg names

An alternate syntax is supported to cope with aggregations or metrics which have dots in the name, such as the 99.9th percentile. This metric may be referred to as:

"buckets_path": "my_percentile[99.9]"

Dealing with gaps in the data

Data in the real world is often noisy and sometimes contains gaps — places where data simply doesn’t exist. This can occur for a variety of reasons, the most common being:

  • Documents falling into a bucket do not contain a required field
  • There are no documents matching the query for one or more buckets
  • The metric being calculated is unable to generate a value, likely because another dependent bucket is missing a value. Some pipeline aggregations have specific requirements that must be met (e.g. a derivative cannot calculate a metric for the first value because there is no previous value, HoltWinters moving average need "warmup" data to begin calculating, etc)

Gap policies are a mechanism to inform the pipeline aggregation about the desired behavior when "gappy" or missing data is encountered. All pipeline aggregations accept the gap_policy parameter. There are currently two gap policies to choose from:

skip
This option treats missing data as if the bucket does not exist. It will skip the bucket and continue calculating using the next available value.
insert_zeros
This option will replace missing values with a zero (0) and pipeline aggregation computation will proceed as normal.

Avg Bucket Aggregation

Warning

This functionality is experimental and may be changed or removed completely in a future release.

A sibling pipeline aggregation which calculates the (mean) average value of a specified metric in a sibling aggregation. The specified metric must be numeric and the sibling aggregation must be a multi-bucket aggregation.

Syntax

An avg_bucket aggregation looks like this in isolation:

{
    "avg_bucket": {
        "buckets_path": "the_sum"
    }
}

Table 1. avg_bucket Parameters

Parameter Name

Description

Required

Default Value

buckets_path

The path to the buckets we wish to find the average for (see the section called “buckets_path Syntax” for more details)

Required

gap_policy

The policy to apply when gaps are found in the data (see the section called “Dealing with gaps in the data” for more details)

Optional, defaults to skip

format

format to apply to the output value of this aggregation

Optional, defaults to null


The following snippet calculates the average of the total monthly sales:

POST /_search
{
  "size": 0,
  "aggs": {
    "sales_per_month": {
      "date_histogram": {
        "field": "date",
        "interval": "month"
      },
      "aggs": {
        "sales": {
          "sum": {
            "field": "price"
          }
        }
      }
    },
    "avg_monthly_sales": {
      "avg_bucket": {
        "buckets_path": "sales_per_month>sales" 
      }
    }
  }
}

buckets_path instructs this avg_bucket aggregation that we want the (mean) average value of the sales aggregation in the sales_per_month date histogram.

And the following may be the response:

{
   "took": 11,
   "timed_out": false,
   "_shards": ...,
   "hits": ...,
   "aggregations": {
      "sales_per_month": {
         "buckets": [
            {
               "key_as_string": "2015/01/01 00:00:00",
               "key": 1420070400000,
               "doc_count": 3,
               "sales": {
                  "value": 550.0
               }
            },
            {
               "key_as_string": "2015/02/01 00:00:00",
               "key": 1422748800000,
               "doc_count": 2,
               "sales": {
                  "value": 60.0
               }
            },
            {
               "key_as_string": "2015/03/01 00:00:00",
               "key": 1425168000000,
               "doc_count": 2,
               "sales": {
                  "value": 375.0
               }
            }
         ]
      },
      "avg_monthly_sales": {
          "value": 328.33333333333333
      }
   }
}

Derivative Aggregation

Warning

This functionality is experimental and may be changed or removed completely in a future release.

A parent pipeline aggregation which calculates the derivative of a specified metric in a parent histogram (or date_histogram) aggregation. The specified metric must be numeric and the enclosing histogram must have min_doc_count set to 0 (default for histogram aggregations).

Syntax

A derivative aggregation looks like this in isolation:

"derivative": {
  "buckets_path": "the_sum"
}

Table 2. derivative Parameters

Parameter Name

Description

Required

Default Value

buckets_path

The path to the buckets we wish to find the derivative for (see the section called “buckets_path Syntax” for more details)

Required

gap_policy

The policy to apply when gaps are found in the data (see the section called “Dealing with gaps in the data” for more details)

Optional, defaults to skip

format

format to apply to the output value of this aggregation

Optional, defaults to null


First Order Derivative

The following snippet calculates the derivative of the total monthly sales:

POST /sales/_search
{
    "size": 0,
    "aggs" : {
        "sales_per_month" : {
            "date_histogram" : {
                "field" : "date",
                "interval" : "month"
            },
            "aggs": {
                "sales": {
                    "sum": {
                        "field": "price"
                    }
                },
                "sales_deriv": {
                    "derivative": {
                        "buckets_path": "sales" 
                    }
                }
            }
        }
    }
}

buckets_path instructs this derivative aggregation to use the output of the sales aggregation for the derivative

And the following may be the response:

{
   "took": 11,
   "timed_out": false,
   "_shards": ...,
   "hits": ...,
   "aggregations": {
      "sales_per_month": {
         "buckets": [
            {
               "key_as_string": "2015/01/01 00:00:00",
               "key": 1420070400000,
               "doc_count": 3,
               "sales": {
                  "value": 550.0
               } 
            },
            {
               "key_as_string": "2015/02/01 00:00:00",
               "key": 1422748800000,
               "doc_count": 2,
               "sales": {
                  "value": 60.0
               },
               "sales_deriv": {
                  "value": -490.0 
               }
            },
            {
               "key_as_string": "2015/03/01 00:00:00",
               "key": 1425168000000,
               "doc_count": 2, 
               "sales": {
                  "value": 375.0
               },
               "sales_deriv": {
                  "value": 315.0
               }
            }
         ]
      }
   }
}

No derivative for the first bucket since we need at least 2 data points to calculate the derivative

Derivative value units are implicitly defined by the sales aggregation and the parent histogram so in this case the units would be $/month assuming the price field has units of $.

The number of documents in the bucket are represented by the doc_count f

Second Order Derivative

A second order derivative can be calculated by chaining the derivative pipeline aggregation onto the result of another derivative pipeline aggregation as in the following example which will calculate both the first and the second order derivative of the total monthly sales:

POST /sales/_search
{
    "size": 0,
    "aggs" : {
        "sales_per_month" : {
            "date_histogram" : {
                "field" : "date",
                "interval" : "month"
            },
            "aggs": {
                "sales": {
                    "sum": {
                        "field": "price"
                    }
                },
                "sales_deriv": {
                    "derivative": {
                        "buckets_path": "sales"
                    }
                },
                "sales_2nd_deriv": {
                    "derivative": {
                        "buckets_path": "sales_deriv" 
                    }
                }
            }
        }
    }
}

buckets_path for the second derivative points to the name of the first derivative

And the following may be the response:

{
   "took": 50,
   "timed_out": false,
   "_shards": ...,
   "hits": ...,
   "aggregations": {
      "sales_per_month": {
         "buckets": [
            {
               "key_as_string": "2015/01/01 00:00:00",
               "key": 1420070400000,
               "doc_count": 3,
               "sales": {
                  "value": 550.0
               } 
            },
            {
               "key_as_string": "2015/02/01 00:00:00",
               "key": 1422748800000,
               "doc_count": 2,
               "sales": {
                  "value": 60.0
               },
               "sales_deriv": {
                  "value": -490.0
               } 
            },
            {
               "key_as_string": "2015/03/01 00:00:00",
               "key": 1425168000000,
               "doc_count": 2,
               "sales": {
                  "value": 375.0
               },
               "sales_deriv": {
                  "value": 315.0
               },
               "sales_2nd_deriv": {
                  "value": 805.0
               }
            }
         ]
      }
   }
}

No second derivative for the first two buckets since we need at least 2 data points from the first derivative to calculate the second derivative

Units

The derivative aggregation allows the units of the derivative values to be specified. This returns an extra field in the response normalized_value which reports the derivative value in the desired x-axis units. In the below example we calculate the derivative of the total sales per month but ask for the derivative of the sales as in the units of sales per day:

POST /sales/_search
{
    "size": 0,
    "aggs" : {
        "sales_per_month" : {
            "date_histogram" : {
                "field" : "date",
                "interval" : "month"
            },
            "aggs": {
                "sales": {
                    "sum": {
                        "field": "price"
                    }
                },
                "sales_deriv": {
                    "derivative": {
                        "buckets_path": "sales",
                        "unit": "day" 
                    }
                }
            }
        }
    }
}

unit specifies what unit to use for the x-axis of the derivative calculation

And the following may be the response:

{
   "took": 50,
   "timed_out": false,
   "_shards": ...,
   "hits": ...,
   "aggregations": {
      "sales_per_month": {
         "buckets": [
            {
               "key_as_string": "2015/01/01 00:00:00",
               "key": 1420070400000,
               "doc_count": 3,
               "sales": {
                  "value": 550.0
               } 
            },
            {
               "key_as_string": "2015/02/01 00:00:00",
               "key": 1422748800000,
               "doc_count": 2,
               "sales": {
                  "value": 60.0
               },
               "sales_deriv": {
                  "value": -490.0, 
                  "normalized_value": -15.806451612903226 
               }
            },
            {
               "key_as_string": "2015/03/01 00:00:00",
               "key": 1425168000000,
               "doc_count": 2,
               "sales": {
                  "value": 375.0
               },
               "sales_deriv": {
                  "value": 315.0,
                  "normalized_value": 11.25
               }
            }
         ]
      }
   }
}

value is reported in the original units of per month

normalized_value is reported in the desired units of per day

Max Bucket Aggregation

Warning

This functionality is experimental and may be changed or removed completely in a future release.

A sibling pipeline aggregation which identifies the bucket(s) with the maximum value of a specified metric in a sibling aggregation and outputs both the value and the key(s) of the bucket(s). The specified metric must be numeric and the sibling aggregation must be a multi-bucket aggregation.

Syntax

A max_bucket aggregation looks like this in isolation:

{
    "max_bucket": {
        "buckets_path": "the_sum"
    }
}

Table 3. max_bucket Parameters

Parameter Name

Description

Required

Default Value

buckets_path

The path to the buckets we wish to find the maximum for (see the section called “buckets_path Syntax” for more details)

Required

gap_policy

The policy to apply when gaps are found in the data (see the section called “Dealing with gaps in the data” for more details)

Optional, defaults to skip

format

format to apply to the output value of this aggregation

Optional, defaults to null


The following snippet calculates the maximum of the total monthly sales:

POST /sales/_search
{
    "size": 0,
    "aggs" : {
        "sales_per_month" : {
            "date_histogram" : {
                "field" : "date",
                "interval" : "month"
            },
            "aggs": {
                "sales": {
                    "sum": {
                        "field": "price"
                    }
                }
            }
        },
        "max_monthly_sales": {
            "max_bucket": {
                "buckets_path": "sales_per_month>sales" 
            }
        }
    }
}

buckets_path instructs this max_bucket aggregation that we want the maximum value of the sales aggregation in the sales_per_month date histogram.

And the following may be the response:

{
   "took": 11,
   "timed_out": false,
   "_shards": ...,
   "hits": ...,
   "aggregations": {
      "sales_per_month": {
         "buckets": [
            {
               "key_as_string": "2015/01/01 00:00:00",
               "key": 1420070400000,
               "doc_count": 3,
               "sales": {
                  "value": 550.0
               }
            },
            {
               "key_as_string": "2015/02/01 00:00:00",
               "key": 1422748800000,
               "doc_count": 2,
               "sales": {
                  "value": 60.0
               }
            },
            {
               "key_as_string": "2015/03/01 00:00:00",
               "key": 1425168000000,
               "doc_count": 2,
               "sales": {
                  "value": 375.0
               }
            }
         ]
      },
      "max_monthly_sales": {
          "keys": ["2015/01/01 00:00:00"], 
          "value": 550.0
      }
   }
}

keys is an array of strings since the maximum value may be present in multiple buckets

Min Bucket Aggregation

Warning

This functionality is experimental and may be changed or removed completely in a future release.

A sibling pipeline aggregation which identifies the bucket(s) with the minimum value of a specified metric in a sibling aggregation and outputs both the value and the key(s) of the bucket(s). The specified metric must be numeric and the sibling aggregation must be a multi-bucket aggregation.

Syntax

A max_bucket aggregation looks like this in isolation:

{
    "min_bucket": {
        "buckets_path": "the_sum"
    }
}

Table 4. min_bucket Parameters

Parameter Name

Description

Required

Default Value

buckets_path

The path to the buckets we wish to find the minimum for (see the section called “buckets_path Syntax” for more details)

Required

gap_policy

The policy to apply when gaps are found in the data (see the section called “Dealing with gaps in the data” for more details)

Optional, defaults to skip

format

format to apply to the output value of this aggregation

Optional, defaults to null


The following snippet calculates the minimum of the total monthly sales:

POST /sales/_search
{
    "size": 0,
    "aggs" : {
        "sales_per_month" : {
            "date_histogram" : {
                "field" : "date",
                "interval" : "month"
            },
            "aggs": {
                "sales": {
                    "sum": {
                        "field": "price"
                    }
                }
            }
        },
        "min_monthly_sales": {
            "min_bucket": {
                "buckets_path": "sales_per_month>sales" 
            }
        }
    }
}

buckets_path instructs this max_bucket aggregation that we want the minimum value of the sales aggregation in the sales_per_month date histogram.

And the following may be the response:

{
   "took": 11,
   "timed_out": false,
   "_shards": ...,
   "hits": ...,
   "aggregations": {
      "sales_per_month": {
         "buckets": [
            {
               "key_as_string": "2015/01/01 00:00:00",
               "key": 1420070400000,
               "doc_count": 3,
               "sales": {
                  "value": 550.0
               }
            },
            {
               "key_as_string": "2015/02/01 00:00:00",
               "key": 1422748800000,
               "doc_count": 2,
               "sales": {
                  "value": 60.0
               }
            },
            {
               "key_as_string": "2015/03/01 00:00:00",
               "key": 1425168000000,
               "doc_count": 2,
               "sales": {
                  "value": 375.0
               }
            }
         ]
      },
      "min_monthly_sales": {
          "keys": ["2015/02/01 00:00:00"], 
          "value": 60.0
      }
   }
}

keys is an array of strings since the minimum value may be present in multiple buckets

Sum Bucket Aggregation

Warning

This functionality is experimental and may be changed or removed completely in a future release.

A sibling pipeline aggregation which calculates the sum across all bucket of a specified metric in a sibling aggregation. The specified metric must be numeric and the sibling aggregation must be a multi-bucket aggregation.

Syntax

A sum_bucket aggregation looks like this in isolation:

{
    "sum_bucket": {
        "buckets_path": "the_sum"
    }
}

Table 5. sum_bucket Parameters

Parameter Name

Description

Required

Default Value

buckets_path

The path to the buckets we wish to find the sum for (see the section called “buckets_path Syntax” for more details)

Required

gap_policy

The policy to apply when gaps are found in the data (see the section called “Dealing with gaps in the data” for more details)

Optional, defaults to skip

format

format to apply to the output value of this aggregation

Optional, defaults to null


The following snippet calculates the sum of all the total monthly sales buckets:

POST /sales/_search
{
    "size": 0,
    "aggs" : {
        "sales_per_month" : {
            "date_histogram" : {
                "field" : "date",
                "interval" : "month"
            },
            "aggs": {
                "sales": {
                    "sum": {
                        "field": "price"
                    }
                }
            }
        },
        "sum_monthly_sales": {
            "sum_bucket": {
                "buckets_path": "sales_per_month>sales" 
            }
        }
    }
}

buckets_path instructs this sum_bucket aggregation that we want the sum of the sales aggregation in the sales_per_month date histogram.

And the following may be the response:

{
   "took": 11,
   "timed_out": false,
   "_shards": ...,
   "hits": ...,
   "aggregations": {
      "sales_per_month": {
         "buckets": [
            {
               "key_as_string": "2015/01/01 00:00:00",
               "key": 1420070400000,
               "doc_count": 3,
               "sales": {
                  "value": 550.0
               }
            },
            {
               "key_as_string": "2015/02/01 00:00:00",
               "key": 1422748800000,
               "doc_count": 2,
               "sales": {
                  "value": 60.0
               }
            },
            {
               "key_as_string": "2015/03/01 00:00:00",
               "key": 1425168000000,
               "doc_count": 2,
               "sales": {
                  "value": 375.0
               }
            }
         ]
      },
      "sum_monthly_sales": {
          "value": 985.0
      }
   }
}

Stats Bucket Aggregation

Warning

This functionality is experimental and may be changed or removed completely in a future release.

A sibling pipeline aggregation which calculates a variety of stats across all bucket of a specified metric in a sibling aggregation. The specified metric must be numeric and the sibling aggregation must be a multi-bucket aggregation.

Syntax

A stats_bucket aggregation looks like this in isolation:

{
    "stats_bucket": {
        "buckets_path": "the_sum"
    }
}

Table 6. stats_bucket Parameters

Parameter Name

Description

Required

Default Value

buckets_path

The path to the buckets we wish to calculate stats for (see the section called “buckets_path Syntax” for more details)

Required

gap_policy

The policy to apply when gaps are found in the data (see the section called “Dealing with gaps in the data” for more details)

Optional

skip

format

format to apply to the output value of this aggregation

Optional

null


The following snippet calculates the sum of all the total monthly sales buckets:

POST /sales/_search
{
    "size": 0,
    "aggs" : {
        "sales_per_month" : {
            "date_histogram" : {
                "field" : "date",
                "interval" : "month"
            },
            "aggs": {
                "sales": {
                    "sum": {
                        "field": "price"
                    }
                }
            }
        },
        "stats_monthly_sales": {
            "stats_bucket": {
                "buckets_path": "sales_per_month>sales" 
            }
        }
    }
}

bucket_paths instructs this stats_bucket aggregation that we want the calculate stats for the sales aggregation in the sales_per_month date histogram.

And the following may be the response:

{
   "took": 11,
   "timed_out": false,
   "_shards": ...,
   "hits": ...,
   "aggregations": {
      "sales_per_month": {
         "buckets": [
            {
               "key_as_string": "2015/01/01 00:00:00",
               "key": 1420070400000,
               "doc_count": 3,
               "sales": {
                  "value": 550.0
               }
            },
            {
               "key_as_string": "2015/02/01 00:00:00",
               "key": 1422748800000,
               "doc_count": 2,
               "sales": {
                  "value": 60.0
               }
            },
            {
               "key_as_string": "2015/03/01 00:00:00",
               "key": 1425168000000,
               "doc_count": 2,
               "sales": {
                  "value": 375.0
               }
            }
         ]
      },
      "stats_monthly_sales": {
         "count": 3,
         "min": 60.0,
         "max": 550.0,
         "avg": 328.3333333333333,
         "sum": 985.0
      }
   }
}

Extended Stats Bucket Aggregation

Warning

This functionality is experimental and may be changed or removed completely in a future release.

A sibling pipeline aggregation which calculates a variety of stats across all bucket of a specified metric in a sibling aggregation. The specified metric must be numeric and the sibling aggregation must be a multi-bucket aggregation.

This aggregation provides a few more statistics (sum of squares, standard deviation, etc) compared to the stats_bucket aggregation.

Syntax

A extended_stats_bucket aggregation looks like this in isolation:

{
    "extended_stats_bucket": {
        "buckets_path": "the_sum"
    }
}

Table 7. extended_stats_bucket Parameters

Parameter Name

Description

Required

Default Value

buckets_path

The path to the buckets we wish to calculate stats for (see the section called “buckets_path Syntax” for more details)

Required

gap_policy

The policy to apply when gaps are found in the data (see the section called “Dealing with gaps in the data” for more details)

Optional

skip

format

format to apply to the output value of this aggregation

Optional

null

sigma

The number of standard deviations above/below the mean to display

Optional

2


The following snippet calculates the sum of all the total monthly sales buckets:

POST /sales/_search
{
    "size": 0,
    "aggs" : {
        "sales_per_month" : {
            "date_histogram" : {
                "field" : "date",
                "interval" : "month"
            },
            "aggs": {
                "sales": {
                    "sum": {
                        "field": "price"
                    }
                }
            }
        },
        "stats_monthly_sales": {
            "extended_stats_bucket": {
                "buckets_path": "sales_per_month>sales" 
            }
        }
    }
}

bucket_paths instructs this extended_stats_bucket aggregation that we want the calculate stats for the sales aggregation in the sales_per_month date histogram.

And the following may be the response:

{
   "took": 11,
   "timed_out": false,
   "_shards": ...,
   "hits": ...,
   "aggregations": {
      "sales_per_month": {
         "buckets": [
            {
               "key_as_string": "2015/01/01 00:00:00",
               "key": 1420070400000,
               "doc_count": 3,
               "sales": {
                  "value": 550.0
               }
            },
            {
               "key_as_string": "2015/02/01 00:00:00",
               "key": 1422748800000,
               "doc_count": 2,
               "sales": {
                  "value": 60.0
               }
            },
            {
               "key_as_string": "2015/03/01 00:00:00",
               "key": 1425168000000,
               "doc_count": 2,
               "sales": {
                  "value": 375.0
               }
            }
         ]
      },
      "stats_monthly_sales": {
         "count": 3,
         "min": 60.0,
         "max": 550.0,
         "avg": 328.3333333333333,
         "sum": 985.0,
         "sum_of_squares": 446725.0,
         "variance": 41105.55555555556,
         "std_deviation": 202.74505063146563,
         "std_deviation_bounds": {
           "upper": 733.8234345962646,
           "lower": -77.15676792959795
         }
      }
   }
}

Percentiles Bucket Aggregation

Warning

This functionality is experimental and may be changed or removed completely in a future release.

A sibling pipeline aggregation which calculates percentiles across all bucket of a specified metric in a sibling aggregation. The specified metric must be numeric and the sibling aggregation must be a multi-bucket aggregation.

Syntax

A percentiles_bucket aggregation looks like this in isolation:

{
    "percentiles_bucket": {
        "buckets_path": "the_sum"
    }
}

Table 8. sum_bucket Parameters

Parameter Name

Description

Required

Default Value

buckets_path

The path to the buckets we wish to find the sum for (see the section called “buckets_path Syntax” for more details)

Required

gap_policy

The policy to apply when gaps are found in the data (see the section called “Dealing with gaps in the data” for more details)

Optional

skip

format

format to apply to the output value of this aggregation

Optional

null

percents

The list of percentiles to calculate

Optional

[ 1, 5, 25, 50, 75, 95, 99 ]


The following snippet calculates the sum of all the total monthly sales buckets:

POST /sales/_search
{
    "size": 0,
    "aggs" : {
        "sales_per_month" : {
            "date_histogram" : {
                "field" : "date",
                "interval" : "month"
            },
            "aggs": {
                "sales": {
                    "sum": {
                        "field": "price"
                    }
                }
            }
        },
        "percentiles_monthly_sales": {
            "percentiles_bucket": {
                "buckets_path": "sales_per_month>sales", 
                "percents": [ 25.0, 50.0, 75.0 ] 
            }
        }
    }
}

buckets_path instructs this percentiles_bucket aggregation that we want to calculate percentiles for the sales aggregation in the sales_per_month date histogram.

percents specifies which percentiles we wish to calculate, in this case, the 25th, 50th and 75th percentiles.

And the following may be the response:

{
   "took": 11,
   "timed_out": false,
   "_shards": ...,
   "hits": ...,
   "aggregations": {
      "sales_per_month": {
         "buckets": [
            {
               "key_as_string": "2015/01/01 00:00:00",
               "key": 1420070400000,
               "doc_count": 3,
               "sales": {
                  "value": 550.0
               }
            },
            {
               "key_as_string": "2015/02/01 00:00:00",
               "key": 1422748800000,
               "doc_count": 2,
               "sales": {
                  "value": 60.0
               }
            },
            {
               "key_as_string": "2015/03/01 00:00:00",
               "key": 1425168000000,
               "doc_count": 2,
               "sales": {
                  "value": 375.0
               }
            }
         ]
      },
      "percentiles_monthly_sales": {
        "values" : {
            "25.0": 375.0,
            "50.0": 375.0,
            "75.0": 550.0
         }
      }
   }
}

Percentiles_bucket implementation

The Percentile Bucket returns the nearest input data point that is not greater than the requested percentile; it does not interpolate between data points.

The percentiles are calculated exactly and is not an approximation (unlike the Percentiles Metric). This means the implementation maintains an in-memory, sorted list of your data to compute the percentiles, before discarding the data. You may run into memory pressure issues if you attempt to calculate percentiles over many millions of data-points in a single percentiles_bucket.

Moving Average Aggregation

Warning

This functionality is experimental and may be changed or removed completely in a future release.

Given an ordered series of data, the Moving Average aggregation will slide a window across the data and emit the average value of that window. For example, given the data [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], we can calculate a simple moving average with windows size of 5 as follows:

  • (1 + 2 + 3 + 4 + 5) / 5 = 3
  • (2 + 3 + 4 + 5 + 6) / 5 = 4
  • (3 + 4 + 5 + 6 + 7) / 5 = 5
  • etc

Moving averages are a simple method to smooth sequential data. Moving averages are typically applied to time-based data, such as stock prices or server metrics. The smoothing can be used to eliminate high frequency fluctuations or random noise, which allows the lower frequency trends to be more easily visualized, such as seasonality.

Syntax

A moving_avg aggregation looks like this in isolation:

{
    "moving_avg": {
        "buckets_path": "the_sum",
        "model": "holt",
        "window": 5,
        "gap_policy": "insert_zero",
        "settings": {
            "alpha": 0.8
        }
    }
}

Table 9. moving_avg Parameters

Parameter Name

Description

Required

Default Value

buckets_path

Path to the metric of interest (see buckets_path Syntax for more details

Required

model

The moving average weighting model that we wish to use

Optional

simple

gap_policy

Determines what should happen when a gap in the data is encountered.

Optional

insert_zero

window

The size of window to "slide" across the histogram.

Optional

5

minimize

If the model should be algorithmically minimized. See Minimization for more details

Optional

false for most models

settings

Model-specific settings, contents which differ depending on the model specified.

Optional


moving_avg aggregations must be embedded inside of a histogram or date_histogram aggregation. They can be embedded like any other metric aggregation:

POST /_search
{
    "size": 0,
    "aggs": {
        "my_date_histo":{                
            "date_histogram":{
                "field":"timestamp",
                "interval":"day"
            },
            "aggs":{
                "the_sum":{
                    "sum":{ "field": "lemmings" } 
                },
                "the_movavg":{
                    "moving_avg":{ "buckets_path": "the_sum" } 
                }
            }
        }
    }
}

A date_histogram named "my_date_histo" is constructed on the "timestamp" field, with one-day intervals

A sum metric is used to calculate the sum of a field. This could be any metric (sum, min, max, etc)

Finally, we specify a moving_avg aggregation which uses "the_sum" metric as its input.

Moving averages are built by first specifying a histogram or date_histogram over a field. You can then optionally add normal metrics, such as a sum, inside of that histogram. Finally, the moving_avg is embedded inside the histogram. The buckets_path parameter is then used to "point" at one of the sibling metrics inside of the histogram (see the section called “buckets_path Syntax” for a description of the syntax for buckets_path.

Models

The moving_avg aggregation includes four different moving average "models". The main difference is how the values in the window are weighted. As data-points become "older" in the window, they may be weighted differently. This will affect the final average for that window.

Models are specified using the model parameter. Some models may have optional configurations which are specified inside the settings parameter.

Simple

The simple model calculates the sum of all values in the window, then divides by the size of the window. It is effectively a simple arithmetic mean of the window. The simple model does not perform any time-dependent weighting, which means the values from a simple moving average tend to "lag" behind the real data.

{
    "the_movavg":{
        "moving_avg":{
            "buckets_path": "the_sum",
            "window" : 30,
            "model" : "simple"
        }
    }
}

A simple model has no special settings to configure

The window size can change the behavior of the moving average. For example, a small window ("window": 10) will closely track the data and only smooth out small scale fluctuations:

Figure 1. Moving average with window of size 10

images/pipeline_movavg/movavg_10window.png

In contrast, a simple moving average with larger window ("window": 100) will smooth out all higher-frequency fluctuations, leaving only low-frequency, long term trends. It also tends to "lag" behind the actual data by a substantial amount:

Figure 2. Moving average with window of size 100

images/pipeline_movavg/movavg_100window.png

Linear

The linear model assigns a linear weighting to points in the series, such that "older" datapoints (e.g. those at the beginning of the window) contribute a linearly less amount to the total average. The linear weighting helps reduce the "lag" behind the data’s mean, since older points have less influence.

{
    "the_movavg":{
        "moving_avg":{
            "buckets_path": "the_sum",
            "window" : 30,
            "model" : "linear"
        }
}

A linear model has no special settings to configure

Like the simple model, window size can change the behavior of the moving average. For example, a small window ("window": 10) will closely track the data and only smooth out small scale fluctuations:

Figure 3. Linear moving average with window of size 10

images/pipeline_movavg/linear_10window.png

In contrast, a linear moving average with larger window ("window": 100) will smooth out all higher-frequency fluctuations, leaving only low-frequency, long term trends. It also tends to "lag" behind the actual data by a substantial amount, although typically less than the simple model:

Figure 4. Linear moving average with window of size 100

images/pipeline_movavg/linear_100window.png

EWMA (Exponentially Weighted)

The ewma model (aka "single-exponential") is similar to the linear model, except older data-points become exponentially less important, rather than linearly less important. The speed at which the importance decays can be controlled with an alpha setting. Small values make the weight decay slowly, which provides greater smoothing and takes into account a larger portion of the window. Larger valuers make the weight decay quickly, which reduces the impact of older values on the moving average. This tends to make the moving average track the data more closely but with less smoothing.

The default value of alpha is 0.3, and the setting accepts any float from 0-1 inclusive.

The EWMA model can be Minimized

{
    "the_movavg":{
        "moving_avg":{
            "buckets_path": "the_sum",
            "window" : 30,
            "model" : "ewma",
            "settings" : {
                "alpha" : 0.5
            }
        }
}

Figure 5. EWMA with window of size 10, alpha = 0.2

images/pipeline_movavg/single_0.2alpha.png

Figure 6. EWMA with window of size 10, alpha = 0.7

images/pipeline_movavg/single_0.7alpha.png

Holt-Linear

The holt model (aka "double exponential") incorporates a second exponential term which tracks the data’s trend. Single exponential does not perform well when the data has an underlying linear trend. The double exponential model calculates two values internally: a "level" and a "trend".

The level calculation is similar to ewma, and is an exponentially weighted view of the data. The difference is that the previously smoothed value is used instead of the raw value, which allows it to stay close to the original series. The trend calculation looks at the difference between the current and last value (e.g. the slope, or trend, of the smoothed data). The trend value is also exponentially weighted.

Values are produced by multiplying the level and trend components.

The default value of alpha is 0.3 and beta is 0.1. The settings accept any float from 0-1 inclusive.

The Holt-Linear model can be Minimized

{
    "the_movavg":{
        "moving_avg":{
            "buckets_path": "the_sum",
            "window" : 30,
            "model" : "holt",
            "settings" : {
                "alpha" : 0.5,
                "beta" : 0.5
            }
        }
}

In practice, the alpha value behaves very similarly in holt as ewma: small values produce more smoothing and more lag, while larger values produce closer tracking and less lag. The value of beta is often difficult to see. Small values emphasize long-term trends (such as a constant linear trend in the whole series), while larger values emphasize short-term trends. This will become more apparently when you are predicting values.

Figure 7. Holt-Linear moving average with window of size 100, alpha = 0.5, beta = 0.2

images/pipeline_movavg/double_0.2beta.png

Figure 8. Holt-Linear moving average with window of size 100, alpha = 0.5, beta = 0.7

images/pipeline_movavg/double_0.7beta.png

Holt-Winters

The holt_winters model (aka "triple exponential") incorporates a third exponential term which tracks the seasonal aspect of your data. This aggregation therefore smooths based on three components: "level", "trend" and "seasonality".

The level and trend calculation is identical to holt The seasonal calculation looks at the difference between the current point, and the point one period earlier.

Holt-Winters requires a little more handholding than the other moving averages. You need to specify the "periodicity" of your data: e.g. if your data has cyclic trends every 7 days, you would set period: 7. Similarly if there was a monthly trend, you would set it to 30. There is currently no periodicity detection, although that is planned for future enhancements.

There are two varieties of Holt-Winters: additive and multiplicative.

"Cold Start"

Unfortunately, due to the nature of Holt-Winters, it requires two periods of data to "bootstrap" the algorithm. This means that your window must always be at least twice the size of your period. An exception will be thrown if it isn’t. It also means that Holt-Winters will not emit a value for the first 2 * period buckets; the current algorithm does not backcast.

Figure 9. Holt-Winters showing a "cold" start where no values are emitted

images/pipeline_movavg/triple_untruncated.png

Because the "cold start" obscures what the moving average looks like, the rest of the Holt-Winters images are truncated to not show the "cold start". Just be aware this will always be present at the beginning of your moving averages!

Additive Holt-Winters

Additive seasonality is the default; it can also be specified by setting "type": "add". This variety is preferred when the seasonal affect is additive to your data. E.g. you could simply subtract the seasonal effect to "de-seasonalize" your data into a flat trend.

The default values of alpha and gamma are 0.3 while beta is 0.1. The settings accept any float from 0-1 inclusive. The default value of period is 1.

The additive Holt-Winters model can be Minimized

{
    "the_movavg":{
        "moving_avg":{
            "buckets_path": "the_sum",
            "window" : 30,
            "model" : "holt_winters",
            "settings" : {
                "type" : "add",
                "alpha" : 0.5,
                "beta" : 0.5,
                "gamma" : 0.5,
                "period" : 7
            }
        }
}

Figure 10. Holt-Winters moving average with window of size 120, alpha = 0.5, beta = 0.7, gamma = 0.3, period = 30

images/pipeline_movavg/triple.png

Multiplicative Holt-Winters

Multiplicative is specified by setting "type": "mult". This variety is preferred when the seasonal affect is multiplied against your data. E.g. if the seasonal affect is x5 the data, rather than simply adding to it.

The default values of alpha and gamma are 0.3 while beta is 0.1. The settings accept any float from 0-1 inclusive. The default value of period is 1.

The multiplicative Holt-Winters model can be Minimized

Warning

Multiplicative Holt-Winters works by dividing each data point by the seasonal value. This is problematic if any of your data is zero, or if there are gaps in the data (since this results in a divid-by-zero). To combat this, the mult Holt-Winters pads all values by a very small amount (1*10-10) so that all values are non-zero. This affects the result, but only minimally. If your data is non-zero, or you prefer to see NaN when zero’s are encountered, you can disable this behavior with pad: false

{
    "the_movavg":{
        "moving_avg":{
            "buckets_path": "the_sum",
            "window" : 30,
            "model" : "holt_winters",
            "settings" : {
                "type" : "mult",
                "alpha" : 0.5,
                "beta" : 0.5,
                "gamma" : 0.5,
                "period" : 7,
                "pad" : true
            }
        }
}

Prediction

All the moving average model support a "prediction" mode, which will attempt to extrapolate into the future given the current smoothed, moving average. Depending on the model and parameter, these predictions may or may not be accurate.

Predictions are enabled by adding a predict parameter to any moving average aggregation, specifying the number of predictions you would like appended to the end of the series. These predictions will be spaced out at the same interval as your buckets:

{
    "the_movavg":{
        "moving_avg":{
            "buckets_path": "the_sum",
            "window" : 30,
            "model" : "simple",
            "predict" : 10
        }
}

The simple, linear and ewma models all produce "flat" predictions: they essentially converge on the mean of the last value in the series, producing a flat:

Figure 11. Simple moving average with window of size 10, predict = 50

images/pipeline_movavg/simple_prediction.png

In contrast, the holt model can extrapolate based on local or global constant trends. If we set a high beta value, we can extrapolate based on local constant trends (in this case the predictions head down, because the data at the end of the series was heading in a downward direction):

Figure 12. Holt-Linear moving average with window of size 100, predict = 20, alpha = 0.5, beta = 0.8

images/pipeline_movavg/double_prediction_local.png

In contrast, if we choose a small beta, the predictions are based on the global constant trend. In this series, the global trend is slightly positive, so the prediction makes a sharp u-turn and begins a positive slope:

Figure 13. Double Exponential moving average with window of size 100, predict = 20, alpha = 0.5, beta = 0.1

images/pipeline_movavg/double_prediction_global.png

The holt_winters model has the potential to deliver the best predictions, since it also incorporates seasonal fluctuations into the model:

Figure 14. Holt-Winters moving average with window of size 120, predict = 25, alpha = 0.8, beta = 0.2, gamma = 0.7, period = 30

images/pipeline_movavg/triple_prediction.png

Minimization

Some of the models (EWMA, Holt-Linear, Holt-Winters) require one or more parameters to be configured. Parameter choice can be tricky and sometimes non-intuitive. Furthermore, small deviations in these parameters can sometimes have a drastic effect on the output moving average.

For that reason, the three "tunable" models can be algorithmically minimized. Minimization is a process where parameters are tweaked until the predictions generated by the model closely match the output data. Minimization is not fullproof and can be susceptible to overfitting, but it often gives better results than hand-tuning.

Minimization is disabled by default for ewma and holt_linear, while it is enabled by default for holt_winters. Minimization is most useful with Holt-Winters, since it helps improve the accuracy of the predictions. EWMA and Holt-Linear are not great predictors, and mostly used for smoothing data, so minimization is less useful on those models.

Minimization is enabled/disabled via the minimize parameter:

{
    "the_movavg":{
        "moving_avg":{
            "buckets_path": "the_sum",
            "model" : "holt_winters",
            "window" : 30,
            "minimize" : true,  
            "settings" : {
                "period" : 7
            }
        }
}

Minimization is enabled with the minimize parameter

When enabled, minimization will find the optimal values for alpha, beta and gamma. The user should still provide appropriate values for window, period and type.

Warning

Minimization works by running a stochastic process called simulated annealing. This process will usually generate a good solution, but is not guaranteed to find the global optimum. It also requires some amount of additional computational power, since the model needs to be re-run multiple times as the values are tweaked. The run-time of minimization is linear to the size of the window being processed: excessively large windows may cause latency.

Finally, minimization fits the model to the last n values, where n = window. This generally produces better forecasts into the future, since the parameters are tuned around the end of the series. It can, however, generate poorer fitting moving averages at the beginning of the series.

Cumulative Sum Aggregation

Warning

This functionality is experimental and may be changed or removed completely in a future release.

A parent pipeline aggregation which calculates the cumulative sum of a specified metric in a parent histogram (or date_histogram) aggregation. The specified metric must be numeric and the enclosing histogram must have min_doc_count set to 0 (default for histogram aggregations).

Syntax

A cumulative_sum aggregation looks like this in isolation:

{
    "cumulative_sum": {
        "buckets_path": "the_sum"
    }
}

Table 10. cumulative_sum Parameters

Parameter Name

Description

Required

Default Value

buckets_path

The path to the buckets we wish to find the cumulative sum for (see the section called “buckets_path Syntax” for more details)

Required

format

format to apply to the output value of this aggregation

Optional, defaults to null


The following snippet calculates the cumulative sum of the total monthly sales:

POST /sales/_search
{
    "size": 0,
    "aggs" : {
        "sales_per_month" : {
            "date_histogram" : {
                "field" : "date",
                "interval" : "month"
            },
            "aggs": {
                "sales": {
                    "sum": {
                        "field": "price"
                    }
                },
                "cumulative_sales": {
                    "cumulative_sum": {
                        "buckets_path": "sales" 
                    }
                }
            }
        }
    }
}

buckets_path instructs this cumulative sum aggregation to use the output of the sales aggregation for the cumulative sum

And the following may be the response:

{
   "took": 11,
   "timed_out": false,
   "_shards": ...,
   "hits": ...,
   "aggregations": {
      "sales_per_month": {
         "buckets": [
            {
               "key_as_string": "2015/01/01 00:00:00",
               "key": 1420070400000,
               "doc_count": 3,
               "sales": {
                  "value": 550.0
               },
               "cumulative_sales": {
                  "value": 550.0
               }
            },
            {
               "key_as_string": "2015/02/01 00:00:00",
               "key": 1422748800000,
               "doc_count": 2,
               "sales": {
                  "value": 60.0
               },
               "cumulative_sales": {
                  "value": 610.0
               }
            },
            {
               "key_as_string": "2015/03/01 00:00:00",
               "key": 1425168000000,
               "doc_count": 2,
               "sales": {
                  "value": 375.0
               },
               "cumulative_sales": {
                  "value": 985.0
               }
            }
         ]
      }
   }
}

Bucket Script Aggregation

Warning

This functionality is experimental and may be changed or removed completely in a future release.

A parent pipeline aggregation which executes a script which can perform per bucket computations on specified metrics in the parent multi-bucket aggregation. The specified metric must be numeric and the script must return a numeric value.

Syntax

A bucket_script aggregation looks like this in isolation:

{
    "bucket_script": {
        "buckets_path": {
            "my_var1": "the_sum", 
            "my_var2": "the_value_count"
        },
        "script": "my_var1 / my_var2"
    }
}

Here, my_var1 is the name of the variable for this buckets path to use in the script, the_sum is the path to the metrics to use for that variable.

Table 11. bucket_script Parameters

Parameter Name

Description

Required

Default Value

script

The script to run for this aggregation. The script can be inline, file or indexed. (see Scripting for more details)

Required

buckets_path

A map of script variables and their associated path to the buckets we wish to use for the variable (see the section called “buckets_path Syntax” for more details)

Required

gap_policy

The policy to apply when gaps are found in the data (see the section called “Dealing with gaps in the data” for more details)

Optional, defaults to skip

format

format to apply to the output value of this aggregation

Optional, defaults to null


The following snippet calculates the ratio percentage of t-shirt sales compared to total sales each month:

POST /sales/_search
{
    "size": 0,
    "aggs" : {
        "sales_per_month" : {
            "date_histogram" : {
                "field" : "date",
                "interval" : "month"
            },
            "aggs": {
                "total_sales": {
                    "sum": {
                        "field": "price"
                    }
                },
                "t-shirts": {
                  "filter": {
                    "term": {
                      "type": "t-shirt"
                    }
                  },
                  "aggs": {
                    "sales": {
                      "sum": {
                        "field": "price"
                      }
                    }
                  }
                },
                "t-shirt-percentage": {
                    "bucket_script": {
                        "buckets_path": {
                          "tShirtSales": "t-shirts>sales",
                          "totalSales": "total_sales"
                        },
                        "script": "params.tShirtSales / params.totalSales * 100"
                    }
                }
            }
        }
    }
}

And the following may be the response:

{
   "took": 11,
   "timed_out": false,
   "_shards": ...,
   "hits": ...,
   "aggregations": {
      "sales_per_month": {
         "buckets": [
            {
               "key_as_string": "2015/01/01 00:00:00",
               "key": 1420070400000,
               "doc_count": 3,
               "total_sales": {
                   "value": 550.0
               },
               "t-shirts": {
                   "doc_count": 1,
                   "sales": {
                       "value": 200.0
                   }
               },
               "t-shirt-percentage": {
                   "value": 36.36363636363637
               }
            },
            {
               "key_as_string": "2015/02/01 00:00:00",
               "key": 1422748800000,
               "doc_count": 2,
               "total_sales": {
                   "value": 60.0
               },
               "t-shirts": {
                   "doc_count": 1,
                   "sales": {
                       "value": 10.0
                   }
               },
               "t-shirt-percentage": {
                   "value": 16.666666666666664
               }
            },
            {
               "key_as_string": "2015/03/01 00:00:00",
               "key": 1425168000000,
               "doc_count": 2,
               "total_sales": {
                   "value": 375.0
               },
               "t-shirts": {
                   "doc_count": 1,
                   "sales": {
                       "value": 175.0
                   }
               },
               "t-shirt-percentage": {
                   "value": 46.666666666666664
               }
            }
         ]
      }
   }
}

Bucket Selector Aggregation

Warning

This functionality is experimental and may be changed or removed completely in a future release.

A parent pipeline aggregation which executes a script which determines whether the current bucket will be retained in the parent multi-bucket aggregation. The specified metric must be numeric and the script must return a boolean value. If the script language is expression then a numeric return value is permitted. In this case 0.0 will be evaluated as false and all other values will evaluate to true.

Note: The bucket_selector aggregation, like all pipeline aggregations, executions after all other sibling aggregations. This means that using the bucket_selector aggregation to filter the returned buckets in the response does not save on execution time running the aggregations.

Syntax

A bucket_selector aggregation looks like this in isolation:

{
    "bucket_selector": {
        "buckets_path": {
            "my_var1": "the_sum", 
            "my_var2": "the_value_count"
        },
        "script": "params.my_var1 > params.my_var2"
    }
}

Here, my_var1 is the name of the variable for this buckets path to use in the script, the_sum is the path to the metrics to use for that variable.

Table 12. bucket_selector Parameters

Parameter Name

Description

Required

Default Value

script

The script to run for this aggregation. The script can be inline, file or indexed. (see Scripting for more details)

Required

buckets_path

A map of script variables and their associated path to the buckets we wish to use for the variable (see the section called “buckets_path Syntax” for more details)

Required

gap_policy

The policy to apply when gaps are found in the data (see the section called “Dealing with gaps in the data” for more details)

Optional, defaults to skip


The following snippet only retains buckets where the total sales for the month is more than 400:

POST /sales/_search
{
    "size": 0,
    "aggs" : {
        "sales_per_month" : {
            "date_histogram" : {
                "field" : "date",
                "interval" : "month"
            },
            "aggs": {
                "total_sales": {
                    "sum": {
                        "field": "price"
                    }
                },
                "sales_bucket_filter": {
                    "bucket_selector": {
                        "buckets_path": {
                          "totalSales": "total_sales"
                        },
                        "script": "params.totalSales > 200"
                    }
                }
            }
        }
    }
}

And the following may be the response:

{
   "took": 11,
   "timed_out": false,
   "_shards": ...,
   "hits": ...,
   "aggregations": {
      "sales_per_month": {
         "buckets": [
            {
               "key_as_string": "2015/01/01 00:00:00",
               "key": 1420070400000,
               "doc_count": 3,
               "total_sales": {
                   "value": 550.0
               }
            },
            {
               "key_as_string": "2015/03/01 00:00:00",
               "key": 1425168000000,
               "doc_count": 2,
               "total_sales": {
                   "value": 375.0
               },
            }
         ]
      }
   }
}

Bucket for 2015/02/01 00:00:00 has been removed as its total sales was less than 200

Serial Differencing Aggregation

Warning

This functionality is experimental and may be changed or removed completely in a future release.

Serial differencing is a technique where values in a time series are subtracted from itself at different time lags or periods. For example, the datapoint f(x) = f(xt) - f(xt-n), where n is the period being used.

A period of 1 is equivalent to a derivative with no time normalization: it is simply the change from one point to the next. Single periods are useful for removing constant, linear trends.

Single periods are also useful for transforming data into a stationary series. In this example, the Dow Jones is plotted over ~250 days. The raw data is not stationary, which would make it difficult to use with some techniques.

By calculating the first-difference, we de-trend the data (e.g. remove a constant, linear trend). We can see that the data becomes a stationary series (e.g. the first difference is randomly distributed around zero, and doesn’t seem to exhibit any pattern/behavior). The transformation reveals that the dataset is following a random-walk; the value is the previous value +/- a random amount. This insight allows selection of further tools for analysis.

Figure 15. Dow Jones plotted and made stationary with first-differencing

images/pipeline_serialdiff/dow.png

Larger periods can be used to remove seasonal / cyclic behavior. In this example, a population of lemmings was synthetically generated with a sine wave + constant linear trend + random noise. The sine wave has a period of 30 days.

The first-difference removes the constant trend, leaving just a sine wave. The 30th-difference is then applied to the first-difference to remove the cyclic behavior, leaving a stationary series which is amenable to other analysis.

Figure 16. Lemmings data plotted made stationary with 1st and 30th difference

images/pipeline_serialdiff/lemmings.png

Syntax

A serial_diff aggregation looks like this in isolation:

{
    "serial_diff": {
        "buckets_path": "the_sum",
        "lag": "7"
    }
}

Table 13. serial_diff Parameters

Parameter Name

Description

Required

Default Value

buckets_path

Path to the metric of interest (see buckets_path Syntax for more details

Required

lag

The historical bucket to subtract from the current value. E.g. a lag of 7 will subtract the current value from the value 7 buckets ago. Must be a positive, non-zero integer

Optional

1

gap_policy

Determines what should happen when a gap in the data is encountered.

Optional

insert_zero

format

Format to apply to the output value of this aggregation

Optional

null


serial_diff aggregations must be embedded inside of a histogram or date_histogram aggregation:

POST /_search
{
   "size": 0,
   "aggs": {
      "my_date_histo": {                  
         "date_histogram": {
            "field": "timestamp",
            "interval": "day"
         },
         "aggs": {
            "the_sum": {
               "sum": {
                  "field": "lemmings"     
               }
            },
            "thirtieth_difference": {
               "serial_diff": {                
                  "buckets_path": "the_sum",
                  "lag" : 30
               }
            }
         }
      }
   }
}

A date_histogram named "my_date_histo" is constructed on the "timestamp" field, with one-day intervals

A sum metric is used to calculate the sum of a field. This could be any metric (sum, min, max, etc)

Finally, we specify a serial_diff aggregation which uses "the_sum" metric as its input.

Serial differences are built by first specifying a histogram or date_histogram over a field. You can then optionally add normal metrics, such as a sum, inside of that histogram. Finally, the serial_diff is embedded inside the histogram. The buckets_path parameter is then used to "point" at one of the sibling metrics inside of the histogram (see the section called “buckets_path Syntax” for a description of the syntax for buckets_path.