GCP – Controlling metric ingestion with Google Cloud Managed Service for Prometheus
By default, Google Cloud Monitoring accepts and processes all well-formed metrics sent to a metric ingestion endpoint. However, under certain circumstances, metrics generation can be prolific, leading to a series of unnecessary expenses. This is especially true for verbose metrics of no particular utility. To control costs, platform users need a way to manage the flow of metrics prior to ingestion so that only relevant and useful metrics are processed and billed for.
Managed Service for Prometheus, which uses Cloud Monitoring under the hood, charges on a per-sample basis. Therefore, controlling the number of samples ingested is crucial to managing costs. There are two main ways to do this: filtering input or adjusting the length of the sampling period.
As a simple example, extending the sampling interval can dramatically reduce the number of samples ingested and thus the cost.
Changing a 10-second sampling period to a 30-second sampling period can reduce your sample volume by 66%, without any significant loss of information.
Changing a 10-second sampling period to a 60-second sampling period can reduce your sample volume by 83%.
For additional information about how samples are counted and how the sampling period affects the number of samples, see Pricing examples based on samples ingested.
In a previous blog post we discussed metrics management, how to identify high cost metrics, and how to reduce cardinality. In this blog post, we look at some of the options we have to manage metrics ingestion and go over a few practical examples of using Prometheus to save on costs.
Using Kubernetes Custom Resources to control ingestion
The process of configuring Managed Service for Prometheus ingestion is very similar for rate reduction and time-series filtering, as both are implemented via configurations applied to either the PodMonitoring or ClusterPodMonitoring custom resources. The ClusterPodMonitoring resource provides the same interface as the PodMonitoring resource but does not limit discovered Pods to a given namespace. In most scenarios, it’s better to use the PodMonitoring resource, as the ClusterPodMonitoring resource is normally used for items that are not naturally namespace-scoped, such as kube-state metrics, which reports information about the cluster from the k8s API.
Extend the period of the scrape interval
The examples that follow modify the scrape interval of a PodMonitoring resource, thus reducing the frequency of sample collection. You can use the same basic techniques to modify ClusterPodMonitoring resources and ingestion filters. Examples of ingestion filter modification will be shown later in this article. Since increasing the scrape interval is the simplest way to throttle metric collection, we demonstrate that first.
The following manifest defines a PodMonitoring resource, prom-example, in the EXAMPLE namespace. The resource uses a Kubernetes label selector to find all Pods in the Namespace that have the label app.kubernetes.io/name with the value prom-example. The matching Pods are scraped on a port named metrics, every 30 seconds, on the /metrics HTTP path. To extend the scrape interval from 30 seconds to 60 seconds, simply change the 30s below to 60s.
kubectl apply the following:
<ListValue: [StructValue([(‘code’, ‘apiVersion: monitoring.googleapis.com/v1rnkind: PodMonitoringrnmetadata:rn name: prom-examplernspec:rn selector:rn matchLabels:rn app.kubernetes.io/name: prom-examplern endpoints:rn – port: metricsrn interval: 60s #formerly 30s’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e8feb2e60d0>)])]>
After making changes of this sort, it can be helpful to review the status of your modifications by enabling the target status feature, then checking the status of the scrape targets in the PodMonitoring or ClusterPodMonitoring resources; do this by setting the features.targetStatus.enabled value within the OperatorConfig resource to true. Note: turn off target status after using it as it is quite noisy and therefore expensive to operate continuously.
To enable target status, kubectl apply the following:
<ListValue: [StructValue([(‘code’, ‘apiVersion: monitoring.googleapis.com/v1rnkind: OperatorConfigrnmetadata:rn namespace: gmp-publicrn name: configrnfeatures:rn targetStatus:rn enabled: true’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e8feb2e6850>)])]>
After a few seconds, the Status.Endpoint Statuses field appears on every valid PodMonitoring or ClusterPodMonitoring resource, when configured.
If there is a PodMonitoring resource with the name prom-example in the NAMESPACE_NAME namespace, then status can be verified by running the following command:
<ListValue: [StructValue([(‘code’, ‘$ kubectl -n NAMESPACE_NAME describe podmonitorings/prom-example’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e8feb2e6c40>)])]>
Filtering input
An alternative to the restriction of sampling rates is to filter inbound metrics before they are ingested. The standard method to prevent time-series metrics from being processed by the collector is to use Prometheus relabeling rules with a keep action for an allowlist or a drop action for a denylist. This rule goes in the metricRelabeling section of your PodMonitoring or ClusterPodMonitoring resource.
The following metric relabeling rule will filter out any time series that begins with foo_bar_, foo_baz_, or foo_qux_:
<ListValue: [StructValue([(‘code’, ‘metricRelabeling: rn- action: droprn regex: foo_(bar|baz|qux)_.+rn sourceLabels: [__name__]’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e8feb2e6be0>)])]>
The following metric relabeling rule uses a regular expression to specify which metrics to keep based on the name of the metric. For example, metrics whose name begins with kube_daemonset_ are kept.
<ListValue: [StructValue([(‘code’, ‘metricRelabeling:rn- action: keeprn regex: kube_(daemonset|deployment|pod|namespace|node|statefulset|persistentvolume|horizontalpodautoscaler)_.+rn sourceLabels: [__name__]’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e8feb2e6a00>)])]>
You can also set a rule to manage specific time series based on a label value. For example, the following metric relabeling rule uses a regular expression to filter out all time series where the value for the label “direction” starts with “destination”:
<ListValue: [StructValue([(‘code’, ‘metricRelabeling:rn- action: droprn regex: destination.*rn sourceLabels: [direction]’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e8feb2e6730>)])]>
As the examples above demonstrate, it’s simple to create allow/deny lists of metrics to control ingestion selection.
The decision as to whether it’s best to reduce metrics ingestion by content filtering or by scrape-rate reduction can be complicated. It’s worth pointing out that sample-rate reduction has very broad implications, while filtering offers more control and a more selective approach. Filtering also requires more thought and planning. A rule of thumb is to set scrape intervals to 30 seconds before applying content filtering, but to analyze the cost/benefit before moving a scrape interval to 60 seconds. Having a clear use case for either 30s or 60s data collection is very helpful in the decision-making process.
Links in this article:
Link
Description
Location
Blog Post
Google Cloud Blog
Resource documentation
GitHub
Resource documentation
GitHub
Pricing examples
Google Cloud
Label documentation
Kubernetes.io
How to enable/disable
Google Cloud
Relabeling documentation
Prometheus.io
Interface documentation
Google Cloud
Read More for the details.