Thanos: Aggregating Multiple Prometheus Instances

Aurélien Dunand

Oct 6, 2023 • 5 min

This is the concluding article in our three-part series on Thanos monitoring.

Part 1: “Thanos: Long-Term Storage of Prometheus Metrics”
Part 2: “Deploying Thanos and Prometheus on a K8s Cluster”
Part 3: “Thanos: Aggregating Multiple Prometheus Instances”

In the previous article, we ended by configuring the Thanos Query as a datasource in Grafana to query multiple Prometheus instances from a single datasource.

Now, let’s explore how Thanos aggregates multiple Prometheus instances. We will discuss two relevant cases: within a single Kubernetes cluster and across multiple Kubernetes clusters.

Thanos and Multiple Prometheus Instances in a Single K8s Cluster

Case Introduction

Let’s consider a kubernetes cluster with several namespaces, used by different teams. Each team has its own Prometheus collecting their metrics, and there’s a system Prometheus collecting the cluster’s metrics.

The goal is to have all our metrics accessible on Grafana, regardless of which Prometheus has the metric. Let’s see how to configure our Prometheis (did you know, the plural of Prometheus is Prometheis) and Thanos to achieve this. For simplicity in this article, we will not set up long-term metric storage with Thanos but only the global view.

Deploying the Prometheis with the Thanos sidecar

For the system Prometheus, we’ll simply use the kube-prometheus-stack. We configure Prometheus to target only PodMonitor and ServiceMonitor with a prometheus label. This way, we distinguish what should be monitored by the system Prometheus and what should be monitored by other instances.

prometheus:
  prometheusSpec:
    serviceMonitorSelector:
      matchExpressions:
        - key: prometheus
          operator: DoesNotExist
    podMonitorSelector:
      matchExpressions:
        - key: prometheus
          operator: DoesNotExist
    thanos:
      version: 0.31.0

For each team, we’ll set up a dedicated Prometheus instance that targets only PodMonitor and ServiceMonitor that have a prometheus label with the team’s name:

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: team1
  namespace: monitoring
spec:
  serviceMonitorSelector:
    matchLabels:
      prometheus: team1
  podMonitorSelector:
    matchLabels:
      prometheus: team1
  thanos:
    version: 0.31.0
  serviceAccountName: kube-prometheus-stack-prometheus

As a test, let’s add a PodMonitor that fetches metrics from Prometheus:

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  labels:
    prometheus: team1
  name: prom-team1
  namespace: monitoring
spec:
  jobLabel: app.kubernetes.io/name
  podMetricsEndpoints:
    - path: /metrics
      port: web
  selector:
    matchLabels:
      operator.prometheus.io/name: team1

Configuring Thanos Query to Target the Sidecars

We’ll deploy Thanos using the bitnami/thanos chart, as discussed in the previous post. Our Prometheus instances are deployed by the operator, so they are reachable through the automatically added prometheus-operated service. We can use Thanos' DNS Service Discovery to target all our Prometheus at once with the following values:

query:
  dnsDiscovery:
    sidecarsService: "prometheus-operated"
    sidecarsNamespace: "monitoring"

Checking the Results on the WebUI

When accessing the Thanos Query interface, under the “Stores” tab, we can see our different Thanos sidecars:

Sidecars visible in the Thanos Query Stores

And we can test a query to ensure all Prometheis are queried:

PromQL query across various Prometheis

All that’s left is to add the Thanos Query as a Prometheus type datasource in Grafana for use.

Thanos and Multiple K8s Clusters Monitored by Prometheus

Case Introduction

Now, instead of wanting a global view of Prometheus within a single cluster, we aim to target multiple Kubernetes clusters. Our Thanos Query should be able to reach the Thanos StoreAPI of other clusters. The StoreAPI is exposed in gRPC, which uses HTTP/2. So, we can use an Ingress for simplicity. In our example, we’ll use ingress-nginx as the ingress controller.

Deploying Prometheus and the Thanos Sidecar Exposed via an Ingress

We install Prometheus with the kube-prometheus-stack chart with the following values:

prometheus:
  prometheusSpec:
    externalLabels:
      cluster: cluster-b
    thanos:
      version: 0.31.0
  thanosIngress:
    enabled: true
    ingressClassName: nginx
    annotations:
      nginx.ingress.kubernetes.io/backend-protocol: "GRPC"
      cert-manager.io/cluster-issuer: letsencrypt
    hosts:
      - thanos-grpc.cluster-b.example.com
    pathType: Prefix
    tls:
      - secretName: thanos-sidecar-tls


        hosts:
          - thanos-grpc.cluster-b.example.com

Here are two key takeaways:

We set an external label with the cluster name, this allows us to use the Grafana Kubernetes dashboards in multi-cluster mode
We use the annotation nginx.ingress.kubernetes.io/backend-protocol in GRPC mode so that ingress-nginx speaks in gRPC to the service

Configure Thanos Query to retrieve metrics from all clusters

We deploy Thanos with the following values to tell it to join the two sidecars exposed via an ingress in HTTPS:

query:
  stores:
    - 'thanos-grpc.cluster-b.example.com:443'
    - 'thanos-grpc.cluster-c.example.com:443'
  extraFlags:
      - --grpc-client-tls-secure
queryFrontend:
  enabled: false

When we look at the Thanos Query web interface, we clearly see our two sidecars:

Clusters' sidecars visible in the Thanos Query Stores

Additionally, we aim to integrate the sidecars from our initial cluster. The problem is that we use the --grpc-client-tls-secure option and our first sidecars do not use TLS, so it won’t work. There is an open issue about this. While waiting for it to be implemented, we can use the Thanos Query we just created as an intermediary and add it to the configuration of the Thanos Query we had previously set up:

query:
  dnsDiscovery:
    sidecarsService: "prometheus-operated"
    sidecarsNamespace: "monitoring"
  stores:
    - 'dnssrv+_grpc._tcp.thanos-external-query-grpc.monitoring.svc.cluster.local'

The Thanos Query also exposes the StoreAPI, allowing us to target it in the same way we target a sidecar or a storegateway. We continue to use the DNS Service Discovery to point to the service of our other Thanos Query instance.

We add an external label cluster to the Prometheis deployed in the first part as we just did for the new clusters. Now, on the Thanos Query interface, we clearly see all our instances:

All instances in the Thanos Query Stores

Configure multi-cluster Kubernetes dashboards and check the result on Grafana

The final step involves configuring the Kubernetes dashboards deployed by the kube-prometheus-stack chart to be multi-cluster with the following values:

grafana:
  enabled: true
  sidecar:
    dashboards:
      multicluster:
        global:
          enabled: true
        etcd:
          enabled: true

We redeploy the chart and then add Thanos as a Grafana datasource:

Adding Thanos as a Grafana datasource

Now we can explore the Kubernetes dashboards that have a cluster filter, allowing us to visualize the metrics of all our instances from the same datasource.

Kubernetes Dashboard with a cluster variable

With that, we wrap up our in-depth exploration of Thanos in this trilogy. See you soon for more monitoring topics!

Do not miss our latest DevOps and Cloud Native blogposts! Follow Enix on Linkedin!