Getting started with Prometheus Federation in Docker

Carlos Alcaide
6 min readApr 3, 2022

Introduction

If you have a small project with a few containers deployed on a cluster for example, it’s quite easy to set up a Prometheus server to scrape all the metrics from the cluster nodes and the containers and then configure a Grafana to show all the important data in a more friendly way.

But what if you have several projects to be monitored, or several environments. You could have your production environment in a cluster and then your development environment in a complete different cluster, but you want to monitor both of them.

You could have also all the mentioned above and more like databases in virtual machines, services deployed anywhere…And you have a prometheus server deployed along side every single “space” of services, imagine to build up a dashboard considering all those Prometheus servers, it could drive you crazy to administrate. In this case Prometheus Federation could make your life easier.

Prometheus Federation allows to aggregate metrics coming from different Prometheus servers into one central server from which serve to the rest of the services, Grafana for example. Federation does not scrape all the metrics from every server configured on the Federation network, the metrics to be scraped may be filtered so the central server only asks for the interesting metrics defined in the configuration.

Use case example diagram.

For example, a configuration setup for a Prometheus Federation use case could be the one from the image. There we have three different environments, one with two service A.1 and A.2 with its own Prometheus server scraping the metrics of the services A.1 and A.2; then we have another environment with a database hosted on a virtual machine, both of them offering metrics to another Prometheus server; and a third environment with a service B and an API Gateway serving metrics to a third Prometheus server. These Prometheus server are configured as secondary servers in a Federated network where there is a central/primary Prometheus server that is scraping continuously the selected metrics from the above mentioned secondary servers. This way, if the administrator of these environments want to access some certain metric he or she just need to go the central Prometheus server and ask for it.

Hands-on scenario

Let’s put this in practice with a very simple scenario where we have three different environments with one Prometheus server each. Then we have a central server of Prometheus that will scrape all the metrics collected by the other servers, setting this way up a Federated network of Prometheus servers.

For the purpose of simplicity, in this scenario we will set up as target. This endpoints can be consulted here https://demo.do.prometheus.io/ .

The scenario described above should look like this:

Hands-on scenario diagram.

Hands-on

First we will create a docker-compose file that brings up four Prometheus containers. As we are setting up this on the same machine we need to configure a different port for each server, in my case I use ports from 9090 to 9093. Another thing to take into account considering the docker-compose file is that each server needs a configuration file, for this I created the following directory structure and shared all the configuration files with their corresponding container:

Prometheus configuration files structure.

Taking all these into consideration, the docker-compose file should look like the following one:

version: '3.2'
networks:
monitoring:
driver: bridge
services:
prometheus.production:
image: prom/prometheus:latest
container_name: prometheus_production
ports:
- 9091:9090
command:
- --config.file=/etc/prometheus/prometheus.yaml
volumes:
- ./config/production/prometheus.yaml:/etc/prometheus/prometheus.yaml:ro
networks:
- monitoring
prometheus.development:
image: prom/prometheus:latest
container_name: prometheus_development
ports:
- 9092:9090
command:
- --config.file=/etc/prometheus/prometheus.yaml
volumes:
- ./config/development/prometheus.yaml:/etc/prometheus/prometheus.yaml:ro
networks:
- monitoring
prometheus.staging:
image: prom/prometheus:latest
container_name: prometheus_staging
ports:
- 9093:9090
command:
- --config.file=/etc/prometheus/prometheus.yaml
volumes:
- ./config/staging/prometheus.yaml:/etc/prometheus/prometheus.yaml:ro
networks:
- monitoring
prometheus.central:
image: prom/prometheus:latest
container_name: prometheus_central
ports:
- 9090:9090
command:
- --config.file=/etc/prometheus/prometheus.yaml
volumes:
- ./config/central/prometheus.yaml:/etc/prometheus/prometheus.yaml:ro
networks:
- monitoring

Once we have the containers defined, we need to write the configuration files for the secondary servers. All of them are similar to each other, the only difference are a label to identify them externally and the target they scrape metrics from. An example of these configuration files may be observed as follows:

global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
secondary: 'development-prometheus'
scrape_configs:
- job_name: 'alertmanager-development'
scrape_interval: 5s
static_configs:
- targets:
- "demo.do.prometheus.io:9093"

We have added the external label to better identify from which server a metric comes from, in this case we could say that the metric comes from de secondary server where the development environment is hosted.

The second important parameter is the target configured in the job, I have configured each server to scrape different metrics from different endpoints so we later can better identify if the federation works properly or not. Here there is a list of the different endpoints used in this example:

The last step is to configure the federation itself. This is done in the central server configuration file and it is added as another job to scrape metrics from:

global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
primary: 'central-prometheus'
scrape_configs:
- job_name: 'prometheus'
scrape_interval: 5s
static_configs:
- targets: ['demo.do.prometheus.io:3000']
- job_name: 'federation'
scrape_interval: 30s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job="alertmanager-development"}'
- '{job="production-node-exporter"}'
- '{job="staging-random"}'
static_configs:
- targets:
- "prometheus.production:9090"
- "prometheus.development:9090"
- "prometheus.staging:9090"

At it can be seen, the job name is “federation” but this is not mandatory. Another configuration to take into account is the field “honor_labels”, this field set to true resolve the possible conflicts when scraping external metrics by keeping the original labels. When this field is set to false, the conflicts in the labels are resolved by renaming the conflicting label.

Scraping all the possible metrics from all the secondary servers in a federated network can be quite a huge quantity of data and it may be very possible that all this data is not going to be used in the central server. For that we have the “params” field where we can write down some expressions to filter the data to be scraped from the targeted servers. In this case the expressions correspond to the job defined in every server.

Lastly, we need to configure the servers that are part of the federated network (remember to write the port and not to write the protocol http or https).

And there it is! We have now deployed a federated network of Prometheus servers, all of them retrieving metrics from different endpoints and all the metrics together gathered in the central server.

Federated network targets.

An example of the federation working properly could be trying to retrieve some metric from one of the secondary servers directly from the central server. In this case we can look for some metric of the development environment which is retrieving metrics from an alert manager:

alertmanager_http_requests_in_flight

If we look for this metric in the central server we can see that it has been correctly scraped from the development environment Prometheus:

Metric retrieved from one of the secondary server viewed on the central server.

Conclusions

  • This feature of Prometheus may be very useful when we have to manage a set of services with many different Prometheus servers, basically it makes the life easier.
  • Filtering is a powerful help when creating the federated network but it has to be done very carefully as some important metric may be forgotten or not scraped correctly due to the filter.
  • It may be useful to gather all the importan data into one server with a very well prepared persistance (periodic backups, high availability, etc) and leave the secondary server with less preparation.

--

--

Carlos Alcaide

DevOps Engineer learning new staff, doing sometimes some backend and occasionally trying frontend.