Prometheus

COMPONENTS

The Prometheus ecosystem consists of multiple components, many of which are optional:

  • the main Prometheus server which scrapes and stores time series data
  • client libraries for instrumenting application code
  • a push gateway for supporting short-lived jobs
  • special-purpose exporters for services like HAProxy, StatsD, Graphite, etc.
  • an alertmanager to handle alerts
  • various support tools ( grasdana,api clients,prometheus web ui)

INSTALL

sudo useradd --no-create-home --shell /bin/false prometheus
sudo mkdir /etc/prometheus /var/lib/prometheus
sudo chown prometheus:prometheus /var/lib/prometheus
cd /tmp/
wget https://github.com/prometheus/prometheus/releases/download/v2.7.1/prometheus-2.7.1.linux-amd64.tar.gz
tar -xvf prometheus-2.7.1.linux-amd64.tar.gz
cd prometheus-2.7.1.linux-amd64/
sudo mv {console*,prometheus.yml} /etc/prometheus
sudo chown -R prometheus:prometheus /etc/prometheus
sudo mv {prometheus,promtool} /usr/local/bin/
sudo chown prometheus:prometheus /usr/local/bin/prometheus /usr/local/bin/promtool
cat <<EOF | sudo tee -a /etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
    --config.file /etc/prometheus/prometheus.yml \
    --storage.tsdb.path /var/lib/prometheus/ \
    --web.console.templates=/etc/prometheus/consoles \
    --web.console.libraries=/etc/prometheus/console_libraries

[Install]
WantedBy=multi-user.target
sudo systemctl start prometheus
sudo systemctl enable prometheus
sudo systemctl status prometheus
curl localhost:9090
curl localhost:9090/metrics

GENERAL

listen: on port 9090
alertmagic: listens on port 9093
pull monitoring:

  • monitoring solution does the pulling of data from targets
  • scrapes endpoints every X amount of seconds to add data
  • less live
  • can run from anywhere
  • access metrics from web directory from endpoints -> ${target}/metrics
  • limited if using completed networking or extensive firewalls

Prometheus data

Time series data

Time series data consists of a series of values associated with different points in time.

Single data point vs time series You could track a single data point, such as the current outdoor temperature. Outdoor temp: 0C/31F However, if you write down the temp one every hour, that’s a time series! 8AM - -6C/21F 9AM - -3C/26F 10AM - -2c/28F 11AM - 0c/31F This means Prometheus not only tracks the current value of each metric but also changes to each metric over time.

metric name - every metric in Prometheus has a metric name.The metric name refers to the general feature of a system or application that is being measured.Note that the metric name merely refers to the feature being measured.Metric names do not point to a specific data value but potentially a collection of many values.
metric labels - prometheus use labels to provide a dimensional data model.This means we can use labels to specify additional things,such as which node’s CPU usage is being represented. A unizue combination of a metric name and a set of labels identifies a particular set of time-series data.This example uses a lable callecd cpu to refer to usage of a specific CPU node_cpu_seconds_total{cpu="0"}
metric types - refer to different ways in which exporters represent the metric data they provide.Metric types are not represented in any special way in a prometheus server,but it is important to understand them in order to properly interpret your metrics.

Metric types:

COUNTER
A counter is single number that can only increase or be reset to zero.Counters represent cumulative values. Total HTTP requests servers:
+- 0
+- 12
+- 86
+- 2121

node_cpu_seconds_total[1m] - series of values of this metric type in last 1m.

GAUGE
a gauge is a single number that can increase and decrease over time. Current HTTP requests active:
+-45
+-12
+-10
+-56

HISTOGRAM
A histogram counts the number of observations/events that fall into a set of configurable buckets, each with its own seperate time series.A histogram will use labels to differentiatate between buckets.The below example provides the number of HTTP requests whose duration falls into each bucket.

http_request_duration_seconds_bucket{l1="0.3"}
http_request_duration_seconds_bucket{l1="0.36}
http_request_duration_seconds_bucket{l1="1.0"}

Histograms also include separate metric names to expose the _sum of all observer values and the total _count of events.

Summary
A summary is similar to a histogram, but it exposes metrics in the form of quantiles instead of buckets.While buckets divide values based on specific boundaries,quantiles divide values based on the percentiles into whiche they fail.

This value represents the number of HTTP requests whose duration falls withing the 95th percentile of all requests or the top 5% longest requests.

http_request_duration_seconds{quantile="0.95"}

Like histograms, summaries also expose the _sum and _count metrics.

Querying

time-series selector = metric_name + label -> node_cpu_seconds_total{cpu="0"}

label matching
= equals -> node_cpu_seconds_total{cpu="0"}
!= doesn’t equals -> node_cpu_seconds_total{cpu!="0"}
=~ regex -> node_cpu_seconds_total{cpu=~"1*"}
!~ doesn’t match regex -> node_cpu_seconds_total{cpu!~"1*"}

node_cpu_seconds_total{mode=~"user|system"} -> only user and system

expressions types:
instant vector - a set of time series containg a single sample of each time series, all sharing the same timestamp
range vector - a set of time series containing a range of data points over time for each time series (.e.g metric_name+labels+[Ntime])

You can use offset modifier to provide a time offset to select data from the past, with or without a range selector.
node_cpu_seconds_total offset 1h - select the CPU usage from one hour ago
node_cpu_seconds_total[5m] offset 1h - select CPU usage values over a five-minute period one hour ago

Operatros allow you to perform calculations based upon your metric data. arithmetic binary

+ addition  
- substraction  
* multiplication  
/ division  
% modulo  
^ exponentiation  

Matching rules When using operators,Prometheus uses matching rules to determine how to combine or compare records from two sets of data.By default,records only match if all of their labels match.

Use the following keywords to control matching behaviour:

ignoring(<label list>) - ignore the specified labele when matching on(<label list>) - use only the specified labels when matching

node_cpu_seconds_total{mode="system"} + ignoring(mode) node_cpu_secons_total{mode="user"} node_cpu_seconds_total{mode="system"} + on(cpu) node_cpu_secons_total{mode="user"}

Comparison Binary operators
By default, comparison operators filter results to only those where the comparison evaluates as true

== equal
!= not equal
> greater than
< less than
>= greater than or equal
<= less than or equal

node_cpu_seconds_total == 0 # return only records with value of 0 node_cpu_seconds_total == bool 0 # print all values but set 1 to for all values where the comparison is true ( or previous expression value is 0)

Logical/Set binary operators

and - intersection ( it will print vector1 mathing records if the query is vector1 and vector2)
or - union ( it will print all records, to be precice vector1 + vector2 if the query is vector1 or vector2)
unless - complement ( it will print vector1 unmathing records if the query is vector1 unless vector2)

There operators use lables to compare records.For example, and returns only records where the set of labels in one set of results is matched by labels in the other set.or is used to print both vectors at the end of each part of or (similar to AND logic in ELK).unless is used to print only records which are not in one of the vectors.

Aggreagation Operators
They combine multiple values into a single value

sum - add all values together
min - select the samallest value
max - select the largest value
avg - calculate the average of all values
stddev - calculate population standard deviation over all values
stdvar - calculate population standard variance over all values
count - count number of values
count_values - cound the number of values with the same value
bottomk - smallest number (k) of elements
topk - largest number (k) of elements
quantile - calculate the quantile for particular dimension

avg(node_cpu_seconds_total{mode="idle"}) - print the average idle time between all CPUs

<aggr-op> [without|by (<label list>)] ([parameter,] <vector expression>)

Functions
Functions provide a wide array of built-in functionality to aid in the process of writing queries.
abs - calculate absolute value
clamp_max() - returns values, but replaces them with a maximum value if they exceed that value
clamp_min() - returns values, but replaces them with a minimum value if they are less than that value

The rate function rate() is particularly useful function for tracking the average per-second rate of increase in a time-series value. For example, this function is useful for alerting when a paricular metric “spikes” or increase abnormally quickly. It should only be used with counters.It is best suited for alerting,and for graphing of slow-moving counters.For fast-moving counters use irate()

HTTP_API

Prometheus provides an HTTP API you can use to execute queries and obtain results using HTTP requests. This API is useful way to interact with Prometheuys, especially if you are building your own custom tools that require access to Prometheus data.

Querying via the HTTP API You can run queries via the API at /api/v1/query on the Prometheus server.

curl PROMETHEUS_URL/api/v1/query?query=node_cpu_seconds_total

/api/v1/query
/api/v1/labels
/api/v1/rules
/api/v1/alerts
/api/v1/targets
/api/v1/status/config
/api/v1/status/flags

Visualization

In Prometheus, visualization refers to the creation of visual representations of your metric data, such as charts,graphs,dashboards, etc. There are multiple tools that can help you visualize your metric data.Some tools are built into Prometheus - expressions browser,console templates Other external tools can be integrated with Prometheus to visualize Prometheus data. - Grafana,Others

Console templates allow you to create visualization console using Go templating language.Prometheus server serves console based upon these templates.
Template files are stored at the location defined by the --web.console.templates argument (/etc/prometheus/console).You can view templates by accessing /consoles/<template file name> endpoint on your Prometheus server.You can find some example template files locates in /etc/prometheus/consoles.

Adding linux boxes -> https://grafana.com/docs/grafana-cloud/quickstart/noagent_linuxnode/ Adding windows boxes -> https://devconnected.com/windows-server-monitoring-using-prometheus-and-wmi-exporter/

Grafana

Grafana is an open-source analytics and monitoring tool.Grafana can connect to Prometheus, allowing you to build visualizations and dashboards to display your Prometheus metric data. With Grafana,you can:

  • Access Prometheus data using queries
  • Display query results using a variety of different panels (graphs,gauges,tables,etc.)
  • Collect multiple panels into a single dashboard

Port used - 3000

Exporters

Exporters provide the metric data that is collected by the Prometheeus server. An exported is any application that exposes data in format Prometheus can read. The scape_config in prometheus.yml configures the Prometheus server to regularly connect to each exported and collect metric data.

Apache exporter

sudo useradd -M -r -s /bin/false apache_exporter
wget https://github.com/Lusitaniae/apache_exporter/releases/download/v0.9.0/apache_exporter-0.9.0.linux-amd64.tar.gz
tar -zxvf apache_exporter-0.9.0.linux-amd64.tar.gz
sudo cp apache_exporter-0.9.0.linux-amd64/apache_exporter /usr/local/bin/
sudo chown apache_exporter:apache_exporter /usr/local/bin//apache_exporter

cat <<EOF | sudo tee -a /etc/systemd/system/apache_exporter.service
[Unit]
Description=Prometheus Apache Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=apache_exporter
Group=apache_exporter
Type=simple
ExecStart=/usr/local/bin/apache_exporter

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
suod systemctl start apache_exporter
sudo systemctl status apache_exporter
sudo systemctl enable apache_exporter

Node exporter

sudo useradd -M -r -s /bin/false node_exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.1.2/node_exporter-1.1.2.linux-amd64.tar.gz
tar -zxvf node_exporter-1.1.2.linux-amd64.tar.gz
sudo cp node_exporter-1.1.2.linux-amd64/node_exporter /usr/local/bin/
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter

cat <<EOF | tee -a /etc/systemd/system/node_exporter.service
[Unit]
Description=Prometheus Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl deaemon-reaload
sudo systemctl start node_exporter
sudo systemctl enable node_exporter

# test metrics on target
curl localhost:9100/metrics

Instances

Instances are individual endpoints Prometheus scrapes.Usually, an instance is a single application or process being monitored. Prometheus automatically adds an instance label to metrics.

Jobs

Prometheus also automatically adds labels for jobs. A job is a collection of instances, all sharing the same purpose.For example, a job might refet to a collection of multiple replicas for a single application

+- job: api-server
  +- instance 1: 1.2.4.4:5670
  +- instance 2: 1.2.4.4:5671
  +- instance 3: 5.6.7.8:5670

Scrape Meta-metrics

Prometheys automatically creates metric data about scrapes for each instance. For each instance scrape, prometheus stores a sample in the following time series:

  • up{job="<job-name>", instance="<instance-id>"} -> 1 if the instance is healthy,i.e. reachable or 0 if the scrape failed (good metric for monitoring)
  • scrape_duration_seconds{job="<job-name>", instance="<instance-id>"} -> number of seconds the scrape took to complete
  • scrape_samples_post_metric_relabeling{job="<job-name>", instance="<instance-id>"}
  • scrape_samples_scraped{job="<job-name>", instance="<instance-id>"}
  • scrape_series_added{job="<job-name>", instance="<instance-id>"}

Application Monitoring

We have already set up monitoring for a linux server using node exporter. However,you can also use Prometheus to monitor specific applications using other exporters.This is known as application monitoring. There are a varriety of ways to monitor applications:

  • use an existing exporter
  • use prometheus pushgateway for batch processing and short-lived jobs
  • use client libraries to build and exporter into your custom applications
  • code your own client libraries or exporters

Pushgateway

Prometheus server uses a pull method to collect metrics,mening Prometheus reaches out to exporters to pull data.exporteds to no reach out to Prometheus.

However, there are some use cases where a push method is necessary,such as monitoring of batch job processes.

Prometheus Pushgateway serves as a middle-man for these use cases.

  • Clients push metric data to PushGateway
  • Prometheus server pulls metrics from Pushgateway,just like any other exporter

Pushgateway uses 9091 port.

When to Use Pushgateway The Prometheus documentation recommends using Pushgateway only for very specific use cases. There usally involve service-level batch jobs.A batch job’s process exits when processing is finished.It is unable to server metrics once the job is complete.It should not need to wait for a scrape from Prometheus server in order ot provide metrics;therefore, such jobs need a way to push metrics at the appropriate time.

INSTALLING

sudo useradd -M -r -s /bin/false pushgateway
wget https://github.com/prometheus/pushgateway/releases/download/v1.4.1/pushgateway-1.4.1.linux-amd64.tar.gz
tar -zxvf pushgateway-1.4.1.linux-amd64.tar.gz
sudo cp pushgateway-1.4.1.linux-amd64/pushgateway /usr/local/bin/
sudo chown pushgateway:pushgateway /usr/local/bin/pushgateway

cat <<EOF | sudo tee -a /etc/systemd/system/pushgateway.service
[Unit]
Description=Prometheus Pushgateway
Wants=network-online.target
After=network-online.target

[Service]
User=pushgateway
Group=pushgateway
Type=simple
ExecStart=/usr/local/bin/pushgateway

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl start pushgatway
sudo systemctl status pushgatway
sudo systemctl enable pushgatway
# check pushgateway metrics
curl localhost:9091/metrics

In prometheus.yml set honor_lables: true for job for pushgateway.This setting controls how prometheus handles conflicts between labels that are already present in scraped data and labesl that prometheus wolud attach server-side(“job” and “instnace” lables,manually configured target lables,and labels generated by service discovery implementations)

Pushing data to pushgateway To push metrics to pushgateway, simply send the metric data over HTTP with the pushgateway API

/metrics/job/some_job/instance/some_instance ## with instance or /metrics/job/some_job ## with instance=null

The request body should contain data formatted like any other Prometheus exporter metric

# TYPE some_metric counter
# HELP Example metric
some_metric{label="val1"} 42
# TYPE another_metric gauge
anohter_metric 12

The Prometheus client libraries also include functionality for pushing data via pushgateway.

Sending multiple metrics in one liner

cat <<EOF| curl --data-binary @- http://localhost:9091/metrics/job/my_job/instance/my_instance
> # TYPE temperature gauge
> temperature{location="room1"} 31
> temperature{location="room2"} 33
> # TYPE my_metric gauge
> # HELP my_metric An example
> my_metric 5
> EOF

Recording rules

Recording rules allow you to pre-compute the values of expressions and queries and save the results as their own separate set of time-series data. Recording rules are evaluated on a schedule, executing an expression and saving the result as a new metric.

Recording rules are especially useful when you have complex or expensive queries that are run frequently.For example, by saving pre-computed results using a recording rule, the expression doesn’t need to be re-evaluated every time someone opens a dashboard.

Recording rules are configured using YAML.Create them by placing YAML files in the location specified by rule_files in prometheus.yaml. When creating or changning recording rules,reload their configuration the same way you would when changing prometheus.yml

groups:
- name: my_rule_group
  rules:
  - record: my_custom_metric
    exor: up{job="My Job"}

Recording rule configuration format

groups:
# the name of the group.must be unique within a file
- name: <string>
# how owfet rules in the group are evaluated
[intarval: <duration> | default = global.evaluation_interval]
rules:
# the name of the metric where results will be stored
- record: <string>
 # the expression or query used to calculate the value
 expr: <string>
 # Labels to add or overwrite before storing the result.
 labels:
   [ <labelname>: <labelvalue> ]

Checking rule syntax: promtool check rules <rule_filename>

Alertmanager

Alertmanager is an application that runs in a separate process from the Prometheus serer.It is responsible for handling alerts sent to it by clients such as Prometheus server. Alerts are notifications that are triggered automatically by metric data. Alertmanager does the following:

  • deduplicating alerts when multiple clients send teh same alert
  • grouping multiple alerts together when they happen around the same time
  • routing alerts to the proper destination such as email,or another alerting application such as PD or OpsGenie

Alertmanager doesn’t create alerts or determine when alerts needs to be sent based in metric data.Prometheus handles that step and forwards the resulting alrets to Alertmanager.

Installing

Port used 9093

sudo useradd -M -r -s /bin/false alertmanager
wget https://github.com/prometheus/alertmanager/releases/download/v0.22.2/alertmanager-0.22.2.linux-amd64.tar.gz
tar -zxvf alertmanager-0.22.2.linux-amd64.tar.gz
sudo cp alertmanager-0.22.2.linux-amd64/{alertmanager,amtool} /usr/local/bin/
sudo chown alertmanager:alertmanager /usr/local/bin/{alertmanager,amtool}
sudo mkdir -p /etc/alertmanager
sudo cp alertmanager-0.22.2.linux-amd64/alertmanager.yml /etc/alertmanager/
sudo chown -R alertmanager:alertmanager /etc/alertmanager
sudo mkdir -p /var/lib/alertmanager
sudo chown alertmanager:alertmanager /var/lib/alertmanager
# optional
sudo mkdir -p /etc/amtool

cat <<EOF | sudo tee -a /etc/amtool/config.yml
alertmanager: http://localhost:9093
EOF

cat <<EOF | sudo tee -a /etc/systemd/system/alertmanager.service
[Unit]
Description=Prometheus Alertmanager
Wants=network-online.target
After=network-online.target

[Service]
User=alertmanager
Group=alertmanager
Type=simple
ExecStart=/usr/local/bin/alertmanager --config.file /etc/alertmanager/alertmanager.yml --storage.path /var/lib/alertmanager/

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl start alertmanager
sudo systemctl status alertmanager
sudo systemctl enable alertmanager
amtool config show 

Configuration

Alertmanager is configured in much the same way as Prometheus server.The location of a configuration file is defined in the --config.file command-line flag when running Alertmanager. Alertmanager configrations are reloaded similarly to Prometheus server configurations.

There are multiple ways to relaod the Alertmanager configuration:

  • restat Alertmanager
  • send a SIGHUP signal to the Alertmanager process | sudo killall -HUP alertmanager
  • send an HTTP POST request to the /-/reload endpoint

HA and Alertmanager
You can run Alertmanager in a HA configuiration with multiple Alermanager instances.These instances will work with one another to de-duplicate and group alerts,even if the alerts are sent to different instances.Use --cluster flags to configure a multi-instance Alertmanager cluster. Note that Prometheus server should be aware of each Alermanager instance.Don’t load-balance traffic between Prometheus server and Alertmanager.Cluster port for HA for Alertmanager is 9094. Note: Memebers of the HA Alertmanager cluster will share the alerts but config of the each memeber must be configured manually so they can have same setup.

Alerting Rules

Alerting rules provide a way to define the conditions and content of Prometheus alerts. They allow you to use Prometheus expressions to define conditions that will trigger an alert based upon your metric data. Alerting rules are created and configured in the same way as recoding rules. Specify one or more locations for rule files in prometheus.yml under rule_files. Create rules by defining them within YAML files in the appropriate location.

an example

groups:
- name: example
  rules:
  # Alert for any instance that is unreachable for >5 minutes.
  - alert: InstanceDown
    expr: up == 0
    for: 5m
    labels:
      severity: page
    annotations:
      summary: "Instance {{ $labels.instance }} down"
      description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."

Any alert will have associated labels map[__name__:<name of the expr metric> instance:<instance> job:<jobname>].

Fetching alerts from CLI

# from alertmanager
curl -s localhost:9093/api/v2/alerts | python -m json.tool
# from prometheus
curl -s localhost:9090/api/v1/alerts | python -mjson.tool

Alerts states:

  • inactive = note yet pending or firing
  • pending = not yet active long enough to become firing
  • firing = acvie for more than defined for clause threshold

Testing rules before prometheus restarting promtool check rules <PATH_TO_RULE>

Managing Alerts in Alertmanager

Prometheus simply triggers alerts based on data.Once you have created some alerting rules,you can use Alertmanager to provide more sophisticated management of your alerts.

Routing: Alertmanager implements a routing tree, which is represented in the route block in your Alertmanager config file alertmanager.yml. The routing tree controls the logic of how and when alerts will be sent.

Grouping allows you to combine multiple alerts into a single notification.Example: Multiple servers just went down.Instead of getting an email for each server, you get one alert email telling you how many servers went down. Use the group_by block in your routes to set which labels will be used to group alerts together.

Inhibition allow you to suppress/mute an alert if another alert is already firing.Example: your data center just lost network connectivity.Instead of getting alerts for every application that is unreachable,the network connectivity alert suppresses all of these other alerts. Use the inhibit_rule block in your Alertmanager config to set matchers for which alerts will inhibit which other alerts.

Silences are a simple way to temporarily turn off certain notifications.Example: your infra is having widespread issues you are already aware of.You use silences to temporarily stop the alerts while you fix the issue.

Silences are configured through the Alertmnager web UI.

example of matching

route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'web.hook'
  routes:
  - receiver: 'web.hook'
    # one can group the alerts which has same labels
    group_by: ['service']
    
    # regex which alertnames will be evaluated
    # DEPRECATED, use matchers
    # match_re:
    #  alertname: 'Server.*Down'
    matchers:
    - alertname =~ "Server.*Down"
receivers:
- name: 'web.hook'
  webhook_configs:
  - url: 'http://127.0.0.1:5001/'
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

    # a list of matchers for which one or more alerts have to exist for the inhibition to take effect
  - source_matchers: [ alertname =~ "Server.*Down" ]
    # a list of matchers that have to be fulfilled by the target alerts to be muted
    target_matchers: [ alertname = "DownstreamServerDown" ]

Prometheus and HA

HA systems are systems that are resilien and durable.They are capable of operating for long periods of time without failure. Making Prometheus highly available is actually a fairly simple process.You can run multiple Prometheus servers so that if one fails, there are others still working. You can simply run multiple Prometheus servers with the same configuration.These servers doesn’t need to talk to each other. Each server will scrape metrics separately from all the exporters based on the same configuration.This simplicity is one adventage of the pull-based method of gathering metrics!

Federation

Federation is the process of one Prometheus server scraping some or all time-series data from another Prometheus server.

Hierarchical Federation, higher-level Prometheus servers collect time-series data from multiple lowe-level servers. Use case: Multiple data centers, each with its own internal Prometheus server(s) monitoring things within the data center.A higher-level Prometheus serer scrapes and aggregates data acroess all data centers.

Cross-Service Federation - with this setup a Prometheus server monitoring one service or set of services scrapes selected data from another server monitoring a different set of services so queries and alerts can be run on the combined data set. Use case: A prometheus server monitors server metrics, and another Prometheus server monitors application metrics.The application-monitoring Prometheus server scrapes CPU usage data from the server-monitoring Prometheus server, allowing queries to be run that take into account both CPU usage and application-based metrics.

Configuring Federation Federation can be set up by configuring a Prometheus server to scrape from the /federate endpoint on another Prometheus server.

an example

scrape_configs:
  - job_name: 'federate'
    scrape_interval: 15s

    honor_labels: true
    metrics_path: '/federate' ## use instead /metrics

    # narrow the time-series data you are retrieving with the `match[]` paramter
    # match on metric names and/or lables to retrieve a subset of the data
    # in this case, all records which has expression `job="prometheus"` or which name has name like `job:.*`
    params:
      'match[]':
        - '{job="prometheus"}'
        - '{__name__=~"job:.*"}'

    static_configs:
      - targets:
        - 'source-prometheus-1:9090'
        - 'source-prometheus-2:9090'
        - 'source-prometheus-3:9090'

Some Prometheus Server Security Considerations

Prometheus server doesn’t provide authentication out of the box.Anynone who can access the server’s HTTP endpoints has access to all your time-series data. Prometheus server doesn’t provide TLS encryption.Non-encrypted communcations between clients and Prometheus are vulnerable to unencrypted data being read and to man-in-the middle attacks. If your Prometheus endpoints are opent to a network with potentially untrusted clients,you can add your own security layer on top of Prometheus server using a reverse proxy.

Unsecured traffic and endpoints Be sure to consider all potentially unsecured traffic in your Prometheus setup.If your network configuration would allow untrusted users to gain access to sensitive components or data, you may need to take steps to secure your Prometheus setup.

Reverse proxy A reverse proxy acts as a midlleman between clients and a server.You can implement security features on top of Prometheus using a reverse proxy.You can use any simple web server,such as Apache or Nginx,for this purpose.

Alertmanager Security Considerations Like Prometheus server,Alertmanager doesn’t provide authentication or TLS encryption. Use a reverse proxy to add your own security layer,if needed.

Pushgateway Security Considerations Pushgateway likewise doesn’t provide authentication or TLS encryption. Again,you can add your own security layer with a reverse proxy.

Exporter Security Considerations Every exporter is different. Many exporters do provide authenticaion and/or TLS encryption. Check the documentation for your exporters to learn more about basic security features. Without security, data provided by exporters can be read by anynone with access to the /metrics endpoint.

Client Libraries

Prometheus client libraries provide an easy way to add instrumentation to your code in order to monitor your applications with Prometheus. Client libraries provide functionality that allows you to:

  • collect and record metric data in your code
  • provide a /metrics endpoint,turning your application into a Prometheus exporter so Prometheus can scrape metrics from your application

There are existing client libraries for many popular programming languages and frameworks.You can also code your own client libraries. Prometheus supports the following official client libraries, although there are many third-party client libraries for other languages.

  • Go
  • Java/Scala
  • Python
  • Ruby