admin 管理员组文章数量: 1087131
Prometheus(5)Alert manager配置和Pormetheus 配置说明
1 概述
Pormetheus的警告由独立的两部分组成。
Prometheus 服务中的警告规则将警告发送警告到Alertmanager。然后这个Alertmanager管理这些警告。
包括:
- silencing,
- inhibition,
- aggregation,
- 以及通过一些方法发送通知,例如:email,PagerDuty和HipChat。
2 Alertmanager (警报管理器)
2.1 Grouping(分组)
Grouping分组将性质类似的警告分组成一个通知类
。
当许多系统同时出现故障时,这种情况尤其有用,可以使数百到数千个警报可能同时触发。
例如:
- 当出现网络分区时,十个到数百个服务实例正在集群中运行。
- 当多半服务实例暂时无法访问数据库,如果服务实例不能和数据库通信,则对于已经配置好警报规则的Prometheus服务将会对每个服务实例发送一个警报,这样便会导致数百个警报发送到Alertmanager。
- 如果一个用户仅仅想看到一个页面,这个页面上的数据是精确地表示哪个服务实例受影响了。如果没有设置分组,这些数据会有许多个通知,还是比较分散的,这时便可以使用grouping进行分组
- Alertmanager便可以通过它们的集群和警报名称来分组标签, 这样它可以发送一个单独受影响的通知。
如何配置:
警报分组,分组通知的时间,和通知的接受者
是在配置文件中由一个路由树配置的
2.2 inhibition(抑制)
如果某些其他警报已经触发了,则对于某些警报,Inhibition是一个抑制通知的概念
。
例如:
- 一个警报已经触发,它正在通知整个集群是不可达的时,Alertmanager则可以配置成关心这个集群的其他警报无效。
这可以防止与实际问题无关的数百或数千个触发警报的通知
。
如何配置:
- 通过
Alertmanager的配置文件配置Inhibition
。
2.3 silencing(静默)
静默,可以在给定时间内简单地忽略所有警报
。
slience基于matchers配置,类似路由树。
- 来到的警告将会被检查,判断它们是否和活跃的slience相等或者正则表达式匹配。
- 如果匹配成功,则不会将这些警报发送给接收者。
如何配置:
- Silences
在Alertmanager的web接口中配置
。
2.4 Client behavior(客户行为)
Alertmanager 对其客户的行为有特殊要求。这些仅与 Prometheus 不用于发送警报的高级用例相关。
2.5 High Availability(高可用性)
Alertmanager 支持配置以创建集群以实现高可用性。这可以使用–cluster-* 标志进行配置。
重要的是不要在 Prometheus 和它的 Alertmanagers 之间对流量进行负载平衡,而是将 Prometheus 指向所有 Alertmanagers 的列表。
3 configuration (配置)
Alertmanager通过命令行标志和配置文件
进行配置。
- 命令行标志配置不可变的系统参数,
查看所有命令,请使用命令
alertmanager -h
。 - 配置文件定义了禁止规则、通知路由和通知接收器。
可视化编辑器可以帮助构建路由树。
Alertmanager能够在运行时动态加载配置文件。
- 如果新的配置有错误,则配置中的变化不会生效,错误也会被记录;
- 同时错误日志被输出到终端,通过发送
SIGHUP
信号量给这个进程,或者通过HTTP POST请求/-/reload
来触发Alertmanager配置动态重新加载。
3.1 配置文件
使用-config.file
指定要加载的配置文件
./alertmanager -config.file=simple.yml
配置文件使用yaml格式编写的,括号表示参数是可选的,对于非列表参数,该值将设置为指定的默认值。
-
<duration>
: 与正则表达式匹配的持续时间[0-9]+(ms|[smhdwy])
((([0-9]+)y)?(([0-9]+)w)?(([0-9]+)d)?(([0-9]+)h)?(([0-9]+)m)?(([0-9]+)s)?(([0-9]+)ms)?|0)
例如:1d, 1h30m, 5m, 10s -
<labeltime>
: 与正则表达式匹配的字符串[a-zA-Z_][a-zA-Z0-9_]*
-
<labelvalue>
: 一串 unicode 字符 -
<filepath>
: 当前工作目录下的有效路径 -
<boolean>
: 布尔值:false 或者 true
。 -
<string>
:常规字符串
-
<secret>
: 一个秘密的常规字符串,例如密码 -
<tmpl_string>
: 一个在使用前被模板扩展的字符串 -
<tmpl_secret>:
在使用前进行模板扩展的字符串,这是一个秘密的常规字符串
10.<int>
: 一个整数值
全局配置指定在所有其他配置上下文中有效的参数。它们还作为其他配置部分的默认值。
global:# The default SMTP From header field.[ smtp_from: <tmpl_string> ]# The default SMTP smarthost used for sending emails, including port number.# Port number usually is 25, or 587 for SMTP over TLS (sometimes referred to as STARTTLS).# Example: smtp.example:587[ smtp_smarthost: <string> ]# The default hostname to identify to the SMTP server.[ smtp_hello: <string> | default = "localhost" ]# SMTP Auth using CRAM-MD5, LOGIN and PLAIN. If empty, Alertmanager doesn't authenticate to the SMTP server.[ smtp_auth_username: <string> ]# SMTP Auth using LOGIN and PLAIN.[ smtp_auth_password: <secret> ]# SMTP Auth using PLAIN.[ smtp_auth_identity: <string> ]# SMTP Auth using CRAM-MD5.[ smtp_auth_secret: <secret> ]# The default SMTP TLS requirement.# Note that Go does not support unencrypted connections to remote SMTP endpoints.[ smtp_require_tls: <bool> | default = true ]# The API URL to use for Slack notifications.[ slack_api_url: <secret> ][ slack_api_url_file: <filepath> ][ victorops_api_key: <secret> ][ victorops_api_url: <string> | default = "/" ][ pagerduty_url: <string> | default = "" ][ opsgenie_api_key: <secret> ][ opsgenie_api_url: <string> | default = "/" ][ wechat_api_url: <string> | default = "/" ][ wechat_api_secret: <secret> ][ wechat_api_corp_id: <string> ]# The default HTTP client configuration[ http_config: <http_config> ]# ResolveTimeout is the default value used by alertmanager if the alert does# not include EndsAt, after this time passes it can declare the alert as resolved if it has not been updated.# This has no impact on alerts from Prometheus, as they always include EndsAt.[ resolve_timeout: <duration> | default = 5m ]# Files from which custom notification template definitions are read.
# The last component may use a wildcard matcher, e.g. 'templates/*.tmpl'.
templates:[ - <filepath> ... ]# The root node of the routing tree.
route: <route># A list of notification receivers.
receivers:- <receiver> ...# A list of inhibition rules.
inhibit_rules:[ - <inhibit_rule> ... ]# A list of mute time intervals for muting routes.
mute_time_intervals:[ - <mute_time_interval> ... ]
3.2 <route>
路由块定义路由树中的节点及其子节点
。如果未设置,其可选配置参数将从其父节点继承。
每个警报在已配置路由树的顶部节点,这个节点必须匹配所有警报,然后遍历所有的子节点
。
- 如果
continue设置成false, 当匹配到第一个孩子时,它会停止下来
; - 如果
continue设置成true, 则警报将继续匹配后续的兄弟姐妹节点
。 - 如果
一个警报不匹配一个节点的任何孩子,这个警报将会基于当前节点的配置参数来处理警报
。
[ receiver: <string> ]
# The labels by which incoming alerts are grouped together. For example,
# multiple alerts coming in for cluster=A and alertname=LatencyHigh would
# be batched into a single group.
#
# To aggregate by all possible labels use the special value '...' as the sole label name, for example:
# group_by: ['...']
# This effectively disables aggregation entirely, passing through all
# alerts as-is. This is unlikely to be what you want, unless you have
# a very low alert volume or your upstream notification system performs
# its own grouping.
[ group_by: '[' <labelname>, ... ']' ]# Whether an alert should continue matching subsequent sibling nodes.
[ continue: <boolean> | default = false ]# DEPRECATED: Use matchers below.
# A set of equality matchers an alert has to fulfill to match the node.
match:[ <labelname>: <labelvalue>, ... ]# DEPRECATED: Use matchers below.
# A set of regex-matchers an alert has to fulfill to match the node.
match_re:[ <labelname>: <regex>, ... ]# A list of matchers that an alert has to fulfill to match the node.
matchers:[ - <matcher> ... ]# How long to initially wait to send a notification for a group
# of alerts. Allows to wait for an inhibiting alert to arrive or collect
# more initial alerts for the same group. (Usually ~0s to few minutes.)
[ group_wait: <duration> | default = 30s ]# How long to wait before sending a notification about new alerts that
# are added to a group of alerts for which an initial notification has
# already been sent. (Usually ~5m or more.)
[ group_interval: <duration> | default = 5m ]# How long to wait before sending a notification again if it has already
# been sent successfully for an alert. (Usually ~3h or more).
[ repeat_interval: <duration> | default = 4h ]# Times when the route should be muted. These must match the name of a
# mute time interval defined in the mute_time_intervals section.
# Additionally, the root node cannot have any mute times.
# When a route is muted it will not send any notifications, but
# otherwise acts normally (including ending the route-matching process
# if the `continue` option is not set.)
mute_time_intervals:[ - <string> ...]# Zero or more child routes.
routes:[ - <route> ... ]
举例
# The root route with all parameters, which are inherited by the child
# routes if they are not overwritten.
route:receiver: 'default-receiver'group_wait: 30sgroup_interval: 5mrepeat_interval: 4hgroup_by: [cluster, alertname]# All alerts that do not match the following child routes# will remain at the root node and be dispatched to 'default-receiver'.routes:# All alerts with service=mysql or service=cassandra# are dispatched to the database pager.- receiver: 'database-pager'group_wait: 10smatchers:- service=~"mysql|cassandra"# All alerts with the team=frontend label match this sub-route.# They are grouped by product and environment rather than cluster# and alertname.- receiver: 'frontend-pager'group_by: [product, environment]matchers:- team="frontend"
3.3 <mute_time_interval>
指定可以在路由树中引用的命名时间间隔,以在一天中的特定时间使特定路由静音。
name: <string>
time_intervals:[ - <time_interval> ... ]
3.4 <time_interval>
包含时间间隔的实际定义。该语法支持以下字段:
- times:[ - <time_range> ...]weekdays:[ - <weekday_range> ...]days_of_month:[ - <days_of_month_range> ...]months:[ - <month_range> ...]years:[ - <year_range> ...]
所有字段都是列表。
在每个非空列表中,必须至少满足一个元素才能匹配该字段。
如果未指定字段,则任何值都将匹配该字段。对于匹配完整时间间隔的瞬间,所有字段都必须匹配。
所有定义均采用 UTC,目前不支持其他时区。
3.4.1 time_range
范围包括开始时间和结束时间,以便于表示在小时边界开始/结束的时间。
例如,开始时间:“17:00”和结束时间:“24:00”将从 17:00 开始,并在 24:00 之前结束。
times:- start_time: HH:MMend_time: HH:MM
3.4.2 days_of_month_range
月份中数字天数的列表。天数从 1 开始。也接受从月底开始的负值,
例如,
- 1 月期间的 -1 表示 1 月 31 日。
- [‘1:5’, ‘-3:-1’]。延长超过月初或月底将导致它被钳制。
- [‘1:31’],在二月指定将根据闰年将实际结束日期限制为 28 或 29。两端包容。
3.4.3 month_range
不区分大小写的名称(例如“January”)或数字标识的日历月列表,
如:
- January = 1。
也接受范围。
例如
- [‘1:3’, ‘may:august’, ‘december’]。两端包容。
3.4.4 year_range
年份的数字列表。接受范围。
例如
- [‘2020:2022’, ‘2030’]。两端包容。
3.5 <inhibit_rule>
当存在与另一组匹配器匹配的警报(源)时,抑制规则将与一组匹配器匹配的警报(目标)静音。
对于列表中的标签名称,目标警报和源警报必须具有相同的标签值,在equal
这个标签里面。
缺少标签和具有空值的标签是一回事。因此,如果源警报和目标警报中都缺少列出的所有标签名称equal
,改抑制规则将生效。
为了防止警报抑制自身,同时匹配规则的目标端和源端的警报不能被相同为真的警报(包括自身)抑制
建议以警报从不匹配双方的方式选择目标和源匹配器,它更容易推理并且不会触发这种特殊情况
# DEPRECATED: Use target_matchers below.
# Matchers that have to be fulfilled in the alerts to be muted.
target_match:[ <labelname>: <labelvalue>, ... ]
# DEPRECATED: Use target_matchers below.
target_match_re:[ <labelname>: <regex>, ... ]# A list of matchers that have to be fulfilled by the target
# alerts to be muted.
target_matchers:[ - <matcher> ... ]# DEPRECATED: Use source_matchers below.
# Matchers for which one or more alerts have to exist for the
# inhibition to take effect.
source_match:[ <labelname>: <labelvalue>, ... ]
# DEPRECATED: Use source_matchers below.
source_match_re:[ <labelname>: <regex>, ... ]# A list of matchers for which one or more alerts have
# to exist for the inhibition to take effect.
source_matchers:[ - <matcher> ... ]# Labels that must have an equal value in the source and target
# alert for the inhibition to take effect.
[ equal: '[' <labelname>, ... ']' ]
3.6 <http_config>
允许配置接收方用来与基于 HTTP 的 API 服务通信的 HTTP 客户端。
# Note that `basic_auth` and `authorization` options are mutually exclusive.# Sets the `Authorization` header with the configured username and password.
# password and password_file are mutually exclusive.
basic_auth:[ username: <string> ][ password: <secret> ][ password_file: <string> ]# Optional the `Authorization` header configuration.
authorization:# Sets the authentication type.[ type: <string> | default: Bearer ]# Sets the credentials. It is mutually exclusive with# `credentials_file`.[ credentials: <secret> ]# Sets the credentials with the credentials read from the configured file.# It is mutually exclusive with `credentials`.[ credentials_file: <filename> ]# Optional OAuth 2.0 configuration.
# Cannot be used at the same time as basic_auth or authorization.
oauth2:[ <oauth2> ]# Optional proxy URL.
[ proxy_url: <string> ]# Configure whether HTTP requests follow HTTP 3xx redirects.
[ follow_redirects: <bool> | default = true ]# Configures the TLS settings.
tls_config:[ <tls_config> ]
3.6.1 oauth2
使用客户端凭据授予类型的 OAuth 2.0 身份验证。
Alertmanager 使用给定的客户端访问和密钥从指定的端点获取访问令牌。
client_id: <string>
[ client_secret: <secret> ]# Read the client secret from a file.
# It is mutually exclusive with `client_secret`.
[ client_secret_file: <filename> ]# Scopes for the token request.
scopes:[ - <string> ... ]# The URL to fetch the token from.
token_url: <string># Optional parameters to append to the token URL.
endpoint_params:[ <string>: <string> ... ]
3.6.2 <tls_config>
允许配置 TLS 连接
# CA certificate to validate the server certificate with.
[ ca_file: <filepath> ]# Certificate and key files for client cert authentication to the server.
[ cert_file: <filepath> ]
[ key_file: <filepath> ]# ServerName extension to indicate the name of the server.
# .1
[ server_name: <string> ]# Disable validation of the server certificate.
[ insecure_skip_verify: <boolean> | default = false]
3.7 <receiver>
Receiver 是一个或多个通知集成的命名配置。
注意:作为取消过去暂停新接收器的一部分,除了现有要求外,还同意新的通知集成需要有一个具有推送访问权限的承诺维护者。
# The unique name of the receiver.
name: <string># Configurations for several notification integrations.
email_configs:[ - <email_config>, ... ]
pagerduty_configs:[ - <pagerduty_config>, ... ]
pushover_configs:[ - <pushover_config>, ... ]
slack_configs:[ - <slack_config>, ... ]
opsgenie_configs:[ - <opsgenie_config>, ... ]
webhook_configs:[ - <webhook_config>, ... ]
victorops_configs:[ - <victorops_config>, ... ]
wechat_configs:[ - <wechat_config>, ... ]
3.7.1 <email_config>
# Whether or not to notify about resolved alerts.
[ send_resolved: <boolean> | default = false ]# The email address to send notifications to.
to: <tmpl_string># The sender address.
[ from: <tmpl_string> | default = global.smtp_from ]# The SMTP host through which emails are sent.
[ smarthost: <string> | default = global.smtp_smarthost ]# The hostname to identify to the SMTP server.
[ hello: <string> | default = global.smtp_hello ]# SMTP authentication information.
[ auth_username: <string> | default = global.smtp_auth_username ]
[ auth_password: <secret> | default = global.smtp_auth_password ]
[ auth_secret: <secret> | default = global.smtp_auth_secret ]
[ auth_identity: <string> | default = global.smtp_auth_identity ]# The SMTP TLS requirement.
# Note that Go does not support unencrypted connections to remote SMTP endpoints.
[ require_tls: <bool> | default = global.smtp_require_tls ]# TLS configuration.
tls_config:[ <tls_config> ]# The HTML body of the email notification.
[ html: <tmpl_string> | default = '{{ template "email.default.html" . }}' ]
# The text body of the email notification.
[ text: <tmpl_string> ]# Further headers email header key/value pairs. Overrides any headers
# previously set by the notification implementation.
[ headers: { <string>: <tmpl_string>, ... } ]
3.8 其他
其他的请查看官方文档
/
本文标签: Prometheus(5)Alert manager配置和Pormetheus 配置说明
版权声明:本文标题:Prometheus(5)Alert manager配置和Pormetheus 配置说明 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.roclinux.cn/b/1693412746a220467.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论