监控系统

三剑客

metrics monitor alert

监控什么

主机级别

CPU Memory Disk space Processes

应用程序

Error and success rates Service failures and restarts Performance and latency of responses Resource usage

网络相关

Connectivity Error rates and packet loss Latency Bandwidth utilization

服务池资源

Pooled resource usage Scaling adjustment indicators Degraded instances

外部依赖的度量

Service status and availability Success and error rates Run rate and operational costs Resource exhaustion

如何采集Metrics

黄金指标

延迟

延迟可以知道一个任务需要多久

流量

知道系统的繁忙程度

错误

错误的分类和管理

饱和

资源的利用率

可以说,任何的组件都可以通过这四个指标进行监控

组件说明

服务器组件

To measure CPU, the following measurements might be appropriate:

Latency: Average or maximum delay in CPU scheduler Traffic: CPU utilization Errors: Processor specific error events, faulted CPUs Saturation: Run queue length

应用程序与服务

Latency: The time to complete requests Traffic: Number of requests per second served Errors: Application errors that occur when processing client requests or accessing resources Saturation: The percentage or amount of resources currently being used

服务组

就是一组应用服务的整体指标

外部资源

Latency: Time it takes to receive a response from the service or to provision new resources from a provider Traffic: Amount of work being pushed to an external service, the number of requests being made to an external API Errors: Error rates for service requests Saturation: Amount of account-restricted resources used (instances, API requests, acceptable cost, etc.)

端到端的监控

Latency: The time to complete user requests Traffic: Number of user requests per second Errors: Errors that occur when processing client requests or accessing resources Saturation: The percentage or amount of resources currently being used

监控和报警

监控系统的组成

agent和data exporter

agent 在机器上独立运行,采集一些基础数据,不断地更新数据到远端。

可以采用pull or push的方式都可以。push的话,agent需要知道server的位置,并进行通信。pull的话,自己暴露在一个特定的endpoint上,服务端来拉。

metrics ingress

For push-based systems, the metrics ingress endpoint is a central location on the network where each monitoring agent or stats aggregator sends its collected data. The endpoint should be able to authenticate and receive data from a large number of hosts simultaneously. Ingress endpoints for metrics systems are often load balanced or distributed at scale both for reliability and to keep up with high volumes of traffic.

For pull-based systems, the corresponding component is the polling mechanism that reaches out and parses the metrics endpoints exposed on individual hosts. This has some of the same requirements, but some responsibilities are reversed. For instance, if individual hosts implement authentication, the metrics gathering process must be able to provide the correct credentials to log in and access the secure endpoint.

数据管理层

存时序数据,并有一些查询方式暴露出来。

可视化层

这个不用多说

报警和阈值设置功能

设定一定的阈值,到达之后,进行报警。这个需要自己平衡。

黑盒和白盒监控

黑盒站在业务视角,百合更多的有监控细节。

报警级别区分重要性

最高 pages

应急处理

次要通知

发邮件或者工单

打印日志

only 打印

分布式微服务监控

可伸缩的监控

力度和采样

要能控制

多种不同单位的数据集合

分布式tracer

提供响应能力

每一层组件设置四个黄金指标

有一个完整的大图

具体问题可以下钻

这个也不用多说,能够细化解决问题

缓解甚至解决问题

直接对接一下处理平台,出现问题,推送开关等等

https://www.digitalocean.com/community/tutorials/an-introduction-to-metrics-monitoring-and-alerting