监控系统
三剑客
metrics monitor alert
监控什么
主机级别
CPU Memory Disk space Processes
应用程序
Error and success rates Service failures and restarts Performance and latency of responses Resource usage
网络相关
Connectivity Error rates and packet loss Latency Bandwidth utilization
服务池资源
Pooled resource usage Scaling adjustment indicators Degraded instances
外部依赖的度量
Service status and availability Success and error rates Run rate and operational costs Resource exhaustion
如何采集Metrics
黄金指标
延迟
延迟可以知道一个任务需要多久
流量
知道系统的繁忙程度
错误
错误的分类和管理
饱和
资源的利用率
可以说,任何的组件都可以通过这四个指标进行监控
组件说明
服务器组件
To measure CPU, the following measurements might be appropriate:
Latency: Average or maximum delay in CPU scheduler Traffic: CPU utilization Errors: Processor specific error events, faulted CPUs Saturation: Run queue length
应用程序与服务
Latency: The time to complete requests Traffic: Number of requests per second served Errors: Application errors that occur when processing client requests or accessing resources Saturation: The percentage or amount of resources currently being used
服务组
就是一组应用服务的整体指标
外部资源
Latency: Time it takes to receive a response from the service or to provision new resources from a provider Traffic: Amount of work being pushed to an external service, the number of requests being made to an external API Errors: Error rates for service requests Saturation: Amount of account-restricted resources used (instances, API requests, acceptable cost, etc.)
端到端的监控
Latency: The time to complete user requests Traffic: Number of user requests per second Errors: Errors that occur when processing client requests or accessing resources Saturation: The percentage or amount of resources currently being used
监控和报警
监控系统的组成
agent和data exporter
agent 在机器上独立运行,采集一些基础数据,不断地更新数据到远端。
可以采用pull or push的方式都可以。push的话,agent需要知道server的位置,并进行通信。pull的话,自己暴露在一个特定的endpoint上,服务端来拉。
metrics ingress
For push-based systems, the metrics ingress endpoint is a central location on the network where each monitoring agent or stats aggregator sends its collected data. The endpoint should be able to authenticate and receive data from a large number of hosts simultaneously. Ingress endpoints for metrics systems are often load balanced or distributed at scale both for reliability and to keep up with high volumes of traffic.
For pull-based systems, the corresponding component is the polling mechanism that reaches out and parses the metrics endpoints exposed on individual hosts. This has some of the same requirements, but some responsibilities are reversed. For instance, if individual hosts implement authentication, the metrics gathering process must be able to provide the correct credentials to log in and access the secure endpoint.
数据管理层
存时序数据,并有一些查询方式暴露出来。
可视化层
这个不用多说
报警和阈值设置功能
设定一定的阈值,到达之后,进行报警。这个需要自己平衡。
黑盒和白盒监控
黑盒站在业务视角,百合更多的有监控细节。
报警级别区分重要性
最高 pages
应急处理
次要通知
发邮件或者工单
打印日志
only 打印
分布式微服务监控
可伸缩的监控
力度和采样
要能控制
多种不同单位的数据集合
分布式tracer
提供响应能力
每一层组件设置四个黄金指标
有一个完整的大图
具体问题可以下钻
这个也不用多说,能够细化解决问题
缓解甚至解决问题
直接对接一下处理平台,出现问题,推送开关等等
https://www.digitalocean.com/community/tutorials/an-introduction-to-metrics-monitoring-and-alerting