监控系统概要
监控系统 三剑客 metrics monitor alert 监控什么 主机级别 CPU Memory Disk space Processes 应用程序 Error and success rates Service failures and restarts Performance and latency of responses Resource usage 网络相关 Connectivity Error rates and packet loss Latency Bandwidth utilization 服务池资源 Pooled resource usage Scaling adjustment indicators Degraded instances 外部依赖的度量 Service status and availability Success and error rates Run rate and operational costs Resource exhaustion 如何采集Metrics 黄金指标 延迟 延迟可以知道一个任务需要多久 流量 知道系统的繁忙程度 错误 错误的分类和管理 饱和 资源的利用率 可以说,任何的组件都可以通过这四个指标进行监控 组件说明 服务器组件 To measure CPU, the following measurements might be appropriate: Latency: Average or maximum delay in CPU scheduler Traffic: CPU utilization Errors: Processor specific error events, faulted CPUs Saturation: Run queue length 应用程序与服务 Latency: The time to complete requests Traffic: Number of requests per second served Errors: Application errors that occur when processing client requests or accessing resources Saturation: The percentage or amount of resources currently being used ...