title: 2.12.进程监控
order: 17
icon: lightbulb
一、环境
主机名 | IP地址 | 系统 | 说明 |
localhost | 192.168.11.61 | Ubuntu 20.04 | docker安装的prometheus |
test | 192.168.11.62 | Ubuntu 20.04 | 安装process-exporter对这台服务器的进程监控 |
1、环境搭建
docker安装
略
docker-compose安装
略
二、进程监控
1、process exporter功能
如果想要对主机的进程进行监控,例如chronyd,sshd等服务进程以及自定义脚本程序运行状态监控。我们使用node exporter就不能实现需求了,此时就需要使用process exporter来做进程状态的监控。
项目地址:https://github.com/ncabatoff/process-exporter
2、二进制安装(二选一)
wget https://github.com/ncabatoff/process-exporter/releases/download/v0.7.10/process-exporter-0.7.10.linux-amd64.tar.gz
tar zxvf process-exporter-0.7.10.linux-amd64.tar.gz
mkdir /opt/prometheus -p
mv process-exporter-0.7.10.linux-amd64 /opt/prometheus/process_exporter
ls -l /opt/prometheus/process_exporter
创建用户
useradd -M -s /usr/sbin/nologin prometheus
更改exporter文件夹权限
chown prometheus:prometheus -R /opt/prometheus
创建配置文件
监控所有进程
cat >>/opt/prometheus/process_exporter/process.yml<<"EOF"
process_names:
- name: "{{.Comm}}" # 匹配模板
cmdline:
- '.+' # 匹配名称
EOF
创建systemd
cat <<"EOF" >/etc/systemd/system/process_exporter.service
[Unit]
Description=process_exporter
After=network.target
[Service]
Type=simple
User=prometheus
Group=prometheus
ExecStart=/opt/prometheus/process_exporter/process-exporter -config.path=/opt/prometheus/process_exporter/process.yml
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
启动
systemctl daemon-reload
systemctl start process_exporter
加入到开机自启动
systemctl enable process_exporter
检查
systemctl status process_exporter
启动不了检查日志
journalctl -u process_exporter -f
2、docker安装
创建数据目录
mkdir /data/process_exporter -p
cd /data/process_exporter
创建配置文件
Process-Exporter 的做法是配置需要监控的进程的名称,他会去搜索该进程从而得到其需要的监控信息,其实也就是我们常做的 ps -efl | grep xxx 命令来查看对应的进程。
- 监控所有进程
cat >>process.yml <<"EOF"
process_names:
- name: "{{.Comm}}" # 匹配模板
cmdline:
- '.+' # 匹配所有名称
EOF
- 监控指定进程
process_names:
# - name: "{{.Comm}}"
# cmdline:
# - '.+'
- name: "{{.Matches}}"
cmdline:
- 'nginx' #唯一标识
- name: "{{.Matches}}"
cmdline:
- 'mongod'
- name: "{{.Matches}}"
cmdline:
- 'mysqld'
- name: "{{.Matches}}"
cmdline:
- 'redis-server'
docker运行
docker run -d --rm -p 9256:9256 --privileged -v /proc:/host/proc -v `pwd`:/config --name process-exporter ncabatoff/process-exporter --procfs /host/proc -config.path /config/process.yml
检查
http://192.168.11.62:9256/metrics
配置说明
匹配模板
3、Prometheus设置
cd /data/docker-prometheus
使用cat追加
cat >> prometheus/prometheus.yml <<"EOF"
- job_name: 'process'
scrape_interval: 30s
scrape_timeout: 15s
static_configs:
- targets: ['192.168.11.62:9256']
EOF
重新加载配置
curl -X POST http://localhost:9090/-/reload
检查
http://192.168.11.61:9090/targets?search=
4、metrics说明
namedprocess_
namedprocess_namegroup_states{state="Zombie"} 查看僵尸
# 上下文切换数量
# Counter
namedprocess_namegroup_context_switches_total
# CPU user/system 时间(秒)
# Counter
namedprocess_namegroup_cpu_seconds_total
# 主要页缺失次数
# Counter
namedprocess_namegroup_major_page_faults_total
# 次要页缺失次数
# Counter
namedprocess_namegroup_minor_page_faults_total
# 内存占用(byte)
# Gauge
namedprocess_namegroup_memory_bytes
# 同名进程数量
# Gauge
namedprocess_namegroup_num_procs
# 同名进程状态分布
# Gauge
namedprocess_namegroup_states
# 线程数量
# Gauge
namedprocess_namegroup_num_threads
# 启动时间戳
# Gauge
namedprocess_namegroup_oldest_start_time_seconds
# 打开文件描述符数量
# Gauge
namedprocess_namegroup_open_filedesc
# 打开文件数 / 允许打开文件数
# Gauge
namedprocess_namegroup_worst_fd_ratio
# 读数据量(byte)
# Counter
namedprocess_namegroup_read_bytes_total
# 写数据量(byte)
# Counter
namedprocess_namegroup_write_bytes_total
# 内核wchan等待线程数量
# Gauge
namedprocess_namegroup_threads_wchan
常用指标
指标名 | 解释 |
namedprocess_namegroup_num_procs | 运行的进程数 |
namedprocess_namegroup_states | Running/Sleeping/Other/Zombie状态的进程数 |
namedprocess_namegroup_cpu_seconds_total | 获取/proc/[pid]/stat 进程CPU utime、stime状态时间 |
namedprocess_namegroup_read_bytes_total | 获取/proc/[pid]/io 进程读取字节数 |
namedprocess_namegroup_write_bytes_total | 获取/proc/[pid]/io 进程写入字节数 |
namedprocess_namegroup_memory_bytes | 获取进程使用的内存字节数 |
namedprocess_namegroup_open_filedesc | 获取进程使用的文件描述符数量 |
namedprocess_namegroup_thread_count | 运行的线程数 |
namedprocess_namegroup_thread_cpu_seconds_total | 获取线程CPU状态时间 |
namedprocess_namegroup_thread_io_bytes_total | 获取线程IO字节数 |
5、触发器
Prometheus配置
# 报警(触发器)配置
rule_files:
- "alert.yml"
- "rules/*.yml"
添加触发器(告警规则)
cat > prometheus/rules/process.yml <<"EOF"
groups:
- name: process
rules:
- alert: 进程数多告警
expr: sum(namedprocess_namegroup_states) by (instance) > 1000
for: 1m
labels:
severity: warning
annotations:
summary: "进程数超过1000"
description: "服务器当前有{{ $value }}个进程"
- alert: 僵尸进程数告警
expr: sum by(instance, groupname) (namedprocess_namegroup_states{state="Zombie"}) > 0
for: 1m
labels:
severity: warning
annotations:
summary: "有僵尸进程数"
description: "进程{{ $labels.groupname }}有{{ $value }}个僵尸进程"
- alert: 进程重启告警
expr: ceil(time() - max by(instance, groupname) (namedprocess_namegroup_oldest_start_time_seconds)) < 60
for: 15s
labels:
severity: warning
annotations:
summary: "进程重启"
description: "进程{{ $labels.groupname }}在{{ $value }}秒前重启过"
- alert: 进程退出告警
expr: max by(instance, groupname) (delta(namedprocess_namegroup_oldest_start_time_seconds{groupname=~"^java.*|^nginx.*"}[1d])) < 0
for: 1m
labels:
severity: warning
annotations:
summary: "进程退出"
description: "进程{{ $labels.groupname }}退出了"
EOF
重新加载配置
curl -X POST http://localhost:9090/-/reload
检查
http://192.168.11.61:9090/alerts?search=
6、Doshboard
https://grafana.com/grafana/dashboards/8378-system-processes-metrics/
问题
下面2个图形显示不正常
process-exporter 升级到 0.5.0后 ,`namedprocess_namegroup_cpu_user_seconds_total`和`namedprocess_namegroup_cpu_system_seconds_total`合为一个指标名`namedprocess_namegroup_cpu_seconds_total`
`namedprocess_namegroup_cpu_user_seconds_tota`l变成`namedprocess_namegroup_cpu_seconds_total{mode="system"}`
`namedprocess_namegroup_cpu_system_seconds_total`变成`namedprocess_namegroup_cpu_seconds_total{mode="user"}`
指标 | 监控项含义 | 单位 | 说明 |
namedprocess_namegroup_cpu_seconds_total{mode="system"} | 当前内核空间占用CPU百分比。 | % | 系统上下文切换的消耗。如果该监控项数值比较高,则说明服务器开了太多的进程或线程。 |
namedprocess_namegroup_cpu_seconds_total{mode="user"} | 当前用户空间占用CPU百分比。 | % | 用户进程对CPU的消耗。 |
解决
Top processes by System CPU cores used图形修改如下:
topk(5,
rate(namedprocess_namegroup_cpu_seconds_total{mode="system",groupname=~"$processes",instance=~"$host"}[$interval])
or
(
irate(namedprocess_namegroup_cpu_seconds_total{mode="system",groupname=~"$processes",instance=~"$host"}[5m])))
Top processes by Total CPU cores used图形修改如下
topk(5,sum by (groupname,instance) (rate(namedprocess_namegroup_cpu_seconds_total{groupname=~"$processes",instance=~"$host"}[$interval]))
or
sum by (groupname,instance) (irate(namedprocess_namegroup_cpu_seconds_total{groupname=~"$processes",instance=~"$host"}[5m])))
或图形改名为Top processes by User CPU cores used
用户进程cpu使用率排名
topk(5,
rate(namedprocess_namegroup_cpu_seconds_total{mode="user",groupname=~"$processes",instance=~"$host"}[$interval])
or
(
irate(namedprocess_namegroup_cpu_seconds_total{mode="user",groupname=~"$processes",instance=~"$host"}[5m])))
修改完成后,图行正常
三、我的微信
如果碰到问题,可以随时加我微信,谢谢
评论区