Manage Alerts
This topic introduces various alert items from different dimensions, including business continuity, cluster availability, and machine load, and provides corresponding resolutions.
In the following examples, all variables are prefixed with $
. They should be replaced according to your business environment. For example, $job_name
should be replaced with the corresponding Job Name in the Prometheus configuration, and $fe_leader
should be replaced with the IP address of the Leader FE.
Service Suspension Alerts
FE Service Suspension
PromSQL
count(up{group="fe", job="$job_name"}) >= 3
Alert Description
An alert is triggered when the number of active FE nodes falls below a specified value. You can adjust this value based on the actual number of FE nodes.
Resolution
Try to restart the suspended FE node.
BE Service Suspension
PromSQL
node_info{type="be_node_num", job="$job_name",state="dead"} > 1
Alert Description
An alert is triggered when more than one BE node is suspended.
Resolution
Try to restart the suspended BE node.
Machine Load Alerts
BE CPU Alert
PromSQL
(1-(sum(rate(starrocks_be_cpu{mode="idle", job="$job_name",instance=~".*"}[5m])) by (job, instance)) / (sum(rate(starrocks_be_cpu{job="$job_name",host=~".*"}[5m])) by (job, instance))) * 100 > 90
Alert Description
An alert is triggered when BE CPU Utilization exceeds 90%.
Resolution
Check whether there are large queries or large-scale data loading and forward the details to the support team for further investigation.
-
Use the
top
command to check resource usage by processes.top -Hp $be_pid
-
Use the
perf
command to collect and analyze performance data.# Execute the command for 1-2 minutes, and terminate it by pressing CTRL+C.
sudo perf top -p $be_pid -g >/tmp/perf.txt
In emergencies, to quickly restore service, you can try to restart the corresponding BE node after preserving the stack. An emergency here refers to a situation where the BE node's CPU utilization remains abnormally high, and no effective means are available to reduce CPU usage.