Configure alerts
Netdata's health watchdog is highly configurable, with support for dynamic thresholds, hysteresis, alert templates, and more. You can tweak any of the existing alerts based on your infrastructure's topology or specific monitoring needs, or create new entities.
You can use health alerts in conjunction with any of Netdata's collectors (see the supported collector list) to monitor the health of your systems, containers, and applications in real time.
While you can see active alerts both on the local dashboard and Netdata Cloud, all health alerts are configured per node via individual Netdata Agents. If you want to deploy a new alert across your infrastructure, you must configure each node with the same health configuration files.
Reload health configuration
You don’t need to restart the Netdata Agent between changes to health configuration files, such as specific health entities. Instead, you can use netdatacli
and the reload-health
option to prevent gaps in metrics collection.
sudo netdatacli reload-health
If netdatacli
doesn't work on your system, send a SIGUSR2
signal to the daemon, which reloads health configuration without restarting the entire process.
killall -USR2 netdata
Edit health configuration files
You can configure the Agent's health watchdog service by editing files in two locations:
-
The
[health]
section innetdata.conf
. By editing the daemon's behavior, you can disable health monitoring altogether, run health checks more or less often, and more. See daemon configuration for a table of all the available settings, their default values, and what they control. -
The individual
.conf
files inhealth.d/
. These health entity files are organized by the type of metric they’re performing calculations on or their associated collector. You should edit these files using theedit-config
script. For example:sudo ./edit-config health.d/cpu.conf
.
Navigate to your Netdata config directory and
use edit-config
to make changes to any of these files.
Edit individual alerts
For example, to edit the cpu.conf
health configuration file, run:
sudo ./edit-config health.d/cpu.conf
Each health configuration file contains one or more health entities, which always begin with alarm:
or template:
.
For example, here is the first health entity in health.d/cpu.conf
:
template: 10min_cpu_usage
on: system.cpu
class: Utilization
type: System
component: CPU
lookup: average -10m unaligned of user,system,softirq,irq,guest
units: %
every: 1m
warn: $this > (($status >= $WARNING) ? (75) : (85))
crit: $this > (($status == $CRITICAL) ? (85) : (95))
delay: down 15m multiplier 1.5 max 1h
summary: CPU utilization
info: Average cpu utilization for the last 10 minutes (excluding iowait, nice and steal)
to: sysadmin
To tune this alert to trigger warning and critical alerts at a lower CPU utilization, change the warn
and crit
lines
to the values of your choosing. For example:
warn: $this > (($status >= $WARNING) ? (60) : (75))
crit: $this > (($status == $CRITICAL) ? (75) : (85))
Save the file and reload Netdata's health configuration to apply your changes.
Disable or silence alerts
Alerts and notifications can be disabled permanently via configuration changes, or temporarily, via the health management API. The available options are described below.
Disable all alerts
In the netdata.conf
[health]
section, set enabled
to no
, and restart the Agent.
Disable some alerts
In the netdata.conf
[health]
section, set enabled alarms
to a
simple pattern that
excludes one or more alerts. e.g. enabled alarms = !oom_kill *
will load all alerts except oom_kill
.
You can also edit the file where the alert is defined, comment out its definition, and reload Netdata's health configuration.
Silence an individual alert
You can stop receiving notification for an individual alert by changing the to:
line to silent
.
to: silent
This action requires that you reload Netdata's health configuration.
Temporarily disable alerts at runtime
When you need to frequently disable all or some alerts from triggering during certain times (for instance, when running backups), you can use the health management API. The API allows you to issue commands to control the health engine's behavior without changing configuration, or restarting the Agent.
Temporarily silence notifications at runtime
If you want health checks to keep running and alerts to keep getting triggered, but notifications to be suppressed temporarily, you can use the health management API. The API allows you to issue commands to control the health engine's behavior without changing configuration, or restarting the Agent.