Alerts That Actually Page You: A Practical Guide to Not Crying Wolf
Most monitoring alerts are noise. Here's how to design threshold + duration alerts that wake you up for real problems and stay quiet the rest of the time — with real homelab and WordPress examples.
Alerts That Actually Page You: A Practical Guide to Not Crying Wolf
A confession to start: I am the reason a friend of mine briefly stopped trusting his monitoring stack.
He'd asked me to help set up alerts for his three-node Proxmox cluster. I went a little overboard. CPU above 80%, memory above 75%, disk I/O latency above 50ms, network throughput above some number I picked because it sounded reasonable. By the end of the first night his phone had buzzed eleven times. By the end of the first week he had every Netwarden notification muted and was checking the dashboard manually every morning like it was 2008.
That's the failure mode nobody warns you about. The cost of a bad alert isn't a wasted notification. The cost is that the next real alert lands on a phone with notifications off, on a dashboard nobody opens, in an inbox routed to a folder labeled "Monitoring (probably nothing)."
So this is the post I wish I'd written before I gave him that monitoring config. It's the working version of "what's an alert actually for, and how do you design one that earns its right to ring at 3 AM."
What an alert is for
An alert is a contract between you and your monitoring system. You agree to take it seriously when it fires. The system agrees to only fire when something actually deserves your attention.
Every alert that violates that contract erodes the contract. Five false alarms in a row and you stop reading the subject line. Twenty in a row and you mute the channel. Fifty and you delete the integration.
So the question for every alert you set up isn't "could this metric ever indicate a problem?" Almost any metric could, in some scenario, indicate a problem. The question is: if this alert fires right now, would I want to be interrupted to look at it?
If the honest answer is "not really, I'd just snooze it," that alert should not exist. Or it should not page. Or it needs a duration filter. Or it needs to be a daily summary, not a real-time notification. We'll get to all of those.
The two failure modes
Alerts fail in two directions and both are bad.
Cry-wolf failure. The alert fires for things that don't matter. CPU spiked for ten seconds. Disk hit 81%. Response time was 1.4 seconds for one request. Each of these is a real measurement. None of them is a real problem. You learn to ignore the alert. Then a real problem fires and you ignore that one too.
Silent failure. The alert doesn't fire when it should. Your contact form is broken because a plugin update mangled a hook, but the homepage still returns 200, so your "uptime monitoring" reports green. Your Pi NAS is at 94% disk for the third week running, but the threshold is 95%, so you've never been told. The site is technically up but the database is timing out 30% of queries.
Most people, when they get burned by a silent failure, respond by adding more alerts. That fixes silent failure and creates cry-wolf. Then they get tired of the noise, raise the thresholds, and create silent failure again. The pendulum swings back and forth and the monitoring system slowly loses your trust either way.
The fix isn't more alerts or fewer alerts. The fix is better-designed alerts.
The four ingredients of an alert that earns its keep
Every good alert has four things. Skip any of them and the alert will misbehave.
1. A metric that maps to a real outcome
"CPU above 80%" is a metric. "The site is too slow for users" is an outcome. They are not the same thing. CPU at 80% during a known-busy window with response times still under 200ms is fine. CPU at 40% with response times spiking to 4 seconds is a problem.
When you can, alert on the outcome. Response time. HTTP error rate. Whether the agent is reporting at all. Whether a specific service (PHP-FPM, MySQL, the WP cron) is up. Whether disk space has dropped below the threshold where the application can still write.
When you can't alert on the outcome directly, alert on the metric that's closest to the outcome and combine it with duration (next ingredient).
2. A duration filter
This is the single most underrated setting in any alerting system. Almost no useful alert should fire on the first sample.
A duration filter says: "this condition has to be true continuously for at least N minutes before you bother me." Five minutes is a good default for most things. Fifteen for noisy metrics. One minute for things that are genuinely catastrophic if true (host completely unreachable, agent stopped reporting).
Why this matters in practice:
- A Proxmox node hits 91% CPU for fourteen seconds during a nightly backup. Without a duration filter, it pages you. With a 5-minute filter, it doesn't.
- A WordPress plugin auto-updates at 4 AM and PHP-FPM OOMs and restarts. The site is down for forty seconds. Without a duration filter, you wake up. With a 2-minute filter, you find out at breakfast — which is fine, because it already self-recovered.
- A Pi's swap usage trickles up over a week. There's no "moment" when it goes wrong. A duration filter doesn't matter here at all — what you want is a low threshold (say, swap > 100MB for 30 minutes) so it warns you days before the problem actually bites.
In Netwarden, every threshold alert has a "for at least" field. Use it. The default I reach for is 5 minutes. Five minutes is long enough to filter almost all transient noise and short enough that you still find out about real outages quickly.
3. A severity that matches reality
Not every alert deserves to ring at 3 AM. Be honest about which ones do.
I think about it as three tiers, not five. Five tiers always collapse to "critical" and "we'll get to it" anyway.
-
Page-me-now. Site is down. Host is unreachable. Agent stopped reporting from a production box. Disk is full and the app can't write. These ring my phone, they trigger push notifications, they email me. If I'm on a beach this is the alert I look at.
-
Tell-me-today. Disk is at 80% with no immediate crisis. PHP error count is unusually high but the site is still up. A plugin update is available with security flag. These email me. They show up on the dashboard. They do not ring at 3 AM.
-
Show-me-on-the-dashboard. Failed login attempts are above baseline but not at attack levels. CPU is trending up week-over-week. A plugin update is available, no security flag. These don't notify at all. I see them when I open the dashboard, which is enough.
The tiers map to channels naturally. Page-me-now uses email + mobile push (and a webhook into whatever incident tracker you have, if you have one). Tell-me-today is email only. Show-me-on-the-dashboard is no notification, just a dashboard widget.
If everything is page-me-now, nothing is.
4. A reason to exist that you can name out loud
This is the one most people skip. Before you save an alert, say out loud what you'll do when it fires.
"When this alert fires, I will..."
If you can't finish the sentence, the alert shouldn't exist. Or it should be a dashboard, not an alert. "When this alert fires I will... look at the dashboard to see if it's a real problem" is not a runbook, that's a snooze button with extra steps.
Good answers look like:
- "...SSH into the host and check what's eating disk."
- "...check whether MySQL is up, and if not, restart the WP host."
- "...look at the WordPress plugin update log to see what auto-updated last night."
- "...check whether the backup job is running long again."
If you have an answer like that for every alert, your alerts are doing something. If you don't, prune them.
Worked examples on stuff people actually run
Abstract advice gets you halfway. Here are alerts I'd actually configure for the kinds of setups Netwarden users tend to have.
A three-node Proxmox cluster
What's worth paging on:
- Any node is unreachable for 2 minutes. This is page-me-now. Push notification, email, the works. Hosts go offline for real reasons and you want to know fast.
- Disk usage above 90% on the storage pool for 10 minutes. This isn't urgent in the "respond now" sense, but it's a "tell me today" — you've got hours, maybe a day, before something can't write. Email is fine.
- A specific VM is down for 5 minutes. If you've marked the VM as one you care about (Home Assistant, the Plex box, whatever), you want to know when it's actually gone, not when it rebooted itself in 40 seconds.
What's not worth alerting on at all:
- CPU above any threshold. Proxmox CPU is supposed to spike. If you have a workload where sustained CPU starvation actually matters, alert on the outcome (response time of the service running on it), not the CPU.
- Memory above 85%. Linux uses memory. Caches fill it. The number you actually care about is "available memory after caches/buffers" and even that's only useful with a long duration filter.
A small WordPress site (one host, your own)
What's worth paging on:
- The site is down for 2 minutes. Uptime check failing, agent reporting service down, whatever flavor your stack supports. This is the alert.
- Response time above 5 seconds for 5 minutes. Site is technically up but unusable. Different problem, same severity.
- PHP-FPM is not running for 1 minute. This often dies and gets auto-restarted. One minute of "not running" means it didn't come back.
What's worth daily emails on:
- Plugin or core security updates available. Don't page me. Tell me at breakfast.
- Failed login attempts spiked above baseline. Useful to know about, almost never urgent.
- Disk above 80%. You have time.
What's not worth alerting on:
- The number of plugins installed. The list of themes. Any "configuration" metric that doesn't change unless you change it.
- "PHP errors went up by 1." Define a real threshold based on your baseline.
A small fleet of client WordPress sites (a freelancer's setup)
This is where alert design starts to matter most, because cry-wolf scales with site count. If each site fires one false alert per week and you have fifteen sites, that's a noisy week.
The trick here is to tier ruthlessly. Only "site down" alerts page. Everything else is a daily digest, even things you'd page on for your own site.
Why? Because for a single site, "PHP errors are elevated" is interesting. For fifteen sites, "PHP errors are elevated on three sites today" is a Monday-morning task, not a 3 AM page.
In practice that looks like:
- Page-me-now: site is down, host is unreachable. Per site, with a 2-minute duration filter.
- Tell-me-today: a daily summary of every yellow signal across the fleet. Plugin updates pending, failed login spikes, error rate elevations, slow response time trending up.
Most people figure this out the hard way after one bad week. You don't have to.
Testing alerts without spamming yourself
Once you've designed an alert, you need to know it actually fires when the condition is met. Otherwise you're trusting on faith — which is fine right up until the day you find out the alert was misconfigured and the silent failure was just a typo in the threshold.
A few ways to do this without setting your phone on fire:
Use a dummy alert with a very low threshold first. Set "CPU > 1% for 1 minute" on a single host. Confirm it fires. Confirm it shows up on email and push. Now you know the channel works. Delete the dummy alert.
Trigger the underlying condition deliberately, on purpose, in a controlled way. Want to test the disk-full alert? dd if=/dev/zero of=/tmp/big bs=1M count=10000 on a non-production host. Want to test the "site is down" alert? Stop nginx for two minutes. The point is to verify the chain end to end: condition triggers, evaluation fires, notification arrives, you can acknowledge it.
Use the webhook channel as a test bed. Outbound webhooks are great for this — you can point one at a service like webhook.site and see exactly what payload Netwarden sends, including for alerts you don't want hitting your real email. It's also how you'd verify integration with your own scripts or incident tracker.
What you should not do is wait for production to test your alerts. Alerts that have never fired in test are not alerts. They're hopes.
Alert hygiene: the boring part that matters most
Setting up alerts is the fun part. Keeping them honest is the boring part. Schedule fifteen minutes once a month for this and you'll thank yourself a year from now.
Here's the checklist:
- Which alerts fired in the last 30 days, and which led to action? If an alert fired five times and you took action zero times, it's a false alert. Either tighten it or delete it.
- Which alerts have not fired in 90 days? Either nothing has gone wrong (in which case, fine) or the alert is broken (in which case, find out before production does). Test the silent ones occasionally.
- Which problems happened that you didn't have an alert for? Every postmortem is an opportunity to add a new alert — but only if you can name what you'd do when it fires.
- Which alerts have flapped repeatedly? Flapping is the system telling you the threshold or duration is wrong. Netwarden has flapping protection built in (it'll suppress repeated re-triggers within a window) but the underlying alert config is still wrong and you should fix it.
- Which alerts have a runbook? Which don't? A runbook can be one sentence in the description field. "If this fires, check $thing." Better than nothing.
The single most useful question, the one that separates a working alert system from a slowly-rotting one: for each alert, what would have to be true for me to delete it? If you can't answer, you can't audit.
When to delete an alert
People are bad at deleting alerts. There's always a "but what if" scenario. Delete them anyway.
Delete an alert when:
- It hasn't fired in 6 months and the underlying condition has been tested.
- It's fired more than five times without prompting any action.
- It's been replaced by a better-targeted alert and you forgot to remove the old one.
- The system it monitors no longer exists. (You'd be surprised. This is the most common one in long-lived setups.)
- You can't articulate what you'd do when it fires.
Deleting alerts is not "reducing coverage." It's reducing noise. If you find yourself missing one, add it back — better-designed this time.
Putting it together
Here's the practical version of everything above, compressed to a checklist you can run before saving any alert:
- Does this alert correspond to something I'd want to be interrupted for? If not, downgrade the channel or kill it.
- Is there a duration filter? If not, add one. Default to 5 minutes.
- What channel? Page-me-now (push + email + maybe webhook), tell-me-today (email), or show-me-on-the-dashboard (no notification)?
- What will I do when it fires? Write it down. One sentence is enough.
- How will I know it works? Test it once before relying on it.
- When would I delete this? If you can't answer, you can't audit.
If your monitoring system is currently paging you more than two or three times a week and at most one of them is a real problem, you don't have an alerting strategy. You have alert fatigue with a vendor logo on it. The fix is not a new tool. The fix is going through every alert you have and applying the six questions above.
You'll be amazed how many alerts you delete on the first pass, and how much more you trust the ones that survive.
Get started with one alert
If you're staring at all this and wondering where to begin, the answer is: one alert.
Install Netwarden on the one host you actually care about — curl -sSL get.netwarden.com | bash — and set up exactly one alert: "this host is down for 2 minutes." Configure the channel (email, mobile push, both). Test it by stopping the agent. Confirm it fires. Confirm it clears.
That's it. That's day one. Add more alerts only when you have a real reason — usually right after something broke and you wished you'd known sooner. Built that way, your alert list stays honest, because every alert came from a real incident, not from a checklist.
The full setup walkthrough lives in the alerts setup docs. The short version is: install the agent, pick a metric, set a threshold and a duration, choose a channel. That's the whole product surface. The hard part isn't the tool. The hard part is the design discipline I just described — and that's portable. Take it with you to whatever monitoring stack you end up using.
Free tier covers one host, which is enough for everything in this post. When you outgrow it, Solo is $9.90/month for five hosts and Pro is $29.90/month for twenty-five.
Keep Reading
- The Small-Team Monitoring Playbook — How to run a serious monitoring setup when "the team" is one or two people, without enterprise overhead.
- Monitor a Proxmox Cluster Without Datadog — A worked example of the alert design above on a real homelab cluster.
- Alerts Setup Guide — Step-by-step instructions for configuring threshold alerts, duration filters, and channels in Netwarden.
Ready to set up your first alert? Start free with Netwarden — one host, no credit card.
Get More Monitoring Insights
Subscribe to our weekly newsletter for monitoring tips and industry insights.
Related Articles
The Small-Team and Homelab Monitoring Playbook
Most monitoring guides are written for 200-engineer SRE orgs. This one is written for the rest of us — the solo dev, the small IT shop, the homelabber with 1-25 boxes who needs real alerts without standing up a five-service monitoring stack.
WordPress Monitoring, Honestly: What to Watch and What to Skip
Most WordPress monitoring guides promise the moon — Core Web Vitals, real-user analytics, synthetic browser tests from twenty cities. This one is the honest version: here's what's worth watching, what we actually monitor, and what we don't.
How Netwarden's Security Wedge Works
Most monitoring tools don't surface security signals. Most security tools don't surface monitoring signals. We built one tool that does both — because the people we sell to don't want to pay for two. Here's how the security wedge actually works under the hood.
Ready for Simple Monitoring?
Stop wrestling with complex monitoring tools. Get started with Netwarden today.