Free guide · 18 min read
What to monitor, what to alert on, and what to ignore — for agencies running 5–50 client sites
The WordPress Agency Uptime Checklist
What to monitor, what to alert on, and what to ignore — for agencies running 5–50 client sites.
Why most WordPress monitoring advice is wrong for you
Most monitoring guides are written for one of two audiences. The first is the lone hobbyist running one personal blog — "just install Jetpack and go to bed." The second is the enterprise SRE team monitoring 500 microservices with three pagers and a Datadog bill that exceeds the GDP of a small island nation.
You are neither. You're a freelancer or a small agency. You have between 5 and 50 client sites. You have no DevOps engineer. You have a finite monthly budget that needs to come back as profit. Every alert you set up is a future 2am phone call you might receive. Every alert you skip is a future client email you might receive. The job is to find the line.
This checklist is the line — calibrated for that audience. Twenty-five concrete things to monitor (or deliberately not monitor), broken into three tiers, plus a section on what to skip on purpose. By the end, you should know exactly what to set up on Monday morning and what to ignore until your business actually needs it.
A note on cost: every recommendation here works on Netwarden's Solo plan at $9.90/month for 5 hosts, or the Pro plan at $29.90/month for 25 hosts. If your stack is similar, the total monthly cost should match those numbers — no surprise add-ons, no per-metric billing.
If you'd rather read the full monitoring rationale first, the WordPress monitoring cornerstone covers the why behind these choices.
Tier 1 — uptime + the four things every site needs
These are the things every site should monitor regardless of size. If you do nothing else, do these. Most can be set up in under 30 minutes per site.
1. HTTP uptime check (1–5 minute cadence)
What: an HTTP GET against the site's home page (or a dedicated /health endpoint if WordPress permits one). Anything other than 2xx is a failure.
Threshold: alert when 2 consecutive checks fail — that's roughly 2–10 minutes of confirmed downtime depending on your cadence. One failure can be a transient network blip on the monitoring side.
Cadence: 5 minutes is fine for most agency sites. 1 minute is the floor. Below that you're paying for noise — true downtime that lasts under a minute is usually invisible to your users anyway.
Tier: critical. Page yourself on this one.
2. SSL certificate expiration alerting
What: scrape the site's TLS certificate weekly and alert when it has fewer than 14 days remaining.
Why it matters at agency scale: at 30 sites, the math says one certificate will expire in any given month. Let's-Encrypt auto-renewal works until it doesn't — a misconfigured renewal hook, a moved domain, an apache reload that never ran, and you're calling the client at 8am on a Sunday.
Threshold: warning at 14 days, critical at 3 days. Anything under 3 days means you're already in the danger zone — Let's-Encrypt won't renew that fast if there's a config error.
Tier: warning at 14 days, critical at 3 days.
3. DNS resolution monitoring
What: a separate check that resolves the site's hostname from a third-party resolver and confirms it returns the expected A or AAAA record.
Why: if the client's DNS provider has an outage (Route 53, Cloudflare, GoDaddy — they all have them), the HTTP uptime check above will silently fail in a way that looks like the host is down. A dedicated DNS check tells you "this isn't your fault — call the registrar."
Threshold: alert on 2 consecutive lookup failures.
Tier: warning. Most DNS outages are short and resolved by the provider before you can do anything; the alert is mostly so you know what's happening when the client calls.
4. Disk space on the WP host
What: disk free percentage on the volume hosting the WordPress files and database.
Why this is non-negotiable for WP: WordPress will silently fail to upload media, write logs, or save post revisions when disk fills up. The site will appear to work for read-only visitors while the admin is broken. You'll find out from the client.
Threshold: warning at 80%, critical at 90%. At 90% on a df -h snapshot you typically have hours, not days. WP's wp-content/uploads directory grows nonlinearly — a single client uploading a 4K product video will fill 5% of a small VPS.
Tier: warning at 80%, critical at 90%.
Tier 2 — WordPress-specific
This is the tier that separates "I monitor uptime" from "I actually understand my client sites." Most of these require an agent on the host (or the WordPress plugin). Eight items.
5. PHP version end-of-life tracking
What: the host's installed PHP version, compared against the official supported-version list.
Why: WordPress core support for PHP 7.4 ends, then 8.0, then 8.1. A site running EOL PHP gets no security patches and is gradually incompatible with newer plugin versions. At agency scale, a quiet php -v audit catches the slow drift on managed-hosting clients who forget to enable the auto-update toggle.
Threshold: alert when running version reaches "security fixes only" status (~12 months from EOL); page when it reaches EOL.
Tier: info → warning → critical, transitioning over months.
6. Plugin and theme update availability — security only
What: the WordPress plugin reports the count of plugin/theme updates marked as security updates by the WordPress.org repo.
Why this is separate from "any update": general updates are nice-to-have. Security updates patch known exploits already being scanned for. Treating them the same is how compromised plugins get into client sites.
Threshold: alert immediately on any security update for an active plugin. Critical tier — you want this as a "fix this today" event, not a digest.
Tier: critical for security updates. Skip alerts on general updates entirely (see item 18 in the don't-bother list).
7. WP-Cron failure detection
What: the WordPress plugin checks whether wp-cron.php has been triggered within the expected window. If WP-Cron stops firing, scheduled tasks (post-publish, plugin maintenance, daily summaries, transient cleanup) silently stop happening.
Why this matters: WP-Cron is fragile. It depends on site traffic to fire (unless you've moved it to a real cron). A low-traffic client site whose WP-Cron stalls gets a slow drift toward weirdness — failed scheduled posts, unrotated transients, out-of-date sitemaps.
Threshold: alert when WP-Cron hasn't fired in 6 hours on a site that normally fires it within 1 hour.
Tier: warning.
8. Failed-login spike detection
What: the rate of failed wp-login.php POSTs and xmlrpc.php POSTs to the site, measured per minute.
Why: brute-force attempts against WordPress have been industrial for a decade. Most fail and are background noise. A spike — 50× normal — is either a credential-stuffing run or a botnet picking your client's site for the hour. You want to know.
Threshold: warning at 5× the trailing 7-day average, critical at 50×. Don't alert on raw counts; use a baseline-multiplier approach because every site has different normal traffic.
Tier: warning to critical depending on multiplier.
9. Slow query log threshold
What: the agent's MySQL collector tracks queries exceeding 2 seconds. Threshold-alert when the count of slow queries per minute crosses a baseline.
Why this beats just "MySQL is up": the database can be technically up while one runaway plugin query is locking a row and stalling every page-load behind it. Slow-query rate is the leading indicator. By the time TTFB pages spike, you've already lost 5 minutes of visitor experience.
Threshold: warning if slow queries/sec exceed 5× the trailing 24-hour median.
Tier: warning.
10. Core WordPress version drift
What: the WordPress plugin reports the core version. Compare against the latest stable from api.wordpress.org.
Why: auto-updates can be disabled by host policies, by plugins like "Easy Updates Manager," or by a client who panicked once. Sites silently fall behind on core security updates.
Threshold: info if more than one minor version behind; warning if more than one major version behind.
Tier: info → warning over weeks. Don't page on this; it's a digest item.
11. PHP error count
What: count of PHP fatal errors / warnings / notices in the past 5 minutes (from PHP-FPM's error log or the host's /var/log/php* files via the agent).
Why: a sudden surge in PHP errors usually means a bad plugin update, a malformed wp-config.php change, or an OOM event in PHP-FPM. None of those will show up in an HTTP uptime check until pages start returning 500s.
Threshold: alert at any sustained increase over baseline. PHP errors should be near-zero on a healthy site; if you're seeing 50/min routinely, fix that before you set the alert.
Tier: warning.
12. WP_DEBUG accidentally on in production
What: the WordPress plugin reports whether WP_DEBUG is currently true. This is a setup-leakage check, not an ongoing health metric.
Why: developers leave WP_DEBUG = true in production occasionally, especially after debugging a one-off issue. The result is information disclosure (stack traces visible to attackers) and a slight performance penalty. Catch this once at setup, then on every plugin install.
Threshold: any value of true in production. Critical from a security standpoint, but in practice it's a one-off fix.
Tier: warning the first time it's seen, then info on the dashboard.
Tier 3 — host & database
Five items. These are the host-level checks that every reasonable monitoring agent picks up automatically. Listed here so you know what thresholds to set.
13. CPU sustained at high utilization
What: CPU percentage averaged over a window.
Threshold: warning if 1-minute average is over 80% for 10 minutes. Critical if over 95% for 5 minutes. The duration filter matters — backups, indexing, and rebuild jobs spike CPU briefly and you don't want pages from them.
Tier: warning to critical based on sustained duration.
14. Memory pressure (free RAM trending toward zero)
What: available memory as a percentage of total. Pay attention to the cliff, not the slope.
Threshold: warning at 90% used; critical at 95% used or any swap-in/swap-out activity on a host that should not be swapping.
Why duration matters: WordPress will OOM PHP-FPM workers on memory pressure long before the kernel kills anything. The TTFB will spike and then 502s will appear. The lead indicator is "free RAM dropped below threshold and stayed there."
Tier: warning at 90%, critical at 95% sustained.
15. MySQL connections approaching max_connections
What: the percentage of max_connections currently in use. The agent's MySQL collector reports this directly.
Why: if you hit max_connections, every new request that tries to talk to the DB returns "Too many connections" until something dies. That looks like a 500 to the visitor.
Threshold: warning at 80%, critical at 95%.
Tier: warning to critical.
16. MySQL slow_queries_per_second trending up
This is item 9 from a different angle — same metric, but here you're looking at it as a trailing trend, not a single threshold. If your slow-query count creeps up week-over-week, you're accumulating tech debt. Add a weekly digest, not a real-time alert.
Tier: info digest (weekly).
17. Backup recency check (the host's backup, not yours)
What: if the host runs a daily/weekly backup (Kinsta, WP Engine, SiteGround, your own VPS cron), confirm the last successful backup is within the expected window.
Why this matters at scale: at 30 client sites you cannot manually verify "did the backup run last night" each morning. A monitor on the backup-completion log file (or an exposed endpoint if the host provides one) does it for you.
Threshold: alert if last successful backup is older than 26 hours (assuming daily) or 8 days (assuming weekly).
Tier: warning.
If your host doesn't expose a backup-status endpoint, pair this with a small script that runs nightly and curls a heartbeat URL on Netwarden. If the heartbeat doesn't arrive, alert. That's a cheap, reliable backup-of-the-backup-monitor pattern.
What NOT to monitor (yet)
Eight items deliberately on the don't-bother list. Listed so you can tell a client "no, we don't need that" with confidence.
18. Don't alert on "any plugin update is available"
Why: at 30 sites, you'll get 4–10 alerts per day. They're not actionable in real time and they erode the signal value of every other alert. Roll non-security plugin updates into a weekly maintenance pass.
Do this instead: monthly maintenance retainer with a 2-hour block where you batch-update everything across all clients. Charge for it.
19. Don't run synthetic transaction monitoring
Why: "load the home page, click into a product, add to cart, check out, verify confirmation" tests are wonderful for an enterprise. For a freelancer with a wedding photographer client, they cost more in tooling than the entire year's monitoring budget and break every time the client tweaks the contact form.
Do this instead: if a client really needs end-to-end checks, use a single uptime check pointed at a critical endpoint (the contact form's success page, the cart endpoint) and call it good.
20. Don't pay for multi-region uptime monitoring
Why: unless your client has European customers complaining their site is unreachable, paying $25/mo to confirm Netwarden's check fails from Singapore and Frankfurt is not adding signal. One reliable monitoring location with a confirmed-from-second-source DNS check (item 3) handles 99% of real outages.
Do this instead: one monitoring location. If a global client demands geo coverage, charge them for it as a discrete line item.
21. Don't track Real User Monitoring (RUM) yet
Why: RUM is a wonderful product for SaaS companies optimizing checkout funnels. It is overkill for a 12-site WordPress freelancer. The data is high-volume, the dashboards are noisy, and the "actionable insight per dollar" ratio is poor.
Do this instead: a single TTFB measurement from the agent's HTTP check is enough signal for the agency tier.
22. Don't track Core Web Vitals as monitoring
Why: CWV are Google ranking signals, not site-health alerts. They're best handled in your monthly client report (PageSpeed Insights screenshot, Search Console export) — not as a real-time dashboard widget. They drift slowly, don't page anyone, and you can't fix them at 2am.
Do this instead: include CWV trends in the monthly retainer report. Use Search Console.
23. Don't aggregate logs from client sites yet
Why: log aggregation (Loki, Papertrail, Datadog Logs) makes sense for engineering teams that need to grep across hundreds of services. For 12 WordPress sites, tail -f /var/log/nginx/error.log over SSH or the host's built-in log viewer covers 95% of incidents. Log aggregation costs $0.50–$2 per GB and most of that is dead weight.
Do this instead: PHP error count alert (item 11) tells you when to SSH in and read logs by hand. That's the right granularity at this scale.
24. Don't trust ML "anomaly detection" alerts (yet)
Why: machine-learning anomaly detection sells well at enterprise scale where every percentage-point reduction in false positives saves a salary. At your scale, the false-positive rate of every ML system you can afford is higher than the false-positive rate of well-tuned threshold rules. The cure is worse than the disease.
Do this instead: threshold + duration filters. They're boring, they work, and they don't page you because the moon was full.
25. Don't pay for "100+ integrations" you'll never use
Why: monitoring vendors love advertising integration counts. You will use 2–4 of them at your scale: email alerts, mobile push, maybe an outbound webhook, maybe a ticketing system once you cross 30 clients. Pay for the breadth you actually use.
Alerting hygiene for agencies
Five rules that aren't items in the checklist above but matter just as much.
Severity tiering, used consistently across all clients. Critical wakes you up. Warning lands in an email digest read in the morning. Info sits on the dashboard. If you call everything critical, you'll start ignoring everything.
Per-client routing. Some clients want every blip; some are paying you for the dashboard once a month. Configure it. Don't send a brand-new client every single warning during the first month — you'll train them to ignore your emails.
Quiet hours and maintenance windows. When you push updates to a client's site, silence the alerts for that host for 30 minutes. Do not skip this. Every alert that fires during a planned change burns trust in the alert system.
The two-week false-positive rule. If an alert fires twice in two weeks and neither was actionable, delete it. A monitoring system is not a museum. The alerts that survive are the ones that have earned their place.
White-label considerations (Pro tier). If you're invoicing $200+/month per client, the Pro plan at $29.90/mo for 25 sites is cheap insurance, and the white-label dashboard means you can include a "your monitoring dashboard" link in the monthly retainer email — branded as the agency, not Netwarden. That's a per-client retention argument worth the upgrade alone.
For a deeper take on alert design, the Alerts That Actually Page You post covers the philosophy.
Putting it together: real cost examples
A: 1-person freelancer at 12 client sites
Plan: Pro at $29.90/month. 12 hosts used out of 25. Budget: ~$2.50 per client per month, well below the cost of a single hour of your time when an alert misfires. Setup time: ~4 hours total (20 minutes per site to install the agent and the WP plugin, plus configure shared alert templates). Monthly maintenance: 30 minutes (review the digest, kill any noisy alerts, batch the weekly plugin updates).
B: 2-person agency at 35 sites
Plan: Pro at $29.90/month for the first 25 hosts, plus an overage of 10 sites. Or, if you're past 25 and growing, a custom Pro tier — contact sales. Budget: ~$1 per client per month at this scale. Setup time: ~12 hours total, paid for by the first client retainer that didn't churn because you caught a downtime before they did. Routing: primary on-call alternates weekly between you and your partner. Critical pages both. Warnings go to a shared inbox you triage on Mondays.
C: 50+ sites established agency
You're past the scope of this checklist. At this scale you've earned the cost of a real on-call rotation tool, multi-region uptime, log aggregation, and probably a part-time ops person. The items in tiers 1–3 are still the foundation, but you have the volume to justify the next layer up.
For the agency play at 25–50 sites, the agency playbook post walks through the specific Pro tier configuration including white-label and per-client tagging.
The 25-point at-a-glance checklist
Print this page or save it as a PDF (your browser's print dialog handles both). The tier number is on the left.
Tier 1 — uptime + the four every site needs
- [ ] 1. HTTP uptime check, 5-min cadence, 2-fail confirmation
- [ ] 2. SSL cert expiration: warning at 14 days, critical at 3
- [ ] 3. DNS resolution check from a separate resolver
- [ ] 4. Disk space: warning at 80%, critical at 90%
Tier 2 — WordPress-specific
- [ ] 5. PHP version vs EOL: info → warning → critical over months
- [ ] 6. Security-only plugin updates: critical, immediate
- [ ] 7. WP-Cron firing within expected window
- [ ] 8. Failed-login rate spike: warning at 5×, critical at 50× baseline
- [ ] 9. Slow query rate: warning at 5× the 24-hour median
- [ ] 10. Core WP version drift: info → warning
- [ ] 11. PHP error count surge: warning above baseline
- [ ] 12.
WP_DEBUG = truein production: warning, one-time fix
Tier 3 — host & database
- [ ] 13. CPU > 80% for 10 min (warning), > 95% for 5 min (critical)
- [ ] 14. Memory > 90% used (warning), > 95% or swap activity (critical)
- [ ] 15. MySQL connections > 80% of max (warning), > 95% (critical)
- [ ] 16. MySQL slow-query trend: weekly digest
- [ ] 17. Backup recency: warning if last backup older than expected window
Don't bother (yet)
- [ ] 18. General plugin update alerts → batch weekly instead
- [ ] 19. Synthetic transaction monitoring → skip
- [ ] 20. Multi-region uptime → skip unless customer-facing demand
- [ ] 21. Real User Monitoring (RUM) → skip
- [ ] 22. Core Web Vitals as alerts → put in monthly report instead
- [ ] 23. Log aggregation → skip until volume justifies
- [ ] 24. ML anomaly detection → skip
- [ ] 25. Pay for "100+ integrations" → use the 3–4 you actually use
Next step
Install the Netwarden agent on one client site to see the dashboard in action — the Free tier covers 1 host so you can try it on your most-loved client without committing.
curl -sSL get.netwarden.com | bash
Then add the WordPress plugin on the same site (search "Netwarden" in the WP plugin directory). Tier 2 items light up automatically.
If you're already past 5 sites, jump straight to Solo at $9.90 or Pro at $29.90 — the math from the cost examples above makes it back inside the first month.
For the deeper context, the WordPress monitoring cornerstone and the agency playbook cover the why and the how respectively.
— The Netwarden Team
Tip: your browser's print dialog can save this page as a PDF.
← Back to Netwarden