tutorials

The Small-Team and Homelab Monitoring Playbook

A pragmatic, opinionated guide to monitoring 1-25 hosts without standing up Prometheus, Grafana, and a queue. Covers what to watch, what to skip, and how to set alerts that don't wake you up for nothing.

Netwarden TeamApril 20, 202613 min read

monitoringsmall teamshomelabself-hostedalertsgetting started

Share this article:

Twitter LinkedIn Facebook

The Small-Team and Homelab Monitoring Playbook

Most monitoring writing is aimed at SRE teams running 4,000 hosts and an error budget spreadsheet. That's not the audience I have in mind here.

This is for the engineer with 1-25 hosts. A solo dev with a side project on a couple of VPSes. A small IT shop running on-prem servers, a Proxmox node, and a few Windows boxes. A homelabber with a NUC, a Synology, and a Raspberry Pi by the router. People who need to know when something is actually broken — and who do not want to spend a weekend wiring Prometheus, Alertmanager, Grafana, node_exporter, and a notification gateway just to find out their disk filled up.

This post is the playbook I wish someone had handed me. It covers what to monitor, what to deliberately ignore, the minimum-viable stack, how to install the agent in five minutes, and how to set thresholds that page you for real problems and stay quiet the rest of the time.

It's also the cornerstone the rest of our blog hangs off — when other posts say "set up monitoring first," they mean this post.

What "monitoring" actually means at this scale

There are four jobs a monitoring system has to do for a small operation. In order of importance:

Tell you when a host is down. Not metrics — just up or down. If the agent stops checking in, somebody needs to know.
Tell you when a host is about to be down. Disk filling, memory pressure, runaway process, swap thrashing. The boring stuff that takes services out.
Tell you when a service on the host is unhappy. A MySQL replica lag, a Postgres connection pool maxed out, a Docker container in a restart loop, a VM stuck in a paused state.
Give you enough history to debug after the fact. When you get the page at 7am and want to know "was this building for an hour or did it just snap?" — you need a timeline.

That's it. At 25 hosts and below, anything beyond those four jobs is overhead, not value. You don't need distributed tracing. You don't need a service mesh dashboard. You probably don't need APM. You definitely don't need to forward 40GB of logs a day to a central store.

What you should monitor (the actual list)

For each host, here is the floor:

Reachability. The agent checks in, or it doesn't. If it doesn't, that's an alert.
CPU. Sustained high usage for several minutes. Not a 5-second spike.
Memory. Used vs available. Watch swap usage on hosts that have swap.
Disk. Per-mount usage and free bytes. Free bytes matters more than percentage on small filesystems.
Network. Interface up/down. Throughput is nice to have, mostly diagnostic.
Processes. Total count and a count of zombies. A spike either way is interesting.
System updates available. Specifically security updates. You want to know.

If you run containers, add:

Container state per host. Running, stopped, restarting. A container in a restart loop is a real signal.
Container resource use. CPU and memory of the heavier containers.

If you run virtualization, add:

VM state. Running, paused, shut down. KVM/libvirt and Proxmox both expose this cleanly.

If you run databases, add the obvious things per engine:

MySQL/MariaDB. Connections, slow queries, replication lag if you have a replica.
PostgreSQL. Connections, transaction rate, replication lag if you have a replica.

If you run WordPress, monitor the host like any other Linux box and use the WordPress plugin to surface site-level state (plugin updates pending, PHP version, comment queue, etc.) on top of the agent's host metrics.

That's the whole list. Anywhere from a dozen to a few dozen things per host depending on what's installed. You do not need to roll your own collectors for any of this — the agent handles all of it via auto-discovery, which I'll get to in a minute.

What you should NOT monitor (yet)

Trying to monitor everything is the fastest way to ignore your monitoring system entirely. Here is what is genuinely fine to leave off at this scale:

Per-process CPU for every process. You don't need it. You need to know when a process is eating the box, which the top-process metric handles.
Detailed I/O per device for every disk. The disk-free signal catches the cases that actually matter to small ops. I/O profiling is a debug tool, not an alert source.
Application-level RUM, Core Web Vitals, synthetic browser tests. These are real things, but they're a different product from infra monitoring, and they're not what is going to take you down at 25 hosts.
Anomaly detection, ML-driven baselines, "predictive" alerts. At small scale you don't have enough traffic to baseline meaningfully, and the false-positive rate of statistical anomaly detection on noisy time series is brutal. Threshold alerts on the right metrics outperform anomaly detection at this size every time.
Logs, in bulk, centralized. Fine to ship logs from one or two important services to a central place. Shipping every log line from every host into a search index is an enormous amount of work and storage for very little operational return at 25 hosts.

You can add any of this later. Most of it you'll never need.

The minimum-viable stack

Here is the smallest stack that does the four jobs:

One monitoring agent installed on every host. Single binary, runs as a systemd service (or launchd/Windows service), checks in over HTTPS. Reports host metrics, container state, VM state, and database metrics if those are present.
One web UI to look at dashboards, configure alerts, and see history.
One delivery channel for alerts that you actually check. For most people that's email plus a mobile push notification, so you don't have to be at a laptop to find out something's on fire.
One outbound webhook if you also want to feed alerts into something else you already run.

That's it. No queue, no time-series database to administer, no scrape configs, no relabeling rules, no dashboards-as-code repo, no Grafana provisioning, no Alertmanager routing tree.

This is the shape Netwarden is. We're not the only tool that fits this shape, but if you'd rather not run the underlying TSDB and the dashboard and the alert router yourself, this is what we're optimized for. Free tier covers one host, and the paid tiers are flat per-host pricing — no per-metric billing, no per-event-ingested billing.

Installing the agent (the actual five minutes)

On Linux and macOS:

curl -sSL get.netwarden.com | bash

That command pulls the install script, picks the right package format (RPM/DEB/tarball/Homebrew), drops the agent binary in place, registers it as a system service, and starts it. It will prompt for the API key and tenant ID, or you can pass them as flags if you're scripting it.

You'll find the API key in the Netwarden UI under your account once you sign up. Linking a host to your account is one paste.

Within about thirty seconds of the agent starting, the host shows up in your dashboard with CPU, memory, disk, network, and process metrics flowing. If Docker or Podman is running on the box, you get container metrics automatically. If MySQL/MariaDB or PostgreSQL is running, the agent finds the socket and starts collecting database metrics — no manual config file, no exporters to install separately. Same for libvirt/KVM and Proxmox VMs. Same for WordPress, with the WP plugin installed on the site.

This is what "auto-discovery" actually means. The agent looks at what's running on the host and turns on the right collectors. You do not write a YAML file listing your services.

Run the same command on every host you own. That's the whole install.

For the full installation reference (firewall rules, offline installs, Windows specifics), see the installation docs and the getting-started walkthrough.

Picking thresholds that don't page you for nothing

This is the part everyone gets wrong, and it has nothing to do with the tool. It's a discipline question.

Three rules.

Rule 1: alert on duration, not on instantaneous values

"CPU above 80%" is a useless alert. CPU goes above 80% for a second every time you run anything heavy. "CPU above 90% sustained for 10 minutes" is a real alert, because that's not a spike — that's a problem.

Every threshold alert should have a "for how long" clause. As a default, 5-10 minutes for performance metrics, 1-2 minutes for availability metrics. The agent went silent? Page after 2 minutes — that's enough time to rule out a normal blip. Memory above 90%? Page after 10 minutes — anything shorter is a process briefly being a process.

Rule 2: alert on free bytes, not just percentages, for disks

"Disk above 85%" fires for weeks before anything actually bad happens, on a 4TB disk. Then everyone ignores it. Then on the 250GB disk it fires and the box is genuinely full an hour later.

Use percentage AND free bytes, both with thresholds. Page when either crosses the line. On big disks, percentage stops being scary long before free-bytes is. On small disks, free-bytes is the leading indicator.

Rule 3: every alert needs to map to a thing you would do at 2am

Before you save an alert, ask: if this fires at 2am, what do I do? If the honest answer is "I check the dashboard, see it's already recovered, and go back to sleep" — that's not an alert. That's a chart. Move it to the dashboard and delete the alert.

If the honest answer is "I'd want to know in the morning" — that's an alert with no push notification. Email it. Look at it after coffee.

If the honest answer is "I'd actually get up and do something" — that's a push-notification alert.

The mobile push channel is what makes this work, by the way. You can be away from the laptop and still get the one alert that actually matters. Without that, "alert on email only" turns into "I check email when I get to a computer," which is not on call.

A concrete starter alert set for 1-25 hosts

Copy this almost verbatim. Tune from there.

Per host (push):

Agent has not checked in for 2 minutes.
CPU above 90% sustained for 10 minutes.
Memory above 90% sustained for 10 minutes.
Disk free below 5GB on any mount, OR disk above 90%.
Swap above 50% sustained for 10 minutes (only on hosts with swap).

Per host (email only):

Security updates available.
Process count exceeds (whatever's normal + a buffer) for an hour.

Containers (push):

Container restart count increased more than 3 times in 10 minutes (it's in a crash loop).

Databases (push):

MySQL/Postgres replication lag above 60 seconds for 5 minutes.
Database connections above 90% of max_connections for 5 minutes.

Databases (email only):

Slow query rate above baseline-and-a-half for an hour.

That's roughly 8-12 active alerts per host, all of which map to "yes, I would do something." If any of them fire often without anything actually being wrong — raise the threshold or extend the duration. Don't add more alerts.

For the deeper version of this — the philosophy plus the per-service thresholds — read the alerting cornerstone: Alerts that actually page you.

Dashboards: don't overthink them

You need three.

A fleet overview. One row per host, status indicator, top metrics. The "is everything OK right now" view. This is the page you keep open on a second monitor.
A per-host detail page. Drill in when something goes red. CPU/memory/disk/net charts, top processes, top containers, recent events.
One dashboard per important workload. A WordPress dashboard with site-up, plugin updates, PHP errors. A database dashboard with connections and replication lag. A Proxmox dashboard with VM state across the cluster.

That's it. You can build all three in the drag-and-drop dashboard editor in well under an hour with the included widget set (about 17 widget types — CPU, memory, disk, network, process, container, VM, MySQL, Postgres, system updates, plus chart and gauge primitives). Resist the urge to make a fourth dashboard for something nobody ever opens. Nobody opens dashboards-for-completeness.

Specific scenarios

This is the bridge into the rest of the playbook. If your situation is one of these, the linked post drills in deeper.

Homelabber with a Pi, NAS, and a couple of mini-PCs. Raspberry Pi and home server monitoring in 2026.
Self-hosted Proxmox cluster running real workloads. Monitor a Proxmox cluster without paying Datadog.
You run WordPress sites for clients. The WordPress monitoring guide.

Each one is a tighter version of this playbook for that specific shape.

Operating the system

Once it's running, the maintenance is small.

Weekly (5 minutes). Glance at the alert history. Anything that fired more than 3 times without a real incident: raise the threshold or add duration. Anything that should have fired and didn't: add it.

Monthly (15 minutes). Look at the dashboard. Are there hosts you forgot you had? Decommission them. Are there services running that aren't being monitored? Often this is "I started a Docker stack on the side and the agent is happily monitoring it but I never built a dashboard tile for it." Add the tile.

Quarterly (30 minutes). Re-read your alert thresholds. Anything that hasn't fired in 90 days: keep only if it's a thing you would want to fire. Delete the rest. Less is more.

That's the entire ops loop. Compare to running Prometheus + Alertmanager + Grafana + node_exporter + a notification gateway: each of those needs upgrades, config drift catches you, scrape configs go stale, dashboards rot. At small scale, the agent-plus-hosted-UI shape genuinely does work better.

If running it yourself matters to you (it does to me on my own homelab), the self-hosted Bun binary is in preview — same agent, same UI, single binary, runs on your own box. It's not GA yet, no firm date, but it's coming.

Where to go from here

Pick one host. Run curl -sSL get.netwarden.com | bash. Watch metrics show up. Add the starter alert set above. Set the push channel on your phone. Wait a week.

A week in, you'll have a real feel for what fires too often (raise the threshold), what never fires (delete it), and what you wish you had (add it). The whole point is that you can iterate cheaply because the system is small enough to actually understand.

That's the playbook. Everything else on this blog is a deeper cut on one piece of it.

Keep reading

Alerts that actually page you — the alerting cornerstone, deeper than the rules above.
Monitor a Proxmox cluster without paying Datadog — if your homelab has grown into a real cluster.
Raspberry Pi and home server monitoring in 2026 — for the smaller end of small.
Installation docs — full reference for every platform, including Windows and offline installs.

Ready to set this up? Start free with Netwarden. Free tier covers 1 host; Solo is $9.90/month for 5 hosts; Pro is $29.90/month for 25 hosts. Flat per-host pricing, no per-metric billing.

Get More Monitoring Insights

Subscribe to our weekly newsletter for monitoring tips and industry insights.

Share this article

Help others discover simple monitoring

How Netwarden's Security Wedge Works

Most monitoring tools don't surface security signals. Most security tools don't surface monitoring signals. We built one tool that does both — because the people we sell to don't want to pay for two. Here's how the security wedge actually works under the hood.

Netwarden Team-May 11

Raspberry Pi Home Server Monitoring in 2026

Your Pi is doing real work. It runs Plex, blocks ads for the whole house, and tells the lights to dim at sunset. Here's how to monitor it properly without an entire observability stack swallowing the SD card.

Netwarden Team-May 5

Monitor a Proxmox Cluster Without Datadog (or a Second Mortgage)

The Proxmox web UI shows you graphs. It does not text you when a VM dies at 3 AM. Here's how to fix that without paying $15/host/month to a vendor that thinks 'small business' means 500 nodes.

Thiago Vinhas-Apr 21

Ready for Simple Monitoring?

Stop wrestling with complex monitoring tools. Get started with Netwarden today.

Get Started Free View Pricing

The Small-Team and Homelab Monitoring Playbook

What "monitoring" actually means at this scale

What you should monitor (the actual list)

What you should NOT monitor (yet)

The minimum-viable stack

Installing the agent (the actual five minutes)

Picking thresholds that don't page you for nothing

Rule 1: alert on duration, not on instantaneous values

Rule 2: alert on free bytes, not just percentages, for disks

Rule 3: every alert needs to map to a thing you would do at 2am

A concrete starter alert set for 1-25 hosts

Dashboards: don't overthink them

Specific scenarios

Operating the system

Where to go from here

Keep reading

Get More Monitoring Insights

Share this article

Related Articles

How Netwarden's Security Wedge Works

Raspberry Pi Home Server Monitoring in 2026

Monitor a Proxmox Cluster Without Datadog (or a Second Mortgage)

Ready for Simple Monitoring?