This is an automated archive made by the Lemmit Bot.
The original was posted on /r/homeassistant by /u/netingle on 2026-06-29 09:44:39+00:00.
I recently had a couple of months of parental leave and I wanted a project I could do, at home, without it being too much like “work”. So I set about getting a really comprehensive observability setup for my Home Assistant stack, and in the process learnt a ton about HAOS, Home Assistant, ZigBee and more.
Disclaimers: I work at Grafana Labs. I used a lot of Claude Code to build this, but this post is not about AI and was not written by AI! Everything in the post can be done with the open source Grafana, Alloy, Prometheus and Loki.
“Layer 0”: The Hardware and Operating System
First I wanted to understand how “busy” my Home Assistant box was - I was using a decade-plus-old Intel NUC, and was noticing some lagginess in the UI.
We do publish a Home Assistant app to gather telemetry but it didn’t have the standard Prometheus node-exporter enabled. I added this but quickly realised the values it reported were plain wrong. You see, HAOS apps run inside the Supervisor's managed Docker environment, which does not permit mounting the host's /proc or /sys filesystems into a container. Those mounts are required for accurate node exporter metrics (CPU, memory, disk, network) at the host level. This is a known, deliberate restriction - see:
To get around this I deployed Alloy (which embeds node-exporter and the Prometheus agent) in a plain old docker container, and by using DOCKER_HOST=ssh://root@homeassistant.local I was able to version control the docker compose setup and use a GitOps-style deployment “pipeline” from my laptop. The downside of this approach is the weekly “your system is unsupported” messages you get in Home Assistant…
With this in place, gathering host-level metrics and logs, I was able to pretty quickly see my machine’s CPU was quite busy, perhaps explaining the sluggishness I was experiencing. I upgraded to a newer, faster mini PC (a GMKtec M7 Ultra) but hit a new snag - it came with a cheap SSD that failed within days. Having the journal (i.e. kernel logs) stored off-machine meant I could quickly diagnose what went wrong and get a new (expensive, Samsung) SSD same-day delivered.
Finally, I also took a look at the temperature of things like the CPU - and found there was far too much heat in my server rack. I whacked a big Noctua fan I had left over from my 3D-printed wind tunnel project on top of the rack to pull out the heat and saw an immediate difference.
"Standard" Linux Dashboard
“Layer 1”: Docker (for apps)
I run a ton of HA apps - Zigbee2MQTT, Z-Wave JS, VSCode, ESPHome Builder, Music Assistant etc. I wanted to see how they were contributing to system load, and start debugging why tracks were skipping with Music Assistant. For that I needed per-container system telemetry and logs. Typically you’d use cAdvisor to translate the metrics into Prometheus format and luckily Alloy embeds that too. Again it needs to mount parts of the host filesystem that running as an app wouldn’t allow - but running Alloy as a “bare” container does.
Beyond that I wanted to explore the latest Grafana features (sparklines in tables!) with some pretty Docker dashboards, which I built with Claude and GCX, the agent-first CLI for Grafana. This allowed me to get a good overview of which container was driving system load and drill into individual containers - and helped me correlate the skipping Music Assistant tracks with CPU spikes. This helped motivate the upgrade to the faster mini PC, which reduced the skipping but didn’t completely eliminate it…
My pretty Docker dashboard... NB aggregate network bandwidth is incorrect due to so many containers running in the host network namespace.
“Layer 2”: Home Assistant
Thankfully this step was much more straightforward, as Home Assistant already exposes decent Prometheus metrics. Enabling these in the config and setting up Alloy to collect the metrics was pretty trivial. My main use for these was collecting long-term, high resolution metrics from a set of cheap Aqara ZigBee temperature sensors I had attached to every radiator in my house with a 3D printed bracket. At first these were constantly dropping off the network, which led me down the path of collecting a ton of Zigbee2mqtt telemetry, but that’s another story…
Once I had these metrics for a few months, I was able to dial in my central heating - tweaking each radiator's valves so they all come up to temperature at roughly the same rate. British central heating systems are weird. When Google decided to screw over Nest customers I moved to Hive, and I went a step further by having my hot water controllable from Home Assistant. This led me to my first Home Assistant PR - adding Prometheus metrics for water_heater domain. Though I wouldn’t recommend Hive - the integration is cloud-based and the auth tokens expire every month or so…
Our central heating during the European heatwave(s)
“Layer 3”: Apps - Unifi & Unpoller
A bunch of people have already put in the work to package existing software up as Home Assistant Apps; one such app is Unpoller, a Prometheus exporter for Unifi’s products. It’s a really great piece of software, and comes with high-quality Grafana dashboards. Installing it and scraping it with Alloy was dead simple - see here
Unifi logs were a bit trickier; Alloy can accept and forward syslog to Loki, but it took me ages to figure out how to configure Unifi to send them. There is an “SIEM Integration” under integrations, that gets you logs for camera detections etc - but to get the firewall logs you need to look under “Network > Settings > CyberSecure > Traffic Logging”. The Alloy config is here. But with these logs I can see which of the devices from my IOT subnet are contacting the internet, and start to lock them down - a topic for a future post.
The excellent bundled Unifi dashboards
“Layer 3”: Apps - Zigbee2MQTT
As I mentioned above, when I started using Zigbee devices I had a particularly unreliable mesh; I tried a bunch of fixes (for another post...) but along the way I added native Prometheus instrumentation to Zigbee2MQTT. There have been attempts at this in the past, and a bunch of projects adding “non-native” (ie instrumenting at the MQTT layer), but I wanted access to internal stats such as queue length, retries, failures etc. This has given me so really useful (and pretty) dashboards:
This is still a work-in-progress; the PRs can be found on GitHub (https://github.com/Koenkk/zigbee-herdsman/pull/1751 & https://github.com/Koenkk/zigbee2mqtt/pull/31645), I’ve built a docker image (docker.io/tomwilkie/zigbee2mqtt-prometheus-amd64:2.12.0-dev) and theres HA app definition on the z2m PR.
Prototype z2m dashboards
Wrapping Up & Next Steps
Getting all my telemetry into one place, and having the history going back months, has helped me dig into the weird and wonderful intermittent 4am failures in my Home Assistant install. This has helped me improve the reliability of the whole setup and in turn improve the WAF. Its also been fun to learn more about how it all works under the covers!
I still need to finish off the z2m PRs. After that I’d love to start collecting traces and profiles of Home Assistant, in particular automations, and see if I can drive down e.g. the latency from motion detection to lights turning on. There also a bunch more telemetry to collect from the various apps I'm using, and other bits of infrastructure I've got. I'd love to get the logs from all my ESPHome devices into Loki..