How do you observe your server functions? (kbin.social)

submitted 11 months ago* (last edited 11 months ago) by Fermiverse@kbin.social to c/selfhosted@lemmy.world

41 comments fedilink hide all child comments

I use some batch scripts in my proxmox installation. They are in cron.hourly and daily checking for virus and ram/CPU load of my LXC containers. An email is send on condition.

What are your tipps or solution without unnecessary load on disc io or CPU time. Lets keep it simple.

Edit: a lot of great input about possible solutions. In addition TIL "that keep it simple" means a lot of different things to people.😉

top 28 comments

sorted by: hot top controversial new old

[-] FrostyCaveman@lemm.ee 11 points 11 months ago

Prometheus, Loki and Grafana.

And so so many Prometheus metric exporters.

Observability is such an endless rabbit hole, it’s so easy for me to spend huge amounts of time accomplishing not that much lol. But very enjoyable and cool to see it all come together.

My pro tips: using Kubernetes actually makes this stuff a heck of a lot easier to set up thanks to the common patterns that k8s has - lots of turnkey helm charts out there that make it all so easy and are powerful. Another tip would be to use Prometheus service discovery if you can. Also, Loki/Promtail is actually quite easy to set up - but using LogQL queries can be very tricky. Just be warned, observability is a full time hobby in itself lol

[-] kill_dash_nine@lemm.ee 2 points 11 months ago* (last edited 11 months ago)

I’m in a similar boat except I just do everything on standard Docker containers but so do use Telegraf, Influx, and Grafana for everything. I’ve gone mostly to Discord notifications on any alerts. If I run into any problem scenarios, I figure out how to monitor it and add it via Telegraf and add an alert. I’m still just using Grafana alerts but it works fine for my home lab.

Even better if I can automate fixes to those problems. One of the best things I did was monitoring all of my network devices and all major hops. If I have internet or network issues, I know exactly where the problem is without having to troubleshoot. Lots of dpinger and shell scripts to input data to Telegraf.

[-] SheeEttin@lemmy.world 8 points 11 months ago

I'll keep it very simple: I don't.

If I'm trying to do something and I notice an issue, then I'll investigate it. But if it's not affecting anything, is it really a problem?

[-] mea_rah@lemmy.world 4 points 11 months ago

I was kind of the same, but I still collected metrics, because I just love graphs.

Over time I ended up setting alerts for failures I wish I was aware of earlier. Some examples:

HDD monitoring - usually drive is showing signs of failure couple days before it fails, so I have time to shop around for replacement. If I had no alert set, I'd probably only notice when both sides of a mirror failed which would mean couple days of downtime, lot of work with backup restoration and very limited time to find drive for reasonable price
networking issues - especially VPN, it's much better to know that it is broken before you leave house
some core services like DNS. With two Adguard instances it's much better to be alerted when one is down, than to realize that you suddenly have no DNS when both fail and you can't even google stuff without messing with your connection settings.
SSD writes - same as HDDs, but in this case the alert is around 90% declared TBW lifetime claimed by manufacturer and I tend to replace them proactively as they are usually used as system disk without mirror, which holds no valuable data, but would again lead to extended unplanned downtime
CPU usage being maxed out for long time - I had one service fail in a way where it consumed 100% of all cores. This had no impact on other services because process scheduler did its job, but I ended up burning kilowats of electricity as this continued unnoticed for weeks. This was before energy prices went up, but it was still noticeable power consumption. (Had double CPU server back then, that consumed a lot of juice when maxed out)

[-] H2iK@lemmy.world 2 points 11 months ago

What do you use to collect these metrics?

[-] mea_rah@lemmy.world 2 points 11 months ago

I use Telegraf for most of the metrics.

[-] m_randall@sh.itjust.works 6 points 11 months ago

I’ve used monit for maybe 2 decades now. Works great and simple to use.

https://mmonit.com/monit/

[-] Fermiverse@kbin.social 0 points 11 months ago

Nice, will take a look into it

[-] vegetaaaaaaa@lemmy.world 5 points 11 months ago* (last edited 11 months ago)

https://github.com/awesome-foss/awesome-sysadmin#monitoring

I use netdata (agent only, not the cloud/SaaS stuff)

[-] NonDollarCurrency@monero.town 4 points 11 months ago

I use zabbix to monitor everything, agent on each device uses around 30 mb of memory and with the Linux templates it can monitor just about everything on the server.

[-] Fermiverse@kbin.social 4 points 11 months ago

Does zabbix use a database continuously polling and storing data or is live data used for indication and/or triggers?

[-] packetloss@lemmy.world 2 points 11 months ago

Zabbix stores all it's data in a PostgreSQL or MySQL database. However.... there are 2 ways that Zabbix Agents work. Either in passive mode, or in active mode.

Passive Agent = "poller" process on the Zabbix server sends a request to the agent asking for values for the items it's monitoring (based on template applied to host). Depending on the number of hosts you're monitoring and how many poller processes are configured to start with the Zabbix server, you may run into a situation where requests are queued because the poller process is too busy. Increasing the number of poller processes will fix this, but it also adds additional load to your DB as each poller process will connect to your DB to write data, and each poller process will consume a certain amount of memory. Too many and you'll run out of RAM, or bog down your DB.

Active Agent = "trapper" process on the Zabbix server listens for item values from being sent to it from the agents. Agents will query the Zabbix server to see what templates are applied to it's host, and will figure out what items it's supposed to monitor. The agent will actively query the items without the Zabbix server requesting it, and will send the item values to the server as scheduled. This puts a lot less load on the Zabbix server.

Item values are not read from the DB to activate the trigger. When a value is received that matches the trigger's expression, then the trigger is activated. Live values are used to activate triggers and trigger actions (alerts).

[-] ThorrJo@lemmy.sdf.org 2 points 11 months ago

Not the above guy but I believe it's a database.

[-] SheeEttin@lemmy.world 1 points 11 months ago

Database.

[-] Decronym@lemmy.decronym.xyz 3 points 11 months ago* (last edited 11 months ago)

Acronyms, initialisms, abbreviations, contractions, and other phrases which expand to something larger, that I've seen in this thread:

Fewer Letters	More Letters
DNS	Domain Name Service/System
HA	Home Assistant automation software
~	High Availability
SSD	Solid State Drive mass storage
VPN	Virtual Private Network

4 acronyms in this thread; the most compressed thread commented on today has 14 acronyms.

[Thread #11 for this sub, first seen 19th Jul 2023, 17:40] [FAQ] [Full list] [Contact] [Source code]

[-] easeKItMAn@lemmy.world 3 points 11 months ago* (last edited 11 months ago)

I set up custom bash scripts collecting information (df, docker json, smartCTL etc) Either parse existing json info or assemble json strings and push it to Homeassistant REST api (cron) In Homeassistant data is turned into sensors and displayed. HA sends messages of sensors fail.
Info served in HA:

HDD/SSD (size, smartCTL errors, spin up/down, temperature etc)
Availability/health of docker services
CPU usage/RAM/temperature
Network interface/throughput/speed/connections
fail2ban jails

Trying to keep my servers as barebones as possible. Additional services/apps put strain on CPU/RAM etc. Found out most of data necessary for monitoring is either available (docker json, smartCTL json) or can be easily caught, e.g.

df -Pht ext4 | tail -n +2 | awk '{ print $1}

It was fun learning and defining what must be monitored or not, and building a custom interface in HA.

[-] Fermiverse@kbin.social 3 points 11 months ago* (last edited 11 months ago)

Thats basically the way I do it.

pvesh get /cluster/resources --output-format json-pretty | jq --arg k "lxc/$container_id" -r 'map(select(.id == $k))[].name, map(select(.id == $k))[].mem, map(select(.id == $k))[].maxmem, map(select(.id == $k))[].cpu')

Example using pvesh in proxmox. The data is available, just have to use it. I also prefer barebone approach.

[-] easeKItMAn@lemmy.world 2 points 11 months ago

At last we keep it simple ;)

[-] antihumanitarian@lemmy.world 2 points 11 months ago

I use netdata, it's very good at digesting thousands of metrics to sharing actionable. The cloud portion is proprietary, but you can toggle off the data collection. I did turn on the cloud portion though, I get email notifications when something breaks. Might sound counter to the self hosted mantra, but a self hosted monitoring system isn't very helpful when your own systems go down.

[-] tychosmoose@lemm.ee 2 points 11 months ago

Monit for simple stuff and daemon restart on failure. LibreNMS for SNMP polling, graphing, logging, & alerting.

[-] ratz@chatsubo.hiteklolife.net 1 points 11 months ago

Might be a bit more complex than what you want, but I love Prometheus + Alertmanager and a nice sexy Grafana dashboard

[-] Taleya@aussie.zone 1 points 11 months ago

Nagios. Core, but i've worked with it for years and am kinda masochistic. (Currently tying it into an IDRAC6)

[-] TheInsane42@lemmy.world 4 points 11 months ago

Same here, nagios4 from the debian distro.

I really need to check if it still works. ;)

[-] titey@lemmy.home.titey.net 2 points 11 months ago

Nagios FTW!

[-] pythia@lemmy.dbzer0.com 1 points 11 months ago

I used monitorix a long time ago. now netdata.

[-] Elw@lemmy.sdf.org 1 points 11 months ago

Have used both Zabbix and Prometheus for this. Highly recommend Prometheus over Zabbix if you can afford the time to learn it. It’s a bit weird at first but it’s so much easier to extend and manage then Zabbix in my experience.

You will need to set up Grafana to go along with Prometheus. But, again, it’s so flexible that you will end up being happier with it.

[-] tko@tkohhh.social 1 points 11 months ago

Regarding your edit: people are answering the question you posed in your post title, not necessarily giving you advice about how you should do it.

[-] ThorrJo@lemmy.sdf.org 1 points 11 months ago

I'm ever so slowly teaching myself Zabbix, need something full-featured because I also need monitoring for my hosting clients etc

load more comments

this post was submitted on 19 Jul 2023

40 points (97.6% liked)

Selfhosted

37808 readers

497 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.
No spam posting.
Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it's not obvious why your post topic revolves around selfhosting, please include details to make it clear.
Don't duplicate the full text of your blog or github here. Just post the link for folks to click.
Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).
No trolling.

Resources:

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 1 year ago

MODERATORS

HybridSarcasm@lemmy.world

HybridSarcasm@lemmy.hybridsarcasm.xyz