this post was submitted on 14 Dec 2023
38 points (89.6% liked)

Selfhosted

40006 readers
553 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

  1. Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.

  2. No spam posting.

  3. Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it's not obvious why your post topic revolves around selfhosting, please include details to make it clear.

  4. Don't duplicate the full text of your blog or github here. Just post the link for folks to click.

  5. Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).

  6. No trolling.

Resources:

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 1 year ago
MODERATORS
 

Hi, I have a bunch of Raspberry Pies hosting all kinds of stuff and I want to have a monitoring solution for all of that. What would be your recommendations?

My goal is to be able to have an overview of CPU load, network load, CPU temp and to see what's going on inside docker containers as I have everything dockerized. I'd like the solution to be open source. I want the solution to be web browser accessible and have nice load graphs with history. I don't want to spend too much time setting it up.

All my Pies are running RaspberryOS, which is Debian based.

all 37 comments
sorted by: hot top controversial new old
[–] [email protected] 25 points 11 months ago (2 children)

Standard solution would be grafana + Prometheus on one server and a node exporter running on each pi. You then register the node exporters in Prometheus and use that as a data source for grafana. There you build a dashboard showing whatever metrics you want. It can also show some information about the Docker socket, like number of running/stopped containers and such.

[–] [email protected] 4 points 11 months ago

Add in alertmanager and hook it to slack. Get notified whenever containers or systems are misbehaving.

[–] [email protected] 1 points 11 months ago
[–] [email protected] 10 points 11 months ago (1 children)

Second for Netdata for the temps and load info, portainer for docker monitoring. Netdata gives you more real time info than even glances. Portainer is an easy way to look at logs and such, I don’t use it to manage, prefer command line for that. Netdata we’ll give you some docker info, but not logs.

[–] [email protected] 1 points 11 months ago* (last edited 11 months ago)

I use Ansible for management, I just want to see nice graphs and maybe get alerts when things go south. Thanks for recommendation.

[–] [email protected] 9 points 11 months ago (1 children)

I use netdata, it's quick and easy, but I don't think it monitors docker containers specifically.

[–] [email protected] 1 points 11 months ago

I'll take a look, thanks!

[–] [email protected] 7 points 11 months ago (1 children)

Grafana, influxdb, telegraf agents. Easy to setup. Barely any configuration required. Everything you asked for in the default telegraf agent config. There are dashboards with plenty of examples on grafanas website.

[–] [email protected] 2 points 11 months ago (2 children)

What's the difference between Prometheus and Telegraf? Why do you prefer Telegraf?

[–] [email protected] 2 points 11 months ago* (last edited 11 months ago) (1 children)

My work environments use Prometheus and node-exporter and grafana. At home I use telegraf, influxdb and grafana (and Prometheus for other app specific metrics) but the biggest reason I went with telegraf and influxdb at home is because Prometheus scrapes data from the configured clients (pull), while telegraf sends the data on the configured interval to influxdb (push) and starting my homelab adventure I had 2 VMS in the cloud and 2 pis at home and having telegraf sending the data in to my pis rather than going out and scraping made it a lot easier for that remote setup. I had influxdb setup behind a reverse proxy and auth so telegraf was sending data over TLS and needed to authenticate to just the single endpoint. That is the major difference to me, but there are also subsets of other exporters and plugins and stuff to tailor data for each one depending on what you want.

[–] [email protected] 0 points 11 months ago

Ok, great to know, thanks!

[–] [email protected] 2 points 11 months ago (1 children)

Influxdb is a “time series” database for storing metrics. Temperatures, ram usage, cpu usage with time stamps. Telegraf is the client side agent that sends those metrics to the database in json format. Prometheus does pretty much the same thing but is a bit too bloated for my liking, so I went back to Influx.

[–] [email protected] 2 points 11 months ago

prometheus is bloated?

[–] [email protected] 5 points 11 months ago

netdata is easy to set up and detects a lot of things on it's own like databases and ntpd and...

[–] [email protected] 3 points 11 months ago (1 children)

I didnt see it recommended yet, UptimeKuma is really simple if you just want to monitor the basics like if a url works or ping, tcp, etc without an agent.

It doesn't do CPU/memory style metrics, but I find myself checking it more often because of how simple it is.

[–] [email protected] 0 points 11 months ago (1 children)

I need CPU and other metrics because recently one of my Docker containers got infected with DDOS software and CPU spike was a tell tale.

[–] [email protected] 1 points 11 months ago (1 children)

Omg I have CPU spikes on my Raspberry Pi. Maybe it's infected too, and how would I ever find out?

Is there some software I can run to check?

[–] [email protected] 0 points 11 months ago (1 children)

Are they small spikes spread across time or large chunks of heavy load, like 80%+ load for hours? If it's the first, then probably it's just normal operation. Otherwise check your running processes and start tracking what's going on during high loads.

[–] [email protected] 1 points 11 months ago (1 children)

I would say it's 100% load for maybe 3 minutes, so maybe it's normal.

It makes my system overload so my PiHole stops processing.

But it sounds like maybe it's normal and a background service using too much sometimes?

[–] [email protected] 0 points 10 months ago

Maybe normal, maybe not. What software do you run there?

[–] [email protected] 2 points 11 months ago

I'm kind of loving Zabbix, but not sure if it's the right solution for your needs. I'd say it would definitely work, but does take a bit of setup initially. This article is interesting, and seems to have a lot of what you want. Not sure if you want to do all of this. https://opensource.com/article/23/3/build-raspberry-pi-dashboard-appsmith

[–] [email protected] 1 points 11 months ago (1 children)

Zabbix can do everything you’re asking and can be connected to Grafana if you want custom visualisations. Most importantly, it contextualises what you need to know on the dashboard, as in it only tells you about things that require your attention.

You’re of course able to dive into the data and look at raw values or graphs if you wish, and can build custom dashboards too.

I’ve used it in both home lab and production scenarios monitoring small to mid size private clouds, including windows and linux hosts, docker, backups, SAN arrays, switches, VMware vSphere, firewalls, the lot. It’s extremely powerful and not terribly manual to set up.

If metrics is all you want and aren’t too fussed on the proactive monitoring focus, Netdata is a great option for getting up and running quickly.

[–] [email protected] 1 points 11 months ago (1 children)

Mmm, forgot about Zabbix, they're actually from my home country and I used to know some people there.

[–] [email protected] 2 points 11 months ago* (last edited 11 months ago) (1 children)

As someone who used Zabbix and stopped because it's an absolute pig, I'm going to recommend Checkmk.

[–] [email protected] 1 points 11 months ago* (last edited 11 months ago) (2 children)

As someone who used Checkmk Raw (the free edition) and stopped because it's an absolute bastard to set up, I'm going to recommend Zabbix.

I'll specify. Installing Checkmk's agent on the servers was fine, but service discovery is completely unreliable (I could never get through a full discovery before it just hung on something unknown) and the web GUI is very difficult to navigate. Unfortunately most of the documentation only covers the paid version, and community support is pretty bad.

Switching to Zabbix was almost plug-and-play by comparison.

[–] [email protected] 11 points 11 months ago

As someone reading this thread, I'm stuck in an endless loop.

[–] [email protected] 1 points 11 months ago

As someone who used Checkmk Raw (the free edition) and stopped because it’s an absolute bastard to set up, I’m going to recommend Zabbix.

I feel the same way about Zabbix. I'm not going to pretend that CheckMk is super intuitive but I guess I got over the hump because I don't find it difficult to setup, use, or manage.

The reality is for a homelab setup, probably something like uptime kuma would fit the bill for most people. I wanted something like Zabbix or CheckMk for all the detailed stats that Op is looking for. The longer I use it the more I realize I don't need them 99% of the time and I should probably give something like uptime kuma a try myself.

[–] [email protected] 1 points 11 months ago

Glances should be right up your alley.

[–] [email protected] 1 points 11 months ago
[–] [email protected] 1 points 11 months ago

Collectd on the host sending data to an openwrt docker container. You can view graphs in the openwrt luci ui

[–] [email protected] 1 points 10 months ago

I don't use Docker, so this may not help you, but I find bpytop and ssh works just fine. :)

[–] [email protected] 1 points 11 months ago* (last edited 10 months ago)

Acronyms, initialisms, abbreviations, contractions, and other phrases which expand to something larger, that I've seen in this thread:

Fewer Letters More Letters
DNS Domain Name Service/System
PiHole Network-wide ad-blocker (DNS sinkhole)
SAN Storage Area Network
SSL Secure Sockets Layer, for transparent encryption
TLS Transport Layer Security, supersedes SSL

3 acronyms in this thread; the most compressed thread commented on today has 7 acronyms.

[Thread #353 for this sub, first seen 14th Dec 2023, 15:25] [FAQ] [Full list] [Contact] [Source code]

[–] [email protected] 1 points 11 months ago

It's old fashioned and maybe difficult to setup initially but I really like munin