Chris Dzombak

sharing preview • dzombak.com

Keep your software up and running on the Raspberry Pi

This post reviews some of the options for keeping a systemd-managed service running persistently on a Raspberry Pi, or really any other Linux system.

Keep your software up and running on the Raspberry Pi

Part of the Raspberry Pi Reliability series.

Even if your Raspberry Pi’s hardware, network connection, and OS remain stable, your software might not be — it may not handle intermittent WiFi connections well, it may slowly leak resources, or it might just get into an odd, broken state occasionally.

This post reviews some options for keeping a service running persistently on a Raspberry Pi. Most solutions apply to other Linux systems, too.

Keeping your service running

The information in this post is, to the best of my knowledge, current as of November 2023. It should work on Raspberry Pi OS versions bullseye and bookworm, at least, but I make no promises.

Systemd services: set the Restart option

For services managed by systemd, set the Restart option to always or on-failure. This goes in the [Service] section of your .service file; you can see an example in my pi-fm-player repo.

Related options that further customize this behavior:

Systemd services: reboot the Pi if the service keeps failing

You can also tell systemd to reboot the machine if your service fails to [re]start too many times within a time window. Set StartLimitIntervalSec and StartLimitIntervalBurst, then set StartLimitAction=reboot in the [Unit] block of the .service file. See this file for an example.

You could alternatively use FailureAction to reboot the Pi immediately after your service fails once.

This can be dangerous! I’ve had this reboot a system mid-upgrade because my service was broken during the upgrade. If the service gets into an always-failing state, it may be very hard to log back into the Pi and disable this — your timing has to be just right!

The Raspberry Pi watchdog

My post about mitigating hardware/firmware issues goes into more detail about setting up the (little-known) Raspberry Pi watchdog.

Once you have the watchdog set up, you may consider setting the pidfile option in watchdog.conf to detect when your service process dies, and reboot the system.See watchdog.conf(5) for more.

This can be dangerous! If the system reboots during an upgrade, or your service stops working and needs manual intervention, you could easily find yourself with a Pi that doesn't stay online long enough for you to fix the issue.

I personally feel this is a pretty big hammer, and if your service is managed by systemd there are better options available, but I’m listing it here for completeness.

The hack: just restart your software periodically

Sometimes you might want to just restart your service on a regular basis, regardless of whether it looks healthy to systemd. Two approaches you can take are:

Monitoring for OOMs and disk space issues

I collect logs remotely for the vast majority of my Pis and servers, which gives me a centralized spot to implement alerting for “out of disk space” and “out of memory” errors.

See the Remote Logging post in this series for instructions on setting that up.

Monitoring your service

You may wish to be alerted when your service stops working, even if it should get fixed automatically using one of the techniques listed above.

I use Uptime Kuma to monitor my home’s services, so this advice applies to it, but similar techniques ought to work with other monitoring solutions.

To be alerted if/when a systemd-managed service stops running, I set up:

*  *  *  *  *  root  runner -job-name "pifm-check" -timeout 10 -retries 2 -retry-delay 15 -success-notify "http://192.168.1.10:9001/api/push/abcdef?status=up&msg=OK&ping=" -- systemctl is-active --quiet service pifm-player.service  >/dev/null

This is a little dense; let’s break it down:

To be alerted if/when a Docker-managed service is unhealthy, I set up:

*  *  *  *  *  runner -job-name "healthcheck-gluetun" -timeout 30 -success-notify "http://192.168.1.10:9001/api/push/ghijkl?status=up&msg=OK&ping=" -- bash -c "[ \"$(docker container inspect --format '{{.State.Health.Status}}' gluetun 2>&1)\" == \"healthy\" ]" >/dev/null

Here’s a breakdown of this (admittedly complex) line:

If the Docker container you want to monitor doesn’t have a health check, you can follow this pattern instead to simply check whether it’s running:

[ \"$(docker container inspect --format '{{.State.Status}}' iperf3-server 2>&1)\" == \"running\" ]

To be alerted when one of my own services stops working, regardless of whether the failure causes the program to exit, I:

  1. Modify my program to support sending a heartbeat periodically if and only if everything is working as expected. You can see an example of this in nut_influx_connector, which uses my new Golang heartbeat library.
  2. Set up a push monitor in Uptime Kuma.
  3. Deploy the new software, using the Uptime Kuma push monitor’s heartbeat URL.

This allows monitoring for more subtle issues, like transient network issues, that might cause the service to stop working for a while but don’t actually cause the process to exit.

To be alerted when a cron job encounters errors, I use my runner tool. runner is a comprehensive wrapper for cron jobs that can:

To monitor dataflows from collectors like nut_influx_connector and ecobee_influx_connector, I:

  1. Set up an Uptime Kuma push monitor.
  2. Write a script, on a separate machine (typically the one running InfluxDB), that queries Influx and verifies data points have been written to the expected measurement recently. The script should exit 0 if all’s well and 1 if the data flow doesn’t seem healthy. Here’s an example:
#!/usr/bin/env bash
set -euo pipefail

POINTS=$(curl -s --max-time 15 -G 'http://192.168.1.10:8086/query' --data-urlencode "db=dzhome" --data-urlencode "q=SELECT COUNT(\"moving_avg_dB\") FROM \"two_weeks\".\"noise_level\" WHERE (\"device_name\" = 'shopsonosnoisecontrol_1') AND time > now()-121s" | jq '.results[0].series[0].values[0][1]')
if [ -z "$POINTS" ] || [ "$POINTS" = "null" ]; then
        exit 1
fi
# we aim to log 4 points per second, so in 120 seconds we should have a couple hundred points at the very least...
if (( POINTS < 200 )); then
        exit 1
fi

exit 0

{:start=”3”} 3. Execute that script every minute via my user’s crontab, wrapped with runner as described above to send a heartbeat when the data flow is healthy:

*  *  *  *  *  runner -success-notify "https://myserver.mytailnet.ts.net:9001/api/push/abcdef?status=up&msg=OK&ping=" -- /home/cdzombak/scripts/healthchecks/shopsonosnoisecontrol-influx-healthcheck.sh  >/dev/null

See Also: Considerations for a long-running Raspberry Pi.