Posted 29 Dec 2023 in Series: Raspberry Pi Reliability

Keep your software up and running on the Raspberry Pi

This post reviews some of the options for keeping a systemd-managed service running persistently on a Raspberry Pi, or really any other Linux system.

⚠️

This post was automatically migrated from my old blogging software, and I have not reviewed it for problems yet. Please contact me if you notice any important issues.

Part of the Raspberry Pi Reliability series.

Even if your Raspberry Pi's hardware, network connection, and OS remain stable, your software might not be — it may not handle intermittent WiFi connections well, it may slowly leak resources, or it might just get into an odd, broken state occasionally.

This post reviews some options for keeping a service running persistently on a Raspberry Pi. Most solutions apply to other Linux systems, too.

Keeping your service running

The information in this post is, to the best of my knowledge, current as of November 2023. It should work on Raspberry Pi OS versions bullseye and bookworm, at least, but I make no promises.

Systemd services: set the `Restart` option

For services managed by systemd, set the Restart option to always or on-failure. This goes in the [Service] section of your .service file; you can see an example in my pi-fm-player repo.

Related options that further customize this behavior:

RestartSec, also in the [Service] section
StartLimitIntervalSec and StartLimitIntervalBurst, in the [Unit] section of the .service file

Systemd services: reboot the Pi if the service keeps failing

You can also tell systemd to reboot the machine if your service fails to [re]start too many times within a time window. Set StartLimitIntervalSec and StartLimitIntervalBurst, then set StartLimitAction=reboot in the [Unit] block of the .service file. See this file for an example.

You could alternatively use FailureAction to reboot the Pi immediately after your service fails once.

This can be dangerous! I’ve had this reboot a system mid-upgrade because my service was broken during the upgrade. If the service gets into an always-failing state, it may be very hard to log back into the Pi and disable this — your timing has to be just right!

The Raspberry Pi watchdog

My post about mitigating hardware/firmware issues goes into more detail about setting up the (little-known) Raspberry Pi watchdog.

Once you have the watchdog set up, you may consider setting the pidfile option in watchdog.conf to detect when your service process dies, and reboot the system.See watchdog.conf(5) for more.

This can be dangerous! If the system reboots during an upgrade, or your service stops working and needs manual intervention, you could easily find yourself with a Pi that doesn't stay online long enough for you to fix the issue.

I personally feel this is a pretty big hammer, and if your service is managed by systemd there are better options available, but I'm listing it here for completeness.

The hack: just restart your software periodically

Sometimes you might want to just restart your service on a regular basis, regardless of whether it looks healthy to systemd. Two approaches you can take are:

For a systemd service, in the [Service] section, set Restart=always and RuntimeMaxSec=3600s. (Of course, change RuntimeMaxSec to your desired restart period.)
You can always use a cron job to schedule a periodic restart of some software. (As an example, you might place 0 5 * * * root systemctl restart pifm-player.service in /etc/cron.d/pifm-restart.)

Monitoring for OOMs and disk space issues

I collect logs remotely for the vast majority of my Pis and servers, which gives me a centralized spot to implement alerting for "out of disk space" and "out of memory" errors.

See the Remote Logging post in this series for instructions on setting that up.

Monitoring your service

You may wish to be alerted when your service stops working, even if it should get fixed automatically using one of the techniques listed above.

I use Uptime Kuma to monitor my home's services, so this advice applies to it, but similar techniques ought to work with other monitoring solutions.

To be alerted if/when a systemd-managed service stops running, I set up:

A push monitor in Uptime Kuma, which requires a heartbeat on a periodic basis and alerts if it stops.
A simple cron job using my runner tool. For my Pi FM transmitter, this job is located at /etc/cron.d/pifm-check:

*  *  *  *  *  root  runner -job-name "pifm-check" -timeout 10 -retries 2 -retry-delay 15 -success-notify "http://192.168.1.10:9001/api/push/abcdef?status=up&msg=OK&ping=" -- systemctl is-active --quiet service pifm-player.service  >/dev/null

This is a little dense; let's break it down:

* * * * * root runner: every minute, runner starts, under the root user
-job-name pifm-check: for logging purposes, this job is called "pifm-check"
-timeout 10: each attempt to run systemctl is-active times out after 10 seconds
-retries 2 -retry-delay 15: if the service isn't online, check again up to 2 times, with a 15 second delay between checks
-success-notify "http://192.168.1.100:9001/api/push/abcdef?status=up&msg=OK&ping=": if the check is successful, send a heartbeat to the Uptime Kuma monitor
--: we're done configuring runner; now we tell it what to run
systemctl is-active --quiet service pifm-player.service: ask systemctl if the pifm-player service is alive; this command will succeed (exit 0) if it is, and otherwise fail (exit 1)
>/dev/null: even if the systemctl command exits with an error code, discard runner's output

To be alerted if/when a Docker-managed service is unhealthy, I set up:

A push monitor in Uptime Kuma, which requires a heartbeat on a periodic basis and alerts if it stops.
A simple cron job using my runner tool, very similar to the one above. Here's an example crontab entry:

{% raw %}*  *  *  *  *  runner -job-name "healthcheck-gluetun" -timeout 30 -success-notify "http://192.168.1.10:9001/api/push/ghijkl?status=up&msg=OK&ping=" -- bash -c "[ \"$(docker container inspect --format '{{.State.Health.Status}}' gluetun 2>&1)\" == \"healthy\" ]" >/dev/null{% endraw %}

Here's a breakdown of this (admittedly complex) line:

* * * * * runner: every minute, runner starts, under whichever user owns this crontab
-job-name "healthcheck-iperf3-server": for logging purposes, this job is called "healthcheck-iperf3-server"
-timeout 30: each attempt to check the container's status times out after 30 seconds
-success-notify "http://192.168.1.10:9001/api/push/ghijkl?status=up&msg=OK&ping=": if the check is successful, send a heartbeat to the Uptime Kuma monitor
--: we're done configuring runner; now we tell it what to run
bash -c: the remainder of this command is a Bash one-liner that uses comparison operators, which we run via bash -c
{% raw %}[ \"$(docker container inspect --format '{{.State.Health.Status}}' gluetun 2>&1)\" == \"healthy\" ]{% endraw %}: get the health status for the gluetun container; return 0 if it is healthy, otherwise fail
>/dev/null: even if the Docker health check fails, discard runner's output

If the Docker container you want to monitor doesn't have a health check, you can follow this pattern instead to simply check whether it's running:

{% raw %}[ \"$(docker container inspect --format '{{.State.Status}}' iperf3-server 2>&1)\" == \"running\" ]{% endraw %}

To be alerted when one of my own services stops working, regardless of whether the failure causes the program to exit, I:

Modify my program to support sending a heartbeat periodically if and only if everything is working as expected. You can see an example of this in nut_influx_connector, which uses my new Golang heartbeat library.
Set up a push monitor in Uptime Kuma.
Deploy the new software, using the Uptime Kuma push monitor's heartbeat URL.

This allows monitoring for more subtle issues, like transient network issues, that might cause the service to stop working for a while but don't actually cause the process to exit.

To be alerted when a cron job encounters errors, I use my runner tool. runner is a comprehensive wrapper for cron jobs that can:

Retry them on error
Log their output alongside diagnostic details
Notify you of failures via email, Ntfy, or Discord
Send heartbeats to Uptime Kuma or similar tools
Silence their output when they're successful, so cron only emails you on failures

To monitor dataflows from collectors like nut_influx_connector and ecobee_influx_connector, I:

Set up an Uptime Kuma push monitor.
Write a script, on a separate machine (typically the one running InfluxDB), that queries Influx and verifies data points have been written to the expected measurement recently. The script should exit 0 if all's well and 1 if the data flow doesn't seem healthy. Here's an example:

#!/usr/bin/env bash
set -euo pipefail

POINTS=$(curl -s --max-time 15 -G 'http://192.168.1.10:8086/query' --data-urlencode "db=dzhome" --data-urlencode "q=SELECT COUNT(\"moving_avg_dB\") FROM \"two_weeks\".\"noise_level\" WHERE (\"device_name\" = 'shopsonosnoisecontrol_1') AND time > now()-121s" | jq '.results[0].series[0].values[0][1]')
if [ -z "$POINTS" ] || [ "$POINTS" = "null" ]; then
        exit 1
fi
# we aim to log 4 points per second, so in 120 seconds we should have a couple hundred points at the very least...
if (( POINTS < 200 )); then
        exit 1
fi

exit 0

{:start="3"} 3. Execute that script every minute via my user's crontab, wrapped with runner as described above to send a heartbeat when the data flow is healthy:

*  *  *  *  *  runner -success-notify "https://myserver.mytailnet.ts.net:9001/api/push/abcdef?status=up&msg=OK&ping=" -- /home/cdzombak/scripts/healthchecks/shopsonosnoisecontrol-influx-healthcheck.sh  >/dev/null