Keep your software up and running on the Raspberry Pi
Part of the Raspberry Pi Reliability series.
Even if your Raspberry Pi’s hardware, network connection, and OS remain stable, your software might not be — it may not handle intermittent WiFi connections well, it may slowly leak resources, or it might just get into an odd, broken state occasionally.
This post reviews some options for keeping a service running persistently on a Raspberry Pi. Most solutions apply to other Linux systems, too.
Keeping your service running
The information in this post is, to the best of my knowledge, current as of November 2023. It should work on Raspberry Pi OS versions bullseye and bookworm, at least, but I make no promises.
Systemd services: set the Restart
option
For services managed by systemd, set the Restart
option to always
or on-failure
. This goes in the [Service]
section of your .service
file; you can see an example in my pi-fm-player repo.
Related options that further customize this behavior:
RestartSec
, also in the[Service]
sectionStartLimitIntervalSec
andStartLimitIntervalBurst
, in the[Unit]
section of the.service
file
Systemd services: reboot the Pi if the service keeps failing
You can also tell systemd to reboot the machine if your service fails to [re]start too many times within a time window. Set StartLimitIntervalSec
and StartLimitIntervalBurst
, then set StartLimitAction=reboot
in the [Unit]
block of the .service
file. See this file for an example.
You could alternatively use FailureAction
to reboot the Pi immediately after your service fails once.
This can be dangerous! I’ve had this reboot a system mid-upgrade because my service was broken during the upgrade. If the service gets into an always-failing state, it may be very hard to log back into the Pi and disable this — your timing has to be just right!
The Raspberry Pi watchdog
My post about mitigating hardware/firmware issues goes into more detail about setting up the (little-known) Raspberry Pi watchdog.
Once you have the watchdog set up, you may consider setting the pidfile
option in watchdog.conf
to detect when your service process dies, and reboot the system.See watchdog.conf(5)
for more.
This can be dangerous! If the system reboots during an upgrade, or your service stops working and needs manual intervention, you could easily find yourself with a Pi that doesn't stay online long enough for you to fix the issue.
I personally feel this is a pretty big hammer, and if your service is managed by systemd there are better options available, but I’m listing it here for completeness.
The hack: just restart your software periodically
Sometimes you might want to just restart your service on a regular basis, regardless of whether it looks healthy to systemd. Two approaches you can take are:
- For a systemd service, in the
[Service]
section, setRestart=always
andRuntimeMaxSec=3600s
. (Of course, changeRuntimeMaxSec
to your desired restart period.) - You can always use a cron job to schedule a periodic restart of some software. (As an example, you might place
0 5 * * * root systemctl restart pifm-player.service
in/etc/cron.d/pifm-restart
.)
Monitoring for OOMs and disk space issues
I collect logs remotely for the vast majority of my Pis and servers, which gives me a centralized spot to implement alerting for “out of disk space” and “out of memory” errors.
See the Remote Logging post in this series for instructions on setting that up.
Monitoring your service
You may wish to be alerted when your service stops working, even if it should get fixed automatically using one of the techniques listed above.
I use Uptime Kuma to monitor my home’s services, so this advice applies to it, but similar techniques ought to work with other monitoring solutions.
To be alerted if/when a systemd-managed service stops running, I set up:
- A push monitor in Uptime Kuma, which requires a heartbeat on a periodic basis and alerts if it stops.
- A simple cron job using my
runner
tool. For my Pi FM transmitter, this job is located at/etc/cron.d/pifm-check
:
* * * * * root runner -job-name "pifm-check" -timeout 10 -retries 2 -retry-delay 15 -success-notify "http://192.168.1.10:9001/api/push/abcdef?status=up&msg=OK&ping=" -- systemctl is-active --quiet service pifm-player.service >/dev/null
This is a little dense; let’s break it down:
* * * * * root runner
: every minute,runner
starts, under theroot
user-job-name pifm-check
: for logging purposes, this job is called “pifm-check”-timeout 10
: each attempt to runsystemctl is-active
times out after 10 seconds-retries 2 -retry-delay 15
: if the service isn’t online, check again up to 2 times, with a 15 second delay between checks-success-notify "http://192.168.1.100:9001/api/push/abcdef?status=up&msg=OK&ping="
: if the check is successful, send a heartbeat to the Uptime Kuma monitor--
: we’re done configuringrunner
; now we tell it what to runsystemctl is-active --quiet service pifm-player.service
: ask systemctl if thepifm-player
service is alive; this command will succeed (exit0
) if it is, and otherwise fail (exit1
)>/dev/null
: even if thesystemctl
command exits with an error code, discardrunner
’s output
To be alerted if/when a Docker-managed service is unhealthy, I set up:
- A push monitor in Uptime Kuma, which requires a heartbeat on a periodic basis and alerts if it stops.
- A simple cron job using my
runner
tool, very similar to the one above. Here’s an examplecrontab
entry:
* * * * * runner -job-name "healthcheck-gluetun" -timeout 30 -success-notify "http://192.168.1.10:9001/api/push/ghijkl?status=up&msg=OK&ping=" -- bash -c "[ \"$(docker container inspect --format '{{.State.Health.Status}}' gluetun 2>&1)\" == \"healthy\" ]" >/dev/null
Here’s a breakdown of this (admittedly complex) line:
* * * * * runner
: every minute,runner
starts, under whichever user owns thiscrontab
-job-name "healthcheck-iperf3-server"
: for logging purposes, this job is called “healthcheck-iperf3-server”-timeout 30
: each attempt to check the container’s status times out after 30 seconds-success-notify "http://192.168.1.10:9001/api/push/ghijkl?status=up&msg=OK&ping="
: if the check is successful, send a heartbeat to the Uptime Kuma monitor--
: we’re done configuringrunner
; now we tell it what to runbash -c
: the remainder of this command is a Bash one-liner that uses comparison operators, which we run viabash -c
[ \"$(docker container inspect --format '{{.State.Health.Status}}' gluetun 2>&1)\" == \"healthy\" ]
: get the health status for thegluetun
container; return0
if it ishealthy
, otherwise fail>/dev/null
: even if the Docker health check fails, discardrunner
’s output
If the Docker container you want to monitor doesn’t have a health check, you can follow this pattern instead to simply check whether it’s running:
[ \"$(docker container inspect --format '{{.State.Status}}' iperf3-server 2>&1)\" == \"running\" ]
To be alerted when one of my own services stops working, regardless of whether the failure causes the program to exit, I:
- Modify my program to support sending a heartbeat periodically if and only if everything is working as expected. You can see an example of this in
nut_influx_connector
, which uses my new Golang heartbeat library. - Set up a push monitor in Uptime Kuma.
- Deploy the new software, using the Uptime Kuma push monitor’s heartbeat URL.
This allows monitoring for more subtle issues, like transient network issues, that might cause the service to stop working for a while but don’t actually cause the process to exit.
To be alerted when a cron job encounters errors, I use my runner
tool. runner
is a comprehensive wrapper for cron jobs that can:
- Retry them on error
- Log their output alongside diagnostic details
- Notify you of failures via email, Ntfy, or Discord
- Send heartbeats to Uptime Kuma or similar tools
- Silence their output when they’re successful, so cron only emails you on failures
To monitor dataflows from collectors like nut_influx_connector
and ecobee_influx_connector
, I:
- Set up an Uptime Kuma push monitor.
- Write a script, on a separate machine (typically the one running InfluxDB), that queries Influx and verifies data points have been written to the expected measurement recently. The script should exit
0
if all’s well and1
if the data flow doesn’t seem healthy. Here’s an example:
#!/usr/bin/env bash
set -euo pipefail
POINTS=$(curl -s --max-time 15 -G 'http://192.168.1.10:8086/query' --data-urlencode "db=dzhome" --data-urlencode "q=SELECT COUNT(\"moving_avg_dB\") FROM \"two_weeks\".\"noise_level\" WHERE (\"device_name\" = 'shopsonosnoisecontrol_1') AND time > now()-121s" | jq '.results[0].series[0].values[0][1]')
if [ -z "$POINTS" ] || [ "$POINTS" = "null" ]; then
exit 1
fi
# we aim to log 4 points per second, so in 120 seconds we should have a couple hundred points at the very least...
if (( POINTS < 200 )); then
exit 1
fi
exit 0
{:start=”3”} 3. Execute that script every minute via my user’s crontab, wrapped with runner
as described above to send a heartbeat when the data flow is healthy:
* * * * * runner -success-notify "https://myserver.mytailnet.ts.net:9001/api/push/abcdef?status=up&msg=OK&ping=" -- /home/cdzombak/scripts/healthchecks/shopsonosnoisecontrol-influx-healthcheck.sh >/dev/null