Debugging an Intermittent Arduino/ESP8266 ISR Crash
Part of the Getting Started with ESP8266 + Arduino series.
I’ve been playing with Wemos D1 Mini ESP8266 boards lately, in preparation for some upcoming project ideas, and just because learning new things is fun. I have some ideas for a series of blog posts about these boards; debugging this issue seems like as good a place as any to start. It’s not quite beginner-level, but hopefully it’s helpful.
For context, I’m using the Arduino core for ESP8266, and using PlatformIO with CLion, rather than the Arduino IDE, as my development environment.
I’m working on this program (that link is to the specific commit I’m debugging here). It does a few things, listed here in no particular order:
- Loads a root certificate store from onboard flash memory, to use for TLS certificate validation.
- Connects to my home WiFi network.
- Sets the local clock via NTP, which is required for TLS certificate validation. (Once I wrap this demo up, I plan to do another focused on timekeeping, time sync, and timezones/DST on this platform.)
- Shows up on the network via multicast DNS (the thing that makes
.local
domains work on your network). - Every 30 seconds, fetches its Internet-facing IP address, via an HTTPS request to ip.dzdz.cz (a domain I use for deploying weird, one-off, personal projects). Then, it encodes a JSON object containing the Internet-facing IP and the WiFi-local IP, and sends that JSON payload to another HTTPS endpoint (I’m using Hookbin).
- Blinks the development board’s LED to indicate status: quickly on and off while connecting to WiFi, blink briefly every once in a while while connected, or slowly on and off to indicate that a failure occured loading the certificate store. (This is not strictly necessary right now, since I’m monitoring the board’s log messages via its USB connection to my computer; but when these are deployed in the field I want to know at a glance if a given project is connected or not.)
- Monitors the WiFi connection and updates the LED blinking rate to indicate when the connection is lost and regained.
These tasks are coordinated using this Arduino cooperative multitasking library, which looks quite capable — my project is only scratching the surface of its API, which I wish was better documented.
Aside: “Cooperative multitasking” means the operating system, or in this case the scheduler, does not interrupt a running task to give time to others; it’s up to each task to return quickly or take breaks to give other tasks a turn. (In the future I’d like to play with FreeRTOS on the ESP8266 platform, but I had to cut scope somewhere on these demos, and between my admittedly-cursory familiarity with Arduino and the large community it provides, I figured I should stick with Arduino for now.)
The problem
The HTTPS demo project crashes after running for several minutes. To start debugging this, I enabled crash decoding in the platformio device monitor
tool. (I describe that setup in a separate blog post.)
This allowed me, eventually, to capture a useful crash log.
An initial attempt
Searching the Internet for esp8266 "rst cause:2, boot mode:(3,6)"
, which is printed at the bottom of the exception, I found several results saying that the ESP8266 will reset like this if you don’t call yield()
often enough in long-running functions or inside loops. This is because the ESP8266’s WiFi stack needs processor time to run, and if it doesn’t get enough processor time, a watchdog timer will reset the chip.
“Easy enough,” I figured. This crash looked to be happening during HTTPS requests, which the documentation warns are hard work for this processor. So I assumed — wrongly — that maybe I could sprinkle some yields throughout this function. That did not fix the problem.
Diving deeper into the crash log
A few things jump out right away:
Illegal instruction
epc1=0x4022698c in __divsf3 at /workdir/repo/gcc-gnu/libgcc/config/xtensa/ieee754-sf.S:1036
First, the “illegal instruction” message where the program apparently crashed is odd. This definitely should exist and be usable.
Then, let’s take a look at the stack trace. At the top:
0x4010064c in timer1_isr_handler at /Users/cdzombak/.platformio/packages/framework-arduinoespressif8266/cores/esp8266/core_esp8266_timer.cpp:34
0x401002c2 in ledTimerISR() at src/main.cpp:257
0x40100698 in timer1_isr_handler at /Users/cdzombak/.platformio/packages/framework-arduinoespressif8266/cores/esp8266/core_esp8266_timer.cpp:44
So, this crash happened while in my LED blinking interrupt service routine (ISR). That’s not particularly surprising — as this guide to interrupts suggests, ISRs are somewhat touchy. To quote my dad, “It’s best to keep ISRs as simple and short as possible.” Thing is, I thought this ISR was pretty simple — it just toggles the LED and sets the timer to keep the LED on or off for the correct amount of time. We’ll come back to that.
It’s confusing to read the rest of this trace; I don’t know exactly how to interpret a stack trace produced during an ISR. The few functions on top have no relation to those at the bottom, and it’s ambiguous to me where the the normal program flow ends and the interrupt-related stack entries begin. I’m sure someone out there with experience on this platform knows exactly how to read this … but I don’t.
At the bottom of the stack trace:
0x40235f98 in br_ssl_engine_closed at /home/earle/src/esp-quick-toolchain/arduino/tools/sdk/ssl/bearssl/src/inner.h:2211
(inlined by) jump_handshake at /home/earle/src/esp-quick-toolchain/arduino/tools/sdk/ssl/bearssl/src/ssl/ssl_engine.c:1081
0x4023658d in br_ssl_engine_recvrec_ack at /home/earle/src/esp-quick-toolchain/arduino/tools/sdk/ssl/bearssl/src/ssl/ssl_engine.c:1206
0x4020b878 in thunk_br_ssl_engine_recvrec_ack at ??:?
I think this means that before the LED timer interrupt fired, the processor was doing some HTTPS stuff. This makes sense, since this is where the processor is spending the bulk of its time.
In the middle of this trace (snipped for brevity):
0x40106bec in spi_flash_read at ??:?
0x402182a2 in flash_hal_read(unsigned int, unsigned int, unsigned char*) at /Users/cdzombak/.platformio/packages/framework-arduinoespressif8266/cores/esp8266/flash_hal.cpp:36
…
0x402134f0 in littlefs_impl::LittleFSFileImpl::read(unsigned char*, unsigned int) at /Users/cdzombak/.platformio/packages/framework-arduinoespressif8266/libraries/LittleFS/src/LittleFS.h:392
The important thing here is that it’s trying to read something from flash memory. Again, I don’t know how to accurately interpret this stack trace, but this is interesting and relevant because:
- LittleFS is used to persist the certificate root store, so BearSSL needs to read it as part of a TLS handshake. This makes sense, and it looks to me like it’s busy doing just that when the crash occurs.
- ESP8266 ISRs need to be marked with the
IRAM_ATTR
attribute. According to the documentation, this is because “the flash may be in the middle of other operations when [interrupts] occur.”
It will be useful to fully understand that last comment, so let’s discuss…
ESP8266’s memory architecture
Here I refer to the ESP8266 datasheet and the ESP8266 technical reference. They don’t go into as much detail as I would like, but we can get a high-level understanding of what’s happening here.
The datasheet (in sections 3.1.2 & 3.1.3) notes:
- “There is no programmable ROM in the SoC. Therefore, user program must be stored in an external SPI flash.”
- “ESP8266EX uses external SPI flash to store user programs …”
The technical reference, in section 5.1, says “… SPI module is used to read the running program from external Flash to CPU CACHE …”
Unfortunately, no reference material I can find details exactly how that process works — for example, exactly when does the ESP8266 read user code from flash? How much does it read at a time?
But our code is stored in flash, connected to the processor via SPI. And, during an ISR, something else may be using flash memory, so our ISR — and everything it calls — needs to be in RAM already, so the processor can execute it. This is what the IRAM_ATTR
attribute does.
Armed with this knowledge, I think the stack trace sure looks like something in my ISR is touching the flash memory, but BearSSL is already using it. Hence the crash.
But what’s using the flash memory?
What’s touching the flash?
My ISR is already marked with IRAM_ATTR
, and the stack trace hints that it’s at least started executing, so I can assume that’s not the issue.
However, in that ISR I do call this function, setInterval
, in the timer library I was using. It’s not marked with IRAM_ATTR
, and neither is the setFrequency
function it calls.
It’s easy to verify that these calls are causing the problem: commenting out the ITimer.setInterval
calls in the ISR fixes the crash. The LED doesn’t blink, of course, but this means I’m on the right track.
This problem now seems easy enough to fix:
- First, start using a local copy of the library, so I can modify it.
- Then, mark those functions with
IRAM_ATTR
. (I also commented out a debug logging statement, which might touchSerial
, which we can’t do from an ISR.)
Really, these hacks indicate that I shouldn’t be using this library — at least not the way I’m trying to. (Spoiler: I get rid of the library soon.)
Rebuild, re-upload, and … it still crashes in the same way: TLS stuff at the bottom of the stack trace, then SPI flash stuff, then the LED blinking ISR.
Taking another look at the library’s code, something jumped out at me: it’s performing floating-point division. And at the very top of the decoded exception, I remember something about division, in a file named ieee754-sf.S
:
Illegal instruction
epc1=0x4022698c in __divsf3 at /workdir/repo/gcc-gnu/libgcc/config/xtensa/ieee754-sf.S:1036
IEEE754 is a standard for floating-point arithmetic, and that file looks like it’s a software implementation of at least some floating-point math routines. A quick search tells me that ESP8266 doesn’t have a floating-point unit (plus, “floating-point,” “IEEE754,” and related terms don’t even appear in the datasheet or technical reference manual).
So, at this point I feel safe assuming that:
- When the processor is very busy doing TLS stuff (which does not need anything floating-point-related, but needs a lot of processor time and RAM), this floating-point implementation is pushed out of RAM.
- When the processor is in the middle of reading the certificate store from flash, to support TLS certificate validation, nothing else can use the flash.
- If my ISR runs at that particular point, the ESP8266 needs to retrieve the implementation of
__divsf3
from flash, but it can’t.
Again, I do wish there were more documentation about exactly how the ESP8266 manages the user code cache in RAM, but this seems like a solid enough theory.
Actually fixing it
After reviewing the ESP8266TimerInterrupt library’s code a little more, I concluded:
- This seems overcomplicated, at least for my current use case.
- I don’t like this library very much; the documentation is somewhat confusing, and there appear to be two completely separate implementations, each with a different feature set.
- It’s not worth trying to fix the library to remove the floating-point math.
- I probably shouldn’t be calling this much external stuff from an ISR anyway.
With all that in mind, I rewrote the LED blink management code to use the timer-management functions from the ESP8266 Arduino core, directly; and also to avoid doing any floating-point math or calling into any external libraries from the interrupt service routine.
- Extracted the LED blink management to its own file, for code cleanliness
- Managed to avoid using any floating-point math at all
Verifying the fix
I’ll cheat, making the ISR fire more frequently to increase the chance of a crash:
diff --git a/src/main.cpp b/src/main.cpp
index b511378..943986d 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -47,11 +47,12 @@ BearSSL::Session tlsSession;
WiFiClientSecure wifiClient;
// LED (all times in milliseconds)
-#define CONNECTED_LED_TIME_ON (TASK_SECOND/5)
-#define CONNECTED_LED_TIME_OFF (TASK_SECOND*1.5)
-#define CONNECTING_LED_TIME_ON (TASK_SECOND/5)
+// DO NOT COMMIT
+#define CONNECTED_LED_TIME_ON (TASK_SECOND/20)
+#define CONNECTED_LED_TIME_OFF (TASK_SECOND/10)
+#define CONNECTING_LED_TIME_ON (TASK_SECOND/20)
#define CONNECTING_LED_TIME_OFF CONNECTING_LED_TIME_ON
-#define CERT_ERR_LED_TIME_ON (TASK_SECOND/1.5)
+#define CERT_ERR_LED_TIME_ON (TASK_SECOND/10)
#define CERT_ERR_LED_TIME_OFF CERT_ERR_LED_TIME_ON
// Tasks
Perfect. Applying this change before applying the actual bugfix, the program crashes reliably within 30 seconds.
Then, after applying the bugfix, it doesn’t crash! (It just blinks at an incredibly distracting rate.)
Lessons learned
I’ll try to be brief:
“Simple” means really, really simple. I thought my first interrupt service routine was simple, but it involved someone else’s code, some non-IRAM_ATTR
function calls, and floating-point math that turned out to be implemented in software. Ew.
Just call nothing from an ISR. Like, if you can possibly push any work out of the ISR, do it. Even if it feels like that work should take about three clock cycles.
No floating point either. I vaguely recall having to migrate something to fixed-point during EECS 373 because of (different) floating point issues on a microcontroller; I should’ve remembered that!
Not all Internet troubleshooting suggestions are accurate. In retrospect, it was quite optimistic to hope that spraying yield()
s all over my code would fix this, but that’s what a lot of Arduino forums said to do. They were wrong.
Be careful what libraries you let into your codebase. In day-to-day software development for desktops & servers, and even Raspberry Pis, I’m used to using external libraries. But this is a different environment, with much stricter limitations and less room for error or sloppiness. In just a week or so of playing with Arduino on the ESP8266, I’ve run into a few different Arduino libraries with various issues:
- Badly written code, with obvious bugs or shortcomings
- Poor, nonexistent, or confusing documentation
- Unexpected interactions, either with your my code or with other libraries
Wow, that is a lot of effort to make an LED blink.
Yes.
Appendix: noInterrupts()
Earlier, before tracking down the real cause, I tried the hacky approach of calling noInterrupts()
at the beginning of my ISR and interrupts()
at the end. This sometimes helps “fix” interrupt conflicts, but:
- It’s generally considered a hack, since it more-or-less papers over the actual problem.
- It doesn’t scale to larger projects.
Anyway, this approach couldn’t have worked here, since the processor needs interrupts to read from the flash memory over API. In retrospect, the best this could have accomplished would be to deadlock the whole system until a watchdog timer reset it.
In fact, this whole approach was doomed to make no difference: the underlying Arduino core already disables interrupts to the extent possible on ESP8266 when entering a timer ISR! (And, of course, it reenables them after the ISR exits.)