There is a class of bug in wireless embedded systems that is essentially invisible to conventional testing.
It does not manifest in a one-hour test, or a four-hour test, or even a twelve-hour test. It shows up at hour seventy-one of continuous operation, or hour two hundred, or after a specific sequence of state transitions that takes days to cycle through. By the time the bug manifests, the device has been running long enough that the engineer who introduced the change has moved on to other work, the test environment has accumulated hundreds of unrelated events, and reproducing the exact conditions that triggered the failure requires elaborate forensics.
These are stability bugs, and they are some of the most operationally damaging defects a wireless product can have, because they manifest in customer deployments rather than in the lab. They are also, by their nature, undetectable without sustained automated testing — because no human can reasonably watch a device for three days, and no functional test sequence captures the slow cumulative effects that produce them. This article is about what those bugs are, why they are structurally invisible to conventional testing, and what an automated stability testing programme actually looks like in practice.
What stability bugs actually are
To understand why stability testing deserves its own dedicated infrastructure, it helps to be specific about what kinds of bugs it catches. They fall into a few characteristic categories, each of which has a different mechanism but the same general property: the bug accumulates rather than triggering immediately.
Memory leaks are the most familiar example. A subsystem allocates a buffer for each incoming message and forgets to free it under some specific code path. Each individual message is processed correctly, the device responds normally, all functional tests pass. But over hours of operation, the leaked buffers accumulate. The free heap shrinks. Eventually, an allocation fails in a critical path, and the device crashes — or worse, enters an undefined state where some operations work and others do not, producing intermittent failures that are devastating to debug.
Watchdog resets are a related category. A subsystem occasionally takes longer than expected to complete some operation, and on those occasions the watchdog timer fires and resets the device. The reset is silent from the customer’s perspective in the short term — a brief blink of LEDs, a momentary disconnection — but cumulative resets erode reliability and the underlying cause goes undiagnosed because each individual reset looks like noise.
Link-layer state corruption is a more subtle category. The wireless protocol stack maintains internal state — sequence numbers, encryption contexts, pairing information, connection parameters — and over many connection cycles, edge cases in the state management can leave the stack in a slightly wrong configuration. The device still works, mostly, but in some specific situation it fails to authenticate, or it sends a packet with a stale sequence number that the peer rejects, or it cannot establish a connection it should have been able to establish. The failures look random because the corruption that causes them happened thousands of operations earlier.
Resource exhaustion in protocol-specific tables is a fourth category. A bonding table fills up because old bonds are not pruned. A connection list grows without bound because disconnect handlers do not free entries under some race condition. A subscription list expands until the device runs out of slots for new subscribers. Each of these has its own specific mechanism, but the pattern is the same: the resource accumulates over time, and the failure mode appears only after the resource is exhausted.
Firmware degradation under sustained load is a fifth category, harder to characterise but no less real. The device works correctly when first powered on, but after sustained operation under load — many connections, much traffic, many state transitions — its performance degrades in ways that are hard to attribute to any single cause. Latencies creep up. Throughput drops. The reset that comes after a few days of operation feels like it cures the device, which is a strong hint that something is accumulating in the running state.
These categories are not exhaustive, but they share the structural property that distinguishes stability bugs from other firmware defects: they take time to manifest. No amount of fast functional testing catches them, because the conditions that trigger them simply have not occurred yet at the moment the functional tests complete.

Why manual testing structurally cannot find them
The reason stability bugs reach customers is straightforward once you think about it. Manual testers run tests for the duration of their working day, at most. A bug that takes seventy-two hours to manifest will never be discovered by a tester running tests during business hours, because the tester goes home and the device goes off, or sits idle, or gets reset for the next day’s testing. The continuous, sustained operation that the bug requires simply never happens in a manual test environment.
You might think that running the device overnight would solve this, and for some categories of bug it does. A leak that consumes buffers in a hot path will exhaust memory within hours and can be caught by an overnight run. But the bugs that cause the most operational damage are typically slower than that — they take days, not hours, because if they took hours they would have been caught by routine extended testing already. The bugs that survive routine testing and escape into production are precisely the ones whose timescale is long enough to outlast any test session a human is willing to supervise.
This is why stability testing has to be automated, and not just automated but designed for sustained unattended operation. A stability test is not a long version of a functional test. It is a fundamentally different kind of test, with different success criteria and different infrastructure requirements. It assumes from the start that no human will be watching, that runs will last for days or weeks, and that the interesting signals are slow trends rather than sharp events.
What automated stability testing actually looks like
The architecture of a stability test is structurally simple but operationally demanding. The test framework drives the device under test through a sustained workload — typically a representative pattern of operations that approximates what the device experiences in deployment, scaled up in intensity to compress what would take months in the field into a few days in the lab. The workload runs continuously while the framework collects telemetry from the device and from the test environment.
The telemetry is the heart of the test. Functional tests produce pass-or-fail results; stability tests produce continuous streams of measurements that get analysed for trends. The specific metrics depend on the device, but most stability suites collect at least the following: free heap memory over time, watchdog reset counts, protocol error counters, connection success and failure rates, latency distributions for representative operations, peripheral state counters, and any application-specific metrics that reflect the device’s internal health. Each of these is logged at a defined cadence — typically once per minute, occasionally more frequently for fast-changing metrics — and the resulting time series becomes the data product the test produces.
A successful stability test is one where every metric remains stable across the test duration. Free heap stays roughly constant after an initial settling period. Watchdog resets stay at zero. Error counters stay at zero or grow only at the rate consistent with deliberately injected error conditions. Latencies stay within their established ranges. Any drift outside these bounds is a failure, even if the device is still nominally functional at the moment the drift is detected.
This framing — drift detection rather than event detection — is what makes stability testing different from other categories of testing. The interesting signals are gradual changes, not sudden ones. A free-heap chart that starts at twenty kilobytes and ends at nineteen kilobytes after seventy-two hours is a leak signature, even though the device never crashed during the test. Catching that signature requires plotting the trend, applying statistical tests to distinguish drift from noise, and alerting when drift exceeds defined thresholds. None of this is difficult once the infrastructure is in place, but it requires a different mindset from pass-or-fail testing.

The instrumentation that makes this possible
The metrics that stability tests need to collect are not always exposed by the firmware out of the box. Most firmware projects do not natively expose continuous free-heap measurements, or per-subsystem error counters, or latency histograms for internal operations. Adding this instrumentation is part of building a stability testing capability, and it is worth treating it as a deliberate design effort rather than something added in response to specific test needs.
A useful framing is that the firmware should expose a set of health metrics, accessible through the same serial API that the test framework already uses for control, that collectively characterise the device’s internal state. Free heap. Stack high-water marks for each thread. Counts of every error condition the firmware can detect. Histograms of operation latencies. Sizes of every internal table. The resource footprint of each subsystem.
These metrics are useful beyond stability testing. They help debug field issues. They support performance analysis. They feed into capacity planning for resource-constrained features. But their primary value is making stability tests possible: without them, the test framework cannot see the slow trends that characterise stability bugs, and the tests degenerate into pass-or-fail checks that miss the signals they were designed to catch.
The discipline that maintains useful instrumentation over time is treating health metrics as a first-class part of the firmware design. Every new subsystem should expose its own resource counters and error counters. Every new resource pool should have visible utilisation metrics. The set of exposed metrics should grow as the firmware grows, and the existing metrics should remain stable across firmware versions so that historical trend data remains comparable.
Where stability tests fit in the development cycle
Stability tests do not run on every commit, and they do not run nightly. The cadence is too long for that. A typical stability test runs for at least seventy-two hours, and a comprehensive one runs for a week or more. Running them frequently would consume test capacity that is better spent on faster categories of testing.
The right cadence for stability testing is on every release candidate, plus targeted runs after specific kinds of changes. A change that touches memory management, the protocol stack’s internal state, the connection handling code, or any other subsystem with stability implications should trigger a stability run before the change is considered shipped. This catches the regressions most likely to introduce the categories of bug stability testing exists to find.
Beyond that, a low-frequency continuous stability test — one that always has some run in flight, against the latest stable firmware — is valuable for catching bugs that nobody specifically suspects. The test rig is doing nothing else useful when there is no release candidate to validate, so dedicating it to ongoing stability monitoring fills the gap with productive work and produces a continuous stream of trend data that becomes valuable when something does eventually go wrong.
The data from these runs feeds into a stability dashboard that tracks key metrics across firmware versions. Looking at the heap-trend chart for the last twenty firmware versions tells you, at a glance, whether the team has been introducing slow regressions or holding steady. Looking at the watchdog-reset count over the same versions tells you whether the firmware is becoming more or less reliable. These trend views are what make the investment in stability testing visible to the rest of the organisation, and they are what justify continued investment over time.

The compounding value over years
The deeper benefit of a sustained stability testing programme is that it accumulates value over time in a way that other testing does not. Every stability bug that is caught and fixed becomes a permanent part of the firmware’s history. Every test run that completes successfully adds another data point to the trend record. Over months and years, the team builds a quantitative picture of the firmware’s stability that simply cannot be reconstructed by any other means.
This picture is operationally valuable. When a customer reports a field issue that might be stability-related, the team can compare the customer’s deployment characteristics to the patterns it has measured in the lab and form hypotheses based on data rather than guesses. When a release is being considered for a particularly demanding deployment, the team can examine the stability profile of the candidate firmware against the profile of previous firmware that has performed well or poorly in similar contexts. The data informs decisions in ways that intuition cannot.
It is also valuable for the team’s culture. A team that has a stability dashboard, that watches it, and that responds to drifts in it builds a deep relationship with the firmware’s reliability. The team’s sense of what good looks like is grounded in measured trends rather than in confidence about the latest release. Bugs that would have been classified as flaky and ignored become signals that get investigated. The team’s overall confidence in the product becomes more accurate, which leads to better decisions about when to ship and when to wait.
For wireless products that are deployed at scale and run continuously, stability is the property that customers care about most. A device that works perfectly for the first hour and then starts failing in subtle ways over the next week is, from the customer’s perspective, simply a broken product. The test infrastructure that catches that pattern before deployment is the test infrastructure that protects the product’s reputation in the market. It is not optional for any team that intends to remain competitive in deployments where reliability matters.
needCode designs and delivers automated stability testing infrastructure for wireless embedded products, including health-metric instrumentation, sustained-workload generators, and trend-analysis dashboards. We have built stability test programmes across BLE mesh, LTE-connected IoT, and multi-protocol embedded engagements, and we know what it takes to make them produce trustworthy data over years of use. If you have stability concerns that conventional testing is not catching, we are happy to talk through what a programme tailored to your product would involve.
Book a free discovery call or get in touch
Further reading
- Anatomy of a Production OTA Pipeline — the release pipeline that integrates stability runs as a release-candidate gate; this post explains what the gate measures
- Semantic Versioning Isn’t Enough for Embedded SDKs — meaningful version semantics are what make a cross-version stability dashboard interpretable; without them, the trend chart this article describes degenerates into noise
- BLE OTA Firmware Updates: How to Ship Updates That Don’t Brick Devices — the production-deployment reliability that stability testing exists to protect; stability bugs are the bricking-class bugs OTA pipelines must avoid pushing
- Documentation as a Product: How Good SDKs Treat Docs as Code — same “first-class part of the firmware design” frame applied to instrumentation discipline; health metrics need to be maintained alongside the code, exactly the way that post argues docs do

