Anatomy of a production OTA pipeline

An OTA pipeline that has not been fault-injected has not been tested. Four properties, four phases, and the test matrix that decides whether a million devices in the field survive a bad update.

The most consequential code in any connected embedded product is the few thousand lines that sit between a signed image arriving over the network and the device booting that image successfully. If those lines work, no one notices. If they fail badly, an entire product fleet bricks itself in the field, and there is no remote recovery path because the recovery path is the thing that just failed.

OTA is the part of an embedded SDK that has the worst ratio of visibility to consequence. It runs once per update, often months apart, and either succeeds invisibly or causes one of the most expensive failure modes in connected products. Customers who are evaluating SDKs for production deployment know this, and the question they actually want answered is not “does your SDK have OTA,” but “have you broken your OTA pipeline in every place it can break, and proven that the device recovers.”

This post walks through what a production OTA pipeline has to do, how the pieces fit together as a state machine, and the fault injection matrix that determines whether the pipeline is genuinely production-grade or just appears to be in the happy path.

The four properties

A production OTA pipeline must satisfy four properties simultaneously. They are not independent choices — each one rules out specific implementation shortcuts that look attractive in early development, and the absence of any one of them is enough to make the pipeline unsafe in the field.

Atomicity. The transition from the old image to the new image must be a single, indivisible operation from the device’s perspective. There must be no observable state in which the device is “partway updated.” Either the new image is fully installed and running, or the old image is. This is the property that lets the pipeline survive power loss at any moment.

Authentication. Every image installed on the device must be cryptographically verified against a public key provisioned at manufacturing. The verification covers the entire image binary plus a structured header containing version, size, flags, and hash. The signing key never lives on the device; it lives in the manufacturer’s signing infrastructure, and the device only ever sees the public counterpart. Any unsigned image, any image with a broken signature, and any image signed by the wrong key is rejected before installation.

Rollback safety. If the new image is installed and fails to operate correctly, the device must be able to restore the previous image automatically, without requiring network access or operator intervention. This is the property that turns a bad update from a fleet-bricking event into a deferred update — annoying, but recoverable.

Resumeability. The image download must survive transport interruptions. A device that has downloaded 80% of a 4MB image and then loses its connection must be able to resume from byte 80% on the next attempt, not start over from zero. On real-world cellular and Wi-Fi links, an OTA pipeline that cannot resume cannot complete updates on a non-trivial fraction of devices.

These four properties are the design constraints that shape every implementation decision below. None of them is optional, and none of them can be added late.

The slot model: why swap wins

The flash layout for an OTA-capable device almost always uses two image slots: a primary slot, which the device boots from in normal operation, and a secondary slot, which receives the incoming image during an update. The bootloader is responsible for managing the transition between them.

MCUboot, the standard open-source bootloader for this role, supports three upgrade modes, and the choice between them is one of the most consequential design decisions in the OTA pipeline.

Overwrite-only writes the new image directly over the old one in the primary slot. It is the smallest in flash overhead, because the secondary slot can be the same size as the primary, and there is no need to preserve the old image. It is also the only mode that provides no rollback capability — once the overwrite has begun, the old image is gone, and a power loss midway through leaves the device with neither a complete old image nor a complete new one. Overwrite-only is appropriate only when flash budget is severely constrained and the consequences of a bricked device are acceptable.

Swap is copy-then-overwrite with full rollback capability. The new image is written to the secondary slot, verified, and then the bootloader swaps the contents of the two slots through a sequence of sector-level moves. The old image ends up in the secondary slot rather than being lost, which is what makes rollback possible. Swap mode is the strongly preferred choice for production, because it is the only mode that satisfies all four properties cleanly.

Direct-XIP executes in place from whichever slot contains the most recent valid image. The bootloader does not move images between slots; it simply selects which slot to boot from. This is useful for external XIP flash where the cost of copying images between slots would be prohibitive, but it does not provide the same atomic-rollback guarantees as swap, and the flash layout requirements are more complex.

The reason swap mode wins for production is the rollback property. A device that boots a new image and then crashes — because of a regression that was not caught in testing, or a configuration mismatch with a specific deployment, or any of the dozens of subtle ways a firmware update can fail in the field — needs to recover automatically. Swap mode is the mechanism that provides this recovery, and it does so without requiring the application to be involved in the recovery decision.

The pipeline as a state machine

The pipeline itself proceeds through four phases, each with defined entry and exit conditions and defined behavior on power loss within the phase.

Phase 1: Image fetching. The device downloads the update image — compressed and signed — to the secondary slot in internal flash, or to a staging region in external flash. The download is resumable via byte-range requests over whatever transport protocol the device uses (MQTT, CoAP, HTTPS). A power loss during fetching leaves the secondary slot containing a partial image; on the next boot, the bootloader detects that the secondary slot does not contain a valid image header, and the device boots the prior image normally. The application can resume the download from the last successfully written offset.

Phase 2: Cryptographic verification. Once the download is complete, the bootloader verifies the image signature using the public key provisioned into the device at manufacturing. Any signature verification failure aborts the update and triggers an alert to the management backend. A power loss during signature verification leaves the secondary slot with a complete-but-not-yet-verified image; on the next boot, the bootloader re-verifies and either proceeds or rejects.

Phase 3: Integrity check. A hash checksum of the image payload — SHA-256 or SHA-384 — is verified independently of the signature. This catches transport corruption that would not necessarily produce a signature failure on its own. The integrity check is a separate phase from the signature verification because the failure modes are different: a signature failure suggests a malicious or misrouted image, while an integrity failure suggests a transport corruption that may be transient.

Phase 4: Swap and commit. The bootloader atomically swaps the primary and secondary slots through a sequence of sector-level operations, each of which is journaled so that a power loss during the swap can be recovered on the next boot. After the swap, the new image runs and has a configurable confirmation window — typically expressed as a number of successful boots, or as an explicit application-level call — to assert itself as healthy. If the new image does not confirm within that window, the bootloader reverts on the next reset, swapping back to the prior image.

The four phases form a state machine in which every state has a defined recovery path on power loss. The states are not optional, and the recovery paths are not optional. A pipeline that has phases without defined recovery is a pipeline that will eventually leave a device in an unbootable state, and the only question is how many devices have to be in the field before that happens.

Why power-failure testing is the part that gets skipped

This is where most OTA pipelines that look fine in normal testing turn out not to be. A pipeline that has not been tested with deliberate power loss at each of its phases has not been tested for the failure modes that matter, because the field will exercise those failure modes whether the lab does or not.

Power-failure injection — deliberately cutting power to the device at controlled points during each of the four phases above, and verifying that the device subsequently boots a valid, known-good image — is the single test category that turns an OTA pipeline from “works in the happy path” into “demonstrably recoverable.” It belongs in the SDK’s own test suite, not in the customer’s. An SDK that ships OTA without it is asking every customer to build the same fault-injection rig independently, and most of them will not.

The pass criterion for power-failure tests is unforgiving and binary: in every test case, the device must boot to a valid image — either the new one if the update completed, or the prior one if it didn’t. Any case in which the device fails to boot, or boots to a corrupted image, is a fail. The reason for the binary criterion is that the field consequence of a failure is binary. A device that does not boot does not provide a path for the next update to fix the problem.

Running this kind of test campaign requires hardware infrastructure that most SDK teams do not initially have. A USB-controlled relay or PDU per board enables controlled power cycling. A debug probe per board provides post-mortem inspection of the device state after a forced reset. A serial capture harness records the boot sequence so failures can be diagnosed without re-running them. None of this is exotic, but all of it has to be present and automated, because a fault-injection campaign run manually is one that doesn’t run.

The phase that typically reveals problems is the swap itself. The swap state machine has to recover correctly from a power loss at any sector boundary, and a bug that only manifests at one specific boundary will eventually find a device in the field that loses power at exactly that moment. The investment in testing this exhaustively is what separates an OTA pipeline that has been validated from one that has merely been demonstrated.

The post-boot confirmation window

There is one specific property in Phase 4 that is worth dwelling on, because it is the most commonly mishandled and the one that distinguishes a production-ready OTA pipeline from one that just looks like it.

After the swap completes, the new image runs. At this point, the bootloader does not yet trust the new image — it is the candidate, not the confirmed image. The new image has a window in which it has to assert that it is healthy, typically by calling an MCUboot API such as boot_set_confirmed(). If the application does not make this call within the configured window — because the application crashes, hangs, or is misconfigured to the point of being unable to call any API — the bootloader detects this on the next reset and swaps back to the prior image.

This is the property that protects against bad updates that crash on first boot. Without the confirmation window, an update that crashes during initialisation would result in a device stuck in a boot loop with no recovery path. With the confirmation window, the same update results in two failed boots and then an automatic revert to the previous, known-working image.

The application’s responsibility in this window is non-trivial. The confirmation call should not happen immediately at boot — the application should run for long enough to verify that its critical subsystems are operational. A reasonable confirmation criterion is “the application has reached its main loop and successfully completed at least one cycle of its primary work,” not “the application has finished initialisation.” The exact criterion is product-specific, but the framing is universal: the confirmation window is for proving operational health, not boot health.

What this means for the SDK

OTA is the place where the SDK earns the trust that lets customers ship at scale. A customer evaluating whether to deploy automatic updates across a million-device fleet is asking whether your pipeline has been tested for every failure mode they can think of, plus the ones they cannot. An SDK that ships with the fault injection matrix already run on every release, with results published, with the swap state machine validated against the specific flash layouts the customer is using, is an SDK that gets the green light for that deployment.

An SDK that ships OTA as a feature without the test infrastructure to back it up is an SDK that is asking the customer to do the validation work — and most customers will not do it, which means the failures will surface in the field.

The boring outcome — every device updates successfully, no devices brick, no fleet-wide alerts — is what production OTA looks like when the pipeline is right. The dramatic outcome is what it looks like when one of the four properties was missing, or one of the four phases had an undefined recovery path, or the fault injection matrix had a gap that turned out to matter. The difference between the two is not luck. It is whether the pipeline was designed and tested with the assumption that every phase will lose power at the worst possible moment, and that the device has to recover regardless.

Ready to ship OTA updates your fleet can survive?

needCode builds production-grade embedded SDKs with fully validated OTA pipelines — fault injection tested, MCUboot integrated, and proven at scale. If you’re evaluating OTA infrastructure for a connected product deployment, let’s talk.

Book a free discovery call or get in touch

Further reading

Firmware Security — how needCode approaches secure boot, encryption, and OTA mechanisms across the full IoT stack
End-to-End Testing — the testing infrastructure behind HIL, firmware unit tests, and real-world condition validation
FreeRTOS / Zephyr OS — the RTOS foundations that the OTA pipeline runs on top of
BLE Over-the-Air Firmware Updates: How to Ship Updates That Don’t Brick Devices — a companion post applying these same principles to BLE-specific OTA
Edge Computing & Cloud Integration — the backend infrastructure that manages and triggers OTA updates at scale

Anatomy of a production OTA pipeline

An OTA pipeline that has not been fault-injected has not been tested. Four properties, four phases, and the test matrix that decides whether a million devices in the field survive a bad update.

The four properties