Durable Execution: Why Your Workflows Should Survive Any Failure
How durable execution platforms like Temporal and Restate are making reliability a built-in primitive instead of something engineers hand-code into every application.
Durable Execution: Why Your Workflows Should Survive Any Failure
Every production system eventually has to answer the same question: what happens when something fails halfway through a multi-step operation?
A user signs up, you charge their card, send a welcome email, provision their account, and notify the sales team. The payment succeeds, the email sends, then the account provisioning service crashes. Now what?
Most teams solve this with retry logic, dead letter queues, idempotency keys, and compensation transactions — all hand-coded, all brittle, all different for every workflow. Durable execution platforms eliminate this entire class of problems.
What Is Durable Execution?
Durable execution means your code runs to completion regardless of failures. If a process crashes, a server reboots, or a network partition happens mid-execution, the workflow resumes exactly where it left off — with all local state preserved.
This isn't magic. It's event sourcing applied to code execution:
- Every side effect (API call, database write, timer) is recorded as an event
- On recovery, the framework replays these events to reconstruct state
- Your code doesn't know it was interrupted — it just continues
# Temporal workflow example
@workflow.defn
class UserOnboarding:
@workflow.run
async def run(self, user: User):
# Each activity is durable — if we crash between
# any two steps, we resume from the last completed one
charge = await workflow.execute_activity(
charge_payment,
args=[user.payment_method],
start_to_close_timeout=timedelta(seconds=30),
retry_policy=RetryPolicy(maximum_attempts=3)
)
await workflow.execute_activity(
send_welcome_email,
args=[user.email],
start_to_close_timeout=timedelta(seconds=10)
)
account = await workflow.execute_activity(
provision_account,
args=[user],
start_to_close_timeout=timedelta(minutes=2)
)
await workflow.execute_activity(
notify_sales,
args=[user, account],
start_to_close_timeout=timedelta(seconds=10)
)
return account
If the worker crashes after send_welcome_email but before provision_account, a new worker picks up the workflow, replays the completed activities (without re-executing them), and continues from provision_account.
No dead letter queues. No manual reconciliation. No lost state.
Get more insights on Architecture
Join 2,000+ engineers who get our weekly deep-dives. No spam, unsubscribe anytime.
When You Need Durable Execution
Not every workflow needs this level of reliability. But these patterns are strong signals:
- Multi-step processes with side effects: Payment → fulfillment → notification chains
- Long-running operations: Processes that span hours or days (approval workflows, data migrations)
- Saga patterns: Distributed transactions that need compensation on failure
- Scheduled recurring work: Cron jobs that must run exactly once and track state
- Human-in-the-loop workflows: Processes that pause for human approval
The Landscape in 2026
Three platforms are dominating adoption:
Temporal
The most mature option. Production-proven at Netflix, Uber, Snap, and thousands of others. Key strengths:
- SDKs for Go, Java, Python, TypeScript, .NET
- Workflow-as-code model (write workflows in your language, not YAML/JSON)
- Built-in visibility UI for debugging running workflows
- Self-hosted or Temporal Cloud
- Battle-tested at massive scale (billions of workflows)
Restate
The newcomer optimized for modern cloud-native workloads. Key differentiators:
You might also like
- Lightweight — runs as a sidecar, not a cluster
- Virtual objects with built-in state (no external database needed for workflow state)
- HTTP-native — services are just HTTP endpoints
- Lower operational overhead than Temporal
- Great for serverless and edge deployments
Inngest
Focused on event-driven workflows with a developer-first approach:
- Functions triggered by events, not explicit workflow definitions
- Built-in step functions with automatic retry
- Managed platform (no infrastructure to run)
- Strong TypeScript/Next.js ecosystem integration
Implementation Patterns
Pattern 1: Idempotent Activities
Every activity must be idempotent. The framework may retry activities on failure, and your code must handle duplicate execution safely:
async def charge_payment(payment_method: str, idempotency_key: str):
# Use the idempotency key to prevent double-charging
existing = await db.get_charge(idempotency_key)
if existing:
return existing
charge = await stripe.charges.create(
amount=amount,
source=payment_method,
idempotency_key=idempotency_key
)
await db.save_charge(idempotency_key, charge)
return charge
Pattern 2: Compensation (Saga Pattern)
When a step fails and you need to undo previous steps:
Free Resource
Free Cloud Architecture Checklist
A 47-point checklist covering security, scalability, cost optimization, and disaster recovery for production cloud environments.
@workflow.defn
class OrderSaga:
@workflow.run
async def run(self, order: Order):
compensations = []
try:
reserve = await workflow.execute_activity(
reserve_inventory, args=[order])
compensations.append((release_inventory, [order]))
charge = await workflow.execute_activity(
charge_payment, args=[order])
compensations.append((refund_payment, [charge]))
await workflow.execute_activity(
ship_order, args=[order])
except Exception:
# Compensate in reverse order
for comp_fn, comp_args in reversed(compensations):
await workflow.execute_activity(
comp_fn, args=comp_args)
raise
Pattern 3: Long-Running with Signals
Workflows that wait for external events:
@workflow.defn
class ApprovalWorkflow:
def __init__(self):
self.approved = None
@workflow.signal
async def approve(self, approved_by: str):
self.approved = approved_by
@workflow.run
async def run(self, request: Request):
await workflow.execute_activity(
notify_approvers, args=[request])
# Wait up to 7 days for approval signal
try:
await workflow.wait_condition(
lambda: self.approved is not None,
timeout=timedelta(days=7)
)
await workflow.execute_activity(
execute_request, args=[request, self.approved])
except asyncio.TimeoutError:
await workflow.execute_activity(
notify_timeout, args=[request])
Getting Started
- Identify your most fragile workflow — the one that breaks most often and requires manual intervention to fix
- Start with Temporal if you need battle-tested reliability, Restate if you want minimal operational overhead
- Rewrite one workflow — don't boil the ocean, pick the highest-pain one
- Make every activity idempotent — this is the hardest part and the most important
- Add observability — durable execution without visibility is a black box
The Bottom Line
Reliability shouldn't be something every engineer hand-codes into every workflow. It should be a platform primitive — something you get for free by using the right execution model.
Durable execution platforms have crossed the chasm from "interesting technology" to "infrastructure standard." If you're still building multi-step workflows with raw queues and retry logic, you're solving a problem that's already been solved.
Pick a platform, migrate your most painful workflow, and stop waking up at 3 AM to manually reconcile failed operations.
Related Service
Technical Architecture & Consulting
System design, microservices architecture, and technology strategy for ambitious projects.
Need help with architecture?
TechSaaS provides expert consulting and managed services for cloud infrastructure, DevOps, and AI/ML operations.
We Will Build You a Demo Site — For Free
Like it? Pay us. Do not like it? Walk away, zero complaints. You will spend way less than hiring developers or any agency.
No spam. No contracts. Just a free demo.