← All articlesPlatform Engineering

Building Webhook Systems That Don't Lose Messages

Design reliable webhook delivery systems with retry logic, signing, idempotency, dead letter queues, and monitoring. Never lose an outgoing webhook again.

Yash Pritwani

24 November 202513 min read

Why Webhooks Fail

Webhooks are HTTP callbacks: when an event happens in your system, you send an HTTP POST to a URL your customer configured. Simple in concept, surprisingly hard in practice.

Workflow automation: triggers, conditions, and actions chain together to eliminate manual processes.

Failure modes include: customer server is down, DNS resolution fails, TLS certificate expired, request times out, customer returns 500, network partition between your infrastructure and theirs. On average, 2-5% of webhook deliveries fail on the first attempt. Without retry logic, that is 2-5% of events your customers never receive.

The Reliable Webhook Architecture

Event → Queue (persistent) → Delivery Worker → Customer Server
                                    ↓ (fail)
                              Retry Queue (exponential backoff)
                                    ↓ (all retries fail)
                              Dead Letter Queue → Alert

Data Model

-- Webhook endpoints (customer-configured)
CREATE TABLE webhook_endpoints (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    customer_id UUID NOT NULL,
    url TEXT NOT NULL,
    secret VARCHAR(255) NOT NULL,  -- For HMAC signing
    events TEXT[] NOT NULL,        -- ['order.created', 'payment.completed']
    active BOOLEAN DEFAULT true,
    created_at TIMESTAMPTZ DEFAULT NOW()
);

-- Webhook delivery log
CREATE TABLE webhook_deliveries (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    endpoint_id UUID REFERENCES webhook_endpoints(id),
    event_type VARCHAR(100) NOT NULL,
    payload JSONB NOT NULL,
    idempotency_key VARCHAR(255) UNIQUE NOT NULL,
    status VARCHAR(20) DEFAULT 'pending',  -- pending, delivered, failed, dead
    attempts INTEGER DEFAULT 0,
    last_attempt_at TIMESTAMPTZ,
    next_retry_at TIMESTAMPTZ,
    last_response_status INTEGER,
    last_response_body TEXT,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    delivered_at TIMESTAMPTZ
);

-- Index for retry worker
CREATE INDEX idx_webhook_retry ON webhook_deliveries (next_retry_at)
    WHERE status = 'pending' AND next_retry_at IS NOT NULL;

Webhook Signing

Get more insights on Platform Engineering

Join 2,000+ engineers who get our weekly deep-dives. No spam, unsubscribe anytime.

Every webhook must be signed so the receiver can verify it came from you:

import hmac
import hashlib
import time
import json

def sign_webhook(payload: dict, secret: str) -> dict:
    """Generate webhook signature headers."""
    timestamp = str(int(time.time()))
    body = json.dumps(payload, separators=(',', ':'), sort_keys=True)

    # Sign: timestamp + body
    signature_input = f"{timestamp}.{body}"
    signature = hmac.new(
        secret.encode(),
        signature_input.encode(),
        hashlib.sha256
    ).hexdigest()

    return {
        "X-Webhook-Signature": f"v1={signature}",
        "X-Webhook-Timestamp": timestamp,
        "X-Webhook-ID": payload.get("id", ""),
    }

Receiver verification:

def verify_webhook(body: bytes, headers: dict, secret: str) -> bool:
    """Verify webhook signature on the receiving end."""
    timestamp = headers.get("X-Webhook-Timestamp", "")
    received_sig = headers.get("X-Webhook-Signature", "").removeprefix("v1=")

    # Reject timestamps older than 5 minutes (replay protection)
    if abs(time.time() - int(timestamp)) > 300:
        return False

    expected_sig = hmac.new(
        secret.encode(),
        f"{timestamp}.{body.decode()}".encode(),
        hashlib.sha256
    ).hexdigest()

    return hmac.compare_digest(expected_sig, received_sig)

The Delivery Worker

import httpx
import asyncio
from datetime import datetime, timedelta

RETRY_DELAYS = [10, 30, 120, 600, 1800, 3600, 7200, 14400]  # seconds
MAX_ATTEMPTS = 8

class WebhookDeliveryWorker:
    def __init__(self, db, timeout=30):
        self.db = db
        self.timeout = timeout

    async def deliver(self, delivery_id: str):
        delivery = await self.db.get_delivery(delivery_id)
        endpoint = await self.db.get_endpoint(delivery.endpoint_id)

        if not endpoint.active:
            await self.db.mark_dead(delivery_id, "Endpoint disabled")
            return

        # Sign the payload
        headers = sign_webhook(delivery.payload, endpoint.secret)
        headers["Content-Type"] = "application/json"
        headers["User-Agent"] = "TechSaaS-Webhooks/1.0"

        try:
            async with httpx.AsyncClient() as client:
                response = await client.post(
                    endpoint.url,
                    json=delivery.payload,
                    headers=headers,
                    timeout=self.timeout,
                    follow_redirects=False  # Don't follow redirects
                )

            if response.status_code < 300:
                await self.db.mark_delivered(delivery_id, response.status_code)
                return
            else:
                await self.handle_failure(
                    delivery, response.status_code, response.text[:500]
                )

        except httpx.TimeoutException:
            await self.handle_failure(delivery, None, "Request timed out")
        except httpx.ConnectError as e:
            await self.handle_failure(delivery, None, f"Connection failed: {e}")
        except Exception as e:
            await self.handle_failure(delivery, None, f"Unexpected error: {e}")

    async def handle_failure(self, delivery, status_code, error_message):
        attempts = delivery.attempts + 1

        if attempts >= MAX_ATTEMPTS:
            await self.db.mark_dead(delivery.id, error_message)
            await self.alert_customer(delivery, error_message)
            return

        # Exponential backoff with jitter
        delay = RETRY_DELAYS[min(attempts - 1, len(RETRY_DELAYS) - 1)]
        jitter = delay * 0.1 * (2 * asyncio.get_event_loop().time() % 1 - 0.5)
        next_retry = datetime.utcnow() + timedelta(seconds=delay + jitter)

        await self.db.schedule_retry(
            delivery.id,
            attempts=attempts,
            next_retry=next_retry,
            status_code=status_code,
            error=error_message
        )

API gateway pattern: a single entry point handles auth, rate limiting, and routing to backend services.

Idempotency

→

AI Agents Are Becoming First-Class Citizens in Platform Engineering7 min read read

→

Platform Engineering in 2026: Building Internal Developer Platforms That Actually Get Used11 min read

→

Platform Engineering for Mid-Size Teams: You Don't Need 500 Engineers to Build an IDP12 min read

Customers may receive the same webhook twice (network issues, worker restarts). Include an idempotency key so they can deduplicate:

def create_webhook_event(event_type: str, data: dict) -> dict:
    idempotency_key = f"{event_type}:{data.get('id')}:{int(time.time())}"
    return {
        "id": str(uuid.uuid4()),
        "type": event_type,
        "idempotency_key": idempotency_key,
        "created_at": datetime.utcnow().isoformat(),
        "data": data
    }

Document this in your webhook API docs so customers know to check the idempotency key before processing.

Retry Worker (Cron Process)

async def retry_worker():
    """Runs every 10 seconds, picks up deliveries due for retry."""
    while True:
        pending = await db.query("""
            SELECT id FROM webhook_deliveries
            WHERE status = 'pending'
              AND next_retry_at <= NOW()
            ORDER BY next_retry_at
            LIMIT 100
            FOR UPDATE SKIP LOCKED
        """)

        tasks = [worker.deliver(d.id) for d in pending]
        await asyncio.gather(*tasks, return_exceptions=True)

        await asyncio.sleep(10)

The FOR UPDATE SKIP LOCKED prevents multiple workers from processing the same delivery.

Customer-Facing Webhook Dashboard

Give customers visibility into their webhook health:

-- Recent deliveries for customer
SELECT
    d.event_type,
    d.status,
    d.attempts,
    d.last_response_status,
    d.created_at,
    d.delivered_at
FROM webhook_deliveries d
JOIN webhook_endpoints e ON e.id = d.endpoint_id
WHERE e.customer_id = $1
ORDER BY d.created_at DESC
LIMIT 50;

-- Delivery success rate
SELECT
    COUNT(*) FILTER (WHERE status = 'delivered') * 100.0 / COUNT(*) AS success_rate,
    COUNT(*) FILTER (WHERE status = 'dead') AS permanently_failed,
    AVG(EXTRACT(EPOCH FROM (delivered_at - created_at))) AS avg_delivery_seconds
FROM webhook_deliveries d
JOIN webhook_endpoints e ON e.id = d.endpoint_id
WHERE e.customer_id = $1
  AND d.created_at >= NOW() - INTERVAL '7 days';

Free Resource

Free Cloud Architecture Checklist

A 47-point checklist covering security, scalability, cost optimization, and disaster recovery for production cloud environments.

Download the Checklist

Endpoint Disabling

If an endpoint fails consistently, disable it to save resources:

async def check_endpoint_health():
    """Disable endpoints with >95% failure rate over 24 hours."""
    unhealthy = await db.query("""
        SELECT e.id, e.customer_id,
               COUNT(*) FILTER (WHERE d.status = 'dead') AS failures,
               COUNT(*) AS total
        FROM webhook_endpoints e
        JOIN webhook_deliveries d ON d.endpoint_id = e.id
        WHERE d.created_at >= NOW() - INTERVAL '24 hours'
          AND e.active = true
        GROUP BY e.id
        HAVING COUNT(*) > 10
           AND COUNT(*) FILTER (WHERE d.status = 'dead') * 100.0 / COUNT(*) > 95
    """)

    for endpoint in unhealthy:
        await db.disable_endpoint(endpoint.id)
        await notify_customer(endpoint.customer_id,
            "Your webhook endpoint has been disabled due to persistent failures. "
            "Please check your server and re-enable in settings."
        )

Microservices architecture: independent services communicate through an API gateway and event bus.

Monitoring Metrics

Track in Grafana:

Delivery success rate: Target >99%
Average delivery latency: Time from event to delivery
Retry rate: How many deliveries need retries
Dead letter rate: Permanently failed deliveries
Queue depth: Pending deliveries (growing = workers behind)

Reliable webhook delivery is a hallmark of a well-engineered platform. At TechSaaS, we build webhook infrastructure as part of our platform engineering services, ensuring your customers never miss an event.

#webhooks#reliability#api#platform-engineering#architecture

Related Service

Cloud Solutions

Let our experts help you build the right technology strategy for your business.

Get a Consultation Chat on WhatsApp

Need help with platform engineering?

TechSaaS provides expert consulting and managed services for cloud infrastructure, DevOps, and AI/ML operations.

Get a Free Consultation WhatsApp Us

We Will Build You a Demo Site — For Free

Like it? Pay us. Do not like it? Walk away, zero complaints. You will spend way less than hiring developers or any agency.

47+ companies trusted us

99.99% uptime

< 48hr response

No spam. No contracts. Just a free demo.