Designing webhooks that actually arrive
Notes from a year of payment integrations: idempotency keys you can trust, retries that don't lie, and the dead-letter queue that pays for itself.
Webhooks are one of those things you ship in a sprint and pay for over years. The first version is always the same: a route that accepts JSON, runs some business logic, and returns 200. Six months later you discover the third-party retried a charge fourteen times, your worker queue is wedged on a poisoned message, and the audit log is missing two days.
Most of that pain comes from treating webhooks like HTTP requests. They aren't. They're durable messages dressed up in HTTP, and they want to be handled the way you'd handle anything from a real message queue.
Trust nothing, especially the payload
Verify the signature before reading the body. Reject anything older than a few minutes (replay protection is free, just check the timestamp header). And treat the JSON as opaque until you've stored it — schema validation comes after you've durably captured what arrived.
// 1. Verify signature with the raw body, not the parsed JSON
const signature = req.headers["x-signature"];
if (!verifyHmac(rawBody, signature, secret)) {
return res.status(401).end();
}
// 2. Reject stale events (replay protection)
const sent = Number(req.headers["x-timestamp"]) * 1000;
if (Date.now() - sent > 5 * 60_000) {
return res.status(401).end();
}
// 3. Persist the raw event before doing anything with it
await db.events.insert({ id: event.id, raw: rawBody, status: "pending" });
return res.status(200).end();Acknowledge fast, process later
The webhook handler has one job: capture the event and tell the sender you've got it. Anything that touches business state — updating a subscription, sending a receipt, writing to the ledger — belongs in a background worker. If your 200 takes more than 200ms, you're going to retry under any kind of load.
This split also makes idempotency natural. Store the event id when you ingest, refuse to insert it twice. The worker pulls events by id and runs the actual logic. Same event arriving four times becomes one insert and three duplicate-key errors — exactly what you want.
Idempotency is about the side effect, not the request
An idempotent endpoint isn't one that returns the same response twice. It's one whose side effects don't accumulate when called twice. There's a difference, and it bites when you start composing logic.
The pattern that's held up best for me: assign every meaningful side effect a deterministic key derived from the event, and let the database enforce uniqueness. Charge succeeded? Insert a row keyed on (event_id, account_id) before crediting the account. Refund? Same.
If you can't replay the entire event stream against an empty database and get the same final state, your handlers aren't idempotent — they're just lucky.
The dead-letter queue earns its keep
Eventually a webhook will arrive that your code can't handle. Bad schema, a state machine in a corner you missed, a downstream service that's down for hours. Don't let those events vanish into a retry storm and then expire. Move them to a dead-letter queue with the original payload, the exception, and the worker version.
- Build a one-screen UI to inspect DLQ events. You will use it.
- Add a 'replay' button that pushes the event back onto the main queue.
- Log a metric per topic — DLQ depth should be a graph you check on Mondays.
What I'd skip on the first pass
Fancy retry policies. The sender already retries. Your worker doesn't need exponential backoff for the first ninety days — it needs a clear failure path and a queue you can reason about. Add the sophistication when you have logs telling you what specifically is failing, not before.
None of this is novel. It's the kind of thing you internalize the second time a Stripe outage takes down your reconciliation worker at 3am and you spend the next morning replaying events by hand. Write it down the first time it bites you, and the version six months later will look almost the same.