Real-timeDec 18, 202511 min read

The reconnect handshake nobody writes about

Building WebSocket clients that survive flaky networks — exponential backoff, server-side jitter, slow-start handshakes, and other things you wish you'd added on day one.

Essay

Every WebSocket tutorial ends at on('message'). The hard part starts there. Networks drop, servers restart, certificates rotate, mobile clients background, ISPs do middlebox things you'll never fully understand. A real-time system survives all of that by being deliberate about how it reconnects, not by hoping it doesn't have to.

Reconnect is a protocol, not a callback

The naive version is one line of code: socket.onclose = () => reconnect(). The first time a server restarts and a thousand clients all reconnect inside fifty milliseconds, you'll wish you'd treated reconnect as a small protocol with its own rules.

Wait a random initial delay before the first attempt — server-jittered, not client-jittered.
On failure, exponentially back off with a cap (1s, 2s, 4s, 8s, 30s).
On success, run a slow-start handshake before resuming normal traffic.
Track reconnect attempts and abandon after a generous ceiling, surfacing the failure to the UI.

Server-side jitter is the trick

Client-side random delays look fair but aren't — your worst-case is a few thousand clients picking similar timeouts. The cleaner pattern: the server sends a recommended reconnect window in its close frame, the client respects it. The server can spread that window deliberately based on how many clients it knows it just dropped.

// Server: bias the recommended delay by load
const recommendedMs = 250 + Math.floor(Math.random() * 1750);
socket.close(1012, JSON.stringify({ reason: "restarting", retryAfter: recommendedMs }));

// Client: honour the recommendation when present
socket.onclose = (e) => {
  const payload = safeParse(e.reason);
  const delay = payload?.retryAfter ?? backoff();
  setTimeout(reconnect, delay);
};

Slow-start: don't trust a fresh socket

When a reconnected client comes back, it has a backlog: presence to resync, missed events to replay, subscriptions to re-establish. Doing all of that in parallel turns a reconnect into a thundering herd. Instead, define a handshake.

Client sends a 'resume' frame with its last-known event id.
Server replies with a snapshot or a stream of missed events, capped at a sensible window.
Only after the snapshot completes does the client send any user-initiated traffic.

The metrics that catch trouble early

If you only watch one thing, watch median reconnect duration. It should be under a second for healthy users. When it crosses two, you've got a regression. When it crosses ten, you've got an incident, and the user has already noticed.

A real-time system is only as good as its worst minute. Optimise for the worst minute.

Build the boring infrastructure first — backoff, jitter, slow-start, metrics — and the exciting features stop feeling fragile. You stop dreading the deploy window. The 3am pages get a lot less common.