Skip to content
All work

Real-Time Collaboration Suite

Presence, chat, and video that don't fall over under load.

<1s

Reconnect (p50)

6x

Capacity factor

5%+

Packet loss tolerated

Overview

A real-time stack for a remote-first team product: presence, chat, document cursors, and selective-forwarding video — built to recover gracefully when networks misbehave.

The problem

The team's first version used a single Socket.IO server and a permissive reconnection model. As soon as one rooms list crossed a thousand users, reconnections caused thundering-herd cascades.

Approach

  1. 01

    Moved presence and pub/sub onto Redis with consistent hashing across WebSocket workers.

  2. 02

    Replaced naïve reconnect with exponential backoff, server-side jitter, and a slow-start handshake.

  3. 03

    Picked Mediasoup over a full mesh — SFU gives quality control room and survives bandwidth dips.

Outcome

  • Survived a 6x usage spike during a product launch with no operator intervention.

  • Median reconnect time after network drop fell from 8.4s to under 900ms.

  • Video bitrates adapt cleanly to packet loss above 5% without dropping participants.