Design a Chat Application
Designing a real-time chat system like WhatsApp is a common senior-level system design question. It tests your knowledge of WebSockets, message queues, distributed storage, and scalability.
Requirements Clarification
Non-Functional: 500 million users, 100 billion messages/day, <100ms message delivery latency, messages stored for 5 years, end-to-end encryption
Back-of-Envelope
Messages per second = 100B / (24 * 3600) = ~1.16M msg/s peak
Storage per message = 100 bytes avg
Daily storage = 100B × 100B = 10 TB/day
5-year storage = 18 PB (need distributed storage)
Connections: 500M users, 20% active = 100M concurrent WebSocket connections
Servers needed: 1 server handles ~65K connections
WebSocket servers = 100M / 65K = ~1,540 servers
High-Level Architecture
Client (Mobile/Web)
|
| WebSocket (persistent connection)
|
Load Balancer (L7 — sticky sessions by user ID)
|
Chat Servers (stateful — maintain WebSocket connections)
|
├── Message Queue (Apache Kafka)
| |
| └── Message Processor Service
| ├── Store to Cassandra (messages)
| ├── Push Notification Service (FCM/APNs)
| └── Update delivery status
|
├── Presence Service (Redis pub/sub — online/offline)
|
└── Media Service
├── S3 / CDN (store images, videos)
└── Return pre-signed URLs to clients
Message Delivery Flow
# 1. Sender sends message via WebSocket
class ChatServer:
async def handle_message(self, ws, message: dict):
# Validate and enrich message
msg = {
'id': generate_uuid(),
'sender_id': message['sender_id'],
'receiver_id': message['receiver_id'],
'content': message['content'],
'timestamp': time.time(),
'status': 'sent',
}
# 2. Publish to Kafka immediately (fast, non-blocking)
await kafka.produce('messages', key=msg['receiver_id'], value=msg)
# 3. ACK back to sender — message accepted
await ws.send(json.dumps({'type':'ack','msg_id':msg['id']}))
# 4. Message Processor (Kafka consumer)
async def process_message(msg):
# Store in Cassandra
await cassandra.execute(
'INSERT INTO messages (chat_id, id, sender, content, ts) VALUES (?,?,?,?,?)',
[msg['chat_id'], msg['id'], msg['sender_id'], msg['content'], msg['timestamp']]
)
# 5. Deliver to receiver if online
receiver_server = await presence.get_server(msg['receiver_id'])
if receiver_server:
# Forward to the server holding receiver's WebSocket
await internal_rpc.deliver(receiver_server, msg)
else:
# Receiver offline — send push notification
await push.notify(msg['receiver_id'], msg)
Database Choice
-- Cassandra: perfect for chat messages
-- Partition by chat_id, cluster by timestamp (reverse — latest first)
CREATE TABLE messages (
chat_id UUID,
message_id TIMEUUID, -- time-based UUID ensures ordering
sender_id UUID,
content TEXT,
media_url TEXT,
status TEXT, -- sent/delivered/read
PRIMARY KEY (chat_id, message_id)
) WITH CLUSTERING ORDER BY (message_id DESC)
AND compaction = {'class': 'TimeWindowCompactionStrategy'};
-- Time-window compaction: efficient for time-series data
Q: Why WebSockets instead of HTTP polling?
HTTP polling sends a request every N seconds even when there are no messages — wasteful. Long polling holds the connection open until data arrives — better but still HTTP overhead per message. WebSockets maintain a persistent bidirectional connection with minimal overhead (~2 bytes per frame vs hundreds of bytes for HTTP headers).
Q: Why Cassandra for messages instead of MySQL?
Cassandra is optimized for append-heavy, time-series write workloads — exactly what chat is. It scales horizontally across datacenters with no single point of failure. MySQL struggles past a few TB on a single node and requires complex sharding. Cassandra's partition key (chat_id) ensures all messages for a conversation are co-located on the same nodes.
Comments (0)
No comments yet. Be the first!
Leave a Comment