System Design•25 min read•Intermediate

Scaling: Vertical, Horizontal, and Sharding

How systems grow from one box to thousands, and the strategies for splitting state across them.

Vertical scaling (scale up)

The simplest answer to \"my server is overloaded\" is to buy a bigger one. More CPU, more RAM, more disk. Modern cloud VMs go up to hundreds of cores and terabytes of RAM. Vertical scaling has real benefits: nothing in your code changes, single-machine semantics still work, no distributed-systems complexity.

But vertical scaling has hard limits. There's a maximum machine size. Doubling capacity often more than doubles the price. And one machine is one failure domain — when it goes down, you're down.

Horizontal scaling (scale out)

Horizontal scaling adds more, similar machines. Now you can grow indefinitely (cheaper hardware, in parallel) and tolerate single-machine failures. The hard part: SPLITTING THE WORK. How do incoming requests find a free machine? How is data partitioned? How do machines coordinate?

Stateless vs stateful

Stateless services — each request is independent. Easy to scale horizontally — just add machines and a load balancer.
Stateful services — hold data (databases, caches, session stores). Harder to scale because state must be partitioned and replicated.

Rule of thumb: keep your application servers stateless and push state into specialized stateful services (databases, caches). That single architectural choice is the foundation of nearly every modern web stack.

Load balancers

A load balancer sits in front of your fleet and distributes requests. The simplest is round-robin. Smarter ones consider machine load, response time, or hashing on a request property (e.g. user id) so the same user keeps hitting the same backend. AWS ALB, Nginx, HAProxy, and Cloudflare are all load balancers.

Load balancer with health checks

The LB watches each backend. Unhealthy ones are skipped; healthy ones share the load.

3 healthy backends. Load balancer routes round-robin.

Sharding (partitioning data)

When your data doesn't fit on one machine, you split it across many. A SHARD is one such partition. The trick is choosing how to split — the SHARD KEY.

📚

Real-life analogy — A library split across buildings

When your library has too many books for one building, you split them across multiple buildings. By author surname (range)? By a stamped ID (hash)? Each strategy makes some questions easy and others hard. That's sharding.

Sharding by hash(key) % N

Simple, but every time you add or remove a shard, almost every key moves.

Shard 0

alice

Shard 1

bob

Shard 2

dee

3 shards, 4 keys. Each key → hash(k) % N.

Hash-based sharding — `shard = hash(key) % N`. Distributes evenly. But changing N rehashes everything — painful.
Range-based sharding — split by key ranges (A-F on shard 1, G-M on shard 2, ...). Easy resharding. But hot ranges are real (think: every Twitter user with surnames starting with S).
Consistent hashing — keys and shards both hash onto a ring; each key goes to the next shard clockwise. Adding a shard moves only K/N keys, not all of them. Used by DynamoDB, Cassandra, Memcached at scale.

Replication

Replication keeps multiple copies of the same data. Replicas serve reads (so the leader isn't a bottleneck) and provide failover (when the leader dies, a replica is promoted).

Single-leader (primary-replica) — all writes go to the leader. Simple, common. Followers may lag.
Multi-leader — multiple nodes accept writes. Higher write availability, but conflict resolution is hard.
Leaderless (Cassandra, Dynamo) — every node accepts reads/writes. Quorum-based: write to W of N nodes, read from R, with R + W > N to guarantee overlap.

⚠ Watch out

Replication ≠ backup. A bug or accidental DELETE replicates instantly to all replicas. Always keep separate, point-in-time backups.

← Previous

System Design Fundamentals

Caching: Where, What, When