Article 6 of 8

Advanced Distributed Systems: Consistency, Availability, and Trade-offs

CAP theorem, eventual consistency, and how to make the right trade-offs.

20 minAdvanced

✦

Key Takeaway

The CAP theorem is not a system-wide decision — it's a per-operation trade-off. Your payment ledger and your notifications badge count have fundamentally different requirements. This article demystifies CAP, practical consistency models, consensus protocols (Raft and Paxos without the PhD), and the all-important PACELC framework. Then I share a five-step decision framework you can apply to your next architecture discussion. Exactly-once delivery is a myth. Exactly-once processing via idempotency is engineering.

I once watched a senior architect spend forty-five minutes in a design review explaining why a new payment service needed strong consistency. The room nodded along. Everyone was impressed by the depth of the explanation.

Then a staff engineer asked one question: "What happens when the network between your two data centres partitions for thirty seconds?"

The room went silent.

The architect knew the theory. He'd read the papers, could cite the CAP theorem from memory, and understood the vocabulary fluently. But he had never actually thought through what those concepts meant for the specific system he was building. He'd confused knowing the words with understanding the trade-offs.

Distributed systems are where theory meets production reality at high speed. And production reality always wins.

The CAP Theorem: What It Actually Means for You

Let me save you from the academic definition and give you the practitioner's version.

The CAP theorem says that when a network partition occurs — and it will — you must choose between two properties: consistency (every read gets the most recent write) or availability (every request gets a response, even if the data is stale).

That's it. You don't "pick two out of three." You always have to handle partitions because networks fail. The real question is: when the network breaks, do you serve potentially stale data, or do you refuse to serve at all?

Scenario 1: Bank transfer. A customer moves $5,000 between accounts. A network partition occurs mid-transfer. Do you show a stale balance that says the money is still in both accounts? You do not. You block the request until the partition heals. You choose consistency over availability.

Scenario 2: Social media feed. A user posts a photo. A partition occurs. Do you show a feed that's missing the latest post, or do you refuse to load the feed entirely? You show the slightly stale feed. Nobody is harmed. You choose availability over consistency.

The mistake I see engineers make constantly: treating CAP as a system-wide decision. It isn't. Different operations within the same system can and should make different trade-offs. Your payment ledger needs strong consistency. Your notification badge count does not. Design accordingly — operation by operation, data type by data type.

Consistency Models: A Practical Guide

Consistency models are the promises your system makes about data freshness. Understanding them saves you from over-engineering things that don't need guarantees, and under-engineering things that do.

Strong Consistency

Every read returns the most recent write. If I write a value and you read it one millisecond later from a different node, you see my write.

When it makes sense: Financial transactions, limited inventory counts, access control changes, any operation where showing stale data causes real-world damage.

The cost: Latency. Every write needs acknowledgement from a quorum of nodes before it commits. Under partition, the system blocks instead of serving stale data. This is expensive relative to eventual consistency.

Real example: When you transfer money in a banking app and immediately check your balance, you expect to see the updated number. If the app showed you the old balance and you initiated the transfer again, that's not a minor UX issue — it's a critical product failure. Banks pay the latency cost because the alternative is unacceptable.

Eventual Consistency

If no new writes occur, all replicas will eventually converge to the same value. "Eventually" means milliseconds to seconds in practice — it's not guaranteed to be immediate.

When it makes sense: Social feeds, product reviews, analytics dashboards, DNS propagation, any operation where "a few seconds behind" is acceptable.

The cost: You must design your application to handle temporary inconsistency gracefully. A user might post a comment and not see it immediately on refresh. Your UI needs to account for this — optimistic updates, "posting..." states, eventual confirmation.

Real example: When you tweet something and a follower in a different region doesn't see it for two seconds, nobody files a support ticket. If every read required global consensus, the feed would be unusably slow. Twitter runs on eventual consistency for feeds because the latency of strong consistency would kill the product.

Causal Consistency

A middle ground that's often underused. Causal consistency guarantees that causally related operations are seen in the same order by all nodes. Operations with no causal relationship can be seen in different orders.

When it makes sense: Messaging systems, collaborative editing, comment threads — anywhere the order of related events matters, but global ordering of all events doesn't.

Real example: In a messaging app, if I send "Are you free tonight?" followed by "Let's grab dinner," every observer must see those messages in that order. But if two unrelated users are having separate conversations, the system doesn't need to globally order their messages relative to each other. This allows correctness where it matters without paying for global consensus everywhere.

Consensus Protocols: Raft and Paxos Without the PhD

You don't need to implement Raft or Paxos. But you need to understand what they do conceptually, because they underpin every distributed database, configuration store, and leader-election mechanism you'll use.

The problem they solve: How do multiple nodes agree on a single value (or a sequence of values) even when some nodes are unreachable?

Paxos was the original algorithm, published by Leslie Lamport in 1998. It is famously difficult to understand and even harder to implement correctly. The running joke in distributed systems circles is that nobody truly understands Paxos — they only think they do until they try to implement it.

Raft was designed explicitly to be understandable. It breaks the consensus problem into three clean sub-problems:

Leader election — One node is elected leader. If it fails, a new election happens. While there's no leader, no writes are accepted.
Log replication — The leader accepts writes and replicates them to followers before acknowledging.
Safety — Once a log entry is committed, it will not be lost, even under failures.

The mental model I use: Raft is a team with a designated decision-maker. The lead makes calls and informs the rest of the team. If the lead goes on leave unexpectedly, the team votes for a new one. During the voting period, no decisions get made. Once a new lead is chosen, they get everyone caught up.

What this means for you operationally: When you use etcd, ZooKeeper, or CockroachDB, they run consensus protocols underneath. Knowing this helps you reason about why writes are slower than reads, why leader failover causes brief unavailability, and why an odd number of nodes (3, 5, 7) is recommended — you need a majority to form a quorum, and odd numbers prevent ties.

Partition Tolerance in Practice

The textbooks give you the clean "two nodes can't reach each other" scenario. Production partitions are messier.

Full partition: Two groups of nodes can't communicate. Each side must independently decide: keep serving, or block until the partition heals?

Partial partition: Node A can reach Node B, Node B can reach Node C, but Node A cannot reach Node C. These are particularly nasty because they create asymmetric views of the cluster. Some nodes think everything is healthy. Others don't.

Asymmetric partition: Node A can send to Node B but can't receive responses. This breaks request-response protocols in ways that timeout-based detection doesn't catch quickly.

What to actually do when a partition occurs:

Detect it fast. Heartbeats, health checks, gossip protocols. The faster you detect a partition, the faster you can react. Most production systems combine timeout-based detection with active probing.
Honour your pre-decided trade-off. This should be decided at design time, not during an incident. If you're building a CP system, block the minority partition. If you're building an AP system, both sides keep serving with the understanding that reconciliation will be needed.
Plan for reconciliation. This is the part most teams skip. When the partition heals, you have two divergent states. How do you merge them? Last-write-wins? Application-level conflict resolution? CRDTs (Conflict-free Replicated Data Types)? If you haven't designed the reconciliation strategy, you will be writing it under pressure during the incident. I've seen teams handle the partition correctly and then corrupt data during reconciliation because nobody had thought about it.

A hard rule I follow: Always test partition scenarios before you claim a system is production-ready. Use tools like Jepsen, Toxiproxy, or Chaos Monkey. If you haven't tested partition behaviour, you only know what you hope the system does — not what it actually does.

PACELC: The Framework CAP Should Have Been

CAP has a blind spot: it only describes behaviour during partitions. But most of the time, your system is not partitioned. What trade-offs are you making during normal operation?

PACELC extends CAP: if there's a Partition, choose between Availability and Consistency. Else (normal operation), choose between Latency and Consistency.

This is far more useful because it captures the trade-offs you're making every day, not just during failures.

PA/EL systems (Cassandra, DynamoDB): During partition, choose availability. During normal operation, choose low latency. Always available and fast, but data may be temporarily stale.
PC/EC systems (traditional RDBMS with synchronous replication): Choose consistency in both scenarios. Always correct, but higher latency.
PA/EC systems (some MongoDB configurations): Availability during partitions, consistency during normal operation. A pragmatic middle ground.

Why this matters for daily decisions: Every time you choose eventual consistency over strong consistency for a specific operation, you're choosing lower latency over perfect freshness. PACELC makes you articulate that trade-off explicitly instead of defaulting to whatever the database does by default.

Idempotency and Exactly-Once Delivery

I'll state this plainly: exactly-once delivery in a distributed system is impossible.

You can have at-most-once delivery (send and forget — might get lost), or at-least-once delivery (keep retrying until acknowledged — might be delivered multiple times). Exactly-once requires the sender to confirm the receiver processed the message. That confirmation is itself a message that can fail, which brings you back to the partition problem.

What you can achieve is exactly-once processing via idempotency.

An operation is idempotent if performing it multiple times has the same effect as performing it once. If the same payment request arrives three times because of retries, an idempotent payment system charges the customer exactly once.

How to implement idempotency in practice:

Idempotency keys. The client generates a unique key for each logical operation and includes it with every request (including retries). The server checks: Have I seen this key? If yes, return the cached response. If no, process it and cache the response durably.
Database constraints. Use unique constraints to prevent duplicate inserts. A retry that tries to insert the same record gets a constraint violation — which you handle gracefully.
State machines. Model operations as allowable state transitions. An order can transition from "pending" to "confirmed" but never from "confirmed" to "confirmed." The transition itself is idempotent because the system checks current state before applying the change.

The subtlety most engineers miss: Idempotency is not just about the happy path. What if the server processes the request, writes to the database, then crashes before sending the response? The client retries. Now you need the server to recognise this as a retry of a completed operation. This is why idempotency keys must be stored durably — not in memory, not in a cache with a 5-minute TTL.

A Practical Decision Framework

After years of making these trade-offs in production, here's the framework I use. It's not perfect, but it stops teams from over-thinking and under-deciding.

Step 1: Classify the operation.

Is a stale read dangerous (financial, inventory, access control)? Strong consistency.
Is a stale read annoying but harmless (feed, notifications, analytics)? Eventual consistency is fine.
Is the ordering of related operations important but global ordering isn't? Causal consistency.

Step 2: Check your latency budget. Strong consistency adds round-trip latency to writes (and sometimes reads). If your SLA requires sub-50ms responses and replicas are geographically distributed, strong consistency might be physically impossible under that constraint. Do the maths before making commitments.

Step 3: Design for partition from day one. Don't treat partition handling as a future concern. Decide now: when the network breaks, does this service block or serve stale data? Document it. Make sure your on-call engineers know the expected behaviour so they don't "fix" the correct handling during an incident.

Step 4: Make idempotency non-negotiable. Every write operation that crosses a network boundary should be idempotent. The cost upfront is small. The cost of debugging duplicate charges or phantom records at 2am is enormous.

Step 5: Choose the simplest model that meets the requirements. I've seen teams adopt Cassandra's tunable consistency for a system that processes fifty writes per second. A single PostgreSQL instance would have handled that workload for years. Don't architect for problems you don't have. Start with strong consistency (it's easier to reason about), and relax to eventual consistency only when you hit a measurable wall.

Key Takeaways

CAP is a per-operation decision, not a system-wide one. Your payment ledger and your notifications counter need completely different consistency guarantees. Design accordingly.
Eventual consistency is not "broken consistency." It's a deliberate trade-off valid for the majority of operations in most systems. Learn to recognise which operations genuinely require stronger guarantees.
PACELC is more useful than CAP for daily decisions. CAP only describes failure scenarios. PACELC captures the latency-vs-consistency trade-off you're making on every single request in normal operation.
Exactly-once delivery is a myth; exactly-once processing is engineering. Build idempotency into every write path from the start. Idempotency keys, database constraints, and state machines are your tools.
Consensus protocols underpin every distributed database you use — understand what they do conceptually so you can reason about leader failover, quorum requirements, and why odd node counts matter.
Always plan for partition reconciliation, not just partition detection. The partition will heal. When it does, you have two divergent states. Design the merge strategy before you need it, not during the incident.
Start with the simplest model that works. Strong consistency is easier to reason about and more forgiving of design errors. Relax to eventual consistency when the load data justifies it — not when the architecture diagram looks more sophisticated.

Real-World Case Study: Designing a Notification System System Design for Engineering Leaders