Article 2 of 6
Scaling Data: From Single Database to Distributed Architecture
The progression from a single database to a scalable data architecture — what each step costs and when to take it.
Every successful product starts with a single relational database, and this is the right call. The question is how to evolve that architecture as data volume and query complexity outgrow what a single node can handle cleanly. The stages — read replicas, caching, sharding, CQRS, domain-owned databases — each solve specific problems and each introduce their own complexity. Knowing which stage you actually need, and when to take the next step, saves years of premature architectural complexity.
The single Postgres instance is one of the best decisions you can make when starting a product. I say this without reservation. It's simple. It's consistent. It's transactional. The operational overhead is minimal, the tooling is excellent, and for the vast majority of early-stage products, it will never become a bottleneck in any way that matters.
The problem comes later. Not with the original decision — that was correct. The problem comes when a team mistakes the architecture that was right for 10k users as the architecture that should serve 1 million users. Or, equally costly, when a team mistakes early signals for a need to scale when the actual problem is a bad query.
After watching Indian fintech and e-commerce teams navigate data scaling across ranges from a few hundred thousand users to tens of millions, I've formed strong opinions about the progression. Not every team needs to go through every stage. Most need to go through the first two or three. The ones that jump to the advanced stages too early pay a painful complexity tax for years.
Let me walk through the stages honestly — what each solves, what it costs, and when you should actually consider it.
Stage 1: Single Primary Database (the correct starting point)
Your single relational database — Postgres, MySQL, whatever — handles reads, writes, analytics queries, and background jobs. This sounds dangerous. It isn't, until it is.
The first symptom that Stage 1 is starting to show strain: analytics queries blocking application queries. Your data team runs a report against the production database at 9 AM, and your product slows to a crawl for 15 minutes. This isn't an architecture problem yet. It's a missing read replica for a specific workload.
The second symptom: read query load exceeding your primary's comfortable capacity. Read-heavy workloads — product catalog pages, user profile queries, feed generation — can overwhelm a primary even when write volume is modest.
When you see either of these, you're ready for Stage 2.
Stage 2: Read Replicas
Adding a read replica means your primary database streams changes to one or more replica instances that can serve read queries. Writes still go to the primary. Reads can go to either.
In practice, this looks like:
Primary DB
├── Handles all writes (INSERT, UPDATE, DELETE)
├── Handles reads that require "read your own writes" consistency
└── Streams WAL (write-ahead log) to...
Read Replica 1
└── Handles analytics, reporting, dashboard queries
Read Replica 2
└── Handles application read traffic during peaksThis is enough to solve the analytics-vs-application conflict and to distribute read load. Most teams need this somewhere between 20k-100k DAU depending on their read/write ratio and query complexity.
The critical operational detail with read replicas is replication lag. The replica is slightly behind the primary — usually milliseconds, sometimes seconds under heavy write load, occasionally more during maintenance events. For most read workloads, this is fine. For anything where a user expects to see their own recent write immediately — confirming an order, seeing their profile update — you need to route those reads to the primary.
I've seen teams introduce subtle bugs by routing all reads to a replica without considering lag. The pattern I recommend: default to replica for reads, but mark specific queries as "primary reads" when the business logic requires consistency. Most application frameworks have a way to specify this at the connection level.
When read replicas aren't enough
Read replicas help with read capacity. They don't help with:
- Write throughput (all writes still go to one primary)
- Very large tables where even the replica can't serve queries fast enough
- Workloads where you need to query data in a shape very different from how you store it
When you hit those limits, you're looking at caching or query optimization before you look at anything architectural.
Stage 2.5: The Optimization Checkpoint
Before moving to caching or sharding, ruthlessly optimize your queries.
I've seen teams design elaborate caching layers to avoid a database query that, with a properly placed composite index, would have run in 2ms instead of 400ms. The caching layer introduced cache invalidation bugs, stale data issues, and significant operational overhead — all to work around a missing index.
The questions to answer before touching architecture:
- Are all your slow queries analyzed with
EXPLAIN ANALYZE? Is every join condition and WHERE clause predicate indexed appropriately? - Are you fetching more data than you need?
SELECT *on a table with 40 columns when you need 3 is expensive. - Are N+1 query patterns present in your ORM usage? Loading a list of 200 orders and then making a separate query per order to load the customer is a common hidden load that has nothing to do with scale.
- Are background jobs and analytics queries running during peak traffic hours?
Fix these first. The architectural changes below are expensive. Premature architecture is technical debt that costs more than premature optimization.
Stage 3: Caching Layers
When specific, expensive, frequently-read data can tolerate some staleness, a caching layer dramatically reduces database load. Redis is the standard tool. It's fast, it's reliable at scale, and the data structures it supports (strings, hashes, sorted sets, pub/sub) map well to a wide range of caching patterns.
The patterns I actually use, and when:
Simple key-value cache for computed values: Product catalog prices with promotions applied, user permission sets, configuration values that rarely change. Cache on first read, set a TTL of 5-60 minutes, accept eventual consistency.
Cache key: user:12345:permissions
Cache value: {"can_edit": true, "can_publish": false, ...}
TTL: 300 secondsWrite-through cache for user-facing data that must be consistent: When a user updates their profile, write to the database and invalidate (or update) the cache simultaneously. The next read from any server gets fresh data.
Sorted sets for feed/ranking use cases: If you're generating a ranked feed of content — most popular posts, recent activity — computing it from the database on every request is expensive. A sorted set in Redis with scores representing rank lets you maintain a pre-computed, constantly-updated ranking that reads in microseconds.
When caching creates more problems than it solves
Caching is genuinely one of the hardest problems in distributed systems, for a simple reason: you now have two copies of data, and they can disagree.
Cache invalidation bugs are subtle. The user updates their email address. Your application updates the database. Two servers out of ten are still serving the cached old email because cache invalidation is lazy and those servers haven't seen a new request yet. The user contacts support. Your support team sees the correct email in the database. The user is seeing the old one in the UI. Nobody can reproduce it.
This is the class of problem that caching introduces. Before adding a cache to any given data set, ask:
- What's the worst case if this data is stale for its full TTL?
- What events must trigger an explicit cache invalidation?
- What happens if the cache is completely empty (cold start)?
- What happens if Redis goes down?
If you can't answer all four clearly, your cache will create bugs that are harder to debug than the performance problem it solved.
Stage 4: Sharding
Sharding is horizontal partitioning of your data. Instead of one large table, you split the data across multiple database instances based on a partition key. Each shard holds a subset of the data. The application knows how to route a query to the right shard.
This is serious architectural complexity. I will be direct: most teams should not shard their own databases. The operational overhead is substantial, the failure modes are nasty, and managed databases or purpose-built distributed databases often provide a better trade-off.
But if you're at the scale where you need it, here's the taxonomy of sharding strategies:
Range sharding partitions by a range of values — for example, users with IDs 1-1,000,000 on shard 1, users with IDs 1,000,001-2,000,000 on shard 2. Simple to understand, but creates "hotspot" problems if recent data is more active than old data (all new users land on the same shard).
Hash sharding applies a hash function to the partition key and distributes data by hash bucket. More even distribution, but adding shards requires rehashing and migrating data — a painful operation.
Directory-based sharding maintains a lookup table that maps partition keys to shard locations. Maximum flexibility (you can move data between shards), but the directory becomes a bottleneck and a single point of failure if not handled carefully.
The operational complexity sharding introduces:
- Cross-shard queries (JOINs across shards) become application-level logic or don't happen at all
- Transactions that span shards require distributed transaction protocols
- Schema migrations must be applied to every shard
- The "correct" shard for a given piece of data must be deterministic and consistent
Before sharding, seriously evaluate: managed distributed SQL (CockroachDB, Google Spanner, PlanetScale) or NoSQL with native horizontal scaling (Cassandra, MongoDB with sharding). For most workloads, these provide horizontal scale without requiring you to write your own sharding logic.
The CQRS Pattern
Command Query Responsibility Segregation is an architectural pattern that separates the model used to handle writes (commands) from the model used to handle reads (queries). At the data layer, this means your write model is normalized for transactional integrity, and your read model is denormalized for query performance.
A concrete example from a payment processing system:
Write model (Command side):
payments table: id, user_id, amount, currency, status, created_at
payment_items table: id, payment_id, product_id, quantity, unit_priceNormalized, with foreign keys, designed for integrity during writes.
Read model (Query side):
payment_summaries materialized view or separate table:
id, user_id, total_amount, item_count, status, created_at,
user_name, user_email (denormalized from users table)Denormalized, designed for the dashboard that shows a user their payment history with zero joins.
CQRS at this level is simpler than the full event-sourcing version. You're just maintaining two representations of the same data — one optimized for writing, one for reading. The sync mechanism can be a database trigger, a background job, or an event system.
This pattern is worth implementing when your read and write requirements have diverged enough that a single model serves neither well.
Domain-Owned Databases
This is the prerequisite for microservices that most "microservices migration" guides gloss over. If multiple services share a single database, you don't have microservices — you have a distributed monolith.
The transition from shared database to domain-owned databases is the point where data architecture decisions become organizational decisions. It requires:
- Identifying domain boundaries — which tables belong to which domain? (Users? Payments? Catalog? Orders?)
- Eliminating cross-domain joins — all the places where domain A queries domain B's tables directly must be replaced with API calls or event-based data propagation
- Handling referential integrity across boundaries — you can no longer use foreign keys across domain databases; consistency is eventual, managed through events
This is genuinely hard work that takes months in a non-trivial system. I've seen teams attempt to do it in a rewrite weekend and create catastrophic data integrity issues that took another quarter to clean up.
The right approach is incremental: extract one domain at a time, maintaining dual-write and dual-read during transition, validating consistency between old and new stores before cutting over fully.
The Analytics/OLAP Split
Running analytics queries on your operational database is a scaling timebomb. I've seen it detonate more times than I can count.
The problem is that analytical queries (aggregations over large datasets, cross-table joins, historical trend calculations) are fundamentally different in character from operational queries (looking up a specific user, inserting a transaction, updating an order status). Analytical queries hold locks longer, consume more CPU, read more data, and benefit from different indexing strategies.
When your analytics team grows and starts running longer queries, or when your business starts requiring real-time dashboards, the operational database starts suffering. Peak query times for analytics align with peak usage hours for users. This isn't a coincidence — it's the same data being queried.
The solution is an OLAP store — a data warehouse designed specifically for analytical queries. Options range from managed services (BigQuery, Redshift, Snowflake) to self-hosted (ClickHouse). The operational database streams changes (via CDC — Change Data Capture — or batch export) to the OLAP store. Analytics queries hit the OLAP store. The operational database is protected.
When to make this split? When either:
- Your analytics queries are measurably impacting operational performance during peak hours, or
- Your analytics requirements are growing to the point where you're writing increasingly complex queries directly against production data
The second signal often comes before the first becomes a crisis.
Data Governance as Architecture Evolves
As data moves from one centralized database to a distributed architecture, the question of who owns what data becomes existential.
In a monolith with a shared database, the implicit answer is "everyone and no one." This becomes untenable at scale because nobody knows who to ask when a data quality issue appears, nobody knows who to notify when a schema changes, and nobody can assess the blast radius of a migration.
The data governance practices that scale:
- Explicit ownership per domain/table: every dataset has a named team owner responsible for its schema, quality, and change notifications
- Schema change protocols: breaking schema changes require advance notice to consumers (same as breaking API changes)
- Data catalog: at 20+ services or 50+ tables, a catalog that documents what data exists, where it lives, who owns it, and how to access it is worth its maintenance cost
This sounds bureaucratic. Done poorly, it is. Done well, it's the difference between a data architecture that supports growth and one that produces increasingly mysterious bugs at the worst possible times.
The Common Mistakes, Stated Plainly
Caching without an invalidation strategy. This is the most common. Teams add Redis to their application to make it faster, cache data that's returned frequently, and then spend the next year debugging subtle inconsistency bugs because the cache is never properly invalidated when the underlying data changes.
Sharding prematurely. I've worked with teams running 100k DAU on a system that was sharded at 10k DAU because a senior engineer read about sharding and decided they needed it. The operational complexity — six months to implement, constant maintenance burden, inability to run certain queries at all — was entirely unnecessary for their scale.
Building a data warehouse before having analytics requirements. Standing up Redshift or BigQuery, building ETL pipelines, and hiring a data engineer costs real money and real engineering time. Before you do any of it, validate that there is a specific set of analytical questions your business needs to answer that you cannot currently answer with your existing data.
Misusing the read replica for consistency-sensitive reads. Replica lag is real and variable. Users seeing stale data after their own write is a bad experience that erodes trust in the product.
Ignoring the analytics/operational split until it's causing incidents. By the time your analytics queries are impacting production, you're in a mode where the data warehouse needs to be built under pressure. Build it before the crisis.
The architecture progression from single database to distributed is not a linear path everyone must walk end-to-end. It's a menu of options, each appropriate to a specific class of problem at a specific scale. Your job is to know which option solves your actual current problem — not the problem you expect to have in three years.
Start simple. Stay simple until the data shows you that simple is no longer enough. Then take the smallest architectural step that buys you the headroom you need. Repeat.