Article 8 of 8
The System Design Mastery Checklist
A practical checklist to audit any system design — yours or someone else's.
After hundreds of design reviews, the failures I've seen weren't from lack of knowledge — they were from inconsistent application of knowledge under pressure. This checklist covers eight domains: requirements clarity, data model, API design, scalability, failure handling, observability, security, and operational readiness. Use it before any significant system goes to production. You'll catch 80% of the problems that would otherwise find you at 3am. Print it. Apply it. Make it a habit.
In 2009, Atul Gawande published The Checklist Manifesto and fundamentally changed how I think about complex work. His argument was precise: even the most skilled professionals — surgeons, pilots, engineers — make avoidable mistakes not because they lack knowledge, but because they lack a systematic way to apply that knowledge under pressure.
I've been in hundreds of design reviews over my career. The reviews that consistently catch real problems share one property: they use a structured approach. Not rigid process that stifles thinking, but a reliable scaffold that ensures nothing critical gets skipped when you're moving fast and your mental energy is focused on the interesting design decisions.
This checklist is that scaffold. It brings together every major theme from this pathway — requirements, data design, failure handling, observability, security, operational readiness — into a single practical tool you can use to audit any system design. Yours, or someone else's.
Work through it systematically. The items you skip are almost always the items that cause your 3am incident.
Requirements Clarity
Before anything else, make sure the problem is actually defined. I've watched teams spend weeks building elaborate designs for the wrong problem.
- Functional requirements are explicit and prioritised. You know what the system must do versus what would be nice to have.
- Non-functional requirements use numbers, not adjectives. "Fast" is meaningless. "p99 latency under 200ms for the /checkout endpoint" is a requirement.
- Scale expectations are documented with rough estimates. Expected concurrent users, requests per second, data growth rate per month. Rough estimates are far better than unstated assumptions.
- Edge cases have been discussed: empty inputs, duplicate requests, concurrent modifications, zero-state initialisation.
- Scope boundary is explicit. You know what this system is not responsible for, and everyone agrees on that boundary.
Requirements clarity isn't bureaucracy. It's the foundation every subsequent decision rests on. Get this wrong and even a technically excellent design can deliver the wrong outcome.
Data Model & Storage
This is where most designs either succeed or quietly plant the seeds of future pain.
- Data ownership is clear. Every piece of data has exactly one authoritative source of truth. No ambiguity, no two-systems-with-the-same-data-in-conflict scenarios.
- Access patterns are understood before choosing technology. You know which queries are hot, which are rare, and which will grow over time.
- Consistency requirements are explicit per data type. Not everything needs strong consistency — but you've made a deliberate decision about what does and why.
- Storage technology matches the access pattern. Don't force a relational model on time-series data. Don't use a document store for data with complex relational integrity requirements.
- Data lifecycle is defined. How long is data retained? What triggers archival or deletion? Who verifies that expired data is actually deleted?
- Schema evolution has a plan. Your data model will change. Migrations without downtime require thinking about backwards compatibility and rollout strategy before you're mid-migration.
API Design
Your APIs are contracts with your consumers. Get them wrong and you pay the cost for years — you either break consumers or you maintain two versions of the same API simultaneously.
- API contracts are documented before implementation. Request/response shapes, status codes, error formats. Implementation without documentation produces undocumented behaviour that becomes a de facto spec.
- Versioning strategy is defined. How will the API evolve without breaking existing consumers? URL versioning, header versioning, API gateway routing — pick one and implement it from the start.
- Error handling is consistent and actionable. Clients should be able to programmatically distinguish between "retry this request" (503) and "fix your request" (400). Generic 500 errors with no detail are not helpful.
- Idempotency is built in where it matters. Any operation that modifies state should be safe to retry. Payment endpoints without idempotency keys are accidents waiting to happen.
- Pagination is designed for large result sets. Offset-based pagination breaks under concurrent modification and performs poorly at large offsets. Cursor-based pagination is the right pattern.
- Rate limiting protects the system from abuse and noisy neighbours. Every public-facing API should have it.
Scalability
Designing for scale isn't about handling Netflix traffic on day one. It's about knowing which knobs to turn when growth arrives, rather than discovering you need to redesign the core architecture.
- Read/write ratio is understood. This single number drives more architectural decisions than almost anything else you'll identify.
- Caching strategy is defined: what gets cached, where (application layer, CDN, database result cache), for how long, and — critically — how it gets invalidated.
- Horizontal scaling path is clear. Can you add more instances to handle more load, or is there a stateful bottleneck that forces vertical scaling?
- Database scaling approach is identified before you hit the wall: read replicas, sharding, a different storage model, or partitioning. "We'll figure it out when we need to" is not a plan.
- Async processing is used for work that doesn't need to happen in the request path. If a task takes more than ~200ms and the caller doesn't need the result immediately, it should not block a synchronous response.
- Backpressure mechanisms exist. When load exceeds capacity, the system should degrade gracefully — queue saturation alerting, shedding non-critical work, serving cached results — not cascade into complete failure.
Failure Handling
This is where good designs become great designs. Every experienced engineer I respect treats failure as a first-class design concern, not an afterthought.
- Timeouts are set on every external call. No call should wait indefinitely. Ever. Pick a timeout. Justify it with your SLA math.
- Retries use exponential backoff with jitter. Naive linear retries cause thundering herds that make outages significantly worse.
- Circuit breakers protect against cascading failures. When a dependency is clearly degraded, stop calling it — fail fast and spare your own thread pool.
- Graceful degradation paths exist. When a non-critical dependency fails, the core experience should remain functional, even if diminished.
- Dead letter queues capture failed async messages. Silent data loss is the worst kind of production bug — you don't know it's happening until a customer tells you weeks later.
- Idempotent processing ensures retries don't produce duplicate side effects. This is especially critical for payment processing and notification delivery.
Observability
You cannot operate what you cannot see. A system without observability is a system you're running on hope.
- Key metrics are defined and instrumented. At minimum: request rate, error rate, and latency (the RED method). These should be dashboarded before you go to production, not after the first incident.
- Structured logging captures enough context to debug issues without reproducing them locally. Correlation IDs threading through the request lifecycle.
- Distributed tracing connects requests across service boundaries. Without it, debugging in a distributed system requires comparing timestamps across log files and guessing.
- Alerting is tied to SLOs, not arbitrary thresholds. Alert on what matters to users. Don't generate alert fatigue with thresholds that trigger on healthy variation.
- SLOs are defined for the service. You know what "good" looks like and can measure whether you're achieving it.
- Dashboards serve both real-time operations (what's happening right now?) and trend analysis (how are we trending over the past 30 days?). These are different dashboards for different audiences.
Security
Security is not a feature you bolt on after the system is working. It's a property of the design.
- Authentication is handled consistently at every entry point. No unauthenticated paths to sensitive resources.
- Authorisation is enforced at the service level, not just in the UI. Never trust that the client has already checked permissions. Always verify at the service boundary.
- Data in transit is encrypted. TLS everywhere. No exceptions, no internal network exemptions.
- Data at rest is encrypted where sensitivity warrants it. PII, credentials, financial data — all encrypted at rest.
- Secrets management uses a dedicated vault or secrets manager. No credentials in environment variables, config files, or source code.
- Input validation happens at every system boundary. Never assume upstream systems have already sanitised their data. Validate at every entry point independently.
Operational Readiness
The best design in the world means nothing if you can't deploy, operate, and recover.
- Deployment strategy supports zero-downtime releases. Blue-green, canary, rolling deployments — pick one and implement it. "Stop the service, deploy, restart" is not a zero-downtime deployment.
- Rollback plan is tested and fast. If a deployment goes wrong, you need to be back to the previous version in minutes, not hours. A rollback plan that has never been tested is not a rollback plan.
- Runbooks exist for common failure scenarios. When the pager goes off at 3am, the on-call engineer needs a runbook, not a debugging session. What are the top five failure modes? Document them now, before they happen.
- Capacity planning is based on measured data, not hope. You know when you'll hit current limits and what the plan is when you do.
- Dependency inventory is maintained. You know every external service, database, third-party API, and shared resource your system depends on — including whether each one has SLA guarantees and what happens when they're breached.
- Disaster recovery is considered, tested, and documented. Backups exist, restoration has been tested end-to-end, and RTO/RPO are defined and agreed with stakeholders.
Using This Checklist Effectively
This checklist is most valuable when used as a conversation tool, not a compliance exercise.
In design reviews: Walk through the sections before the detailed discussion. Missing items in the requirements section often reveal fundamental misalignments before significant design work is done.
When inheriting a system: Use it to audit what exists. Every unchecked item is a risk you now own. Prioritise the highest-impact gaps and address them systematically.
For your own designs: Go through it when you think you're done, not when you're starting. You'll catch the things you assumed rather than designed.
For career growth: Every senior engineer should be able to defend every checked item. "We checked the idempotency box — here's why and how" is a materially different answer than "we checked it because the checklist said to." Use this as a way to develop both your designs and your reasoning.
What Comes Next
You've completed the System Design Mastery pathway.
You started with mental models. Applied them to a real problem. Learned which patterns earn their complexity and which anti-patterns pretend to. Walked through a production-grade case study. Confronted the fundamental trade-offs of distributed systems. And understood how the design challenges change when you're leading the teams that build the systems.
This checklist is the synthesis — a practical tool that brings all of it together.
The next step is applying it. Pick a system you work with daily and audit it against this checklist. You'll find gaps. Some will be acceptable trade-offs. Others will be risks worth addressing. Either way, you're now thinking like a system designer who has seen what goes wrong — which is the most important shift there is.
Good luck. Build systems that survive production.