Article 7 of 8

System Design for Engineering Leaders

How system design thinking changes when you lead teams rather than write code.

12 minAdvanced
Key Takeaway

When you become a leader, your design surface shifts from technical systems to the organisations that build them. Conway's Law isn't a suggestion — your team structure produces your architecture whether you plan it or not. This article covers the leadership lens for build-vs-buy decisions, why technical debt is an organisational symptom not a code problem, when to create platform teams, how to run architecture reviews that teach rather than gate, and what to look for when evaluating design proposals from your team.


The first system I designed as a staff engineer was a real-time analytics pipeline.

I obsessed over Kafka partition strategies, exactly-once semantics, and the perfect schema evolution plan. It was technically elegant. I was proud of every decision in it.

It failed anyway — not because the architecture was wrong, but because three different teams had overlapping ownership of the pipeline stages. Nobody knew who was on-call for the ingestion layer. Two teams had independently duplicated transformation logic because they didn't know the other team was building the same thing. The system was sound. The organisation around it was not.

That was the moment I understood: when you become a leader, you stop designing systems. You start designing the organisations that build systems.


The Shift: From Systems to Organisations

As an individual contributor, your design surface is technical. You think about data flow, latency, consistency, fault tolerance. These skills don't disappear when you become a leader — but they become insufficient.

Your new design surface is people, teams, and incentive structures. The questions change:

IC questionLeader question
How should data flow between these services?How should responsibility flow between these teams?
What happens when this service fails?What happens when this team is overwhelmed, understaffed, or misaligned?
Where should we put the cache?Who owns the cache, and what happens when the owner leaves?
How do we handle schema migrations?Which team is accountable when a migration goes wrong?

This is not a demotion from technical work. It's a higher-order design problem. You're designing the system that produces systems.


Conway's Law Is Not a Suggestion

You've heard it before: "Any organisation that designs a system will produce a design whose structure is a copy of the organisation's communication structure."

Most engineers treat Conway's Law as an interesting observation. Engineering leaders need to treat it as a predictive force.

If you have three frontend teams and one backend team, you will get three frontend apps talking to one monolithic API — regardless of what your architecture diagram says. If your platform team is in a different timezone from your product teams, the platform will evolve slowly and product teams will build their own workarounds. If two teams need to collaborate on a shared service but report to different VPs who don't talk regularly, that service will become a coordination bottleneck.

The practical implication: before you draw an architecture diagram, draw your org chart. If the architecture you want doesn't match the team structure you have, you have two options: change the architecture to match the teams, or change the teams to match the desired architecture. Hoping that skilled engineers will somehow overcome structural misalignment is not a strategy — it's wishful thinking, and I've watched it fail repeatedly.

A company I worked with wanted a unified data platform but had data engineers scattered across four product teams with no shared reporting line. Each team built their own pipelines, their own schemas, their own quality checks. Two years and millions of dollars in redundancy later, they created a centralised data team. That's what the desired architecture had required from the beginning. The two years of distributed effort was Conway's Law operating exactly as described.


Build vs Buy: The Strategic Lens

Engineers default to building. We're trained to do it, and honestly, building is more intellectually satisfying than evaluating vendors. But leaders need a different framework.

The engineer's lens: Can we build this? Is the third-party offering good enough? Will it handle our edge cases?

The leader's lens: Should our team's finite engineering hours go toward this problem? Is this a differentiator for the business, or is it commodity infrastructure?

My rule of thumb: build what makes you unique, buy what makes you normal.

Authentication, payment processing, email delivery, observability tooling, feature flagging — unless these are your actual product, buy them. Your competitive advantage is not a slightly better internal auth system. The hidden cost of building is never just the initial development — it's the ongoing maintenance, the on-call rotation, the knowledge concentration (when the one person who understands it leaves), and the opportunity cost of what that team wasn't building instead.

A useful forcing function: when your team pitches building something, ask "If we had to hire a dedicated engineer to maintain this for five years, would it still be worth building?" That question redirects about 60% of build proposals toward better alternatives, and it should.


Technical Debt Is an Organisational Problem

Unpopular opinion: most technical debt isn't caused by bad engineering. It's caused by bad organisational decisions.

Rushed deadlines that skip design reviews. Reorgs that leave services without clear owners. Teams that grow faster than their onboarding can handle, leading to inconsistent patterns across services. Knowledge silos where one engineer understands the critical path and everyone else works around it, afraid to touch it.

When leaders treat technical debt as a code problem, they allocate "20% time for refactoring" and wonder why nothing changes. The code is a symptom. The root causes are usually:

  • Missing ownership: No team clearly owns the component, so everyone patches it and nobody fundamentally fixes it.
  • Misaligned incentives: Teams are measured on feature delivery velocity, not system health or code quality.
  • Knowledge loss: The original designer left, and nobody understands the constraints that forced the current design.
  • Deferred hard decisions: Leadership avoided a difficult conversation about deprecating a system, so teams built on top of something that should have been replaced two years ago.

Fix the organisational problem. The code fixes follow naturally when the right team owns the right system with the right incentives.

I've seen companies spend entire sprints on technical debt "cleanup weeks" and emerge with marginally tidier code and all the same structural problems intact. The organisational root cause was never addressed.


Designing for Team Autonomy

The most scalable architecture is one where teams can ship independently. This isn't just a technical statement — it's an organisational design principle.

Service ownership means each service has exactly one team responsible for its development, deployment, and operations. Not "shared ownership." Not "the platform team handles deploys for everyone." One team, end to end. When something breaks at 3am, there's no ambiguity about who's responsible.

Clear interfaces between teams matter more than clear interfaces between services. If Team A needs to file a ticket and wait three sprints for Team B to expose a new API field, your system architecture is bottlenecked regardless of how cleanly the underlying services are designed.

The questions I ask when evaluating team autonomy:

  • Can a team deploy to production without coordinating with another team?
  • Can a team change their internal data model without breaking another team's service?
  • Can a team run an experiment or A/B test without a cross-team planning meeting?
  • Does each team own their own data store, or do multiple teams share databases?

If the answers are mostly "no," you don't have a microservices architecture. You have a distributed monolith with all the operational complexity and none of the autonomy benefits.


Architecture Reviews as Teaching Tools

Many organisations treat architecture reviews as gates — a senior engineer or architecture board reviews proposals and either approves or rejects them. This is the least valuable use of architecture reviews.

Architecture reviews should be teaching moments, not checkpoints. The goal isn't to catch bad designs. The goal is to develop engineers who make better design decisions independently.

Here's how I run them:

  1. The proposer presents the design with explicit trade-offs — not just the solution, but what was traded away. If a proposal only lists advantages, the engineer hasn't gone deep enough. Ask: "What's the worst thing about this approach?"

  2. Reviewers ask questions rather than give answers. "What happens when this queue backs up?" is better coaching than "You need a dead letter queue here." The former develops thinking; the latter provides an answer without developing judgment.

  3. Focus on the reasoning, not the diagram. A mediocre architecture with clear reasoning can be improved incrementally. A beautiful architecture with no articulated reasoning will collapse when requirements change.

  4. Document the decision context, not just the decision. Why did we choose PostgreSQL over DynamoDB? What assumptions would invalidate this design in 18 months? This context is worth more than the diagram at the top of the document.

When architecture reviews are used as teaching tools, you build a team that makes better decisions autonomously. When they're gatekeeping mechanisms, you build a team that waits for approval and stops developing judgment.


Platform Teams: When and Why

The question "should we have a platform team?" arises at every growing engineering organisation. The answer depends entirely on timing.

You need a platform team when: multiple product teams are independently solving the same infrastructure problems — deployment pipelines, observability, service discovery, feature flagging — and the divergence in their solutions is creating more operational burden than it's saving. The pattern of duplicated infrastructure is the signal.

You don't need a platform team when: you have fewer than four or five product teams, or the infrastructure problems are genuinely different across teams.

The biggest mistake: creating a platform team too early, before you understand what the platform should actually do. A premature platform team will build what they think product teams need rather than what product teams actually need. Let the patterns emerge from real pain first.

The litmus test: if product engineers are spending more than 20% of their time on undifferentiated infrastructure work — work that doesn't differentiate the product — it's time for a platform team. If they're spending 5%, let them own it.

And critically: a platform team's customer is the product engineering team. If your platform team measures success by the uptime of their own tools rather than the productivity of product teams, they've lost the plot.


Evaluating Design Proposals From Your Team

As a leader, you'll spend more time reviewing designs than creating them. Here's what I look for:

Constraints before solutions. Has the designer clearly stated the constraints they're working within? Traffic volume, latency requirements, team size, timeline. A design without explicit constraints is fiction — it can be made to work under any conditions on paper.

Trade-offs, not just benefits. Every design decision trades something. If the proposal only lists advantages, the designer hasn't gone deep enough. Ask what you're giving up.

Operational reality. Who deploys this? Who gets paged at 2am? How do we know it's working? How do we debug it when it isn't? Designs that ignore operational concerns are designs that create incidents.

Reversibility. How hard is it to change this decision in 18 months? Prefer reversible decisions and make them quickly. Invest analysis time only in the irreversible ones — data model choices, public API contracts, vendor lock-in.

Incremental delivery. Can a simpler version be shipped first and iterated on? Any proposal that requires six months before delivering value is a proposal that will be cancelled at month four when priorities shift.


Key Takeaways

  • When you become a leader, your design surface shifts from technical systems to the organisations that build them. The quality of your org design determines the quality of your system design.
  • Conway's Law is a predictive force. If your team structure doesn't match your desired architecture, change one of them. Engineers cannot overcome structural misalignment through individual effort alone.
  • Build what differentiates you; buy everything else. Every custom system you build is a system you maintain, staff, and on-call forever.
  • Technical debt is usually an organisational symptom — missing ownership, misaligned incentives, or deferred hard decisions. Fix the org problem first and the code will follow.
  • Design for team autonomy by ensuring teams can deploy, change, and experiment independently. Shared databases and cross-team deploy coordination are architectural red flags.
  • Architecture reviews are teaching tools, not gates. Focus on developing judgment, not on catching failures. Invest in the reasoning, not the diagram.
  • Create platform teams when the pattern of duplicated infrastructure pain is clear — not before. Premature platform teams build the wrong thing because the right thing hasn't revealed itself yet.
  • When evaluating proposals, look for explicit constraints, honest trade-offs, operational plans, and incrementally deliverable value.