The Engineering Manager's Guide to Cloud Cost Optimisation

Cloud cost sprawl is a management problem, not a technical one. The engineers on your team know how to right-size instances, use spot fleets, and leverage Reserved Instances. The harder question is: why hasn’t it happened yet?

In my experience managing infrastructure teams at AWS and leading engineering organisations through cost reduction programmes, the answer is almost always the same: no one owns it.

The Ownership Problem

Cloud costs in most engineering organisations are tracked at the company level (“our AWS bill is $2M/month”), reported to finance, and vaguely attributed to teams through tagging — which is usually incomplete. Individual engineers don’t see the bill. Engineering managers don’t have budget accountability.

This is structurally broken. Without direct accountability, no rational engineer will trade feature velocity for cost savings. The incentive system works against you.

Fix it first. Before any technical optimisation effort:

Implement proper cost allocation tags at the team/service level
Give engineering managers visibility into their team’s spend
Create a lightweight “efficiency metric” alongside velocity metrics in your engineering scorecard

The Three Levers

Once you have ownership, the actual optimisation falls into three categories, in order of ROI:

1. Right-sizing (highest ROI, lowest effort)

Most cloud workloads are severely over-provisioned. The conservative instinct (“I’d rather have too much than too little”) is rational for individual engineers but catastrophic at scale.

How to find candidates:

CPU utilisation < 20% average over 30 days
Memory utilisation < 30% average over 30 days
Network < 10% of provisioned bandwidth

AWS Compute Optimizer and the equivalent tools on GCP/Azure generate these reports automatically. The work is in the review process and approval, not the tooling.

Cadence: Monthly. Block 2 hours in your team’s sprint retrospective.

2. Commitment discounts (medium ROI, one-time effort)

Reserved Instances and Savings Plans (AWS), Committed Use Discounts (GCP), and Reserved VM Instances (Azure) typically deliver 30-60% savings on baseline compute — for a commitment of 1-3 years.

The mistake most teams make is purchasing commitments for current usage. The better approach:

Analyse your stable baseline (the load that runs 24/7 regardless of traffic)
Purchase commitments for that baseline only
Handle the rest with Spot/Preemptible/Spot VMs

Key insight: Your cloud provider’s cost tooling will recommend commitment purchases based on current spending. It’s generally more aggressive than you should be — start at 70% of what they recommend.

3. Architectural changes (highest impact, most effort)

Once you’ve captured the easy wins, architectural changes are where the real gains live:

Serverless for intermittent workloads: Lambda/Cloud Functions for jobs that run infrequently have near-zero idle cost
Spot for fault-tolerant batch jobs: ML training, data processing, media encoding
S3/GCS Intelligent Tiering: Data that hasn’t been accessed in 30 days costs 40% less to store
CDN cache hit rate: Every cache miss is a compute cost. Improving cache hit rates from 85% to 95% can cut origin costs by 40%

The Velocity Trap

The common objection to cost optimisation work is “we don’t have time — we’re shipping product.”

This is a false trade-off, but it feels real because optimisation work doesn’t appear in the sprint backlog. The fix is simple: allocate a fixed budget.

I recommend 10-15% of engineering capacity to infrastructure excellence work, which includes cost optimisation, reliability improvements, and technical debt reduction. This isn’t negotiable sprint by sprint — it’s a structural allocation.

At AWS, we called this “undifferentiated heavy lifting” and treated it as a first-class engineering investment. Teams that protected this budget consistently outperformed those that didn’t, because they weren’t constantly firefighting.

Metrics That Matter

Track these monthly:

Metric	Target
Cost per unit of business value (requests served, users, transactions)	Trending down
Commitment coverage (% of spend on reserved capacity)	60-80%
Average CPU utilisation	> 40%
Spot/preemptible usage (% of total compute)	> 20% for applicable workloads

Cloud cost optimisation is an engineering culture problem wearing a finance hat. Get the incentives right, measure the right things, and the technical work will follow.