· Engineering Leadership · 4 min read
Why Your CI/CD Pipeline Is Lying to You
A fast green build doesn't mean your software is production-ready. Here's what I learned building deployment systems at AWS scale — and how to build pipelines that actually tell the truth.
A green CI build means your tests passed. It does not mean your software is ready to deploy. This distinction seems obvious, but almost every engineering organisation I’ve worked with conflates the two — and pays for it in incidents.
At AWS, we ran deployment pipelines for systems that affected millions of customers. The cost of a bad deploy was measured in minutes of customer impact at global scale. That focus clarified something important: a good pipeline tells you when NOT to deploy, not just when your tests pass.
What Most Pipelines Actually Test
A typical CI/CD pipeline runs:
- Unit tests (fast, isolated, high coverage)
- Integration tests (slower, test component interactions)
- A build step
- Maybe a lint/static analysis step
This is necessary but insufficient. Here’s what it misses:
1. The difference between “compiles and runs” and “behaves correctly under production conditions”
Unit tests run against mocked dependencies. Integration tests usually run against test databases with synthetic data. Neither tests what happens when your database has 50 million rows, your cache is cold, and 10 concurrent deploys are happening across your fleet.
2. The blast radius of a bad deploy
Your pipeline knows if the build passed. It doesn’t know if a subtle regression will cause a 2% increase in error rate at 3x traffic — which would only show up an hour after deployment.
3. Whether the rollback path works
Most pipelines test the deployment path. Almost none test the rollback path with the same rigour. This is backwards — rollback is the most important path, and you want to have run it recently enough that you trust it.
What a Truthful Pipeline Looks Like
Pre-deploy: The gates that matter
Canary analysis is the single highest-leverage investment in deployment safety. Deploy to 1-5% of your fleet first, monitor for 10-30 minutes, compare error rates and latency distributions against the previous deployment. Only proceed if the canary is healthy.
This catches a class of bugs that no amount of pre-deploy testing will find: the bugs that only emerge from production traffic patterns, real user data, and the combination of your software with live dependencies.
Rollback smoke tests: Before every production deploy, run a quick smoke test of the rollback procedure in a staging environment. “Does rolling back actually work right now?” is a question you want to answer before you need to.
Deployment blast radius control: Deploy serially across regions, not simultaneously. Your pipeline should know which region is “canary” (lowest traffic, first to get changes), which is “production” (full traffic), and what the approval gates between them are.
Post-deploy: The gates almost everyone skips
Deployment freeze detection: A deploy that looks fine for 5 minutes but degrades over 2 hours (a common pattern with memory leaks or connection pool exhaustion) should trigger automatic rollback. Your pipeline needs to keep watching after the deploy completes.
Business metric anomaly detection: Beyond technical metrics (error rate, latency), track business metrics (checkout conversion rate, API call success rate, user session duration). A deploy that passes all technical gates but causes a 5% drop in conversions is still a bad deploy.
The Organisational Problem
The real reason pipelines lie isn’t technical — it’s that pipeline quality is never measured. Teams measure:
- Deploy frequency ✓
- Mean time to recover (MTTR) ✓
- Change failure rate ✓
But rarely:
- False positive rate of the pipeline (deploys that were approved but caused incidents)
- False negative rate (changes that were blocked unnecessarily)
- Time to detect a bad deploy
- Rollback success rate
If you don’t measure pipeline quality, you won’t invest in improving it. And if your pipeline has a high false-positive rate (it blocks good changes frequently), your engineers will start ignoring the gates — which is worse than having no gates at all.
Practical Steps
This sprint:
- Add canary analysis to your next critical deploy. Even manually comparing error rates before/after.
- Run a rollback drill. Deploy to staging, roll back, confirm it worked.
This quarter:
- Implement automated canary analysis with defined success criteria (error rate < X, P99 latency < Y)
- Add at least one post-deploy monitoring window with defined rollback triggers
- Start measuring pipeline false-positive and false-negative rates
This half:
- Full progressive delivery across your fleet with automated gates
- Business metric integration into deployment health checks
- Deployment playbooks that document what to watch for each service
The goal is a pipeline that your engineers trust enough to deploy on a Friday afternoon. That’s not bravado — it’s the benchmark for a truly truthful pipeline. If your team is nervous about Friday deploys, the pipeline is lying to you.