· AI & Engineering  · 3 min read

Building LLM-Powered Infrastructure Discovery at AWS Scale

How we designed and deployed an AI agent using AWS Bedrock to automate EC2 instance selection — achieving 95%+ recommendation accuracy across millions of instance types.

How we designed and deployed an AI agent using AWS Bedrock to automate EC2 instance selection — achieving 95%+ recommendation accuracy across millions of instance types.

When AWS launched hundreds of new EC2 instance types over a decade, the question of “which instance should I use?” became genuinely hard. Not just for customers, but for the systems trying to help them. This is the story of how we built an LLM-powered discovery agent to solve it.

The Problem: Too Much Choice

The AWS EC2 catalogue is enormous — thousands of instance type, region, and pricing combinations. A customer migrating from on-premise hardware to AWS faces a combinatorial explosion of options. The existing tooling (AWS Compute Optimizer, instance type comparison tables) helps, but requires users to already understand the dimensions they’re optimising for.

We needed something that could take a natural-language description — “I need something that handles 10K concurrent WebSocket connections with low latency and predictable performance” — and return a ranked, reasoned recommendation.

Architecture: Retrieval-Augmented Generation on Instance Metadata

The core insight was treating EC2 instance selection as a retrieval problem first, generation problem second.

Phase 1: Structured Knowledge Base

We built a pipeline that:

  1. Ingested all EC2 instance metadata (vCPU, memory, network bandwidth, EBS throughput, GPU specs, pricing) into a vector store
  2. Enriched each entry with human-readable characteristics generated from structured data (e.g., “This is a memory-optimised instance with 2:1 memory-to-vCPU ratio, suited for in-memory databases”)
  3. Tagged instances with workload archetypes derived from AWS documentation and customer usage patterns

Phase 2: Intent Classification

Before hitting the retrieval layer, we classified the incoming query into workload dimensions:

  • Compute-bound vs memory-bound vs IO-bound
  • Latency-sensitive vs throughput-optimised
  • Cost-priority vs performance-priority
  • Burstable vs sustained load

This classification — itself an LLM task — dramatically improved retrieval precision.

Phase 3: Generation with Constraint Satisfaction

The final step used Claude (via Bedrock) with a structured prompt that included:

  • The top-k retrieved instance candidates with their enriched metadata
  • The classified workload intent
  • Hard constraints (region availability, pricing limits, compliance requirements)
  • A chain-of-thought reasoning requirement

The model returned ranked recommendations with explicit reasoning — something customers could actually audit and argue with.

Results

After a 3-month production trial across internal AWS tooling:

  • 95%+ accuracy on known-answer benchmark workloads
  • ~40% reduction in support tickets related to instance selection
  • P50 response time of 1.2 seconds — fast enough for interactive use

Lessons Learned

Don’t skip the retrieval layer. Pure generation (prompting the LLM with all instance metadata) failed for two reasons: context window limits and hallucination. The LLM would confidently describe instance specifications that didn’t exist.

Classification before retrieval compounds gains. Each layer of precision improvement multiplies. Getting the workload intent right first was worth the extra latency.

Chain-of-thought is non-negotiable for trust. Customers didn’t just want recommendations — they wanted to understand why. An answer without reasoning was treated with the same scepticism as a magic 8-ball.

Structured output parsing beats free-form. We eventually moved to JSON-mode outputs with a defined schema, which made downstream processing and UI rendering far more reliable.


This pattern — classify, retrieve, generate, reason — has become a standard approach in our internal AI tooling. I’ll write more about applying it to other AWS infrastructure problems in future posts.

Back to Blog

Related Posts

View All Posts »
Why Your CI/CD Pipeline Is Lying to You

Why Your CI/CD Pipeline Is Lying to You

A fast green build doesn't mean your software is production-ready. Here's what I learned building deployment systems at AWS scale — and how to build pipelines that actually tell the truth.