5 Things I Wish I Knew Before Running EKS in Production

Everyone tells you EKS is “just Kubernetes.” Nobody tells you about the production surprises waiting on the other side.

Running Kubernetes in a tutorial is one thing. Running it in production on AWS is a completely different game.

I’ve spent months building a real-world DevOps course with 55+ hands-on demos, and along the way I hit every production gotcha you can imagine. Here are the five lessons I wish someone had told me upfront each one could save you hours of debugging and hundreds of dollars in wasted cloud spend.

1. Cluster Autoscaler Doesn't Consolidate Nodes

This one surprised me the most.

Cluster Autoscaler (the default scaling solution most teams start with) only removes completely empty nodes. If a node is running at 10% utilization? It stays. If you have four nodes each running a single small pod? All four stay running. You keep paying for all of them.

Here's what happened in my demo:

I deployed 10 pods across 4 nodes, then scaled the deployment down to 2 pods. With Cluster Autoscaler, all 4 nodes stayed running — because each still had at least one pod on it.

Then I switched to Karpenter.

Within 30 seconds, Karpenter analyzed the remaining pods, consolidated them onto a single right-sized node, and terminated the other three. Four nodes became one. That's a 75% cost reduction just from intelligent bin-packing.

Why Karpenter is different: Karpenter doesn't just remove empty nodes — it actively looks for opportunities to consolidate workloads onto fewer, better-sized instances. It considers pod resource requests, topology constraints, and instance type availability to find the optimal placement.

What this means for you: If you're running EKS with Cluster Autoscaler, you're likely paying for nodes that are barely utilized. Karpenter's consolidation feature alone can cut your compute costs dramatically.

The course covers both OnDemand and Spot NodePools with live consolidation demos in Section 17.

2. Spot Instances Will Interrupt Without Warning (Unless You Handle It)

Spot instances offer up to 70% savings over On-Demand pricing. That's a massive cost reduction. But there's a catch that most tutorials gloss over: AWS can reclaim your Spot instances with just a 2-minute warning.

Without proper handling, here's what happens:

AWS decides it needs the capacity back. Your instance gets terminated. Your pods die immediately. Requests in progress fail. Users see errors. Your PagerDuty goes off at 3 AM.

With proper handling (Karpenter + EventBridge + SQS), here's the flow:

AWS sends a Spot interruption notice to EventBridge
EventBridge routes it to an SQS queue
Karpenter picks up the message and gets a 2-minute heads-up
Karpenter immediately provisions a new replacement node
The old node is cordoned (no new pods scheduled)
Pods are gracefully rescheduled to the new node
The old node terminates — and nobody notices

The result: 70% cost savings with zero downtime. Your users never know it happened.

I built a complete demo where I simulate Spot interruptions and show the entire flow in real-time. It's one of the most eye-opening sections in the course (Section 17), because you can actually watch Karpenter orchestrate the replacement before the old node dies.

3. Hardcoding Secrets Will Come Back to Haunt You

We've all done it. A quick env variable in a YAML manifest. A database password in a ConfigMap. "I'll fix it later."

Later never comes. And then:

Someone commits the YAML to a public repo
A credential rotation requires redeploying every service
An audit reveals secrets are visible in kubectl describe pod

The production-grade approach:

AWS Secrets Manager + Secrets Store CSI Driver. Here's how it works:

Secrets live in AWS Secrets Manager (encrypted, access-controlled, audited)
The Secrets Store CSI Driver mounts them as volumes in your pods
When a secret rotates in AWS, the mounted volume updates automatically
Your application picks up the new secret — no restart needed

What this means in practice: Your security team is happy (secrets are centralized and audited). Your ops team is happy (no restarts during rotation). Your developers are happy (they just read from a file path).

Section 9 in the course walks through the complete setup with AWS Secrets Manager integration, Secrets Store CSI Driver, and Kubernetes manifests that reference them.

4. Observability Isn't Just "Install Prometheus"

When people say "add monitoring," they usually mean install Prometheus and Grafana and call it done. But production observability has three distinct pillars, and each requires its own pipeline:

Traces — Follow a single request across multiple microservices
→ Instrumented with OpenTelemetry → Collected by ADOT Collector → Sent to AWS X-Ray

Logs — Capture application output and errors
→ Collected by a dedicated ADOT Collector → Sent to Amazon CloudWatch Logs

Metrics — Track resource usage, request rates, error rates
→ Collected by a third ADOT Collector → Sent to Amazon Managed Prometheus → Visualized in Amazon Managed Grafana

Three pillars. Three ADOT collectors. Three AWS destinations.

The cost trap nobody warns you about:

Kubernetes runs health check probes (liveness and readiness) every 10 seconds on every pod. Each probe generates a trace. With 20 pods, that's 120 traces per minute — 172,800 traces per day — just from health checks. Your X-Ray bill explodes with completely useless data.

The fix: Configure the OpenTelemetry Collector to filter out health check traces (/health, /ready endpoints) before they reach X-Ray.

I reduced my tracing costs by 85% with this single configuration change. Same visibility into real user requests. 85% lower cost.

Section 20 covers the complete ADOT setup with Amazon Managed Prometheus (AMP), Amazon Managed Grafana (AMG), and X-Ray integration.

5. Tutorial Databases Are Nothing Like Production Databases

Every Kubernetes tutorial uses SQLite or an in-memory database. It works great for demos. Then you get to production and realize:

Your catalog service needs MySQL for relational product data
Your cart service needs DynamoDB for fast NoSQL access
Your order service needs PostgreSQL for complex order queries
Your checkout service needs ElastiCache Redis for session caching
Your order processing needs SQS for async message queuing

Five microservices. Five different AWS managed services. One application.

In the course, students build a complete retail store application with this real-world data plane:

AWS RDS MySQL — Product catalog storage
AWS RDS PostgreSQL — Order management
AWS DynamoDB — Shopping cart (NoSQL)
AWS ElastiCache — Checkout session caching
AWS SQS — Async order processing

Each service has its own connection pooling, retry logic, and secrets management. Everything is provisioned with Terraform — one terraform apply creates the entire data layer.

Why managed services over running databases in Kubernetes?

You can run databases in K8s. But that means you handle backups, failover, storage scaling, and the 2 AM alerts. With AWS managed services, AWS handles all of that. For most teams, that trade-off is worth it.

The Pattern Across All Five Lessons

Notice something? Every lesson follows the same pattern:

Tutorial approach → works in dev → breaks in production → costs money

Production approach → requires more setup → saves money → actually works at scale

This course exists to close that gap. Not with theory — with working Terraform configurations, Kubernetes manifests, and live demos you can follow along with.

Explore the Course GitHub Repo

All the code referenced in this newsletter is publicly available:

Section 17: Karpenter (Consolidation + Spot Interruption Handling)
Section 9: AWS Secrets Manager + CSI Driver
Section 20: OpenTelemetry + ADOT Collectors
Section 14: AWS Services (RDS, DynamoDB, ElastiCache, SQS)

GitHub Repo: devops-real-world-project-implementation-on-aws

Star the repo if you find it useful!

Join 2,821+ Students Building Production-Ready Skills

Enroll in the Course

Which of these five lessons did you learn the hard way? Hit reply and share your story — I read every response.

Kalyan Reddy Daida | StackSimplify

5 Things I Wish I Knew Before Running EKS in Production

1. Cluster Autoscaler Doesn't Consolidate Nodes

2. Spot Instances Will Interrupt Without Warning (Unless You Handle It)

3. Hardcoding Secrets Will Come Back to Haunt You

4. Observability Isn't Just "Install Prometheus"

5. Tutorial Databases Are Nothing Like Production Databases

The Pattern Across All Five Lessons

Explore the Course GitHub Repo

Join 2,821+ Students Building Production-Ready Skills

Keep Reading

Subscribe for new reads…

Quick Links

Subscription

Socials