Kubernetes Cost Optimization: 5 kubectl Commands That Reveal Your Cluster Waste
Last quarter, a Series B fintech company asked us to look at their AWS bill. They were running a 24-node Kubernetes cluster on EKS, spending roughly $35,000 a month on compute alone. Their CTO's exact
# Kubernetes Cost Optimization: 5 kubectl Commands That Reveal Your Cluster Waste
The $14,000/Month Discovery
Last quarter, a Series B fintech company asked us to look at their AWS bill. They were running a 24-node Kubernetes cluster on EKS, spending roughly $35,000 a month on compute alone. Their CTO's exact words: "We know we're overspending, but we don't know where."
We sat down, opened a terminal, and ran five kubectl commands. Took about twenty minutes. What we found was $14,200 per month in pure waste — pods requesting resources they never used, dev namespaces running around the clock, storage volumes allocated and forgotten, and nodes sitting at under 25% utilization.
No fancy tooling. No $50,000 FinOps platform license. Just five commands and the knowledge of what to look for.
Here's exactly what we ran and what it revealed.
---
Command 1: Find Pods Requesting Way More Than They Use
This is where the money hides. Kubernetes resource requests are promises — when a pod says it needs 4 CPU cores, the scheduler reserves 4 cores on a node, whether the pod actually uses them or not. Those reserved-but-unused resources are invisible unless you look.
The command:
kubectl top pods --all-namespaces --sort-by=cpuSample output:
NAMESPACE NAME CPU(cores) MEMORY(bytes)
production payment-api-7d4f8b6c9-x2k4l 312m 486Mi
production order-service-5c8d9e7f1-m9n3p 289m 512Mi
production analytics-worker-6b3a2c8d4-q7w1r 47m 1.2Gi
staging frontend-dev-8f2e1a5b7-j4k6h 12m 128Mi
default test-deployment-9c7d3e6f2-p8r5t 3m 64MiNow compare that to what those pods are actually requesting:
kubectl get pods --all-namespaces -o custom-columns="NAMESPACE:.metadata.namespace,NAME:.metadata.name,CPU_REQ:.spec.containers[*].resources.requests.cpu,MEM_REQ:.spec.containers[*].resources.requests.memory"Sample output:
NAMESPACE NAME CPU_REQ MEM_REQ
production payment-api-7d4f8b6c9-x2k4l 4 8Gi
production order-service-5c8d9e7f1-m9n3p 2 4Gi
production analytics-worker-6b3a2c8d4-q7w1r 2 4Gi
staging frontend-dev-8f2e1a5b7-j4k6h 1 2Gi
default test-deployment-9c7d3e6f2-p8r5t 500m 1GiSee the gap? The payment API requested 4 CPU cores but was using 312 millicores — that's 7.8% utilization. The analytics worker asked for 2 cores and was using 47 millicores. On a per-pod basis, that might not sound like much. But multiply it across 180 pods, and those phantom reservations were locking up 14 nodes worth of capacity that nobody was using.
Dollar impact: In this client's case, over-provisioned resource requests accounted for roughly $6,800/month in wasted node capacity. Those nodes were running, being billed by AWS, and doing almost nothing.
---
Command 2: Find Unscheduled or Pending Pods Eating Reservations
Pods stuck in Pending state are a silent budget killer. They often trigger the cluster autoscaler to spin up new nodes — nodes that then sit idle because the pod can't actually schedule (maybe due to affinity rules, taints, or impossible resource requests).
The command:
kubectl get pods --all-namespaces --field-selector=status.phase=Pending -o wideSample output:
NAMESPACE NAME READY STATUS AGE NODE
ml-jobs training-pipeline-large-2f8a1b-k4m7q 0/1 Pending 12d <none>
analytics spark-driver-batch-9c3d7e-p2n5r 0/1 Pending 8d <none>
staging load-test-runner-7b4e2a-x8j3l 0/1 Pending 23d <none>That load-test-runner had been pending for 23 days. Nobody noticed. But the cluster autoscaler kept trying to provision a node large enough to fit its request of 16 CPU cores and 64Gi of memory. It would spin up an m5.4xlarge, fail to schedule (the pod had a node affinity rule for a label that didn't exist), then the node would sit idle for the scale-down cooldown period before being terminated. Then the cycle repeated.
Dollar impact: This single zombie pod cost $340/month in autoscaler churn — nodes spinning up and down for nothing. Across three similar stuck pods, the total was $1,100/month.
---
Command 3: Identify Idle Namespaces
This one is embarrassing in hindsight, but it's incredibly common. Dev teams spin up namespaces for feature branches, load tests, demos, and sprint reviews. They rarely clean up after themselves.
The command:
kubectl get pods --all-namespaces --no-headers | awk '{print $1}' | sort | uniq -c | sort -rnSample output:
47 production
31 kube-system
22 monitoring
18 staging
14 dev-team-alpha
11 dev-team-beta
9 qa-regression
8 demo-client-acme
6 sprint-42-review
4 load-test-march
2 sandbox-internNow check which of those namespaces have actually received traffic in the last 30 days. sprint-42-review was from two months ago. load-test-march was from March. demo-client-acme was for a sales demo that happened six weeks prior. sandbox-intern was from an intern who left the company.
Those four namespaces were running a combined 20 pods across 6 nodes, none of which served any purpose.
Dollar impact: $2,400/month. Just sitting there. Doing absolutely nothing.
Bonus — get a quick resource summary per namespace:
kubectl resource-quota --all-namespaces 2>/dev/null || \
kubectl get resourcequotas --all-namespaces -o custom-columns="NAMESPACE:.metadata.namespace,CPU_REQ:.status.used.requests\.cpu,MEM_REQ:.status.used.requests\.memory"---
Command 4: Spot Oversized PVCs (Persistent Volume Claims)
Storage is the cost everyone forgets about. Teams request 100Gi "just in case" and then use 3Gi. Cloud providers charge for allocated storage, not used storage.
The command:
kubectl get pvc --all-namespaces -o custom-columns="NAMESPACE:.metadata.namespace,NAME:.metadata.name,CAPACITY:.status.capacity.storage,STORAGECLASS:.spec.storageClassName"Sample output:
NAMESPACE NAME CAPACITY STORAGECLASS
production postgres-data 500Gi gp3
production redis-backup 200Gi gp3
analytics clickhouse-data 1Ti gp3
staging postgres-staging 500Gi gp3
dev-team-alpha mongo-dev 100Gi gp2That staging Postgres had a 500Gi volume — identical to production. But staging held 8Gi of data. The ClickHouse volume was 1Ti, but actual usage was 140Gi. And the dev MongoDB was on the more expensive gp2 storage class.
To check actual usage, exec into pods or use monitoring:
kubectl exec -n production postgres-0 -- df -h /var/lib/postgresql/dataFilesystem Size Used Avail Use%
/dev/nvme1n1 500G 89G 411G 18%Production Postgres was at 18% utilization. Staging was at 1.6%.
Dollar impact: Oversized PVCs were costing $1,900/month. The staging volume alone — which could have been 50Gi — was wasting $36/month on storage that held a test dataset you could recreate in five minutes.
---
Command 5: Node Utilization Audit
This is the big picture command. It shows you whether your nodes are actually earning their keep.
The command:
kubectl describe nodes | grep -A 5 "Allocated resources"Sample output (per node):
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 1200m (15%) 4000m (50%)
memory 2.1Gi (13%) 8Gi (50%)When we ran this across all 24 nodes, the average CPU request utilization was 23%. That means 77% of the CPU capacity the client was paying for was reserved but idle. Twelve of the 24 nodes were below 20% request utilization.
A more structured view:
kubectl top nodesNAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
ip-10-0-1-47.ec2.internal 890m 11% 3.2Gi 20%
ip-10-0-1-98.ec2.internal 1240m 15% 4.1Gi 25%
ip-10-0-2-33.ec2.internal 2100m 26% 6.8Gi 42%
ip-10-0-2-71.ec2.internal 340m 4% 1.1Gi 7%
ip-10-0-3-15.ec2.internal 780m 9% 2.9Gi 18%Node ip-10-0-2-71 was at 4% CPU utilization. It was an m5.2xlarge costing $0.384/hour — roughly $280/month to run at 4% load.
Dollar impact: By right-sizing the cluster from 24 nodes to 15 nodes (and switching to a mix of instance types), the projected savings were $7,200/month on compute alone.
---
The Optimization Playbook
Finding the waste is step one. Here's the playbook we implemented:
Vertical Pod Autoscaler (VPA): We deployed VPA in recommendation mode first. It analyzed actual resource usage over two weeks and suggested right-sized requests. The payment API went from 4 CPU / 8Gi to 500m CPU / 1Gi — and performance didn't change at all.
Cluster Autoscaler tuning: We adjusted the scale-down delay from the default 10 minutes to 5 minutes and set the utilization threshold to 50%. This eliminated the zombie node problem.
Spot instances for non-production: All staging, dev, and QA workloads moved to spot instances, saving 60-70% on those nodes.
Namespace quotas and lifecycle policies: Every non-production namespace now has a TTL. After 7 days of inactivity, an automated job scales it to zero. After 30 days, it gets deleted.
Scheduled scaling: Dev and staging namespaces scale to zero at 8 PM and back up at 8 AM, Monday through Friday. Weekends are off entirely.
---
Cost Comparison: Before and After
Here's what the numbers looked like for this 20-node production cluster (24 nodes before optimization, consolidated to 15 after):
|---|---|---|---|
That's a 41% reduction. Annualized, this client saved $171,600 per year — without changing a single line of application code.
---
4 Mistakes Teams Make With Kubernetes Costs
1. Setting resource requests once and never revisiting them. Traffic patterns change. That service that needed 4 cores at launch might need 500m now that you've optimized the hot path. Treat resource requests like any other config — review them quarterly.
2. Running identical infrastructure for staging and production. Your staging environment does not need the same node count, storage capacity, or instance types as production. It needs enough to validate deployments. A 3-node staging cluster with spot instances is usually plenty.
3. Ignoring the cluster autoscaler's defaults. The default settings are conservative — long scale-down delays, low utilization thresholds. For most workloads, you can be significantly more aggressive without impacting reliability.
4. Not setting namespace resource quotas. Without quotas, any team can request unlimited resources. One runaway deployment can trigger the autoscaler to add 10 nodes before anyone notices. Quotas are guardrails, not bureaucracy.
---
Frequently Asked Questions
Q: Won't right-sizing resource requests hurt application performance?
Not if you do it based on data. The VPA recommendation mode watches actual usage for weeks before suggesting changes. We typically set requests at the P95 usage level plus a 20% buffer. In practice, we've never seen a performance regression from data-driven right-sizing.
Q: How often should we run these commands?
Weekly for active clusters. Set up a cron job that dumps the output to a Slack channel or dashboard. Better yet, deploy a tool like Kubecost (free tier is solid) or Prometheus with custom recording rules that track the request-to-usage ratio over time.
Q: What about managed Kubernetes services — do they add hidden costs?
Yes. EKS charges $0.10/hour per cluster ($73/month). GKE has a management fee for Autopilot. AKS is free for the control plane but charges for uptime SLA. These are small relative to compute costs, but they add up if you're running multiple clusters. For a deeper look at cloud pricing traps, check out our analysis of multi-cloud hidden costs and pitfallsmulti-cloud hidden costs and pitfallshttps://www.techsaas.cloud/blog/multi-cloud-hidden-costs-pitfalls.
---
Related Reading
If you found this useful, these posts go deeper on related topics:
---
Stop Guessing, Start Measuring
Kubernetes cost optimization isn't a one-time project. It's a discipline. The clusters we manage for clients get a monthly cost review — we run these commands (and more), compare against the previous month, and flag any drift.
The five commands in this post take twenty minutes to run. If you find even 10% waste on a $20,000/month cluster, that's $24,000/year back in your budget. For most teams, the actual waste is closer to 30-40%.
Want us to run a Kubernetes cost audit on your clusters? Book a free 30-minute assessmentBook a free 30-minute assessmenthttps://www.techsaas.cloud/services/ and we'll show you exactly where your money is going — no strings attached, no sales pitch. Just data.
Need help with conversational?
TechSaaS provides expert consulting and managed services for cloud infrastructure, DevOps, and AI/ML operations.