← All articlesconversational

Kubernetes Cost Optimization: 5 kubectl Commands That Reveal Your Cluster Waste

Last quarter, a Series B fintech company asked us to look at their AWS bill. They were running a 24-node Kubernetes cluster on EKS, spending roughly $35,000 a month on compute alone. Their CTO's exact

Y
Yash Pritwani
8 min read read

# Kubernetes Cost Optimization: 5 kubectl Commands That Reveal Your Cluster Waste

The $14,000/Month Discovery

Last quarter, a Series B fintech company asked us to look at their AWS bill. They were running a 24-node Kubernetes cluster on EKS, spending roughly $35,000 a month on compute alone. Their CTO's exact words: "We know we're overspending, but we don't know where."

We sat down, opened a terminal, and ran five kubectl commands. Took about twenty minutes. What we found was $14,200 per month in pure waste — pods requesting resources they never used, dev namespaces running around the clock, storage volumes allocated and forgotten, and nodes sitting at under 25% utilization.

No fancy tooling. No $50,000 FinOps platform license. Just five commands and the knowledge of what to look for.

Here's exactly what we ran and what it revealed.

---

Command 1: Find Pods Requesting Way More Than They Use

This is where the money hides. Kubernetes resource requests are promises — when a pod says it needs 4 CPU cores, the scheduler reserves 4 cores on a node, whether the pod actually uses them or not. Those reserved-but-unused resources are invisible unless you look.

The command:

kubectl top pods --all-namespaces --sort-by=cpu

Sample output:

NAMESPACE     NAME                              CPU(cores)   MEMORY(bytes)
production    payment-api-7d4f8b6c9-x2k4l       312m         486Mi
production    order-service-5c8d9e7f1-m9n3p      289m         512Mi
production    analytics-worker-6b3a2c8d4-q7w1r   47m          1.2Gi
staging       frontend-dev-8f2e1a5b7-j4k6h       12m          128Mi
default       test-deployment-9c7d3e6f2-p8r5t     3m           64Mi

Now compare that to what those pods are actually requesting:

kubectl get pods --all-namespaces -o custom-columns="NAMESPACE:.metadata.namespace,NAME:.metadata.name,CPU_REQ:.spec.containers[*].resources.requests.cpu,MEM_REQ:.spec.containers[*].resources.requests.memory"

Sample output:

NAMESPACE     NAME                              CPU_REQ   MEM_REQ
production    payment-api-7d4f8b6c9-x2k4l       4         8Gi
production    order-service-5c8d9e7f1-m9n3p      2         4Gi
production    analytics-worker-6b3a2c8d4-q7w1r   2         4Gi
staging       frontend-dev-8f2e1a5b7-j4k6h       1         2Gi
default       test-deployment-9c7d3e6f2-p8r5t     500m      1Gi

See the gap? The payment API requested 4 CPU cores but was using 312 millicores — that's 7.8% utilization. The analytics worker asked for 2 cores and was using 47 millicores. On a per-pod basis, that might not sound like much. But multiply it across 180 pods, and those phantom reservations were locking up 14 nodes worth of capacity that nobody was using.

Dollar impact: In this client's case, over-provisioned resource requests accounted for roughly $6,800/month in wasted node capacity. Those nodes were running, being billed by AWS, and doing almost nothing.

---

Command 2: Find Unscheduled or Pending Pods Eating Reservations

Pods stuck in Pending state are a silent budget killer. They often trigger the cluster autoscaler to spin up new nodes — nodes that then sit idle because the pod can't actually schedule (maybe due to affinity rules, taints, or impossible resource requests).

The command:

kubectl get pods --all-namespaces --field-selector=status.phase=Pending -o wide

Sample output:

NAMESPACE    NAME                                   READY   STATUS    AGE    NODE
ml-jobs      training-pipeline-large-2f8a1b-k4m7q   0/1     Pending   12d    <none>
analytics    spark-driver-batch-9c3d7e-p2n5r         0/1     Pending   8d     <none>
staging      load-test-runner-7b4e2a-x8j3l           0/1     Pending   23d    <none>

That load-test-runner had been pending for 23 days. Nobody noticed. But the cluster autoscaler kept trying to provision a node large enough to fit its request of 16 CPU cores and 64Gi of memory. It would spin up an m5.4xlarge, fail to schedule (the pod had a node affinity rule for a label that didn't exist), then the node would sit idle for the scale-down cooldown period before being terminated. Then the cycle repeated.

Dollar impact: This single zombie pod cost $340/month in autoscaler churn — nodes spinning up and down for nothing. Across three similar stuck pods, the total was $1,100/month.

---

Command 3: Identify Idle Namespaces

This one is embarrassing in hindsight, but it's incredibly common. Dev teams spin up namespaces for feature branches, load tests, demos, and sprint reviews. They rarely clean up after themselves.

The command:

kubectl get pods --all-namespaces --no-headers | awk '{print $1}' | sort | uniq -c | sort -rn

Sample output:

47 production
   31 kube-system
   22 monitoring
   18 staging
   14 dev-team-alpha
   11 dev-team-beta
    9 qa-regression
    8 demo-client-acme
    6 sprint-42-review
    4 load-test-march
    2 sandbox-intern

Now check which of those namespaces have actually received traffic in the last 30 days. sprint-42-review was from two months ago. load-test-march was from March. demo-client-acme was for a sales demo that happened six weeks prior. sandbox-intern was from an intern who left the company.

Those four namespaces were running a combined 20 pods across 6 nodes, none of which served any purpose.

Dollar impact: $2,400/month. Just sitting there. Doing absolutely nothing.

Bonus — get a quick resource summary per namespace:

kubectl resource-quota --all-namespaces 2>/dev/null || \
kubectl get resourcequotas --all-namespaces -o custom-columns="NAMESPACE:.metadata.namespace,CPU_REQ:.status.used.requests\.cpu,MEM_REQ:.status.used.requests\.memory"

---

Command 4: Spot Oversized PVCs (Persistent Volume Claims)

Storage is the cost everyone forgets about. Teams request 100Gi "just in case" and then use 3Gi. Cloud providers charge for allocated storage, not used storage.

The command:

kubectl get pvc --all-namespaces -o custom-columns="NAMESPACE:.metadata.namespace,NAME:.metadata.name,CAPACITY:.status.capacity.storage,STORAGECLASS:.spec.storageClassName"

Sample output:

NAMESPACE     NAME                    CAPACITY   STORAGECLASS
production    postgres-data           500Gi      gp3
production    redis-backup            200Gi      gp3
analytics     clickhouse-data         1Ti        gp3
staging       postgres-staging        500Gi      gp3
dev-team-alpha  mongo-dev             100Gi      gp2

That staging Postgres had a 500Gi volume — identical to production. But staging held 8Gi of data. The ClickHouse volume was 1Ti, but actual usage was 140Gi. And the dev MongoDB was on the more expensive gp2 storage class.

To check actual usage, exec into pods or use monitoring:

kubectl exec -n production postgres-0 -- df -h /var/lib/postgresql/data
Filesystem      Size  Used  Avail Use%
/dev/nvme1n1    500G   89G   411G  18%

Production Postgres was at 18% utilization. Staging was at 1.6%.

Dollar impact: Oversized PVCs were costing $1,900/month. The staging volume alone — which could have been 50Gi — was wasting $36/month on storage that held a test dataset you could recreate in five minutes.

---

Command 5: Node Utilization Audit

This is the big picture command. It shows you whether your nodes are actually earning their keep.

The command:

kubectl describe nodes | grep -A 5 "Allocated resources"

Sample output (per node):

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests     Limits
  --------           --------     ------
  cpu                1200m (15%)  4000m (50%)
  memory             2.1Gi (13%) 8Gi (50%)

When we ran this across all 24 nodes, the average CPU request utilization was 23%. That means 77% of the CPU capacity the client was paying for was reserved but idle. Twelve of the 24 nodes were below 20% request utilization.

A more structured view:

kubectl top nodes
NAME                          CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
ip-10-0-1-47.ec2.internal     890m         11%    3.2Gi           20%
ip-10-0-1-98.ec2.internal     1240m        15%    4.1Gi           25%
ip-10-0-2-33.ec2.internal     2100m        26%    6.8Gi           42%
ip-10-0-2-71.ec2.internal     340m          4%    1.1Gi            7%
ip-10-0-3-15.ec2.internal     780m          9%    2.9Gi           18%

Node ip-10-0-2-71 was at 4% CPU utilization. It was an m5.2xlarge costing $0.384/hour — roughly $280/month to run at 4% load.

Dollar impact: By right-sizing the cluster from 24 nodes to 15 nodes (and switching to a mix of instance types), the projected savings were $7,200/month on compute alone.

---

The Optimization Playbook

Finding the waste is step one. Here's the playbook we implemented:

Vertical Pod Autoscaler (VPA): We deployed VPA in recommendation mode first. It analyzed actual resource usage over two weeks and suggested right-sized requests. The payment API went from 4 CPU / 8Gi to 500m CPU / 1Gi — and performance didn't change at all.

Cluster Autoscaler tuning: We adjusted the scale-down delay from the default 10 minutes to 5 minutes and set the utilization threshold to 50%. This eliminated the zombie node problem.

Spot instances for non-production: All staging, dev, and QA workloads moved to spot instances, saving 60-70% on those nodes.

Namespace quotas and lifecycle policies: Every non-production namespace now has a TTL. After 7 days of inactivity, an automated job scales it to zero. After 30 days, it gets deleted.

Scheduled scaling: Dev and staging namespaces scale to zero at 8 PM and back up at 8 AM, Monday through Friday. Weekends are off entirely.

---

Cost Comparison: Before and After

Here's what the numbers looked like for this 20-node production cluster (24 nodes before optimization, consolidated to 15 after):

Category
Before (Monthly)
After (Monthly)
Savings

|---|---|---|---|

Compute (EC2 nodes)
$26,400
$14,800
$11,600
Storage (EBS volumes)
$4,200
$2,100
$2,100
Data transfer
$2,800
$2,600
$200
Load balancers
$1,600
$1,200
$400
Total
$35,000
$20,700
$14,300

That's a 41% reduction. Annualized, this client saved $171,600 per year — without changing a single line of application code.

---

4 Mistakes Teams Make With Kubernetes Costs

1. Setting resource requests once and never revisiting them. Traffic patterns change. That service that needed 4 cores at launch might need 500m now that you've optimized the hot path. Treat resource requests like any other config — review them quarterly.

2. Running identical infrastructure for staging and production. Your staging environment does not need the same node count, storage capacity, or instance types as production. It needs enough to validate deployments. A 3-node staging cluster with spot instances is usually plenty.

3. Ignoring the cluster autoscaler's defaults. The default settings are conservative — long scale-down delays, low utilization thresholds. For most workloads, you can be significantly more aggressive without impacting reliability.

4. Not setting namespace resource quotas. Without quotas, any team can request unlimited resources. One runaway deployment can trigger the autoscaler to add 10 nodes before anyone notices. Quotas are guardrails, not bureaucracy.

---

Frequently Asked Questions

Q: Won't right-sizing resource requests hurt application performance?

Not if you do it based on data. The VPA recommendation mode watches actual usage for weeks before suggesting changes. We typically set requests at the P95 usage level plus a 20% buffer. In practice, we've never seen a performance regression from data-driven right-sizing.

Q: How often should we run these commands?

Weekly for active clusters. Set up a cron job that dumps the output to a Slack channel or dashboard. Better yet, deploy a tool like Kubecost (free tier is solid) or Prometheus with custom recording rules that track the request-to-usage ratio over time.

Q: What about managed Kubernetes services — do they add hidden costs?

Yes. EKS charges $0.10/hour per cluster ($73/month). GKE has a management fee for Autopilot. AKS is free for the control plane but charges for uptime SLA. These are small relative to compute costs, but they add up if you're running multiple clusters. For a deeper look at cloud pricing traps, check out our analysis of multi-cloud hidden costs and pitfallsmulti-cloud hidden costs and pitfallshttps://www.techsaas.cloud/blog/multi-cloud-hidden-costs-pitfalls.

---

Related Reading

If you found this useful, these posts go deeper on related topics:

[Multi-Cloud Hidden Costs and Pitfalls](https://www.techsaas.cloud/blog/multi-cloud-hidden-costs-pitfalls) — The pricing traps that don't show up until your bill arrives. Egress fees, cross-region charges, and the true cost of multi-cloud.
[Build vs Buy: A Framework for Engineering Leaders](https://www.techsaas.cloud/blog/build-vs-buy-framework-engineering-leaders) — When to build your own cost tooling vs adopting a platform. Includes a decision matrix we use with clients.
[CI/CD Pipeline Optimization: From 20 Minutes to 3 Minutes](https://www.techsaas.cloud/blog/cicd-pipeline-optimization-20min-to-3min) — Faster pipelines mean fewer long-running build pods, which means lower cluster costs. The connection between CI/CD speed and infrastructure spend is real.

---

Stop Guessing, Start Measuring

Kubernetes cost optimization isn't a one-time project. It's a discipline. The clusters we manage for clients get a monthly cost review — we run these commands (and more), compare against the previous month, and flag any drift.

The five commands in this post take twenty minutes to run. If you find even 10% waste on a $20,000/month cluster, that's $24,000/year back in your budget. For most teams, the actual waste is closer to 30-40%.

Want us to run a Kubernetes cost audit on your clusters? Book a free 30-minute assessmentBook a free 30-minute assessmenthttps://www.techsaas.cloud/services/ and we'll show you exactly where your money is going — no strings attached, no sales pitch. Just data.

#conversational

Need help with conversational?

TechSaaS provides expert consulting and managed services for cloud infrastructure, DevOps, and AI/ML operations.