AI Cost Optimization: GPU Sharing, Quantization, and Batch Inference
Cut AI infrastructure costs by 60-80% with GPU sharing, model quantization, batch inference, and smart scheduling. Practical techniques with benchmarks.
One owner, one affected system, and the next buyer or recovery deadline mapped.
The AI Cost Crisis
Running AI models in production is expensive. A single NVIDIA A100 GPU costs $2-4/hour on cloud providers. Running a 70B parameter model 24/7 costs $1,500-3,000/month just for compute. For most companies, this is unsustainable.
<div style="margin:2.5rem auto;max-width:600px;width:100%;text-align:center;"><svg viewBox="0 0 600 200" xmlns="http://www.w3.org/2000/svg" style="width:100%;height:auto;"><rect width="600" height="200" rx="12" fill="#1a1a2e"/><path d="M100,30 L500,30 L460,65 L140,65 Z" fill="#3b82f6" opacity="0.8"/><text x="300" y="53" text-anchor="middle" fill="#ffffff" font-size="11" font-family="system-ui">Unoptimized Code — 2000ms</text><path d="M140,70 L460,70 L420,105 L180,105 Z" fill="#6366f1" opacity="0.8"/><text x="300" y="93" text-anchor="middle" fill="#ffffff" font-size="11" font-family="system-ui">+ Caching — 800ms</text><path d="M180,110 L420,110 L380,145 L220,145 Z" fill="#a855f7" opacity="0.8"/><text x="300" y="133" text-anchor="middle" fill="#ffffff" font-size="11" font-family="system-ui">+ CDN — 200ms</text><path d="M220,150 L380,150 L350,175 L250,175 Z" fill="#2dd4bf" opacity="0.9"/><text x="300" y="168" text-anchor="middle" fill="#1a1a2e" font-size="11" font-family="system-ui" font-weight="bold">Optimized — 50ms</text><text x="530" y="53" text-anchor="start" fill="#94a3b8" font-size="10" font-family="system-ui">Baseline</text><text x="445" y="93" text-anchor="start" fill="#2dd4bf" font-size="10" font-family="system-ui">-60%</text><text x="405" y="133" text-anchor="start" fill="#2dd4bf" font-size="10" font-family="system-ui">-90%</text><text x="365" y="168" text-anchor="start" fill="#2dd4bf" font-size="10" font-family="system-ui" font-weight="bold">-97.5%</text></svg><p style="margin-top:0.75rem;font-size:0.85rem;color:#94a3b8;font-style:italic;line-height:1.4;">Performance optimization funnel: each layer of optimization compounds to dramatically reduce response times.</p></div>
The good news: there are proven techniques to cut these costs by 60-80% without sacrificing output quality. At TechSaaS, we have deployed these optimizations across dozens of AI workloads.
Technique 1: Model Quantization
Quantization reduces model precision from 32-bit floats to 4-bit or 8-bit integers. The model gets 4-8x smaller and 2-4x faster.
GGUF Quantization (for Ollama / llama.cpp)
# Original Llama 3.1 8B: ~16GB (FP16)
# After Q4_K_M quantization: ~4.9GB
# After Q5_K_M quantization: ~5.7GB
# In Ollama, quantized models are the default
ollama pull llama3.1:8b # Already Q4_K_M
ollama pull llama3.1:8b-q5_K_M # Slightly better qualityAWQ Quantization (for vLLM / production)
# Using AutoAWQ for activation-aware quantization
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = "meta-llama/Llama-3.1-8B-Instruct"
quant_path = "llama-3.1-8b-awq"
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
quant_config = {
"zero_point": True,
"q_group_size": 128,
"w_bit": 4,
"version": "GEMM"
}
model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)Quality impact: Q4_K_M retains 95-98% of the original model quality on most benchmarks. For most production applications, users cannot tell the difference.
Technique 2: GPU Sharing with MIG and Time-Slicing
A single GPU can serve multiple models or users simultaneously.
NVIDIA MIG (Multi-Instance GPU)
Available on A30, A100, and H100 GPUs. Splits one GPU into isolated instances:
# Enable MIG on an A100
sudo nvidia-smi -mig 1
# Create GPU instances (7 x 5GB slices from a 40GB A100)
sudo nvidia-smi mig -cgi 19,19,19,19,19,19,19
# List instances
nvidia-smi mig -lgiEach instance is fully isolated — separate memory, separate compute, separate failure domains. Perfect for serving multiple small models on one expensive GPU.
<div style="margin:2.5rem auto;max-width:600px;width:100%;text-align:center;"><svg viewBox="0 0 600 200" xmlns="http://www.w3.org/2000/svg" style="width:100%;height:auto;"><rect width="600" height="200" rx="12" fill="#1a1a2e"/><text x="80" y="25" text-anchor="middle" fill="#94a3b8" font-size="10" font-family="system-ui">Input</text><circle cx="80" cy="50" r="14" fill="none" stroke="#3b82f6" stroke-width="2"/><circle cx="80" cy="100" r="14" fill="none" stroke="#3b82f6" stroke-width="2"/><circle cx="80" cy="150" r="14" fill="none" stroke="#3b82f6" stroke-width="2"/><text x="230" y="25" text-anchor="middle" fill="#94a3b8" font-size="10" font-family="system-ui">Hidden</text><circle cx="230" cy="45" r="14" fill="#6366f1" opacity="0.8"/><circle cx="230" cy="85" r="14" fill="#6366f1" opacity="0.8"/><circle cx="230" cy="125" r="14" fill="#6366f1" opacity="0.8"/><circle cx="230" cy="165" r="14" fill="#6366f1" opacity="0.8"/><text x="380" y="25" text-anchor="middle" fill="#94a3b8" font-size="10" font-family="system-ui">Hidden</text><circle cx="380" cy="55" r="14" fill="#a855f7" opacity="0.8"/><circle cx="380" cy="100" r="14" fill="#a855f7" opacity="0.8"/><circle cx="380" cy="145" r="14" fill="#a855f7" opacity="0.8"/><text x="520" y="25" text-anchor="middle" fill="#94a3b8" font-size="10" font-family="system-ui">Output</text><circle cx="520" cy="80" r="14" fill="none" stroke="#2dd4bf" stroke-width="2"/><circle cx="520" cy="130" r="14" fill="none" stroke="#2dd4bf" stroke-width="2"/><line x1="94" y1="50" x2="216" y2="45" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="50" x2="216" y2="85" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="50" x2="216" y2="125" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="50" x2="216" y2="165" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="100" x2="216" y2="45" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="100" x2="216" y2="85" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="100" x2="216" y2="125" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="100" x2="216" y2="165" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="150" x2="216" y2="45" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="150" x2="216" y2="85" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="150" x2="216" y2="125" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="150" x2="216" y2="165" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="45" x2="366" y2="55" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="45" x2="366" y2="100" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="45" x2="366" y2="145" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="85" x2="366" y2="55" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="85" x2="366" y2="100" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="85" x2="366" y2="145" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="125" x2="366" y2="55" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="125" x2="366" y2="100" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="125" x2="366" y2="145" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="165" x2="366" y2="55" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="165" x2="366" y2="100" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="165" x2="366" y2="145" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="394" y1="55" x2="506" y2="80" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="394" y1="55" x2="506" y2="130" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="394" y1="100" x2="506" y2="80" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="394" y1="100" x2="506" y2="130" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="394" y1="145" x2="506" y2="80" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="394" y1="145" x2="506" y2="130" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/></svg><p style="margin-top:0.75rem;font-size:0.85rem;color:#94a3b8;font-style:italic;line-height:1.4;">Neural network architecture: data flows through input, hidden, and output layers.</p></div>
Time-Slicing (for consumer GPUs)
For GPUs without MIG (like the GTX 1650 in our TechSaaS server), use NVIDIA's time-slicing:
# /etc/nvidia/time-slicing-config.yaml
version: v1
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 4 # Share 1 GPU as 4 virtual GPUsNot truly isolated like MIG, but allows multiple containers to share one GPU for inference.
Technique 3: Batch Inference with vLLM
Processing requests one at a time wastes GPU cycles. Batching multiple requests together is 3-5x more efficient.
# vLLM handles batching automatically
from vllm import LLM, SamplingParams
# Load quantized model
llm = LLM(
model="TheBloke/Llama-3.1-8B-AWQ",
quantization="awq",
max_model_len=4096,
gpu_memory_utilization=0.85
)
params = SamplingParams(temperature=0.7, max_tokens=512)
# Batch of requests processed together
prompts = [
"Explain Kubernetes in simple terms",
"Write a Python function to parse JSON",
"What is the CAP theorem?",
# ... hundreds more
]
outputs = llm.generate(prompts, params) # Batched automaticallyvLLM uses continuous batching and PagedAttention to maximize throughput:
|-------|-----------|---------------|
The latency barely increases while throughput jumps 5-8x.
Technique 4: Smart Request Routing
Not every query needs the largest model. Route simple queries to smaller, cheaper models:
class ModelRouter:
def __init__(self):
self.small_model = "llama3.1:8b" # Fast, cheap
self.large_model = "llama3.1:70b" # Slow, expensive
def classify_complexity(self, query: str) -> str:
"""Simple heuristic to route queries."""
simple_indicators = [
"what is", "define", "list", "how many",
"yes or no", "true or false"
]
complex_indicators = [
"explain why", "compare", "analyze", "design",
"write code for", "debug", "architecture"
]
query_lower = query.lower()
simple_score = sum(1 for i in simple_indicators if i in query_lower)
complex_score = sum(1 for i in complex_indicators if i in query_lower)
return "simple" if simple_score > complex_score else "complex"
def route(self, query: str) -> str:
complexity = self.classify_complexity(query)
if complexity == "simple":
return self.small_model # 10x cheaper
return self.large_modelIn practice, 60-70% of queries can be handled by small models. This alone cuts costs by 50%+.
Technique 5: Caching Repeated Queries
Many AI applications receive the same or similar queries repeatedly. Cache the results:
import hashlib
import redis
import json
class LLMCache:
def __init__(self, redis_url="redis://localhost:6379"):
self.redis = redis.from_url(redis_url)
self.ttl = 3600 * 24 # 24 hours
def _hash_query(self, model: str, prompt: str, params: dict) -> str:
key_data = f"{model}:{prompt}:{json.dumps(params, sort_keys=True)}"
return hashlib.sha256(key_data.encode()).hexdigest()
def get(self, model, prompt, params):
key = self._hash_query(model, prompt, params)
cached = self.redis.get(f"llm:{key}")
if cached:
return json.loads(cached)
return None
def set(self, model, prompt, params, response):
key = self._hash_query(model, prompt, params)
self.redis.setex(
f"llm:{key}",
self.ttl,
json.dumps(response)
)Cache hit rates of 20-40% are typical for customer-facing applications. That is 20-40% fewer GPU cycles.
Technique 6: Scheduled Inference
If your workload is not real-time, batch process during off-peak hours:
# Process overnight when GPU spot instances are cheapest
# AWS spot pricing: $0.90/hr vs $3.10/hr on-demand for A100
import schedule
import time
def process_batch_queue():
"""Process all queued inference requests."""
pending = db.get_pending_requests()
results = vllm_batch_generate(pending)
db.store_results(results)
# Run batch processing at 2 AM when prices are lowest
schedule.every().day.at("02:00").do(process_batch_queue)<div style="margin:2.5rem auto;max-width:600px;width:100%;text-align:center;"><svg viewBox="0 0 600 180" xmlns="http://www.w3.org/2000/svg" style="width:100%;height:auto;"><rect width="600" height="180" rx="12" fill="#1a1a2e"/><rect x="30" y="60" width="80" height="50" rx="25" fill="#3b82f6" opacity="0.85"/><text x="70" y="90" text-anchor="middle" fill="#ffffff" font-size="11" font-family="system-ui">Prompt</text><rect x="145" y="50" width="90" height="70" rx="8" fill="#6366f1" opacity="0.85"/><text x="190" y="80" text-anchor="middle" fill="#ffffff" font-size="10" font-family="system-ui">Embed</text><text x="190" y="95" text-anchor="middle" fill="#ffffff" font-size="10" font-family="system-ui">[0.2, 0.8...]</text><rect x="270" y="50" width="90" height="70" rx="8" fill="#a855f7" opacity="0.85"/><text x="315" y="75" text-anchor="middle" fill="#ffffff" font-size="10" font-family="system-ui">Vector</text><text x="315" y="90" text-anchor="middle" fill="#ffffff" font-size="10" font-family="system-ui">Search</text><text x="315" y="105" text-anchor="middle" fill="#ffffff" font-size="9" font-family="system-ui" opacity="0.7">top-k=5</text><rect x="395" y="50" width="90" height="70" rx="8" fill="#2dd4bf" opacity="0.85"/><text x="440" y="80" text-anchor="middle" fill="#1a1a2e" font-size="11" font-family="system-ui" font-weight="bold">LLM</text><text x="440" y="95" text-anchor="middle" fill="#1a1a2e" font-size="9" font-family="system-ui">+ context</text><rect x="520" y="60" width="55" height="50" rx="25" fill="#f59e0b" opacity="0.85"/><text x="547" y="90" text-anchor="middle" fill="#1a1a2e" font-size="10" font-family="system-ui">Reply</text><defs><marker id="arrow4" markerWidth="8" markerHeight="6" refX="8" refY="3" orient="auto"><path d="M0,0 L8,3 L0,6" fill="#e2e8f0"/></marker></defs><line x1="112" y1="85" x2="143" y2="85" stroke="#e2e8f0" stroke-width="1.5" marker-end="url(#arrow4)"/><line x1="237" y1="85" x2="268" y2="85" stroke="#e2e8f0" stroke-width="1.5" marker-end="url(#arrow4)"/><line x1="362" y1="85" x2="393" y2="85" stroke="#e2e8f0" stroke-width="1.5" marker-end="url(#arrow4)"/><line x1="487" y1="85" x2="518" y2="85" stroke="#e2e8f0" stroke-width="1.5" marker-end="url(#arrow4)"/><text x="300" y="155" text-anchor="middle" fill="#94a3b8" font-size="10" font-family="system-ui">Retrieval-Augmented Generation (RAG) Flow</text></svg><p style="margin-top:0.75rem;font-size:0.85rem;color:#94a3b8;font-style:italic;line-height:1.4;">RAG architecture: user prompts are embedded, matched against a vector store, then fed to an LLM with retrieved context.</p></div>
Real Cost Savings Example
A TechSaaS client was spending $4,200/month on AI inference:
|-------------|---------|
These are not theoretical numbers. This is what proper AI cost optimization delivers in practice. Every AI deployment should start with these optimizations from day one.
Need the next owner and evidence step mapped?
Send the current system and deadline. Yash replies with the service path, first proof artifact, and handoff owner.