AI Cost Optimization: GPU Sharing, Quantization, and Batch Inference
Cut AI infrastructure costs by 60-80% with GPU sharing, model quantization, batch inference, and smart scheduling. Practical techniques with benchmarks.
The AI Cost Crisis
Running AI models in production is expensive. A single NVIDIA A100 GPU costs $2-4/hour on cloud providers. Running a 70B parameter model 24/7 costs $1,500-3,000/month just for compute. For most companies, this is unsustainable.
<div style="margin:2.5rem auto;max-width:600px;width:100%;text-align:center;"><svg viewBox="0 0 600 200" xmlns="http://www.w3.org/2000/svg" style="width:100%;height:auto;"><rect width="600" height="200" rx="12" fill="#1a1a2e"/><path d="M100,30 L500,30 L460,65 L140,65 Z" fill="#3b82f6" opacity="0.8"/><text x="300" y="53" text-anchor="middle" fill="#ffffff" font-size="11" font-family="system-ui">Unoptimized Code — 2000ms</text><path d="M140,70 L460,70 L420,105 L180,105 Z" fill="#6366f1" opacity="0.8"/><text x="300" y="93" text-anchor="middle" fill="#ffffff" font-size="11" font-family="system-ui">+ Caching — 800ms</text><path d="M180,110 L420,110 L380,145 L220,145 Z" fill="#a855f7" opacity="0.8"/><text x="300" y="133" text-anchor="middle" fill="#ffffff" font-size="11" font-family="system-ui">+ CDN — 200ms</text><path d="M220,150 L380,150 L350,175 L250,175 Z" fill="#2dd4bf" opacity="0.9"/><text x="300" y="168" text-anchor="middle" fill="#1a1a2e" font-size="11" font-family="system-ui" font-weight="bold">Optimized — 50ms</text><text x="530" y="53" text-anchor="start" fill="#94a3b8" font-size="10" font-family="system-ui">Baseline</text><text x="445" y="93" text-anchor="start" fill="#2dd4bf" font-size="10" font-family="system-ui">-60%</text><text x="405" y="133" text-anchor="start" fill="#2dd4bf" font-size="10" font-family="system-ui">-90%</text><text x="365" y="168" text-anchor="start" fill="#2dd4bf" font-size="10" font-family="system-ui" font-weight="bold">-97.5%</text></svg><p style="margin-top:0.75rem;font-size:0.85rem;color:#94a3b8;font-style:italic;line-height:1.4;">Performance optimization funnel: each layer of optimization compounds to dramatically reduce response times.</p></div>
The good news: there are proven techniques to cut these costs by 60-80% without sacrificing output quality. At TechSaaS, we have deployed these optimizations across dozens of AI workloads.
Technique 1: Model Quantization
Quantization reduces model precision from 32-bit floats to 4-bit or 8-bit integers. The model gets 4-8x smaller and 2-4x faster.
GGUF Quantization (for Ollama / llama.cpp)
# Original Llama 3.1 8B: ~16GB (FP16)
# After Q4_K_M quantization: ~4.9GB
# After Q5_K_M quantization: ~5.7GB
# In Ollama, quantized models are the default
ollama pull llama3.1:8b # Already Q4_K_M
ollama pull llama3.1:8b-q5_K_M # Slightly better qualityAWQ Quantization (for vLLM / production)
# Using AutoAWQ for activation-aware quantization
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = "meta-llama/Llama-3.1-8B-Instruct"
quant_path = "llama-3.1-8b-awq"
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
quant_config = {
"zero_point": True,
"q_group_size": 128,
"w_bit": 4,
"version": "GEMM"
}
model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)Quality impact: Q4_K_M retains 95-98% of the original model quality on most benchmarks. For most production applications, users cannot tell the difference.
Technique 2: GPU Sharing with MIG and Time-Slicing
A single GPU can serve multiple models or users simultaneously.
NVIDIA MIG (Multi-Instance GPU)
Available on A30, A100, and H100 GPUs. Splits one GPU into isolated instances:
# Enable MIG on an A100
sudo nvidia-smi -mig 1
# Create GPU instances (7 x 5GB slices from a 40GB A100)
sudo nvidia-smi mig -cgi 19,19,19,19,19,19,19
# List instances
nvidia-smi mig -lgiEach instance is fully isolated — separate memory, separate compute, separate failure domains. Perfect for serving multiple small models on one expensive GPU.
<div style="margin:2.5rem auto;max-width:600px;width:100%;text-align:center;"><svg viewBox="0 0 600 200" xmlns="http://www.w3.org/2000/svg" style="width:100%;height:auto;"><rect width="600" height="200" rx="12" fill="#1a1a2e"/><text x="80" y="25" text-anchor="middle" fill="#94a3b8" font-size="10" font-family="system-ui">Input</text><circle cx="80" cy="50" r="14" fill="none" stroke="#3b82f6" stroke-width="2"/><circle cx="80" cy="100" r="14" fill="none" stroke="#3b82f6" stroke-width="2"/><circle cx="80" cy="150" r="14" fill="none" stroke="#3b82f6" stroke-width="2"/><text x="230" y="25" text-anchor="middle" fill="#94a3b8" font-size="10" font-family="system-ui">Hidden</text><circle cx="230" cy="45" r="14" fill="#6366f1" opacity="0.8"/><circle cx="230" cy="85" r="14" fill="#6366f1" opacity="0.8"/><circle cx="230" cy="125" r="14" fill="#6366f1" opacity="0.8"/><circle cx="230" cy="165" r="14" fill="#6366f1" opacity="0.8"/><text x="380" y="25" text-anchor="middle" fill="#94a3b8" font-size="10" font-family="system-ui">Hidden</text><circle cx="380" cy="55" r="14" fill="#a855f7" opacity="0.8"/><circle cx="380" cy="100" r="14" fill="#a855f7" opacity="0.8"/><circle cx="380" cy="145" r="14" fill="#a855f7" opacity="0.8"/><text x="520" y="25" text-anchor="middle" fill="#94a3b8" font-size="10" font-family="system-ui">Output</text><circle cx="520" cy="80" r="14" fill="none" stroke="#2dd4bf" stroke-width="2"/><circle cx="520" cy="130" r="14" fill="none" stroke="#2dd4bf" stroke-width="2"/><line x1="94" y1="50" x2="216" y2="45" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="50" x2="216" y2="85" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="50" x2="216" y2="125" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="50" x2="216" y2="165" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="100" x2="216" y2="45" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="100" x2="216" y2="85" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="100" x2="216" y2="125" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="100" x2="216" y2="165" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="150" x2="216" y2="45" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="150" x2="216" y2="85" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="150" x2="216" y2="125" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="150" x2="216" y2="165" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="45" x2="366" y2="55" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="45" x2="366" y2="100" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="45" x2="366" y2="145" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="85" x2="366" y2="55" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="85" x2="366" y2="100" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="85" x2="366" y2="145" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="125" x2="366" y2="55" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="125" x2="366" y2="100" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="125" x2="366" y2="145" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="165" x2="366" y2="55" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="165" x2="366" y2="100" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="165" x2="366" y2="145" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="394" y1="55" x2="506" y2="80" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="394" y1="55" x2="506" y2="130" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="394" y1="100" x2="506" y2="80" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="394" y1="100" x2="506" y2="130" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="394" y1="145" x2="506" y2="80" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="394" y1="145" x2="506" y2="130" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/></svg><p style="margin-top:0.75rem;font-size:0.85rem;color:#94a3b8;font-style:italic;line-height:1.4;">Neural network architecture: data flows through input, hidden, and output layers.</p></div>
Time-Slicing (for consumer GPUs)
For GPUs without MIG (like the GTX 1650 in our TechSaaS server), use NVIDIA's time-slicing:
# /etc/nvidia/time-slicing-config.yaml
version: v1
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 4 # Share 1 GPU as 4 virtual GPUsNot truly isolated like MIG, but allows multiple containers to share one GPU for inference.
Technique 3: Batch Inference with vLLM
Processing requests one at a time wastes GPU cycles. Batching multiple requests together is 3-5x more efficient.
# vLLM handles batching automatically
from vllm import LLM, SamplingParams
# Load quantized model
llm = LLM(
model="TheBloke/Llama-3.1-8B-AWQ",
quantization="awq",
max_model_len=4096,
gpu_memory_utilization=0.85
)
params = SamplingParams(temperature=0.7, max_tokens=512)
# Batch of requests processed together
prompts = [
"Explain Kubernetes in simple terms",
"Write a Python function to parse JSON",
"What is the CAP theorem?",
# ... hundreds more
]
outputs = llm.generate(prompts, params) # Batched automaticallyvLLM uses continuous batching and PagedAttention to maximize throughput:
|-------|-----------|---------------|
The latency barely increases while throughput jumps 5-8x.
Technique 4: Smart Request Routing
Not every query needs the largest model. Route simple queries to smaller, cheaper models:
class ModelRouter:
def __init__(self):
self.small_model = "llama3.1:8b" # Fast, cheap
self.large_model = "llama3.1:70b" # Slow, expensive
def classify_complexity(self, query: str) -> str:
"""Simple heuristic to route queries."""
simple_indicators = [
"what is", "define", "list", "how many",
"yes or no", "true or false"
]
complex_indicators = [
"explain why", "compare", "analyze", "design",
"write code for", "debug", "architecture"
]
query_lower = query.lower()
simple_score = sum(1 for i in simple_indicators if i in query_lower)
complex_score = sum(1 for i in complex_indicators if i in query_lower)
return "simple" if simple_score > complex_score else "complex"
def route(self, query: str) -> str:
complexity = self.classify_complexity(query)
if complexity == "simple":
return self.small_model # 10x cheaper
return self.large_modelIn practice, 60-70% of queries can be handled by small models. This alone cuts costs by 50%+.
Technique 5: Caching Repeated Queries
Many AI applications receive the same or similar queries repeatedly. Cache the results:
import hashlib
import redis
import json
class LLMCache:
def __init__(self, redis_url="redis://localhost:6379"):
self.redis = redis.from_url(redis_url)
self.ttl = 3600 * 24 # 24 hours
def _hash_query(self, model: str, prompt: str, params: dict) -> str:
key_data = f"{model}:{prompt}:{json.dumps(params, sort_keys=True)}"
return hashlib.sha256(key_data.encode()).hexdigest()
def get(self, model, prompt, params):
key = self._hash_query(model, prompt, params)
cached = self.redis.get(f"llm:{key}")
if cached:
return json.loads(cached)
return None
def set(self, model, prompt, params, response):
key = self._hash_query(model, prompt, params)
self.redis.setex(
f"llm:{key}",
self.ttl,
json.dumps(response)
)Cache hit rates of 20-40% are typical for customer-facing applications. That is 20-40% fewer GPU cycles.
Technique 6: Scheduled Inference
If your workload is not real-time, batch process during off-peak hours:
# Process overnight when GPU spot instances are cheapest
# AWS spot pricing: $0.90/hr vs $3.10/hr on-demand for A100
import schedule
import time
def process_batch_queue():
"""Process all queued inference requests."""
pending = db.get_pending_requests()
results = vllm_batch_generate(pending)
db.store_results(results)
# Run batch processing at 2 AM when prices are lowest
schedule.every().day.at("02:00").do(process_batch_queue)<div style="margin:2.5rem auto;max-width:600px;width:100%;text-align:center;"><svg viewBox="0 0 600 180" xmlns="http://www.w3.org/2000/svg" style="width:100%;height:auto;"><rect width="600" height="180" rx="12" fill="#1a1a2e"/><rect x="30" y="60" width="80" height="50" rx="25" fill="#3b82f6" opacity="0.85"/><text x="70" y="90" text-anchor="middle" fill="#ffffff" font-size="11" font-family="system-ui">Prompt</text><rect x="145" y="50" width="90" height="70" rx="8" fill="#6366f1" opacity="0.85"/><text x="190" y="80" text-anchor="middle" fill="#ffffff" font-size="10" font-family="system-ui">Embed</text><text x="190" y="95" text-anchor="middle" fill="#ffffff" font-size="10" font-family="system-ui">[0.2, 0.8...]</text><rect x="270" y="50" width="90" height="70" rx="8" fill="#a855f7" opacity="0.85"/><text x="315" y="75" text-anchor="middle" fill="#ffffff" font-size="10" font-family="system-ui">Vector</text><text x="315" y="90" text-anchor="middle" fill="#ffffff" font-size="10" font-family="system-ui">Search</text><text x="315" y="105" text-anchor="middle" fill="#ffffff" font-size="9" font-family="system-ui" opacity="0.7">top-k=5</text><rect x="395" y="50" width="90" height="70" rx="8" fill="#2dd4bf" opacity="0.85"/><text x="440" y="80" text-anchor="middle" fill="#1a1a2e" font-size="11" font-family="system-ui" font-weight="bold">LLM</text><text x="440" y="95" text-anchor="middle" fill="#1a1a2e" font-size="9" font-family="system-ui">+ context</text><rect x="520" y="60" width="55" height="50" rx="25" fill="#f59e0b" opacity="0.85"/><text x="547" y="90" text-anchor="middle" fill="#1a1a2e" font-size="10" font-family="system-ui">Reply</text><defs><marker id="arrow4" markerWidth="8" markerHeight="6" refX="8" refY="3" orient="auto"><path d="M0,0 L8,3 L0,6" fill="#e2e8f0"/></marker></defs><line x1="112" y1="85" x2="143" y2="85" stroke="#e2e8f0" stroke-width="1.5" marker-end="url(#arrow4)"/><line x1="237" y1="85" x2="268" y2="85" stroke="#e2e8f0" stroke-width="1.5" marker-end="url(#arrow4)"/><line x1="362" y1="85" x2="393" y2="85" stroke="#e2e8f0" stroke-width="1.5" marker-end="url(#arrow4)"/><line x1="487" y1="85" x2="518" y2="85" stroke="#e2e8f0" stroke-width="1.5" marker-end="url(#arrow4)"/><text x="300" y="155" text-anchor="middle" fill="#94a3b8" font-size="10" font-family="system-ui">Retrieval-Augmented Generation (RAG) Flow</text></svg><p style="margin-top:0.75rem;font-size:0.85rem;color:#94a3b8;font-style:italic;line-height:1.4;">RAG architecture: user prompts are embedded, matched against a vector store, then fed to an LLM with retrieved context.</p></div>
Real Cost Savings Example
A TechSaaS client was spending $4,200/month on AI inference:
|-------------|---------|
These are not theoretical numbers. This is what proper AI cost optimization delivers in practice. Every AI deployment should start with these optimizations from day one.
Need help with ai & machine learning?
TechSaaS provides expert consulting and managed services for cloud infrastructure, DevOps, and AI/ML operations.