AI Cost Optimization: GPU Sharing, Quantization, and Batch Inference
Cut AI infrastructure costs by 60-80% with GPU sharing, model quantization, batch inference, and smart scheduling. Practical techniques with benchmarks.
The AI Cost Crisis
Running AI models in production is expensive. A single NVIDIA A100 GPU costs $2-4/hour on cloud providers. Running a 70B parameter model 24/7 costs $1,500-3,000/month just for compute. For most companies, this is unsustainable.
Performance optimization funnel: each layer of optimization compounds to dramatically reduce response times.
The good news: there are proven techniques to cut these costs by 60-80% without sacrificing output quality. At TechSaaS, we have deployed these optimizations across dozens of AI workloads.
Technique 1: Model Quantization
Quantization reduces model precision from 32-bit floats to 4-bit or 8-bit integers. The model gets 4-8x smaller and 2-4x faster.
GGUF Quantization (for Ollama / llama.cpp)
# Original Llama 3.1 8B: ~16GB (FP16)
# After Q4_K_M quantization: ~4.9GB
# After Q5_K_M quantization: ~5.7GB
# In Ollama, quantized models are the default
ollama pull llama3.1:8b # Already Q4_K_M
ollama pull llama3.1:8b-q5_K_M # Slightly better quality
AWQ Quantization (for vLLM / production)
# Using AutoAWQ for activation-aware quantization
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = "meta-llama/Llama-3.1-8B-Instruct"
quant_path = "llama-3.1-8b-awq"
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
quant_config = {
"zero_point": True,
"q_group_size": 128,
"w_bit": 4,
"version": "GEMM"
}
model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)
Quality impact: Q4_K_M retains 95-98% of the original model quality on most benchmarks. For most production applications, users cannot tell the difference.
Get more insights on AI & Machine Learning
Join 2,000+ engineers who get our weekly deep-dives. No spam, unsubscribe anytime.
Technique 2: GPU Sharing with MIG and Time-Slicing
A single GPU can serve multiple models or users simultaneously.
NVIDIA MIG (Multi-Instance GPU)
Available on A30, A100, and H100 GPUs. Splits one GPU into isolated instances:
# Enable MIG on an A100
sudo nvidia-smi -mig 1
# Create GPU instances (7 x 5GB slices from a 40GB A100)
sudo nvidia-smi mig -cgi 19,19,19,19,19,19,19
# List instances
nvidia-smi mig -lgi
Each instance is fully isolated — separate memory, separate compute, separate failure domains. Perfect for serving multiple small models on one expensive GPU.
Neural network architecture: data flows through input, hidden, and output layers.
Time-Slicing (for consumer GPUs)
For GPUs without MIG (like the GTX 1650 in our TechSaaS server), use NVIDIA's time-slicing:
# /etc/nvidia/time-slicing-config.yaml
version: v1
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 4 # Share 1 GPU as 4 virtual GPUs
Not truly isolated like MIG, but allows multiple containers to share one GPU for inference.
Technique 3: Batch Inference with vLLM
Processing requests one at a time wastes GPU cycles. Batching multiple requests together is 3-5x more efficient.
# vLLM handles batching automatically
from vllm import LLM, SamplingParams
# Load quantized model
llm = LLM(
model="TheBloke/Llama-3.1-8B-AWQ",
quantization="awq",
max_model_len=4096,
gpu_memory_utilization=0.85
)
params = SamplingParams(temperature=0.7, max_tokens=512)
# Batch of requests processed together
prompts = [
"Explain Kubernetes in simple terms",
"Write a Python function to parse JSON",
"What is the CAP theorem?",
# ... hundreds more
]
outputs = llm.generate(prompts, params) # Batched automatically
vLLM uses continuous batching and PagedAttention to maximize throughput:
| Setup | Throughput | Latency (p50) |
|---|---|---|
| Naive (one at a time) | 15 tok/s | 200ms |
| vLLM (batched) | 80 tok/s | 250ms |
| vLLM + AWQ quantized | 120 tok/s | 180ms |
The latency barely increases while throughput jumps 5-8x.
Technique 4: Smart Request Routing
Not every query needs the largest model. Route simple queries to smaller, cheaper models:
class ModelRouter:
def __init__(self):
self.small_model = "llama3.1:8b" # Fast, cheap
self.large_model = "llama3.1:70b" # Slow, expensive
def classify_complexity(self, query: str) -> str:
"""Simple heuristic to route queries."""
simple_indicators = [
"what is", "define", "list", "how many",
"yes or no", "true or false"
]
complex_indicators = [
"explain why", "compare", "analyze", "design",
"write code for", "debug", "architecture"
]
query_lower = query.lower()
simple_score = sum(1 for i in simple_indicators if i in query_lower)
complex_score = sum(1 for i in complex_indicators if i in query_lower)
return "simple" if simple_score > complex_score else "complex"
def route(self, query: str) -> str:
complexity = self.classify_complexity(query)
if complexity == "simple":
return self.small_model # 10x cheaper
return self.large_model
In practice, 60-70% of queries can be handled by small models. This alone cuts costs by 50%+.
Free Resource
Free Cloud Architecture Checklist
A 47-point checklist covering security, scalability, cost optimization, and disaster recovery for production cloud environments.
Technique 5: Caching Repeated Queries
Many AI applications receive the same or similar queries repeatedly. Cache the results:
import hashlib
import redis
import json
class LLMCache:
def __init__(self, redis_url="redis://localhost:6379"):
self.redis = redis.from_url(redis_url)
self.ttl = 3600 * 24 # 24 hours
def _hash_query(self, model: str, prompt: str, params: dict) -> str:
key_data = f"{model}:{prompt}:{json.dumps(params, sort_keys=True)}"
return hashlib.sha256(key_data.encode()).hexdigest()
def get(self, model, prompt, params):
key = self._hash_query(model, prompt, params)
cached = self.redis.get(f"llm:{key}")
if cached:
return json.loads(cached)
return None
def set(self, model, prompt, params, response):
key = self._hash_query(model, prompt, params)
self.redis.setex(
f"llm:{key}",
self.ttl,
json.dumps(response)
)
Cache hit rates of 20-40% are typical for customer-facing applications. That is 20-40% fewer GPU cycles.
Technique 6: Scheduled Inference
If your workload is not real-time, batch process during off-peak hours:
# Process overnight when GPU spot instances are cheapest
# AWS spot pricing: $0.90/hr vs $3.10/hr on-demand for A100
import schedule
import time
def process_batch_queue():
"""Process all queued inference requests."""
pending = db.get_pending_requests()
results = vllm_batch_generate(pending)
db.store_results(results)
# Run batch processing at 2 AM when prices are lowest
schedule.every().day.at("02:00").do(process_batch_queue)
RAG architecture: user prompts are embedded, matched against a vector store, then fed to an LLM with retrieved context.
Real Cost Savings Example
A TechSaaS client was spending $4,200/month on AI inference:
| Optimization | Savings |
|---|---|
| Q4 Quantization | -30% model size, same GPU handles more |
| vLLM batching | -40% GPU time |
| Small model routing | -25% (simple queries use 8B not 70B) |
| Response caching | -20% fewer requests |
| Combined | -72% ($1,176/month from $4,200) |
These are not theoretical numbers. This is what proper AI cost optimization delivers in practice. Every AI deployment should start with these optimizations from day one.
Related Service
Cloud Solutions
Let our experts help you build the right technology strategy for your business.
Need help with ai & machine learning?
TechSaaS provides expert consulting and managed services for cloud infrastructure, DevOps, and AI/ML operations.
We Will Build You a Demo Site — For Free
Like it? Pay us. Do not like it? Walk away, zero complaints. You will spend way less than hiring developers or any agency.
No spam. No contracts. Just a free demo.