← All articlesAI & Machine Learning

AI Cost Optimization: GPU Sharing, Quantization, and Batch Inference

Cut AI infrastructure costs by 60-80% with GPU sharing, model quantization, batch inference, and smart scheduling. Practical techniques with benchmarks.

Y
Yash Pritwani
13 min read

The AI Cost Crisis

Running AI models in production is expensive. A single NVIDIA A100 GPU costs $2-4/hour on cloud providers. Running a 70B parameter model 24/7 costs $1,500-3,000/month just for compute. For most companies, this is unsustainable.

<div style="margin:2.5rem auto;max-width:600px;width:100%;text-align:center;"><svg viewBox="0 0 600 200" xmlns="http://www.w3.org/2000/svg" style="width:100%;height:auto;"><rect width="600" height="200" rx="12" fill="#1a1a2e"/><path d="M100,30 L500,30 L460,65 L140,65 Z" fill="#3b82f6" opacity="0.8"/><text x="300" y="53" text-anchor="middle" fill="#ffffff" font-size="11" font-family="system-ui">Unoptimized Code — 2000ms</text><path d="M140,70 L460,70 L420,105 L180,105 Z" fill="#6366f1" opacity="0.8"/><text x="300" y="93" text-anchor="middle" fill="#ffffff" font-size="11" font-family="system-ui">+ Caching — 800ms</text><path d="M180,110 L420,110 L380,145 L220,145 Z" fill="#a855f7" opacity="0.8"/><text x="300" y="133" text-anchor="middle" fill="#ffffff" font-size="11" font-family="system-ui">+ CDN — 200ms</text><path d="M220,150 L380,150 L350,175 L250,175 Z" fill="#2dd4bf" opacity="0.9"/><text x="300" y="168" text-anchor="middle" fill="#1a1a2e" font-size="11" font-family="system-ui" font-weight="bold">Optimized — 50ms</text><text x="530" y="53" text-anchor="start" fill="#94a3b8" font-size="10" font-family="system-ui">Baseline</text><text x="445" y="93" text-anchor="start" fill="#2dd4bf" font-size="10" font-family="system-ui">-60%</text><text x="405" y="133" text-anchor="start" fill="#2dd4bf" font-size="10" font-family="system-ui">-90%</text><text x="365" y="168" text-anchor="start" fill="#2dd4bf" font-size="10" font-family="system-ui" font-weight="bold">-97.5%</text></svg><p style="margin-top:0.75rem;font-size:0.85rem;color:#94a3b8;font-style:italic;line-height:1.4;">Performance optimization funnel: each layer of optimization compounds to dramatically reduce response times.</p></div>

The good news: there are proven techniques to cut these costs by 60-80% without sacrificing output quality. At TechSaaS, we have deployed these optimizations across dozens of AI workloads.

Technique 1: Model Quantization

Quantization reduces model precision from 32-bit floats to 4-bit or 8-bit integers. The model gets 4-8x smaller and 2-4x faster.

GGUF Quantization (for Ollama / llama.cpp)

# Original Llama 3.1 8B: ~16GB (FP16)
# After Q4_K_M quantization: ~4.9GB
# After Q5_K_M quantization: ~5.7GB

# In Ollama, quantized models are the default
ollama pull llama3.1:8b         # Already Q4_K_M
ollama pull llama3.1:8b-q5_K_M  # Slightly better quality

AWQ Quantization (for vLLM / production)

# Using AutoAWQ for activation-aware quantization
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "meta-llama/Llama-3.1-8B-Instruct"
quant_path = "llama-3.1-8b-awq"

model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"
}

model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)

Quality impact: Q4_K_M retains 95-98% of the original model quality on most benchmarks. For most production applications, users cannot tell the difference.

Technique 2: GPU Sharing with MIG and Time-Slicing

A single GPU can serve multiple models or users simultaneously.

NVIDIA MIG (Multi-Instance GPU)

Available on A30, A100, and H100 GPUs. Splits one GPU into isolated instances:

# Enable MIG on an A100
sudo nvidia-smi -mig 1

# Create GPU instances (7 x 5GB slices from a 40GB A100)
sudo nvidia-smi mig -cgi 19,19,19,19,19,19,19

# List instances
nvidia-smi mig -lgi

Each instance is fully isolated — separate memory, separate compute, separate failure domains. Perfect for serving multiple small models on one expensive GPU.

<div style="margin:2.5rem auto;max-width:600px;width:100%;text-align:center;"><svg viewBox="0 0 600 200" xmlns="http://www.w3.org/2000/svg" style="width:100%;height:auto;"><rect width="600" height="200" rx="12" fill="#1a1a2e"/><text x="80" y="25" text-anchor="middle" fill="#94a3b8" font-size="10" font-family="system-ui">Input</text><circle cx="80" cy="50" r="14" fill="none" stroke="#3b82f6" stroke-width="2"/><circle cx="80" cy="100" r="14" fill="none" stroke="#3b82f6" stroke-width="2"/><circle cx="80" cy="150" r="14" fill="none" stroke="#3b82f6" stroke-width="2"/><text x="230" y="25" text-anchor="middle" fill="#94a3b8" font-size="10" font-family="system-ui">Hidden</text><circle cx="230" cy="45" r="14" fill="#6366f1" opacity="0.8"/><circle cx="230" cy="85" r="14" fill="#6366f1" opacity="0.8"/><circle cx="230" cy="125" r="14" fill="#6366f1" opacity="0.8"/><circle cx="230" cy="165" r="14" fill="#6366f1" opacity="0.8"/><text x="380" y="25" text-anchor="middle" fill="#94a3b8" font-size="10" font-family="system-ui">Hidden</text><circle cx="380" cy="55" r="14" fill="#a855f7" opacity="0.8"/><circle cx="380" cy="100" r="14" fill="#a855f7" opacity="0.8"/><circle cx="380" cy="145" r="14" fill="#a855f7" opacity="0.8"/><text x="520" y="25" text-anchor="middle" fill="#94a3b8" font-size="10" font-family="system-ui">Output</text><circle cx="520" cy="80" r="14" fill="none" stroke="#2dd4bf" stroke-width="2"/><circle cx="520" cy="130" r="14" fill="none" stroke="#2dd4bf" stroke-width="2"/><line x1="94" y1="50" x2="216" y2="45" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="50" x2="216" y2="85" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="50" x2="216" y2="125" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="50" x2="216" y2="165" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="100" x2="216" y2="45" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="100" x2="216" y2="85" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="100" x2="216" y2="125" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="100" x2="216" y2="165" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="150" x2="216" y2="45" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="150" x2="216" y2="85" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="150" x2="216" y2="125" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="150" x2="216" y2="165" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="45" x2="366" y2="55" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="45" x2="366" y2="100" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="45" x2="366" y2="145" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="85" x2="366" y2="55" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="85" x2="366" y2="100" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="85" x2="366" y2="145" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="125" x2="366" y2="55" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="125" x2="366" y2="100" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="125" x2="366" y2="145" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="165" x2="366" y2="55" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="165" x2="366" y2="100" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="165" x2="366" y2="145" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="394" y1="55" x2="506" y2="80" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="394" y1="55" x2="506" y2="130" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="394" y1="100" x2="506" y2="80" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="394" y1="100" x2="506" y2="130" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="394" y1="145" x2="506" y2="80" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="394" y1="145" x2="506" y2="130" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/></svg><p style="margin-top:0.75rem;font-size:0.85rem;color:#94a3b8;font-style:italic;line-height:1.4;">Neural network architecture: data flows through input, hidden, and output layers.</p></div>

Time-Slicing (for consumer GPUs)

For GPUs without MIG (like the GTX 1650 in our TechSaaS server), use NVIDIA's time-slicing:

# /etc/nvidia/time-slicing-config.yaml
version: v1
sharing:
  timeSlicing:
    resources:
      - name: nvidia.com/gpu
        replicas: 4  # Share 1 GPU as 4 virtual GPUs

Not truly isolated like MIG, but allows multiple containers to share one GPU for inference.

Technique 3: Batch Inference with vLLM

Processing requests one at a time wastes GPU cycles. Batching multiple requests together is 3-5x more efficient.

# vLLM handles batching automatically
from vllm import LLM, SamplingParams

# Load quantized model
llm = LLM(
    model="TheBloke/Llama-3.1-8B-AWQ",
    quantization="awq",
    max_model_len=4096,
    gpu_memory_utilization=0.85
)

params = SamplingParams(temperature=0.7, max_tokens=512)

# Batch of requests processed together
prompts = [
    "Explain Kubernetes in simple terms",
    "Write a Python function to parse JSON",
    "What is the CAP theorem?",
    # ... hundreds more
]

outputs = llm.generate(prompts, params)  # Batched automatically

vLLM uses continuous batching and PagedAttention to maximize throughput:

Setup
Throughput
Latency (p50)

|-------|-----------|---------------|

Naive (one at a time)
15 tok/s
200ms
vLLM (batched)
80 tok/s
250ms
vLLM + AWQ quantized
120 tok/s
180ms

The latency barely increases while throughput jumps 5-8x.

Technique 4: Smart Request Routing

Not every query needs the largest model. Route simple queries to smaller, cheaper models:

class ModelRouter:
    def __init__(self):
        self.small_model = "llama3.1:8b"   # Fast, cheap
        self.large_model = "llama3.1:70b"   # Slow, expensive

    def classify_complexity(self, query: str) -> str:
        """Simple heuristic to route queries."""
        simple_indicators = [
            "what is", "define", "list", "how many",
            "yes or no", "true or false"
        ]
        complex_indicators = [
            "explain why", "compare", "analyze", "design",
            "write code for", "debug", "architecture"
        ]

        query_lower = query.lower()
        simple_score = sum(1 for i in simple_indicators if i in query_lower)
        complex_score = sum(1 for i in complex_indicators if i in query_lower)

        return "simple" if simple_score > complex_score else "complex"

    def route(self, query: str) -> str:
        complexity = self.classify_complexity(query)
        if complexity == "simple":
            return self.small_model  # 10x cheaper
        return self.large_model

In practice, 60-70% of queries can be handled by small models. This alone cuts costs by 50%+.

Technique 5: Caching Repeated Queries

Many AI applications receive the same or similar queries repeatedly. Cache the results:

import hashlib
import redis
import json

class LLMCache:
    def __init__(self, redis_url="redis://localhost:6379"):
        self.redis = redis.from_url(redis_url)
        self.ttl = 3600 * 24  # 24 hours

    def _hash_query(self, model: str, prompt: str, params: dict) -> str:
        key_data = f"{model}:{prompt}:{json.dumps(params, sort_keys=True)}"
        return hashlib.sha256(key_data.encode()).hexdigest()

    def get(self, model, prompt, params):
        key = self._hash_query(model, prompt, params)
        cached = self.redis.get(f"llm:{key}")
        if cached:
            return json.loads(cached)
        return None

    def set(self, model, prompt, params, response):
        key = self._hash_query(model, prompt, params)
        self.redis.setex(
            f"llm:{key}",
            self.ttl,
            json.dumps(response)
        )

Cache hit rates of 20-40% are typical for customer-facing applications. That is 20-40% fewer GPU cycles.

Technique 6: Scheduled Inference

If your workload is not real-time, batch process during off-peak hours:

# Process overnight when GPU spot instances are cheapest
# AWS spot pricing: $0.90/hr vs $3.10/hr on-demand for A100

import schedule
import time

def process_batch_queue():
    """Process all queued inference requests."""
    pending = db.get_pending_requests()
    results = vllm_batch_generate(pending)
    db.store_results(results)

# Run batch processing at 2 AM when prices are lowest
schedule.every().day.at("02:00").do(process_batch_queue)

<div style="margin:2.5rem auto;max-width:600px;width:100%;text-align:center;"><svg viewBox="0 0 600 180" xmlns="http://www.w3.org/2000/svg" style="width:100%;height:auto;"><rect width="600" height="180" rx="12" fill="#1a1a2e"/><rect x="30" y="60" width="80" height="50" rx="25" fill="#3b82f6" opacity="0.85"/><text x="70" y="90" text-anchor="middle" fill="#ffffff" font-size="11" font-family="system-ui">Prompt</text><rect x="145" y="50" width="90" height="70" rx="8" fill="#6366f1" opacity="0.85"/><text x="190" y="80" text-anchor="middle" fill="#ffffff" font-size="10" font-family="system-ui">Embed</text><text x="190" y="95" text-anchor="middle" fill="#ffffff" font-size="10" font-family="system-ui">[0.2, 0.8...]</text><rect x="270" y="50" width="90" height="70" rx="8" fill="#a855f7" opacity="0.85"/><text x="315" y="75" text-anchor="middle" fill="#ffffff" font-size="10" font-family="system-ui">Vector</text><text x="315" y="90" text-anchor="middle" fill="#ffffff" font-size="10" font-family="system-ui">Search</text><text x="315" y="105" text-anchor="middle" fill="#ffffff" font-size="9" font-family="system-ui" opacity="0.7">top-k=5</text><rect x="395" y="50" width="90" height="70" rx="8" fill="#2dd4bf" opacity="0.85"/><text x="440" y="80" text-anchor="middle" fill="#1a1a2e" font-size="11" font-family="system-ui" font-weight="bold">LLM</text><text x="440" y="95" text-anchor="middle" fill="#1a1a2e" font-size="9" font-family="system-ui">+ context</text><rect x="520" y="60" width="55" height="50" rx="25" fill="#f59e0b" opacity="0.85"/><text x="547" y="90" text-anchor="middle" fill="#1a1a2e" font-size="10" font-family="system-ui">Reply</text><defs><marker id="arrow4" markerWidth="8" markerHeight="6" refX="8" refY="3" orient="auto"><path d="M0,0 L8,3 L0,6" fill="#e2e8f0"/></marker></defs><line x1="112" y1="85" x2="143" y2="85" stroke="#e2e8f0" stroke-width="1.5" marker-end="url(#arrow4)"/><line x1="237" y1="85" x2="268" y2="85" stroke="#e2e8f0" stroke-width="1.5" marker-end="url(#arrow4)"/><line x1="362" y1="85" x2="393" y2="85" stroke="#e2e8f0" stroke-width="1.5" marker-end="url(#arrow4)"/><line x1="487" y1="85" x2="518" y2="85" stroke="#e2e8f0" stroke-width="1.5" marker-end="url(#arrow4)"/><text x="300" y="155" text-anchor="middle" fill="#94a3b8" font-size="10" font-family="system-ui">Retrieval-Augmented Generation (RAG) Flow</text></svg><p style="margin-top:0.75rem;font-size:0.85rem;color:#94a3b8;font-style:italic;line-height:1.4;">RAG architecture: user prompts are embedded, matched against a vector store, then fed to an LLM with retrieved context.</p></div>

Real Cost Savings Example

A TechSaaS client was spending $4,200/month on AI inference:

Optimization
Savings

|-------------|---------|

Q4 Quantization
-30% model size, same GPU handles more
vLLM batching
-40% GPU time
Small model routing
-25% (simple queries use 8B not 70B)
Response caching
-20% fewer requests
Combined
-72% ($1,176/month from $4,200)

These are not theoretical numbers. This is what proper AI cost optimization delivers in practice. Every AI deployment should start with these optimizations from day one.

#ai-costs#gpu#quantization#optimization#inference#vllm

Need help with ai & machine learning?

TechSaaS provides expert consulting and managed services for cloud infrastructure, DevOps, and AI/ML operations.