← All articlesAI & Machine Learning

AI Cost Optimization: GPU Sharing, Quantization, and Batch Inference

Cut AI infrastructure costs by 60-80% with GPU sharing, model quantization, batch inference, and smart scheduling. Practical techniques with benchmarks.

Yash Pritwani

11 November 202513 min read

The AI Cost Crisis

Running AI models in production is expensive. A single NVIDIA A100 GPU costs $2-4/hour on cloud providers. Running a 70B parameter model 24/7 costs $1,500-3,000/month just for compute. For most companies, this is unsustainable.

The good news: there are proven techniques to cut these costs by 60-80% without sacrificing output quality. At TechSaaS, we have deployed these optimizations across dozens of AI workloads.

Technique 1: Model Quantization

Quantization reduces model precision from 32-bit floats to 4-bit or 8-bit integers. The model gets 4-8x smaller and 2-4x faster.

GGUF Quantization (for Ollama / llama.cpp)

# Original Llama 3.1 8B: ~16GB (FP16)
# After Q4_K_M quantization: ~4.9GB
# After Q5_K_M quantization: ~5.7GB

# In Ollama, quantized models are the default
ollama pull llama3.1:8b         # Already Q4_K_M
ollama pull llama3.1:8b-q5_K_M  # Slightly better quality

AWQ Quantization (for vLLM / production)

# Using AutoAWQ for activation-aware quantization
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "meta-llama/Llama-3.1-8B-Instruct"
quant_path = "llama-3.1-8b-awq"

model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"
}

model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)

Quality impact: Q4_K_M retains 95-98% of the original model quality on most benchmarks. For most production applications, users cannot tell the difference.

Technique 2: GPU Sharing with MIG and Time-Slicing

A single GPU can serve multiple models or users simultaneously.

NVIDIA MIG (Multi-Instance GPU)

Available on A30, A100, and H100 GPUs. Splits one GPU into isolated instances:

# Enable MIG on an A100
sudo nvidia-smi -mig 1

# Create GPU instances (7 x 5GB slices from a 40GB A100)
sudo nvidia-smi mig -cgi 19,19,19,19,19,19,19

# List instances
nvidia-smi mig -lgi

Each instance is fully isolated — separate memory, separate compute, separate failure domains. Perfect for serving multiple small models on one expensive GPU.

<div style="margin:2.5rem auto;max-width:600px;width:100%;text-align:center;"><svg viewBox="0 0 600 200" xmlns="http://www.w3.org/2000/svg" style="width:100%;height:auto;"><rect width="600" height="200" rx="12" fill="#1a1a2e"/><text x="80" y="25" text-anchor="middle" fill="#94a3b8" font-size="10" font-family="system-ui">Input</text><circle cx="80" cy="50" r="14" fill="none" stroke="#3b82f6" stroke-width="2"/><circle cx="80" cy="100" r="14" fill="none" stroke="#3b82f6" stroke-width="2"/><circle cx="80" cy="150" r="14" fill="none" stroke="#3b82f6" stroke-width="2"/><text x="230" y="25" text-anchor="middle" fill="#94a3b8" font-size="10" font-family="system-ui">Hidden</text><circle cx="230" cy="45" r="14" fill="#6366f1" opacity="0.8"/><circle cx="230" cy="85" r="14" fill="#6366f1" opacity="0.8"/><circle cx="230" cy="125" r="14" fill="#6366f1" opacity="0.8"/><circle cx="230" cy="165" r="14" fill="#6366f1" opacity="0.8"/><text x="380" y="25" text-anchor="middle" fill="#94a3b8" font-size="10" font-family="system-ui">Hidden</text><circle cx="380" cy="55" r="14" fill="#a855f7" opacity="0.8"/><circle cx="380" cy="100" r="14" fill="#a855f7" opacity="0.8"/><circle cx="380" cy="145" r="14" fill="#a855f7" opacity="0.8"/><text x="520" y="25" text-anchor="middle" fill="#94a3b8" font-size="10" font-family="system-ui">Output</text><circle cx="520" cy="80" r="14" fill="none" stroke="#2dd4bf" stroke-width="2"/><circle cx="520" cy="130" r="14" fill="none" stroke="#2dd4bf" stroke-width="2"/><line x1="94" y1="50" x2="216" y2="45" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="50" x2="216" y2="85" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="50" x2="216" y2="125" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="50" x2="216" y2="165" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="100" x2="216" y2="45" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="100" x2="216" y2="85" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="100" x2="216" y2="125" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="100" x2="216" y2="165" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="150" x2="216" y2="45" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="150" x2="216" y2="85" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="150" x2="216" y2="125" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="150" x2="216" y2="165" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="45" x2="366" y2="55" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="45" x2="366" y2="100" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="45" x2="366" y2="145" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="85" x2="366" y2="55" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="85" x2="366" y2="100" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="85" x2="366" y2="145" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="125" x2="366" y2="55" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="125" x2="366" y2="100" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="125" x2="366" y2="145" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="165" x2="366" y2="55" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="165" x2="366" y2="100" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="165" x2="366" y2="145" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="394" y1="55" x2="506" y2="80" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="394" y1="55" x2="506" y2="130" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="394" y1="100" x2="506" y2="80" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="394" y1="100" x2="506" y2="130" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="394" y1="145" x2="506" y2="80" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="394" y1="145" x2="506" y2="130" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/></svg><p style="margin-top:0.75rem;font-size:0.85rem;color:#94a3b8;font-style:italic;line-height:1.4;">Neural network architecture: data flows through input, hidden, and output layers.</p></div>

Time-Slicing (for consumer GPUs)

For GPUs without MIG (like the GTX 1650 in our TechSaaS server), use NVIDIA's time-slicing:

# /etc/nvidia/time-slicing-config.yaml
version: v1
sharing:
  timeSlicing:
    resources:
      - name: nvidia.com/gpu
        replicas: 4  # Share 1 GPU as 4 virtual GPUs

Not truly isolated like MIG, but allows multiple containers to share one GPU for inference.

Technique 3: Batch Inference with vLLM

Processing requests one at a time wastes GPU cycles. Batching multiple requests together is 3-5x more efficient.

# vLLM handles batching automatically
from vllm import LLM, SamplingParams

# Load quantized model
llm = LLM(
    model="TheBloke/Llama-3.1-8B-AWQ",
    quantization="awq",
    max_model_len=4096,
    gpu_memory_utilization=0.85
)

params = SamplingParams(temperature=0.7, max_tokens=512)

# Batch of requests processed together
prompts = [
    "Explain Kubernetes in simple terms",
    "Write a Python function to parse JSON",
    "What is the CAP theorem?",
    # ... hundreds more
]

outputs = llm.generate(prompts, params)  # Batched automatically

vLLM uses continuous batching and PagedAttention to maximize throughput:

Setup

Throughput

Latency (p50)

|-------|-----------|---------------|

Naive (one at a time)

15 tok/s

200ms

vLLM (batched)

80 tok/s

250ms

vLLM + AWQ quantized

120 tok/s

180ms

The latency barely increases while throughput jumps 5-8x.

Technique 4: Smart Request Routing

Not every query needs the largest model. Route simple queries to smaller, cheaper models:

class ModelRouter:
    def __init__(self):
        self.small_model = "llama3.1:8b"   # Fast, cheap
        self.large_model = "llama3.1:70b"   # Slow, expensive

    def classify_complexity(self, query: str) -> str:
        """Simple heuristic to route queries."""
        simple_indicators = [
            "what is", "define", "list", "how many",
            "yes or no", "true or false"
        ]
        complex_indicators = [
            "explain why", "compare", "analyze", "design",
            "write code for", "debug", "architecture"
        ]

        query_lower = query.lower()
        simple_score = sum(1 for i in simple_indicators if i in query_lower)
        complex_score = sum(1 for i in complex_indicators if i in query_lower)

        return "simple" if simple_score > complex_score else "complex"

    def route(self, query: str) -> str:
        complexity = self.classify_complexity(query)
        if complexity == "simple":
            return self.small_model  # 10x cheaper
        return self.large_model

In practice, 60-70% of queries can be handled by small models. This alone cuts costs by 50%+.

Technique 5: Caching Repeated Queries

Many AI applications receive the same or similar queries repeatedly. Cache the results:

import hashlib
import redis
import json

class LLMCache:
    def __init__(self, redis_url="redis://localhost:6379"):
        self.redis = redis.from_url(redis_url)
        self.ttl = 3600 * 24  # 24 hours

    def _hash_query(self, model: str, prompt: str, params: dict) -> str:
        key_data = f"{model}:{prompt}:{json.dumps(params, sort_keys=True)}"
        return hashlib.sha256(key_data.encode()).hexdigest()

    def get(self, model, prompt, params):
        key = self._hash_query(model, prompt, params)
        cached = self.redis.get(f"llm:{key}")
        if cached:
            return json.loads(cached)
        return None

    def set(self, model, prompt, params, response):
        key = self._hash_query(model, prompt, params)
        self.redis.setex(
            f"llm:{key}",
            self.ttl,
            json.dumps(response)
        )

Cache hit rates of 20-40% are typical for customer-facing applications. That is 20-40% fewer GPU cycles.

Technique 6: Scheduled Inference

If your workload is not real-time, batch process during off-peak hours:

# Process overnight when GPU spot instances are cheapest
# AWS spot pricing: $0.90/hr vs $3.10/hr on-demand for A100

import schedule
import time

def process_batch_queue():
    """Process all queued inference requests."""
    pending = db.get_pending_requests()
    results = vllm_batch_generate(pending)
    db.store_results(results)

# Run batch processing at 2 AM when prices are lowest
schedule.every().day.at("02:00").do(process_batch_queue)

<div style="margin:2.5rem auto;max-width:600px;width:100%;text-align:center;"><svg viewBox="0 0 600 180" xmlns="http://www.w3.org/2000/svg" style="width:100%;height:auto;"><rect width="600" height="180" rx="12" fill="#1a1a2e"/><rect x="30" y="60" width="80" height="50" rx="25" fill="#3b82f6" opacity="0.85"/><text x="70" y="90" text-anchor="middle" fill="#ffffff" font-size="11" font-family="system-ui">Prompt</text><rect x="145" y="50" width="90" height="70" rx="8" fill="#6366f1" opacity="0.85"/><text x="190" y="80" text-anchor="middle" fill="#ffffff" font-size="10" font-family="system-ui">Embed</text><text x="190" y="95" text-anchor="middle" fill="#ffffff" font-size="10" font-family="system-ui">[0.2, 0.8...]</text><rect x="270" y="50" width="90" height="70" rx="8" fill="#a855f7" opacity="0.85"/><text x="315" y="75" text-anchor="middle" fill="#ffffff" font-size="10" font-family="system-ui">Vector</text><text x="315" y="90" text-anchor="middle" fill="#ffffff" font-size="10" font-family="system-ui">Search</text><text x="315" y="105" text-anchor="middle" fill="#ffffff" font-size="9" font-family="system-ui" opacity="0.7">top-k=5</text><rect x="395" y="50" width="90" height="70" rx="8" fill="#2dd4bf" opacity="0.85"/><text x="440" y="80" text-anchor="middle" fill="#1a1a2e" font-size="11" font-family="system-ui" font-weight="bold">LLM</text><text x="440" y="95" text-anchor="middle" fill="#1a1a2e" font-size="9" font-family="system-ui">+ context</text><rect x="520" y="60" width="55" height="50" rx="25" fill="#f59e0b" opacity="0.85"/><text x="547" y="90" text-anchor="middle" fill="#1a1a2e" font-size="10" font-family="system-ui">Reply</text><defs><marker id="arrow4" markerWidth="8" markerHeight="6" refX="8" refY="3" orient="auto"><path d="M0,0 L8,3 L0,6" fill="#e2e8f0"/></marker></defs><line x1="112" y1="85" x2="143" y2="85" stroke="#e2e8f0" stroke-width="1.5" marker-end="url(#arrow4)"/><line x1="237" y1="85" x2="268" y2="85" stroke="#e2e8f0" stroke-width="1.5" marker-end="url(#arrow4)"/><line x1="362" y1="85" x2="393" y2="85" stroke="#e2e8f0" stroke-width="1.5" marker-end="url(#arrow4)"/><line x1="487" y1="85" x2="518" y2="85" stroke="#e2e8f0" stroke-width="1.5" marker-end="url(#arrow4)"/><text x="300" y="155" text-anchor="middle" fill="#94a3b8" font-size="10" font-family="system-ui">Retrieval-Augmented Generation (RAG) Flow</text></svg><p style="margin-top:0.75rem;font-size:0.85rem;color:#94a3b8;font-style:italic;line-height:1.4;">RAG architecture: user prompts are embedded, matched against a vector store, then fed to an LLM with retrieved context.</p></div>

Real Cost Savings Example

A TechSaaS client was spending $4,200/month on AI inference:

Optimization

Savings

|-------------|---------|

Q4 Quantization

-30% model size, same GPU handles more

vLLM batching

-40% GPU time

Small model routing

-25% (simple queries use 8B not 70B)

Response caching

-20% fewer requests

Combined

-72% ($1,176/month from $4,200)

These are not theoretical numbers. This is what proper AI cost optimization delivers in practice. Every AI deployment should start with these optimizations from day one.

#ai-costs#gpu#quantization#optimization#inference#vllm

Need help with ai & machine learning?

TechSaaS provides expert consulting and managed services for cloud infrastructure, DevOps, and AI/ML operations.

Get a Free Consultation Call +91 84569 84870