← All articlesAI & Machine Learning

AI Cost Optimization: GPU Sharing, Quantization, and Batch Inference

Cut AI infrastructure costs by 60-80% with GPU sharing, model quantization, batch inference, and smart scheduling. Practical techniques with benchmarks.

Y
Yash Pritwani
13 min read

The AI Cost Crisis

Running AI models in production is expensive. A single NVIDIA A100 GPU costs $2-4/hour on cloud providers. Running a 70B parameter model 24/7 costs $1,500-3,000/month just for compute. For most companies, this is unsustainable.

Unoptimized Code — 2000ms+ Caching — 800ms+ CDN — 200msOptimized — 50msBaseline-60%-90%-97.5%

Performance optimization funnel: each layer of optimization compounds to dramatically reduce response times.

The good news: there are proven techniques to cut these costs by 60-80% without sacrificing output quality. At TechSaaS, we have deployed these optimizations across dozens of AI workloads.

Technique 1: Model Quantization

Quantization reduces model precision from 32-bit floats to 4-bit or 8-bit integers. The model gets 4-8x smaller and 2-4x faster.

GGUF Quantization (for Ollama / llama.cpp)

# Original Llama 3.1 8B: ~16GB (FP16)
# After Q4_K_M quantization: ~4.9GB
# After Q5_K_M quantization: ~5.7GB

# In Ollama, quantized models are the default
ollama pull llama3.1:8b         # Already Q4_K_M
ollama pull llama3.1:8b-q5_K_M  # Slightly better quality

AWQ Quantization (for vLLM / production)

# Using AutoAWQ for activation-aware quantization
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "meta-llama/Llama-3.1-8B-Instruct"
quant_path = "llama-3.1-8b-awq"

model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"
}

model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)

Quality impact: Q4_K_M retains 95-98% of the original model quality on most benchmarks. For most production applications, users cannot tell the difference.

Get more insights on AI & Machine Learning

Join 2,000+ engineers who get our weekly deep-dives. No spam, unsubscribe anytime.

Technique 2: GPU Sharing with MIG and Time-Slicing

A single GPU can serve multiple models or users simultaneously.

NVIDIA MIG (Multi-Instance GPU)

Available on A30, A100, and H100 GPUs. Splits one GPU into isolated instances:

# Enable MIG on an A100
sudo nvidia-smi -mig 1

# Create GPU instances (7 x 5GB slices from a 40GB A100)
sudo nvidia-smi mig -cgi 19,19,19,19,19,19,19

# List instances
nvidia-smi mig -lgi

Each instance is fully isolated — separate memory, separate compute, separate failure domains. Perfect for serving multiple small models on one expensive GPU.

InputHiddenHiddenOutput

Neural network architecture: data flows through input, hidden, and output layers.

Time-Slicing (for consumer GPUs)

For GPUs without MIG (like the GTX 1650 in our TechSaaS server), use NVIDIA's time-slicing:

# /etc/nvidia/time-slicing-config.yaml
version: v1
sharing:
  timeSlicing:
    resources:
      - name: nvidia.com/gpu
        replicas: 4  # Share 1 GPU as 4 virtual GPUs

Not truly isolated like MIG, but allows multiple containers to share one GPU for inference.

Technique 3: Batch Inference with vLLM

Processing requests one at a time wastes GPU cycles. Batching multiple requests together is 3-5x more efficient.

# vLLM handles batching automatically
from vllm import LLM, SamplingParams

# Load quantized model
llm = LLM(
    model="TheBloke/Llama-3.1-8B-AWQ",
    quantization="awq",
    max_model_len=4096,
    gpu_memory_utilization=0.85
)

params = SamplingParams(temperature=0.7, max_tokens=512)

# Batch of requests processed together
prompts = [
    "Explain Kubernetes in simple terms",
    "Write a Python function to parse JSON",
    "What is the CAP theorem?",
    # ... hundreds more
]

outputs = llm.generate(prompts, params)  # Batched automatically

vLLM uses continuous batching and PagedAttention to maximize throughput:

Setup Throughput Latency (p50)
Naive (one at a time) 15 tok/s 200ms
vLLM (batched) 80 tok/s 250ms
vLLM + AWQ quantized 120 tok/s 180ms

The latency barely increases while throughput jumps 5-8x.

Technique 4: Smart Request Routing

Not every query needs the largest model. Route simple queries to smaller, cheaper models:

class ModelRouter:
    def __init__(self):
        self.small_model = "llama3.1:8b"   # Fast, cheap
        self.large_model = "llama3.1:70b"   # Slow, expensive

    def classify_complexity(self, query: str) -> str:
        """Simple heuristic to route queries."""
        simple_indicators = [
            "what is", "define", "list", "how many",
            "yes or no", "true or false"
        ]
        complex_indicators = [
            "explain why", "compare", "analyze", "design",
            "write code for", "debug", "architecture"
        ]

        query_lower = query.lower()
        simple_score = sum(1 for i in simple_indicators if i in query_lower)
        complex_score = sum(1 for i in complex_indicators if i in query_lower)

        return "simple" if simple_score > complex_score else "complex"

    def route(self, query: str) -> str:
        complexity = self.classify_complexity(query)
        if complexity == "simple":
            return self.small_model  # 10x cheaper
        return self.large_model

In practice, 60-70% of queries can be handled by small models. This alone cuts costs by 50%+.

Free Resource

Free Cloud Architecture Checklist

A 47-point checklist covering security, scalability, cost optimization, and disaster recovery for production cloud environments.

Download the Checklist

Technique 5: Caching Repeated Queries

Many AI applications receive the same or similar queries repeatedly. Cache the results:

import hashlib
import redis
import json

class LLMCache:
    def __init__(self, redis_url="redis://localhost:6379"):
        self.redis = redis.from_url(redis_url)
        self.ttl = 3600 * 24  # 24 hours

    def _hash_query(self, model: str, prompt: str, params: dict) -> str:
        key_data = f"{model}:{prompt}:{json.dumps(params, sort_keys=True)}"
        return hashlib.sha256(key_data.encode()).hexdigest()

    def get(self, model, prompt, params):
        key = self._hash_query(model, prompt, params)
        cached = self.redis.get(f"llm:{key}")
        if cached:
            return json.loads(cached)
        return None

    def set(self, model, prompt, params, response):
        key = self._hash_query(model, prompt, params)
        self.redis.setex(
            f"llm:{key}",
            self.ttl,
            json.dumps(response)
        )

Cache hit rates of 20-40% are typical for customer-facing applications. That is 20-40% fewer GPU cycles.

Technique 6: Scheduled Inference

If your workload is not real-time, batch process during off-peak hours:

# Process overnight when GPU spot instances are cheapest
# AWS spot pricing: $0.90/hr vs $3.10/hr on-demand for A100

import schedule
import time

def process_batch_queue():
    """Process all queued inference requests."""
    pending = db.get_pending_requests()
    results = vllm_batch_generate(pending)
    db.store_results(results)

# Run batch processing at 2 AM when prices are lowest
schedule.every().day.at("02:00").do(process_batch_queue)
PromptEmbed[0.2, 0.8...]VectorSearchtop-k=5LLM+ contextReplyRetrieval-Augmented Generation (RAG) Flow

RAG architecture: user prompts are embedded, matched against a vector store, then fed to an LLM with retrieved context.

Real Cost Savings Example

A TechSaaS client was spending $4,200/month on AI inference:

Optimization Savings
Q4 Quantization -30% model size, same GPU handles more
vLLM batching -40% GPU time
Small model routing -25% (simple queries use 8B not 70B)
Response caching -20% fewer requests
Combined -72% ($1,176/month from $4,200)

These are not theoretical numbers. This is what proper AI cost optimization delivers in practice. Every AI deployment should start with these optimizations from day one.

#ai-costs#gpu#quantization#optimization#inference#vllm

Related Service

Cloud Solutions

Let our experts help you build the right technology strategy for your business.

Need help with ai & machine learning?

TechSaaS provides expert consulting and managed services for cloud infrastructure, DevOps, and AI/ML operations.

We Will Build You a Demo Site — For Free

Like it? Pay us. Do not like it? Walk away, zero complaints. You will spend way less than hiring developers or any agency.

47+ companies trusted us
99.99% uptime
< 48hr response

No spam. No contracts. Just a free demo.