← All articlesAI & Machine Learning

AI Cost Optimization: GPU Sharing, Quantization, and Batch Inference

Cut AI infrastructure costs by 60-80% with GPU sharing, model quantization, batch inference, and smart scheduling. Practical techniques with benchmarks.

Yash Pritwani

11 November 202513 min read

The AI Cost Crisis

Running AI models in production is expensive. A single NVIDIA A100 GPU costs $2-4/hour on cloud providers. Running a 70B parameter model 24/7 costs $1,500-3,000/month just for compute. For most companies, this is unsustainable.

Performance optimization funnel: each layer of optimization compounds to dramatically reduce response times.

The good news: there are proven techniques to cut these costs by 60-80% without sacrificing output quality. At TechSaaS, we have deployed these optimizations across dozens of AI workloads.

Technique 1: Model Quantization

Quantization reduces model precision from 32-bit floats to 4-bit or 8-bit integers. The model gets 4-8x smaller and 2-4x faster.

GGUF Quantization (for Ollama / llama.cpp)

# Original Llama 3.1 8B: ~16GB (FP16)
# After Q4_K_M quantization: ~4.9GB
# After Q5_K_M quantization: ~5.7GB

# In Ollama, quantized models are the default
ollama pull llama3.1:8b         # Already Q4_K_M
ollama pull llama3.1:8b-q5_K_M  # Slightly better quality

AWQ Quantization (for vLLM / production)

# Using AutoAWQ for activation-aware quantization
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "meta-llama/Llama-3.1-8B-Instruct"
quant_path = "llama-3.1-8b-awq"

model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"
}

model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)

Quality impact: Q4_K_M retains 95-98% of the original model quality on most benchmarks. For most production applications, users cannot tell the difference.

Get more insights on AI & Machine Learning

Join 2,000+ engineers who get our weekly deep-dives. No spam, unsubscribe anytime.

Technique 2: GPU Sharing with MIG and Time-Slicing

A single GPU can serve multiple models or users simultaneously.

NVIDIA MIG (Multi-Instance GPU)

Available on A30, A100, and H100 GPUs. Splits one GPU into isolated instances:

# Enable MIG on an A100
sudo nvidia-smi -mig 1

# Create GPU instances (7 x 5GB slices from a 40GB A100)
sudo nvidia-smi mig -cgi 19,19,19,19,19,19,19

# List instances
nvidia-smi mig -lgi

Each instance is fully isolated — separate memory, separate compute, separate failure domains. Perfect for serving multiple small models on one expensive GPU.

Neural network architecture: data flows through input, hidden, and output layers.

Time-Slicing (for consumer GPUs)

For GPUs without MIG (like the GTX 1650 in our TechSaaS server), use NVIDIA's time-slicing:

# /etc/nvidia/time-slicing-config.yaml
version: v1
sharing:
  timeSlicing:
    resources:
      - name: nvidia.com/gpu
        replicas: 4  # Share 1 GPU as 4 virtual GPUs

→

Building an AI Screening Pipeline With Embeddings12 min read read

→

How We Built AI Recruitment Matching for Skillety: Embeddings, Bias Handling, and Performance at Scale13 min read

→

Small Language Models at the Edge: The On-Device AI Revolution Changing Everything11 min read

Not truly isolated like MIG, but allows multiple containers to share one GPU for inference.

Technique 3: Batch Inference with vLLM

Processing requests one at a time wastes GPU cycles. Batching multiple requests together is 3-5x more efficient.

# vLLM handles batching automatically
from vllm import LLM, SamplingParams

# Load quantized model
llm = LLM(
    model="TheBloke/Llama-3.1-8B-AWQ",
    quantization="awq",
    max_model_len=4096,
    gpu_memory_utilization=0.85
)

params = SamplingParams(temperature=0.7, max_tokens=512)

# Batch of requests processed together
prompts = [
    "Explain Kubernetes in simple terms",
    "Write a Python function to parse JSON",
    "What is the CAP theorem?",
    # ... hundreds more
]

outputs = llm.generate(prompts, params)  # Batched automatically

vLLM uses continuous batching and PagedAttention to maximize throughput:

Setup	Throughput	Latency (p50)
Naive (one at a time)	15 tok/s	200ms
vLLM (batched)	80 tok/s	250ms
vLLM + AWQ quantized	120 tok/s	180ms

The latency barely increases while throughput jumps 5-8x.

Technique 4: Smart Request Routing

Not every query needs the largest model. Route simple queries to smaller, cheaper models:

class ModelRouter:
    def __init__(self):
        self.small_model = "llama3.1:8b"   # Fast, cheap
        self.large_model = "llama3.1:70b"   # Slow, expensive

    def classify_complexity(self, query: str) -> str:
        """Simple heuristic to route queries."""
        simple_indicators = [
            "what is", "define", "list", "how many",
            "yes or no", "true or false"
        ]
        complex_indicators = [
            "explain why", "compare", "analyze", "design",
            "write code for", "debug", "architecture"
        ]

        query_lower = query.lower()
        simple_score = sum(1 for i in simple_indicators if i in query_lower)
        complex_score = sum(1 for i in complex_indicators if i in query_lower)

        return "simple" if simple_score > complex_score else "complex"

    def route(self, query: str) -> str:
        complexity = self.classify_complexity(query)
        if complexity == "simple":
            return self.small_model  # 10x cheaper
        return self.large_model

In practice, 60-70% of queries can be handled by small models. This alone cuts costs by 50%+.

Free Resource

Free Cloud Architecture Checklist

A 47-point checklist covering security, scalability, cost optimization, and disaster recovery for production cloud environments.

Download the Checklist

Technique 5: Caching Repeated Queries

Many AI applications receive the same or similar queries repeatedly. Cache the results:

import hashlib
import redis
import json

class LLMCache:
    def __init__(self, redis_url="redis://localhost:6379"):
        self.redis = redis.from_url(redis_url)
        self.ttl = 3600 * 24  # 24 hours

    def _hash_query(self, model: str, prompt: str, params: dict) -> str:
        key_data = f"{model}:{prompt}:{json.dumps(params, sort_keys=True)}"
        return hashlib.sha256(key_data.encode()).hexdigest()

    def get(self, model, prompt, params):
        key = self._hash_query(model, prompt, params)
        cached = self.redis.get(f"llm:{key}")
        if cached:
            return json.loads(cached)
        return None

    def set(self, model, prompt, params, response):
        key = self._hash_query(model, prompt, params)
        self.redis.setex(
            f"llm:{key}",
            self.ttl,
            json.dumps(response)
        )

Cache hit rates of 20-40% are typical for customer-facing applications. That is 20-40% fewer GPU cycles.

Technique 6: Scheduled Inference

If your workload is not real-time, batch process during off-peak hours:

# Process overnight when GPU spot instances are cheapest
# AWS spot pricing: $0.90/hr vs $3.10/hr on-demand for A100

import schedule
import time

def process_batch_queue():
    """Process all queued inference requests."""
    pending = db.get_pending_requests()
    results = vllm_batch_generate(pending)
    db.store_results(results)

# Run batch processing at 2 AM when prices are lowest
schedule.every().day.at("02:00").do(process_batch_queue)

RAG architecture: user prompts are embedded, matched against a vector store, then fed to an LLM with retrieved context.

Real Cost Savings Example

A TechSaaS client was spending $4,200/month on AI inference:

Optimization	Savings
Q4 Quantization	-30% model size, same GPU handles more
vLLM batching	-40% GPU time
Small model routing	-25% (simple queries use 8B not 70B)
Response caching	-20% fewer requests
Combined	-72% ($1,176/month from $4,200)

These are not theoretical numbers. This is what proper AI cost optimization delivers in practice. Every AI deployment should start with these optimizations from day one.

#ai-costs#gpu#quantization#optimization#inference#vllm

Related Service

Cloud Solutions

Let our experts help you build the right technology strategy for your business.

Get a Consultation Chat on WhatsApp

Need help with ai & machine learning?

TechSaaS provides expert consulting and managed services for cloud infrastructure, DevOps, and AI/ML operations.

Get a Free Consultation WhatsApp Us

We Will Build You a Demo Site — For Free

Like it? Pay us. Do not like it? Walk away, zero complaints. You will spend way less than hiring developers or any agency.

47+ companies trusted us

99.99% uptime

< 48hr response

No spam. No contracts. Just a free demo.