LLM Inference Optimization: Batching, Quantization, and Speculative Decoding
If you're serving LLM inference in production, you're probably paying 5-10x more than you need to. The default configurations of most serving frameworks optimize for simplicity, not efficiency.
# LLM Inference Optimization: Cut Costs 80% Without Cutting Quality
If you're serving LLM inference in production, you're probably paying 5-10x more than you need to. The default configurations of most serving frameworks optimize for simplicity, not efficiency.
Three techniques — continuous batching, quantization, and speculative decoding — can cut your inference costs by 80% and latency by 60%. Here's how each works and when to use them.
Technique 1: Continuous Batching
The Problem with Naive Batching
Traditional batching waits for N requests to arrive, then processes them together. This creates a latency-throughput tradeoff: small batches waste GPU cycles, large batches add waiting time.
Continuous Batching (Iteration-Level Scheduling)
Instead of batching at the request level, continuous batching schedules at the token level. New requests can join a running batch between token generations, and completed requests leave immediately.
# vLLM handles this automatically
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Llama-3-70B",
tensor_parallel_size=4,
max_num_batched_tokens=32768, # Total tokens across all requests in batch
max_num_seqs=256, # Max concurrent sequences
)Impact: 3-5x throughput improvement over naive batching. Latency for individual requests stays low because they don't wait for a full batch to form.
Benchmarks
|-------------------|---------------------------|-------------|-------------|
Technique 2: Quantization
Quantization reduces the precision of model weights from FP16 (16-bit) to INT8 or INT4, dramatically reducing memory usage and increasing inference speed.
The Tradeoff
|-----------|-------------------|---------------|--------------|
AWQ (Activation-aware Weight Quantization) is our recommendation for production. It preserves quality better than naive INT4 by identifying and protecting salient weight channels.
from vllm import LLM
# Serve a 70B model on a single A100 80GB (impossible with FP16)
llm = LLM(
model="TheBloke/Llama-3-70B-AWQ",
quantization="awq",
tensor_parallel_size=1, # Single GPU!
gpu_memory_utilization=0.9,
)When NOT to Quantize
Technique 3: Speculative Decoding
The insight: use a small, fast "draft" model to generate candidate tokens, then verify them in parallel with the large model. If the draft model is right (which it often is for common patterns), you get the speed of the small model with the quality of the large one.
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Llama-3-70B",
speculative_model="meta-llama/Llama-3-8B", # Draft model
num_speculative_tokens=5, # Generate 5 draft tokens per step
)Impact: 1.5-2.5x speedup for generation-heavy workloads. The speedup is highest when the output is predictable (common language patterns, structured data) and lowest for creative/novel outputs.
Combining All Three
The techniques stack. Here's the configuration we use for a production chatbot serving 10K requests/hour:
llm = LLM(
model="TheBloke/Llama-3-70B-AWQ", # INT4 quantization
quantization="awq",
speculative_model="TheBloke/Llama-3-8B-AWQ", # Quantized draft
num_speculative_tokens=5,
tensor_parallel_size=2, # 2x A100 40GB
max_num_batched_tokens=32768, # Continuous batching
max_num_seqs=256,
)Results vs naive FP16 serving:
Common Mistakes
Before diving into infrastructure recommendations, avoid these pitfalls we've seen repeatedly:
1. Quantizing without benchmarking on YOUR data. Generic benchmarks (MMLU, HumanEval) don't reflect your use case. A model that scores well on academic benchmarks might hallucinate on your domain-specific queries after quantization. Always evaluate on a test set from your actual production traffic.
2. Using speculative decoding for creative tasks. Speculative decoding works best when the output is predictable — structured data, common language patterns, templated responses. For creative writing or novel reasoning, the draft model's predictions are wrong more often, reducing the speedup to near zero.
3. Ignoring cold start latency. vLLM's first request after loading a model takes 5-10x longer than subsequent requests due to CUDA kernel compilation. If your traffic is bursty, keep models warm with synthetic heartbeat requests.
4. Over-optimizing throughput at the expense of latency. Increasing batch size improves throughput but hurts tail latency. For interactive applications (chatbots, autocomplete), optimize for P95 latency first, then tune throughput.
Infrastructure Recommendations
For Startups (< $5K/month inference budget)
For Mid-Market ($5K-$50K/month)
For Enterprise ($50K+/month)
---
Need help optimizing your LLM inference costs? We've deployed inference stacks that serve millions of requests at a fraction of the typical cost. Book a consultationBook a consultationhttps://techsaas.cloud/contact or explore our AI infrastructure servicesAI infrastructure serviceshttps://techsaas.cloud/services.
Need help with general?
TechSaaS provides expert consulting and managed services for cloud infrastructure, DevOps, and AI/ML operations.