Kong vs Envoy vs Traefik: Choosing Your API Gateway Without Regret
Most teams pick their API gateway based on a blog post they read three years ago. Then they spend the next two years fighting it. We've deployed all three — Kong, Envoy, and Traefik — across different production environments, and the right choice depends on exactly three things:
# Kong vs Envoy vs Traefik: Choosing Your API Gateway Without Regret
Most teams pick their API gateway based on a blog post they read three years ago. Then they spend the next two years fighting it. We've deployed all three — Kong, Envoy, and Traefik — across different production environments, and the right choice depends on exactly three things: your team's operational capacity, your traffic patterns, and whether you need a gateway or a service mesh. Everything else is noise.
Here's the decision framework we actually use, backed by real benchmark data and battle scars.
The 30-Second Decision Matrix
|--------|------|-------|---------|
*Benchmarks: 4 vCPU, 8GB RAM, 1KB payload, wrk2 with 100 connections, 10 threads, measured at steady state.*
Kong: The API Management Platform
Kong shines when your gateway isn't just routing traffic — it's your API management layer. Authentication, rate limiting, request transformation, analytics — Kong has a plugin for it, and most of them work well out of the box.
When Kong Wins
Production Configuration
# kong.yml — DB-less declarative config
_format_version: "3.0"
services:
- name: user-service
url: http://user-svc.internal:8080
connect_timeout: 5000
write_timeout: 10000
read_timeout: 15000
retries: 3
routes:
- name: user-api
paths:
- /api/v1/users
strip_path: false
protocols:
- https
plugins:
- name: rate-limiting
config:
minute: 60
policy: redis
redis_host: redis.internal
redis_port: 6379
- name: jwt
config:
claims_to_verify:
- exp
- name: correlation-id
config:
header_name: X-Request-ID
generator: uuid
- name: order-service
url: http://order-svc.internal:8080
routes:
- name: order-api
paths:
- /api/v1/orders
plugins:
- name: rate-limiting
config:
minute: 30
policy: redis
redis_host: redis.internal
- name: request-transformer
config:
add:
headers:
- "X-Gateway: kong"
- "X-Forwarded-Prefix: /api/v1"
# Global plugins applied to all routes
plugins:
- name: prometheus
config:
per_consumer: true
- name: zipkin
config:
http_endpoint: http://jaeger.internal:9411/api/v2/spans
sample_ratio: 0.1Kong's Hidden Cost
The enterprise features — RBAC, developer portal, advanced analytics — require Kong Enterprise at $35K+/year. The open-source version is powerful but lacks the management plane. We've seen teams start with OSS Kong, build custom tooling around the Admin API, and end up spending more engineering time than the enterprise license would have cost.
Envoy: The Performance King
Envoy was built at Lyft to handle millions of requests per second across thousands of services. It adds less than a millisecond of latency in most configurations. But that performance comes with complexity — Envoy's configuration model assumes you have a control plane feeding it updates via the xDS protocol.
When Envoy Wins
Production Configuration
# envoy.yaml — front proxy config
static_resources:
listeners:
- name: main_listener
address:
socket_address:
address: 0.0.0.0
port_value: 8443
filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
stat_prefix: ingress_http
codec_type: AUTO
access_log:
- name: envoy.access_loggers.stdout
typed_config:
"@type": type.googleapis.com/envoy.extensions.access_loggers.stream.v3.StdoutAccessLog
log_format:
json_format:
timestamp: "%START_TIME%"
method: "%REQ(:METHOD)%"
path: "%REQ(X-ENVOY-ORIGINAL-PATH?:PATH)%"
status: "%RESPONSE_CODE%"
duration_ms: "%DURATION%"
upstream: "%UPSTREAM_HOST%"
route_config:
name: local_routes
virtual_hosts:
- name: api
domains: ["api.example.com"]
routes:
- match:
prefix: "/api/v1/users"
route:
cluster: user_service
timeout: 15s
retry_policy:
retry_on: "5xx,reset,connect-failure"
num_retries: 2
per_try_timeout: 5s
- match:
prefix: "/api/v1/orders"
route:
cluster: order_service
timeout: 10s
http_filters:
- name: envoy.filters.http.local_ratelimit
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.local_ratelimit.v3.LocalRateLimit
stat_prefix: http_local_rate_limiter
token_bucket:
max_tokens: 1000
tokens_per_fill: 100
fill_interval: 1s
- name: envoy.filters.http.router
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
clusters:
- name: user_service
type: STRICT_DNS
lb_policy: LEAST_REQUEST
load_assignment:
cluster_name: user_service
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: user-svc.internal
port_value: 8080
health_checks:
- timeout: 2s
interval: 10s
unhealthy_threshold: 3
healthy_threshold: 2
http_health_check:
path: /health
circuit_breakers:
thresholds:
- max_connections: 512
max_pending_requests: 128
max_requests: 1024
max_retries: 3The Envoy Reality Check
That config above is 80 lines for two routes. Kong does the same in 40 lines. Traefik does it in 15. Envoy's verbosity is the tradeoff for precision — every timeout, every retry, every circuit breaker parameter is explicit. For teams that need that control, it's a feature. For teams that don't, it's a maintenance burden.
Traefik: The Docker-Native Choice
Traefik discovers services automatically via Docker labels or Kubernetes Ingress annotations. No Admin API calls, no xDS protocol, no separate config files — your service definition IS your routing configuration.
When Traefik Wins
Production Configuration
# docker-compose.yml — Traefik with service discovery
services:
traefik:
image: traefik:v3.2
command:
- "--api.dashboard=true"
- "--providers.docker=true"
- "--providers.docker.exposedbydefault=false"
- "--entrypoints.web.address=:80"
- "--entrypoints.websecure.address=:443"
- "--certificatesresolvers.letsencrypt.acme.tlschallenge=true"
- "[email protected]"
- "--certificatesresolvers.letsencrypt.acme.storage=/data/acme.json"
- "--metrics.prometheus=true"
- "--accesslog=true"
- "--accesslog.format=json"
ports:
- "80:80"
- "443:443"
volumes:
- /var/run/docker.sock:/var/run/docker.sock:ro
- traefik-data:/data
user-service:
image: myapp/user-service:latest
labels:
- "traefik.enable=true"
- "traefik.http.routers.users.rule=Host(`api.example.com`) && PathPrefix(`/api/v1/users`)"
- "traefik.http.routers.users.tls.certresolver=letsencrypt"
- "traefik.http.services.users.loadbalancer.server.port=8080"
- "traefik.http.routers.users.middlewares=rate-limit,retry"
- "traefik.http.middlewares.rate-limit.ratelimit.average=60"
- "traefik.http.middlewares.rate-limit.ratelimit.burst=20"
- "traefik.http.middlewares.retry.retry.attempts=3"
- "traefik.http.services.users.loadbalancer.healthcheck.path=/health"
- "traefik.http.services.users.loadbalancer.healthcheck.interval=10s"That's the entire routing configuration. Deploy a new container with the right labels, and Traefik picks it up in seconds. No config reload, no API call.
Traefik's Ceiling
Traefik starts to strain at very high scale. Above 200 services or 40K+ RPS on a single node, you'll want to consider Envoy. Traefik also lacks Envoy's advanced traffic management — no traffic shadowing, limited canary support, and circuit breakers are basic compared to Envoy's per-host ejection.
The Decision Framework
Answer these five questions:
1. How many services are you routing to?
2. Do you need API management (auth, rate limiting, analytics per consumer)?
3. What's your acceptable added latency?
4. How large is your platform/DevOps team?
5. Are you building a service mesh?
If you answered "Traefik" to 3+ questions, start with Traefik. Same logic for the others. When two are tied, default to the simpler option — you can always migrate up.
Performance Comparison: Real Numbers
We benchmarked all three under identical conditions: 4 vCPU, 8GB RAM, Ubuntu 24.04, proxying to a backend that returns a static 1KB JSON response.
|--------|----------|------------|-------------|
Envoy's performance advantage is real but matters less than you think. At 5K RPS, the difference between 0.4ms and 1.8ms added latency is invisible to users. It only becomes material above 20K RPS where cumulative resource usage diverges.
Our Recommendation
For 80% of teams reading this: start with Traefik. It has the lowest operational overhead, the fastest time-to-production, and handles more traffic than most teams will ever need. When you outgrow it — and you'll know because you'll start hitting specific limitations, not because a blog told you to — migrate to Kong (if you need API management) or Envoy (if you need raw performance).
We've helped teams at all three stages of this journey. Whether you're setting up your first gateway or migrating from one to another, our infrastructure team can audit your current setup and recommend the right path forward. [Let's talk at techsaas.cloud/services](https://techsaas.cloud/services).
Need help with infrastructure?
TechSaaS provides expert consulting and managed services for cloud infrastructure, DevOps, and AI/ML operations.