OpenTelemetry in Production: The Definitive Guide to Modern Observability
Production guide to OpenTelemetry: instrument your apps, collect traces, metrics, and logs with a single standard. Real examples and deployment patterns.
OpenTelemetry in Production: The Definitive Guide to Modern Observability
The observability wars are over. OpenTelemetry won.
For years, every monitoring vendor shipped its own agent, its own SDK, its own proprietary format. You picked Datadog, you got the Datadog agent. You picked New Relic, you got the New Relic SDK. Switching vendors meant re-instrumenting your entire stack. That era is ending.
OpenTelemetry (OTel) is the CNCF's vendor-neutral observability framework — a single set of APIs, SDKs, and tools that generate, collect, and export telemetry data. It graduated from CNCF incubation in 2024, and as of 2026, it is the second most active CNCF project after Kubernetes. Every major observability vendor — Datadog, Grafana Labs, Honeycomb, Splunk, Dynatrace — now supports OTel natively. The specification covers all three pillars: traces, metrics, and logs under one unified standard.
This is your opentelemetry production guide. Not theory — real instrumentation code, real collector configs, and real deployment patterns we use to monitor 80+ services running on our infrastructure at TechSaaS.
The Three Signals of Observability
OpenTelemetry unifies three telemetry signals that were historically handled by separate tools with separate data models. Understanding how they connect is the foundation of effective otel observability — and the reason OTel has replaced bespoke monitoring stacks at companies ranging from startups to Fortune 500s.
Traces (Distributed Request Paths)
A trace follows a single request as it moves through your distributed system. Each unit of work is a span — a named, timed operation with metadata. Spans form a tree: the root span represents the entire request, and child spans represent downstream calls.
When a user hits your API gateway, the trace captures the gateway processing (50ms), the auth service call (12ms), the database query (8ms), and the cache miss that triggered a slow path (200ms). You see exactly where time was spent and which service caused the bottleneck. Each span carries a trace_id that ties the entire tree together, even across process and network boundaries.
Metrics (Counters, Gauges, Histograms)
Metrics are aggregated numerical measurements. OTel defines three core instrument types:
- Counters — monotonically increasing values (total requests served, total errors encountered)
- Gauges — point-in-time values (current memory usage, active connections, queue depth)
- Histograms — distribution of values (request duration percentiles, payload sizes, batch processing times)
Unlike traces, metrics are pre-aggregated. They are cheap to store and fast to query, making them ideal for dashboards and alerting. When your P99 latency spikes, a metric tells you when — a trace tells you why.
Logs (Structured and Correlated)
OTel's log signal bridges your existing logging infrastructure into the observability stack. The key differentiator: OTel logs carry trace context. When a log line includes a trace_id and span_id, you can jump from a log entry directly to the distributed trace that generated it — and back.
How the three signals connect: A Grafana dashboard shows a spike in error rate (metric). You click through to see which traces had errors (trace). You open a failing trace, find the broken span, and jump to the exact log lines from that span (log). Three signals, one correlated view. That is what a modern distributed tracing metrics logs pipeline delivers that siloed tools cannot.
Architecture Overview
Every OTel deployment follows the same data flow:
┌──────────────┐ ┌───────────────┐ ┌────────────────────────┐ ┌───────────────┐
│ Application │───>│ OTel SDK │───>│ OTel Collector │───>│ Backend │
│ (your code) │ │ + Exporter │ │ │ │ │
│ │ │ │ │ receivers ─> processors │ │ Tempo │
│ │ │ OTLP gRPC │ │ ─> exporters │ │ Prometheus │
│ │ │ or HTTP │ │ │ │ Loki │
└──────────────┘ └───────────────┘ └────────────────────────┘ └───────────────┘
SDK: Instruments your code. Generates spans, records metrics, bridges logs. Exports via OTLP (OpenTelemetry Protocol) over gRPC or HTTP.
Collector: A standalone binary that receives, processes, and exports telemetry. This is the workhorse of any OTel deployment. It decouples your applications from backends — your apps send OTLP to the collector, and the collector routes to whatever backends you use.
Two deployment modes:
- Agent mode: Collector runs as a sidecar or DaemonSet on every node. Low latency, local processing. Each agent handles telemetry from the pods on its node.
- Gateway mode: Centralized collector cluster that all agents forward to. Enables cross-service tail sampling, centralized filtering, and routing decisions that require visibility across multiple services.
Most production deployments use both: agents on every node for collection, a gateway cluster for aggregation and sampling. This two-tier architecture is the recommended pattern in the official OTel documentation and is what any serious opentelemetry production guide will advise.
Instrumenting a Node.js Application
Auto-instrumentation is the fastest path to distributed tracing in Node.js. OTel provides instrumentation packages for Express, Fastify, HTTP, gRPC, PostgreSQL, Redis, and dozens more.
Setup
npm install @opentelemetry/sdk-node \
@opentelemetry/auto-instrumentations-node \
@opentelemetry/exporter-trace-otlp-grpc \
@opentelemetry/exporter-metrics-otlp-grpc \
@opentelemetry/sdk-metrics
Create tracing.ts — this file must be loaded before your application code:
Get more insights on DevOps
Join 2,000+ engineers who get our weekly deep-dives. No spam, unsubscribe anytime.
// tracing.ts — load BEFORE app code via --require or import
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-grpc';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { Resource } from '@opentelemetry/resources';
import {
ATTR_SERVICE_NAME,
ATTR_SERVICE_VERSION,
ATTR_DEPLOYMENT_ENVIRONMENT,
} from '@opentelemetry/semantic-conventions';
const sdk = new NodeSDK({
resource: new Resource({
[ATTR_SERVICE_NAME]: 'order-service',
[ATTR_SERVICE_VERSION]: '2.4.1',
[ATTR_DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV || 'development',
}),
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://otel-collector:4317',
}),
metricReader: new PeriodicExportingMetricReader({
exporter: new OTLPMetricExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://otel-collector:4317',
}),
exportIntervalMillis: 15000,
}),
instrumentations: [
getNodeAutoInstrumentations({
// Filter out noisy health check spans
'@opentelemetry/instrumentation-http': {
ignoreIncomingPaths: ['/healthz', '/readyz'],
},
// Disable filesystem instrumentation — too noisy, no actionable insight
'@opentelemetry/instrumentation-fs': { enabled: false },
}),
],
});
sdk.start();
// Ensure spans are flushed before the process exits
process.on('SIGTERM', () => {
sdk.shutdown()
.then(() => process.exit(0))
.catch(() => process.exit(1));
});
Run your application with the instrumentation loaded first:
node --require ./tracing.js ./dist/server.js
Custom Spans and Metrics
Auto-instrumentation covers HTTP and database calls. For business logic, create custom spans:
import { trace, metrics, SpanStatusCode } from '@opentelemetry/api';
const tracer = trace.getTracer('order-service');
const meter = metrics.getMeter('order-service');
// Custom metrics
const orderCounter = meter.createCounter('orders.created', {
description: 'Total orders created',
});
const orderValueHistogram = meter.createHistogram('orders.value', {
description: 'Order value distribution in cents',
unit: 'cents',
});
async function processOrder(order: Order) {
return tracer.startActiveSpan('processOrder', async (span) => {
try {
span.setAttribute('order.id', order.id);
span.setAttribute('order.item_count', order.items.length);
span.setAttribute('order.customer_tier', order.customerTier);
// Nested span for payment processing
const result = await tracer.startActiveSpan(
'chargePayment',
async (paymentSpan) => {
paymentSpan.setAttribute('payment.method', order.paymentMethod);
paymentSpan.setAttribute('payment.amount_cents', order.totalCents);
const chargeResult = await paymentGateway.charge(order);
paymentSpan.setAttribute('payment.transaction_id', chargeResult.txId);
paymentSpan.end();
return chargeResult;
}
);
// Record metrics with relevant attributes
orderCounter.add(1, { 'customer.tier': order.customerTier });
orderValueHistogram.record(order.totalCents, {
'payment.method': order.paymentMethod,
});
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (error) {
span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
span.recordException(error);
throw error;
} finally {
span.end();
}
});
}
With this setup, every HTTP request automatically generates a trace with spans for the request handler, database queries, and your custom business logic — all correlated by trace_id.
Instrumenting a Python Application
Python has equally mature OTel support. Here is a complete setup for FastAPI.
Setup
pip install opentelemetry-api \
opentelemetry-sdk \
opentelemetry-exporter-otlp-proto-grpc \
opentelemetry-instrumentation-fastapi \
opentelemetry-instrumentation-httpx \
opentelemetry-instrumentation-sqlalchemy \
opentelemetry-instrumentation-redis
# otel_setup.py
import os
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.resources import Resource, SERVICE_NAME, SERVICE_VERSION
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.instrumentation.redis import RedisInstrumentor
def setup_otel(app):
"""Initialize OpenTelemetry with traces and metrics for FastAPI."""
resource = Resource.create({
SERVICE_NAME: "user-service",
SERVICE_VERSION: "1.8.0",
"deployment.environment": os.getenv("ENV", "development"),
})
# Traces
trace_provider = TracerProvider(resource=resource)
trace_provider.add_span_processor(
BatchSpanProcessor(
OTLPSpanExporter(
endpoint=os.getenv(
"OTEL_EXPORTER_OTLP_ENDPOINT", "otel-collector:4317"
),
insecure=True,
),
max_queue_size=2048,
max_export_batch_size=512,
schedule_delay_millis=5000,
)
)
trace.set_tracer_provider(trace_provider)
# Metrics
metric_reader = PeriodicExportingMetricReader(
OTLPMetricExporter(
endpoint=os.getenv(
"OTEL_EXPORTER_OTLP_ENDPOINT", "otel-collector:4317"
),
insecure=True,
),
export_interval_millis=15000,
)
metrics.set_meter_provider(
MeterProvider(resource=resource, metric_readers=[metric_reader])
)
# Auto-instrument frameworks
FastAPIInstrumentor.instrument_app(app, excluded_urls="healthz,readyz")
SQLAlchemyInstrumentor().instrument()
RedisInstrumentor().instrument()
# main.py
from fastapi import FastAPI
from opentelemetry import trace, metrics
from otel_setup import setup_otel
app = FastAPI()
setup_otel(app)
tracer = trace.get_tracer("user-service")
meter = metrics.get_meter("user-service")
signup_counter = meter.create_counter(
"users.signups",
description="Total user signups",
)
auth_latency = meter.create_histogram(
"auth.latency_ms",
description="Authentication latency in milliseconds",
unit="ms",
)
@app.post("/users")
async def create_user(payload: UserCreate):
with tracer.start_as_current_span("validate_user_input") as span:
span.set_attribute("user.email_domain", payload.email.split("@")[1])
validated = validate(payload)
with tracer.start_as_current_span("persist_user") as span:
user = await db.create_user(validated)
span.set_attribute("user.id", str(user.id))
# Span events are lightweight annotations within a span
span.add_event("user_persisted", {"user.id": str(user.id)})
signup_counter.add(1, {"plan": payload.plan, "source": payload.referral_source})
return {"id": user.id, "status": "created"}
Every incoming FastAPI request automatically gets a root span. Database queries via SQLAlchemy and Redis calls are captured as child spans. The custom spans for validation and persistence give you visibility into your business logic — the part that auto-instrumentation cannot cover.
The OpenTelemetry Collector
The Collector is where the real power of otel observability lives. It receives telemetry from your applications, processes it (batching, filtering, sampling, enriching), and exports it to one or more backends. This is the central nervous system of your observability stack.
Here is a production-grade Collector configuration:
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
max_recv_msg_size_mib: 16
http:
endpoint: 0.0.0.0:4318
# Scrape Prometheus metrics from services that already expose /metrics
prometheus:
config:
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels:
[__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels:
[__meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: (.+)
replacement: ${1}:${2}
# Collect host-level metrics from the node
hostmetrics:
collection_interval: 30s
scrapers:
cpu:
memory:
disk:
network:
filesystem:
processors:
# Batch telemetry to reduce export overhead
batch:
send_batch_size: 8192
send_batch_max_size: 16384
timeout: 5s
# Add Kubernetes metadata to all telemetry
k8sattributes:
extract:
metadata:
- k8s.namespace.name
- k8s.deployment.name
- k8s.pod.name
- k8s.node.name
pod_association:
- sources:
- from: resource_attribute
name: k8s.pod.ip
# Memory limiter — MUST be the first processor in every pipeline
memory_limiter:
check_interval: 5s
limit_mib: 1536
spike_limit_mib: 384
# Filter out noisy health check spans before they reach sampling
filter:
error_mode: ignore
traces:
span:
- 'attributes["http.route"] == "/healthz"'
- 'attributes["http.route"] == "/readyz"'
- 'attributes["http.route"] == "/metrics"'
metrics:
metric:
- 'name == "rpc.server.duration" and resource.attributes["service.name"] == "debug-svc"'
# Tail-based sampling — only on the gateway collector
tail_sampling:
decision_wait: 10s
num_traces: 100000
expected_new_traces_per_sec: 500
policies:
# Always keep error traces — never miss a failure
- name: errors
type: status_code
status_code:
status_codes: [ERROR]
# Always keep slow traces (> 2 seconds)
- name: slow-traces
type: latency
latency:
threshold_ms: 2000
# Sample 10% of normal, successful traces
- name: probabilistic-sample
type: probabilistic
probabilistic:
sampling_percentage: 10
# Transform attributes to prevent cardinality explosions
transform:
trace_statements:
- context: span
statements:
- truncate_all(attributes, 256)
- limit(attributes, 64)
exporters:
# Traces to Grafana Tempo via OTLP
otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true
# Metrics to Prometheus via remote write
prometheusremotewrite:
endpoint: http://prometheus:9090/api/v1/write
resource_to_telemetry_conversion:
enabled: true
# Logs to Grafana Loki
loki:
endpoint: http://loki:3100/loki/api/v1/push
default_labels_enabled:
exporter: false
job: true
# Debug exporter for development — remove in production
debug:
verbosity: basic
sampling_initial: 5
sampling_thereafter: 200
extensions:
health_check:
endpoint: 0.0.0.0:13133
pprof:
endpoint: 0.0.0.0:1777
zpages:
endpoint: 0.0.0.0:55679
service:
extensions: [health_check, pprof, zpages]
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, k8sattributes, filter, tail_sampling, batch]
exporters: [otlp/tempo]
metrics:
receivers: [otlp, prometheus, hostmetrics]
processors: [memory_limiter, k8sattributes, batch]
exporters: [prometheusremotewrite]
logs:
receivers: [otlp]
processors: [memory_limiter, k8sattributes, batch]
exporters: [loki]
Key design decisions in this config:
memory_limiteris the first processor in every pipeline. If the collector runs out of memory, it drops data instead of crashing. This is non-negotiable for production.filterremoves health check noise before it hits the sampling decision. No point in sampling health checks.tail_samplingruns on the gateway — it needs to see all spans of a trace before deciding. Always keeps errors and slow traces; probabilistically samples the rest.transformprevents cardinality explosions by truncating attribute values and limiting attribute count per span.prometheusremotewritewithresource_to_telemetry_conversionturns OTel resource attributes into Prometheus labels, preserving service identity.
Deploying in Kubernetes
The OpenTelemetry Operator is the recommended way to deploy OTel in Kubernetes. It manages collectors and can auto-inject instrumentation into pods.
DaemonSet Collector (Agent Mode)
Deploy an agent on every node using a DaemonSet:
# otel-agent-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: otel-agent
namespace: observability
labels:
app.kubernetes.io/name: otel-agent
app.kubernetes.io/component: telemetry-collector
spec:
selector:
matchLabels:
app: otel-agent
template:
metadata:
labels:
app: otel-agent
spec:
serviceAccountName: otel-agent
containers:
- name: otel-agent
image: otel/opentelemetry-collector-contrib:0.102.0
args: ["--config=/etc/otel/config.yaml"]
ports:
- containerPort: 4317 # OTLP gRPC
hostPort: 4317
protocol: TCP
- containerPort: 4318 # OTLP HTTP
hostPort: 4318
protocol: TCP
- containerPort: 13133 # Health check
protocol: TCP
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
livenessProbe:
httpGet:
path: /
port: 13133
initialDelaySeconds: 15
periodSeconds: 10
readinessProbe:
httpGet:
path: /
port: 13133
initialDelaySeconds: 5
periodSeconds: 5
volumeMounts:
- name: config
mountPath: /etc/otel
volumes:
- name: config
configMap:
name: otel-agent-config
tolerations:
- operator: Exists # Run on all nodes including tainted ones
Helm Chart Deployment
For most teams, the OpenTelemetry Collector Helm chart is the simplest path to production:
# values-otel-collector.yaml
mode: daemonset
image:
repository: otel/opentelemetry-collector-contrib
tag: "0.102.0"
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: 1
memory: 1Gi
presets:
kubernetesAttributes:
enabled: true
hostMetrics:
enabled: true
kubeletMetrics:
enabled: true
config:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
memory_limiter:
check_interval: 5s
limit_mib: 768
spike_limit_mib: 192
batch:
send_batch_size: 4096
timeout: 5s
exporters:
otlp/gateway:
endpoint: otel-gateway.observability.svc:4317
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp/gateway]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp/gateway]
logs:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp/gateway]
ports:
otlp:
enabled: true
containerPort: 4317
hostPort: 4317
protocol: TCP
otlp-http:
enabled: true
containerPort: 4318
hostPort: 4318
protocol: TCP
serviceAccount:
create: true
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm install otel-agent open-telemetry/opentelemetry-collector \
-f values-otel-collector.yaml \
-n observability --create-namespace
Resource considerations: Start with 256Mi per agent. Monitor actual usage for a week before setting final limits. Agents with many instrumentation libraries or high-cardinality metrics will need more. The gateway typically needs 2-4x the memory of agents because tail sampling holds traces in memory until the decision window expires. For 500 traces/sec with a 10-second window, budget at least 2Gi for the gateway.
Connecting to the Grafana Stack
The Grafana stack is the most common open-source backend for OpenTelemetry and the natural home for your distributed tracing metrics logs:
- Tempo receives traces via OTLP
- Prometheus receives metrics via remote write
- Loki receives logs via the Loki exporter
The real power is correlation. When OTel injects trace_id into your structured logs, Grafana can link directly from a Tempo trace view to the corresponding Loki log lines, and from Prometheus metric alerts to the traces that caused them.
Configure data source correlations in Grafana:
- In Tempo data source settings, add a "Trace to logs" correlation pointing to Loki with the filter
{service_name="${__span.tags.service.name}"} | trace_id="${__span.traceId}" - In Loki data source settings, add a "Derived field" that extracts
trace_idfrom log lines using the regextrace_id=(\w+)and links to Tempo - In Prometheus, enable exemplar support. When recording rules fire, exemplars carry the
trace_idthat triggered them — click through directly to the trace
This gives your on-call engineers a single pane: see the alert (Prometheus), jump to the trace (Tempo), read the logs (Loki). No context switching, no guessing, no grep across five terminal windows. This is otel observability at its best.
A practical example: At 3 AM, PagerDuty fires because http_request_duration_seconds P99 exceeded 5 seconds. The Prometheus alert links to a Grafana panel. The panel shows exemplars — individual request data points that exceeded the threshold. Clicking an exemplar opens the full trace in Tempo: you see the root span took 6.2 seconds, the inventory-service span took 5.8 seconds, and within that, a PostgreSQL query took 5.6 seconds. Click the span's logs tab: Loki shows slow query: SELECT * FROM inventory WHERE sku IN (...) with 15,000 SKUs in the IN clause. Root cause identified in under two minutes — no code grep, no SSH, no guesswork.
Sampling Strategies
At scale, you cannot store 100% of your traces. A service handling 10,000 requests per second generates terabytes of trace data per day. Sampling is how you keep costs manageable while preserving visibility into the data that matters.
Head-Based Sampling
The decision happens at the start of the trace, before any spans are generated.
TraceIdRatioBased(0.1) → keep 10% of traces, decided at root span
Pros: Zero overhead on rejected traces — they never generate spans. Simple to configure. No buffering requirements.
Cons: Blind. You are equally likely to drop a 5-second error trace as a 10ms success trace. Critical failures might never be sampled.
Tail-Based Sampling
The decision happens after the trace is complete, so you can inspect duration, status, and attributes.
This is the tail_sampling processor in the Collector config above. It keeps all errors, all slow traces, and samples 10% of the rest. This is the correct approach for production because you never lose visibility into failures.
Cons: The gateway must buffer all spans until the decision window expires (the decision_wait parameter). This requires memory proportional to your throughput. For 500 new traces per second with a 10-second window, that is 5,000 traces held in memory simultaneously — each potentially containing dozens of spans.
Rate-Limiting
Cap the number of traces per second regardless of other sampling decisions:
processors:
probabilistic_sampler:
sampling_percentage: 100
Use rate-limiting as a safety valve on top of other strategies. It prevents runaway costs if a service suddenly emits 100x normal traffic during a retry storm.
The practical approach: Use head-based sampling in development (sample everything or nothing). Use tail-based sampling in production with error/latency-aware policies. Layer rate-limiting as a circuit breaker. This is what a mature opentelemetry production guide recommends, and it is what we run.
Production Gotchas
These are the issues that bite you after the initial deployment looks fine. Every one of these has cost us hours of debugging.
Cardinality Explosions
Every unique combination of metric labels creates a new time series. If you add user_id as a metric attribute, you create a time series per user. With 100,000 users, that is 100,000 time series for a single metric. Prometheus will OOM, Cortex/Mimir will reject the writes, and your observability stack becomes the outage.
Free Resource
CI/CD Pipeline Blueprint
Our battle-tested pipeline template covering build, test, security scan, staging, and zero-downtime deployment stages.
Rule: Never use unbounded values (user IDs, request IDs, email addresses, UUIDs) as metric attributes. Use bounded categories: customer_tier (free/pro/enterprise), region (us-east/eu-west), status_code (200/400/500).
The transform processor in the Collector config limits attribute count per span and truncates values — this is your last line of defense against cardinality bombs in traces.
Collector Resource Limits
The Collector can consume significant resources, especially with tail sampling enabled. Always set memory_limiter as the first processor in every pipeline. Without it, a traffic spike will OOM-kill your collector, and you lose all buffered telemetry.
Sizing guideline: For every 1,000 spans per second, allocate roughly 256Mi of memory to the gateway collector if using tail sampling with a 10-second decision window. Monitor otelcol_processor_tail_sampling_count_traces_sampled to verify your policies are working as expected.
SDK Overhead
Auto-instrumentation adds latency. For most applications, it is 1-3% overhead — acceptable. But if you instrument every filesystem operation, every DNS lookup, every timer tick, the overhead compounds quickly.
Disable what you do not need: In the Node.js example above, we disabled @opentelemetry/instrumentation-fs because file system spans generate noise without actionable insight. Review your auto-instrumentation config and explicitly disable noisy instrumentations.
Missing Context Propagation
The most common distributed tracing failure: traces that stop at a service boundary. This happens when an intermediate service (a proxy, a queue consumer, a scheduled job) does not propagate the W3C traceparent header.
Every HTTP client, message producer, and background worker must propagate context. OTel SDKs handle this automatically for instrumented HTTP clients, but custom transports, message queues (Kafka, RabbitMQ, SQS), and cron jobs need manual propagation:
from opentelemetry.propagate import inject
# Inject trace context into outgoing message headers
headers = {}
inject(headers)
queue.publish(message, headers=headers)
If you see traces that end abruptly at a service, check whether that service is forwarding the traceparent and tracestate headers on its outbound calls.
Migration Path: From Legacy to OTel
You do not have to rip and replace. OTel is designed for gradual adoption. Dual-shipping is the key pattern.
From Prometheus
Keep Prometheus running. The OTel Collector's prometheus receiver scrapes your existing /metrics endpoints. The prometheusremotewrite exporter writes back to Prometheus. You can run both OTel metrics and Prometheus scraping simultaneously.
Step 1: Deploy the Collector with the prometheus receiver pointed at your existing scrape targets.
Step 2: Gradually add OTel SDK instrumentation to services, emitting metrics via OTLP.
Step 3: Once a service is fully instrumented with OTel, remove its Prometheus scrape config. The service now emits metrics through OTLP, the Collector writes to Prometheus. Same backend, better instrumentation.
From Jaeger/Zipkin
The Collector has native jaeger and zipkin receivers. Point your existing Jaeger clients at the Collector instead of the Jaeger backend. The Collector exports to Tempo (or any OTLP backend). Migrate services to the OTel SDK one at a time. Old services and new services produce traces that correlate seamlessly because the wire protocol is compatible.
From the ELK Stack
The hardest migration because OTel's log signal is the youngest of the three. The practical approach:
Step 1: Keep Elasticsearch/Logstash running for existing logs.
Step 2: Add OTel SDK to new services for traces and metrics first.
Step 3: Configure OTel log bridge to inject trace IDs into structured logs.
Step 4: Ship logs to both Loki (via the Collector) and Elasticsearch (via existing Logstash) simultaneously.
Step 5: Once teams are comfortable with Loki + Tempo correlation and have migrated their dashboards, decommission the ELK stack.
Dual-shipping is the safety net across all migrations. Send telemetry to both old and new backends during the transition. Compare data to validate completeness. Cut over only when you have full confidence. This incremental approach is what makes adopting a distributed tracing metrics logs standard feasible even in legacy-heavy environments.
Conclusion
OpenTelemetry is not another monitoring tool. It is the standard that every monitoring tool now speaks. By instrumenting with OTel, you decouple your application code from your observability backend — permanently. Switch from Prometheus to Thanos, from Jaeger to Tempo, from Datadog to Grafana Cloud — your application code does not change. Only the Collector config does.
Whether you are greenfield or migrating from legacy tools, the path to production-grade otel observability follows the same steps:
- Add the OTel SDK to one service (Node.js or Python, using the examples above)
- Deploy the Collector with the OTLP receiver and your backend exporters
- Get traces flowing into Tempo and metrics into Prometheus
- Add tail-based sampling when volume grows beyond what you can afford to store
- Correlate logs with trace IDs for the full three-signal picture
- Migrate incrementally — dual-ship from legacy tools, cut over when confident
The distributed tracing metrics logs pipeline takes a day to set up and pays dividends for years. Every production incident you investigate with correlated telemetry instead of grep and guesswork is time saved and downtime reduced.
At TechSaaS, we build and operate production observability stacks — from initial OTel instrumentation to Grafana dashboards to on-call runbooks. Whether you need an opentelemetry production guide tailored to your stack or hands-on help migrating from legacy monitoring, get in touch. We have done this across dozens of services and can help you skip the pitfalls.
Related Service
Platform Engineering
From CI/CD pipelines to service meshes, we create golden paths for your developers.
Need help with devops?
TechSaaS provides expert consulting and managed services for cloud infrastructure, DevOps, and AI/ML operations.
We Will Build You a Demo Site — For Free
Like it? Pay us. Do not like it? Walk away, zero complaints. You will spend way less than hiring developers or any agency.
No spam. No contracts. Just a free demo.