OpenTelemetry in Production: The Definitive Guide to Modern Observability

Production guide to OpenTelemetry: instrument your apps, collect traces, metrics, and logs with a single standard. Real examples and deployment patterns.

Y
Yash Pritwani
12 min read read

OpenTelemetry in Production: The Definitive Guide to Modern Observability

The observability wars are over. OpenTelemetry won.

For years, every monitoring vendor shipped its own agent, its own SDK, its own proprietary format. You picked Datadog, you got the Datadog agent. You picked New Relic, you got the New Relic SDK. Switching vendors meant re-instrumenting your entire stack. That era is ending.

OpenTelemetry (OTel) is the CNCF's vendor-neutral observability framework — a single set of APIs, SDKs, and tools that generate, collect, and export telemetry data. It graduated from CNCF incubation in 2024, and as of 2026, it is the second most active CNCF project after Kubernetes. Every major observability vendor — Datadog, Grafana Labs, Honeycomb, Splunk, Dynatrace — now supports OTel natively. The specification covers all three pillars: traces, metrics, and logs under one unified standard.

This is your opentelemetry production guide. Not theory — real instrumentation code, real collector configs, and real deployment patterns we use to monitor 80+ services running on our infrastructure at TechSaaS.

The Three Signals of Observability

OpenTelemetry unifies three telemetry signals that were historically handled by separate tools with separate data models. Understanding how they connect is the foundation of effective otel observability — and the reason OTel has replaced bespoke monitoring stacks at companies ranging from startups to Fortune 500s.

Traces (Distributed Request Paths)

A trace follows a single request as it moves through your distributed system. Each unit of work is a span — a named, timed operation with metadata. Spans form a tree: the root span represents the entire request, and child spans represent downstream calls.

When a user hits your API gateway, the trace captures the gateway processing (50ms), the auth service call (12ms), the database query (8ms), and the cache miss that triggered a slow path (200ms). You see exactly where time was spent and which service caused the bottleneck. Each span carries a trace_id that ties the entire tree together, even across process and network boundaries.

Metrics (Counters, Gauges, Histograms)

Metrics are aggregated numerical measurements. OTel defines three core instrument types:

  • Counters — monotonically increasing values (total requests served, total errors encountered)
  • Gauges — point-in-time values (current memory usage, active connections, queue depth)
  • Histograms — distribution of values (request duration percentiles, payload sizes, batch processing times)

Unlike traces, metrics are pre-aggregated. They are cheap to store and fast to query, making them ideal for dashboards and alerting. When your P99 latency spikes, a metric tells you when — a trace tells you why.

Logs (Structured and Correlated)

OTel's log signal bridges your existing logging infrastructure into the observability stack. The key differentiator: OTel logs carry trace context. When a log line includes a trace_id and span_id, you can jump from a log entry directly to the distributed trace that generated it — and back.

How the three signals connect: A Grafana dashboard shows a spike in error rate (metric). You click through to see which traces had errors (trace). You open a failing trace, find the broken span, and jump to the exact log lines from that span (log). Three signals, one correlated view. That is what a modern distributed tracing metrics logs pipeline delivers that siloed tools cannot.

Architecture Overview

Every OTel deployment follows the same data flow:

┌──────────────┐    ┌───────────────┐    ┌────────────────────────┐    ┌───────────────┐
│  Application │───>│   OTel SDK    │───>│    OTel Collector      │───>│    Backend    │
│  (your code) │    │  + Exporter   │    │                        │    │               │
│              │    │              │    │  receivers ─> processors │    │  Tempo        │
│              │    │  OTLP gRPC   │    │     ─> exporters        │    │  Prometheus   │
│              │    │  or HTTP     │    │                        │    │  Loki         │
└──────────────┘    └───────────────┘    └────────────────────────┘    └───────────────┘

SDK: Instruments your code. Generates spans, records metrics, bridges logs. Exports via OTLP (OpenTelemetry Protocol) over gRPC or HTTP.

Collector: A standalone binary that receives, processes, and exports telemetry. This is the workhorse of any OTel deployment. It decouples your applications from backends — your apps send OTLP to the collector, and the collector routes to whatever backends you use.

Two deployment modes:

  • Agent mode: Collector runs as a sidecar or DaemonSet on every node. Low latency, local processing. Each agent handles telemetry from the pods on its node.
  • Gateway mode: Centralized collector cluster that all agents forward to. Enables cross-service tail sampling, centralized filtering, and routing decisions that require visibility across multiple services.

Most production deployments use both: agents on every node for collection, a gateway cluster for aggregation and sampling. This two-tier architecture is the recommended pattern in the official OTel documentation and is what any serious opentelemetry production guide will advise.

Instrumenting a Node.js Application

Auto-instrumentation is the fastest path to distributed tracing in Node.js. OTel provides instrumentation packages for Express, Fastify, HTTP, gRPC, PostgreSQL, Redis, and dozens more.

Setup

npm install @opentelemetry/sdk-node \
  @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-trace-otlp-grpc \
  @opentelemetry/exporter-metrics-otlp-grpc \
  @opentelemetry/sdk-metrics

Create tracing.ts — this file must be loaded before your application code:

Get more insights on DevOps

Join 2,000+ engineers who get our weekly deep-dives. No spam, unsubscribe anytime.

// tracing.ts — load BEFORE app code via --require or import
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-grpc';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { Resource } from '@opentelemetry/resources';
import {
  ATTR_SERVICE_NAME,
  ATTR_SERVICE_VERSION,
  ATTR_DEPLOYMENT_ENVIRONMENT,
} from '@opentelemetry/semantic-conventions';

const sdk = new NodeSDK({
  resource: new Resource({
    [ATTR_SERVICE_NAME]: 'order-service',
    [ATTR_SERVICE_VERSION]: '2.4.1',
    [ATTR_DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV || 'development',
  }),
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://otel-collector:4317',
  }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({
      url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://otel-collector:4317',
    }),
    exportIntervalMillis: 15000,
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      // Filter out noisy health check spans
      '@opentelemetry/instrumentation-http': {
        ignoreIncomingPaths: ['/healthz', '/readyz'],
      },
      // Disable filesystem instrumentation — too noisy, no actionable insight
      '@opentelemetry/instrumentation-fs': { enabled: false },
    }),
  ],
});

sdk.start();

// Ensure spans are flushed before the process exits
process.on('SIGTERM', () => {
  sdk.shutdown()
    .then(() => process.exit(0))
    .catch(() => process.exit(1));
});

Run your application with the instrumentation loaded first:

node --require ./tracing.js ./dist/server.js

Custom Spans and Metrics

Auto-instrumentation covers HTTP and database calls. For business logic, create custom spans:

import { trace, metrics, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('order-service');
const meter = metrics.getMeter('order-service');

// Custom metrics
const orderCounter = meter.createCounter('orders.created', {
  description: 'Total orders created',
});
const orderValueHistogram = meter.createHistogram('orders.value', {
  description: 'Order value distribution in cents',
  unit: 'cents',
});

async function processOrder(order: Order) {
  return tracer.startActiveSpan('processOrder', async (span) => {
    try {
      span.setAttribute('order.id', order.id);
      span.setAttribute('order.item_count', order.items.length);
      span.setAttribute('order.customer_tier', order.customerTier);

      // Nested span for payment processing
      const result = await tracer.startActiveSpan(
        'chargePayment',
        async (paymentSpan) => {
          paymentSpan.setAttribute('payment.method', order.paymentMethod);
          paymentSpan.setAttribute('payment.amount_cents', order.totalCents);

          const chargeResult = await paymentGateway.charge(order);

          paymentSpan.setAttribute('payment.transaction_id', chargeResult.txId);
          paymentSpan.end();
          return chargeResult;
        }
      );

      // Record metrics with relevant attributes
      orderCounter.add(1, { 'customer.tier': order.customerTier });
      orderValueHistogram.record(order.totalCents, {
        'payment.method': order.paymentMethod,
      });

      span.setStatus({ code: SpanStatusCode.OK });
      return result;
    } catch (error) {
      span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
      span.recordException(error);
      throw error;
    } finally {
      span.end();
    }
  });
}

With this setup, every HTTP request automatically generates a trace with spans for the request handler, database queries, and your custom business logic — all correlated by trace_id.

Instrumenting a Python Application

Python has equally mature OTel support. Here is a complete setup for FastAPI.

Setup

pip install opentelemetry-api \
  opentelemetry-sdk \
  opentelemetry-exporter-otlp-proto-grpc \
  opentelemetry-instrumentation-fastapi \
  opentelemetry-instrumentation-httpx \
  opentelemetry-instrumentation-sqlalchemy \
  opentelemetry-instrumentation-redis
# otel_setup.py
import os
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.resources import Resource, SERVICE_NAME, SERVICE_VERSION
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.instrumentation.redis import RedisInstrumentor


def setup_otel(app):
    """Initialize OpenTelemetry with traces and metrics for FastAPI."""
    resource = Resource.create({
        SERVICE_NAME: "user-service",
        SERVICE_VERSION: "1.8.0",
        "deployment.environment": os.getenv("ENV", "development"),
    })

    # Traces
    trace_provider = TracerProvider(resource=resource)
    trace_provider.add_span_processor(
        BatchSpanProcessor(
            OTLPSpanExporter(
                endpoint=os.getenv(
                    "OTEL_EXPORTER_OTLP_ENDPOINT", "otel-collector:4317"
                ),
                insecure=True,
            ),
            max_queue_size=2048,
            max_export_batch_size=512,
            schedule_delay_millis=5000,
        )
    )
    trace.set_tracer_provider(trace_provider)

    # Metrics
    metric_reader = PeriodicExportingMetricReader(
        OTLPMetricExporter(
            endpoint=os.getenv(
                "OTEL_EXPORTER_OTLP_ENDPOINT", "otel-collector:4317"
            ),
            insecure=True,
        ),
        export_interval_millis=15000,
    )
    metrics.set_meter_provider(
        MeterProvider(resource=resource, metric_readers=[metric_reader])
    )

    # Auto-instrument frameworks
    FastAPIInstrumentor.instrument_app(app, excluded_urls="healthz,readyz")
    SQLAlchemyInstrumentor().instrument()
    RedisInstrumentor().instrument()
# main.py
from fastapi import FastAPI
from opentelemetry import trace, metrics
from otel_setup import setup_otel

app = FastAPI()
setup_otel(app)

tracer = trace.get_tracer("user-service")
meter = metrics.get_meter("user-service")

signup_counter = meter.create_counter(
    "users.signups",
    description="Total user signups",
)
auth_latency = meter.create_histogram(
    "auth.latency_ms",
    description="Authentication latency in milliseconds",
    unit="ms",
)


@app.post("/users")
async def create_user(payload: UserCreate):
    with tracer.start_as_current_span("validate_user_input") as span:
        span.set_attribute("user.email_domain", payload.email.split("@")[1])
        validated = validate(payload)

    with tracer.start_as_current_span("persist_user") as span:
        user = await db.create_user(validated)
        span.set_attribute("user.id", str(user.id))
        # Span events are lightweight annotations within a span
        span.add_event("user_persisted", {"user.id": str(user.id)})

    signup_counter.add(1, {"plan": payload.plan, "source": payload.referral_source})
    return {"id": user.id, "status": "created"}

Every incoming FastAPI request automatically gets a root span. Database queries via SQLAlchemy and Redis calls are captured as child spans. The custom spans for validation and persistence give you visibility into your business logic — the part that auto-instrumentation cannot cover.

The OpenTelemetry Collector

The Collector is where the real power of otel observability lives. It receives telemetry from your applications, processes it (batching, filtering, sampling, enriching), and exports it to one or more backends. This is the central nervous system of your observability stack.

Here is a production-grade Collector configuration:

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
        max_recv_msg_size_mib: 16
      http:
        endpoint: 0.0.0.0:4318

  # Scrape Prometheus metrics from services that already expose /metrics
  prometheus:
    config:
      scrape_configs:
        - job_name: 'kubernetes-pods'
          kubernetes_sd_configs:
            - role: pod
          relabel_configs:
            - source_labels:
                [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
              action: keep
              regex: true
            - source_labels:
                [__meta_kubernetes_pod_annotation_prometheus_io_port]
              action: replace
              target_label: __address__
              regex: (.+)
              replacement: ${1}:${2}

  # Collect host-level metrics from the node
  hostmetrics:
    collection_interval: 30s
    scrapers:
      cpu:
      memory:
      disk:
      network:
      filesystem:

processors:
  # Batch telemetry to reduce export overhead
  batch:
    send_batch_size: 8192
    send_batch_max_size: 16384
    timeout: 5s

  # Add Kubernetes metadata to all telemetry
  k8sattributes:
    extract:
      metadata:
        - k8s.namespace.name
        - k8s.deployment.name
        - k8s.pod.name
        - k8s.node.name
    pod_association:
      - sources:
          - from: resource_attribute
            name: k8s.pod.ip

  # Memory limiter — MUST be the first processor in every pipeline
  memory_limiter:
    check_interval: 5s
    limit_mib: 1536
    spike_limit_mib: 384

  # Filter out noisy health check spans before they reach sampling
  filter:
    error_mode: ignore
    traces:
      span:
        - 'attributes["http.route"] == "/healthz"'
        - 'attributes["http.route"] == "/readyz"'
        - 'attributes["http.route"] == "/metrics"'
    metrics:
      metric:
        - 'name == "rpc.server.duration" and resource.attributes["service.name"] == "debug-svc"'

  # Tail-based sampling — only on the gateway collector
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    expected_new_traces_per_sec: 500
    policies:
      # Always keep error traces — never miss a failure
      - name: errors
        type: status_code
        status_code:
          status_codes: [ERROR]
      # Always keep slow traces (> 2 seconds)
      - name: slow-traces
        type: latency
        latency:
          threshold_ms: 2000
      # Sample 10% of normal, successful traces
      - name: probabilistic-sample
        type: probabilistic
        probabilistic:
          sampling_percentage: 10

  # Transform attributes to prevent cardinality explosions
  transform:
    trace_statements:
      - context: span
        statements:
          - truncate_all(attributes, 256)
          - limit(attributes, 64)

exporters:
  # Traces to Grafana Tempo via OTLP
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true

  # Metrics to Prometheus via remote write
  prometheusremotewrite:
    endpoint: http://prometheus:9090/api/v1/write
    resource_to_telemetry_conversion:
      enabled: true

  # Logs to Grafana Loki
  loki:
    endpoint: http://loki:3100/loki/api/v1/push
    default_labels_enabled:
      exporter: false
      job: true

  # Debug exporter for development — remove in production
  debug:
    verbosity: basic
    sampling_initial: 5
    sampling_thereafter: 200

extensions:
  health_check:
    endpoint: 0.0.0.0:13133
  pprof:
    endpoint: 0.0.0.0:1777
  zpages:
    endpoint: 0.0.0.0:55679

service:
  extensions: [health_check, pprof, zpages]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, k8sattributes, filter, tail_sampling, batch]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp, prometheus, hostmetrics]
      processors: [memory_limiter, k8sattributes, batch]
      exporters: [prometheusremotewrite]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, k8sattributes, batch]
      exporters: [loki]

Key design decisions in this config:

  • memory_limiter is the first processor in every pipeline. If the collector runs out of memory, it drops data instead of crashing. This is non-negotiable for production.
  • filter removes health check noise before it hits the sampling decision. No point in sampling health checks.
  • tail_sampling runs on the gateway — it needs to see all spans of a trace before deciding. Always keeps errors and slow traces; probabilistically samples the rest.
  • transform prevents cardinality explosions by truncating attribute values and limiting attribute count per span.
  • prometheusremotewrite with resource_to_telemetry_conversion turns OTel resource attributes into Prometheus labels, preserving service identity.

Deploying in Kubernetes

The OpenTelemetry Operator is the recommended way to deploy OTel in Kubernetes. It manages collectors and can auto-inject instrumentation into pods.

DaemonSet Collector (Agent Mode)

Deploy an agent on every node using a DaemonSet:

# otel-agent-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: otel-agent
  namespace: observability
  labels:
    app.kubernetes.io/name: otel-agent
    app.kubernetes.io/component: telemetry-collector
spec:
  selector:
    matchLabels:
      app: otel-agent
  template:
    metadata:
      labels:
        app: otel-agent
    spec:
      serviceAccountName: otel-agent
      containers:
        - name: otel-agent
          image: otel/opentelemetry-collector-contrib:0.102.0
          args: ["--config=/etc/otel/config.yaml"]
          ports:
            - containerPort: 4317  # OTLP gRPC
              hostPort: 4317
              protocol: TCP
            - containerPort: 4318  # OTLP HTTP
              hostPort: 4318
              protocol: TCP
            - containerPort: 13133 # Health check
              protocol: TCP
          resources:
            requests:
              cpu: 200m
              memory: 256Mi
            limits:
              cpu: 500m
              memory: 512Mi
          livenessProbe:
            httpGet:
              path: /
              port: 13133
            initialDelaySeconds: 15
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: /
              port: 13133
            initialDelaySeconds: 5
            periodSeconds: 5
          volumeMounts:
            - name: config
              mountPath: /etc/otel
      volumes:
        - name: config
          configMap:
            name: otel-agent-config
      tolerations:
        - operator: Exists  # Run on all nodes including tainted ones

Helm Chart Deployment

For most teams, the OpenTelemetry Collector Helm chart is the simplest path to production:

# values-otel-collector.yaml
mode: daemonset

image:
  repository: otel/opentelemetry-collector-contrib
  tag: "0.102.0"

resources:
  requests:
    cpu: 200m
    memory: 256Mi
  limits:
    cpu: 1
    memory: 1Gi

presets:
  kubernetesAttributes:
    enabled: true
  hostMetrics:
    enabled: true
  kubeletMetrics:
    enabled: true

config:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317
        http:
          endpoint: 0.0.0.0:4318
  processors:
    memory_limiter:
      check_interval: 5s
      limit_mib: 768
      spike_limit_mib: 192
    batch:
      send_batch_size: 4096
      timeout: 5s
  exporters:
    otlp/gateway:
      endpoint: otel-gateway.observability.svc:4317
      tls:
        insecure: true
  service:
    pipelines:
      traces:
        receivers: [otlp]
        processors: [memory_limiter, batch]
        exporters: [otlp/gateway]
      metrics:
        receivers: [otlp]
        processors: [memory_limiter, batch]
        exporters: [otlp/gateway]
      logs:
        receivers: [otlp]
        processors: [memory_limiter, batch]
        exporters: [otlp/gateway]

ports:
  otlp:
    enabled: true
    containerPort: 4317
    hostPort: 4317
    protocol: TCP
  otlp-http:
    enabled: true
    containerPort: 4318
    hostPort: 4318
    protocol: TCP

serviceAccount:
  create: true
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm install otel-agent open-telemetry/opentelemetry-collector \
  -f values-otel-collector.yaml \
  -n observability --create-namespace

Resource considerations: Start with 256Mi per agent. Monitor actual usage for a week before setting final limits. Agents with many instrumentation libraries or high-cardinality metrics will need more. The gateway typically needs 2-4x the memory of agents because tail sampling holds traces in memory until the decision window expires. For 500 traces/sec with a 10-second window, budget at least 2Gi for the gateway.

Connecting to the Grafana Stack

The Grafana stack is the most common open-source backend for OpenTelemetry and the natural home for your distributed tracing metrics logs:

  • Tempo receives traces via OTLP
  • Prometheus receives metrics via remote write
  • Loki receives logs via the Loki exporter

The real power is correlation. When OTel injects trace_id into your structured logs, Grafana can link directly from a Tempo trace view to the corresponding Loki log lines, and from Prometheus metric alerts to the traces that caused them.

Configure data source correlations in Grafana:

  1. In Tempo data source settings, add a "Trace to logs" correlation pointing to Loki with the filter {service_name="${__span.tags.service.name}"} | trace_id="${__span.traceId}"
  2. In Loki data source settings, add a "Derived field" that extracts trace_id from log lines using the regex trace_id=(\w+) and links to Tempo
  3. In Prometheus, enable exemplar support. When recording rules fire, exemplars carry the trace_id that triggered them — click through directly to the trace

This gives your on-call engineers a single pane: see the alert (Prometheus), jump to the trace (Tempo), read the logs (Loki). No context switching, no guessing, no grep across five terminal windows. This is otel observability at its best.

A practical example: At 3 AM, PagerDuty fires because http_request_duration_seconds P99 exceeded 5 seconds. The Prometheus alert links to a Grafana panel. The panel shows exemplars — individual request data points that exceeded the threshold. Clicking an exemplar opens the full trace in Tempo: you see the root span took 6.2 seconds, the inventory-service span took 5.8 seconds, and within that, a PostgreSQL query took 5.6 seconds. Click the span's logs tab: Loki shows slow query: SELECT * FROM inventory WHERE sku IN (...) with 15,000 SKUs in the IN clause. Root cause identified in under two minutes — no code grep, no SSH, no guesswork.

Sampling Strategies

At scale, you cannot store 100% of your traces. A service handling 10,000 requests per second generates terabytes of trace data per day. Sampling is how you keep costs manageable while preserving visibility into the data that matters.

Head-Based Sampling

The decision happens at the start of the trace, before any spans are generated.

TraceIdRatioBased(0.1)  →  keep 10% of traces, decided at root span

Pros: Zero overhead on rejected traces — they never generate spans. Simple to configure. No buffering requirements.

Cons: Blind. You are equally likely to drop a 5-second error trace as a 10ms success trace. Critical failures might never be sampled.

Tail-Based Sampling

The decision happens after the trace is complete, so you can inspect duration, status, and attributes.

This is the tail_sampling processor in the Collector config above. It keeps all errors, all slow traces, and samples 10% of the rest. This is the correct approach for production because you never lose visibility into failures.

Cons: The gateway must buffer all spans until the decision window expires (the decision_wait parameter). This requires memory proportional to your throughput. For 500 new traces per second with a 10-second window, that is 5,000 traces held in memory simultaneously — each potentially containing dozens of spans.

Rate-Limiting

Cap the number of traces per second regardless of other sampling decisions:

processors:
  probabilistic_sampler:
    sampling_percentage: 100

Use rate-limiting as a safety valve on top of other strategies. It prevents runaway costs if a service suddenly emits 100x normal traffic during a retry storm.

The practical approach: Use head-based sampling in development (sample everything or nothing). Use tail-based sampling in production with error/latency-aware policies. Layer rate-limiting as a circuit breaker. This is what a mature opentelemetry production guide recommends, and it is what we run.

Production Gotchas

These are the issues that bite you after the initial deployment looks fine. Every one of these has cost us hours of debugging.

Cardinality Explosions

Every unique combination of metric labels creates a new time series. If you add user_id as a metric attribute, you create a time series per user. With 100,000 users, that is 100,000 time series for a single metric. Prometheus will OOM, Cortex/Mimir will reject the writes, and your observability stack becomes the outage.

Free Resource

CI/CD Pipeline Blueprint

Our battle-tested pipeline template covering build, test, security scan, staging, and zero-downtime deployment stages.

Get the Blueprint

Rule: Never use unbounded values (user IDs, request IDs, email addresses, UUIDs) as metric attributes. Use bounded categories: customer_tier (free/pro/enterprise), region (us-east/eu-west), status_code (200/400/500).

The transform processor in the Collector config limits attribute count per span and truncates values — this is your last line of defense against cardinality bombs in traces.

Collector Resource Limits

The Collector can consume significant resources, especially with tail sampling enabled. Always set memory_limiter as the first processor in every pipeline. Without it, a traffic spike will OOM-kill your collector, and you lose all buffered telemetry.

Sizing guideline: For every 1,000 spans per second, allocate roughly 256Mi of memory to the gateway collector if using tail sampling with a 10-second decision window. Monitor otelcol_processor_tail_sampling_count_traces_sampled to verify your policies are working as expected.

SDK Overhead

Auto-instrumentation adds latency. For most applications, it is 1-3% overhead — acceptable. But if you instrument every filesystem operation, every DNS lookup, every timer tick, the overhead compounds quickly.

Disable what you do not need: In the Node.js example above, we disabled @opentelemetry/instrumentation-fs because file system spans generate noise without actionable insight. Review your auto-instrumentation config and explicitly disable noisy instrumentations.

Missing Context Propagation

The most common distributed tracing failure: traces that stop at a service boundary. This happens when an intermediate service (a proxy, a queue consumer, a scheduled job) does not propagate the W3C traceparent header.

Every HTTP client, message producer, and background worker must propagate context. OTel SDKs handle this automatically for instrumented HTTP clients, but custom transports, message queues (Kafka, RabbitMQ, SQS), and cron jobs need manual propagation:

from opentelemetry.propagate import inject

# Inject trace context into outgoing message headers
headers = {}
inject(headers)
queue.publish(message, headers=headers)

If you see traces that end abruptly at a service, check whether that service is forwarding the traceparent and tracestate headers on its outbound calls.

Migration Path: From Legacy to OTel

You do not have to rip and replace. OTel is designed for gradual adoption. Dual-shipping is the key pattern.

From Prometheus

Keep Prometheus running. The OTel Collector's prometheus receiver scrapes your existing /metrics endpoints. The prometheusremotewrite exporter writes back to Prometheus. You can run both OTel metrics and Prometheus scraping simultaneously.

Step 1: Deploy the Collector with the prometheus receiver pointed at your existing scrape targets.
Step 2: Gradually add OTel SDK instrumentation to services, emitting metrics via OTLP.
Step 3: Once a service is fully instrumented with OTel, remove its Prometheus scrape config. The service now emits metrics through OTLP, the Collector writes to Prometheus. Same backend, better instrumentation.

From Jaeger/Zipkin

The Collector has native jaeger and zipkin receivers. Point your existing Jaeger clients at the Collector instead of the Jaeger backend. The Collector exports to Tempo (or any OTLP backend). Migrate services to the OTel SDK one at a time. Old services and new services produce traces that correlate seamlessly because the wire protocol is compatible.

From the ELK Stack

The hardest migration because OTel's log signal is the youngest of the three. The practical approach:

Step 1: Keep Elasticsearch/Logstash running for existing logs.
Step 2: Add OTel SDK to new services for traces and metrics first.
Step 3: Configure OTel log bridge to inject trace IDs into structured logs.
Step 4: Ship logs to both Loki (via the Collector) and Elasticsearch (via existing Logstash) simultaneously.
Step 5: Once teams are comfortable with Loki + Tempo correlation and have migrated their dashboards, decommission the ELK stack.

Dual-shipping is the safety net across all migrations. Send telemetry to both old and new backends during the transition. Compare data to validate completeness. Cut over only when you have full confidence. This incremental approach is what makes adopting a distributed tracing metrics logs standard feasible even in legacy-heavy environments.

Conclusion

OpenTelemetry is not another monitoring tool. It is the standard that every monitoring tool now speaks. By instrumenting with OTel, you decouple your application code from your observability backend — permanently. Switch from Prometheus to Thanos, from Jaeger to Tempo, from Datadog to Grafana Cloud — your application code does not change. Only the Collector config does.

Whether you are greenfield or migrating from legacy tools, the path to production-grade otel observability follows the same steps:

  1. Add the OTel SDK to one service (Node.js or Python, using the examples above)
  2. Deploy the Collector with the OTLP receiver and your backend exporters
  3. Get traces flowing into Tempo and metrics into Prometheus
  4. Add tail-based sampling when volume grows beyond what you can afford to store
  5. Correlate logs with trace IDs for the full three-signal picture
  6. Migrate incrementally — dual-ship from legacy tools, cut over when confident

The distributed tracing metrics logs pipeline takes a day to set up and pays dividends for years. Every production incident you investigate with correlated telemetry instead of grep and guesswork is time saved and downtime reduced.


At TechSaaS, we build and operate production observability stacks — from initial OTel instrumentation to Grafana dashboards to on-call runbooks. Whether you need an opentelemetry production guide tailored to your stack or hands-on help migrating from legacy monitoring, get in touch. We have done this across dozens of services and can help you skip the pitfalls.

#OpenTelemetry#Observability#Monitoring#Distributed Tracing#Metrics#Logging#DevOps

Related Service

Platform Engineering

From CI/CD pipelines to service meshes, we create golden paths for your developers.

Need help with devops?

TechSaaS provides expert consulting and managed services for cloud infrastructure, DevOps, and AI/ML operations.

We Will Build You a Demo Site — For Free

Like it? Pay us. Do not like it? Walk away, zero complaints. You will spend way less than hiring developers or any agency.

47+ companies trusted us
99.99% uptime
< 48hr response

No spam. No contracts. Just a free demo.