← All articlesdata-engineering

MLflow vs DVC vs Weights & Biases: Model Versioning That Actually Works in Production

Here's a stat that should make every ML team uncomfortable: according to a 2025 Gartner survey, only 54% of ML models that pass evaluation ever make it to production. The number one blocker isn't model quality — it's operational: teams can't reproduce the training run, can't trac

Y
Yash Pritwani
7 min read read

# MLflow vs DVC vs Weights & Biases: Model Versioning That Actually Works in Production

Here's a stat that should make every ML team uncomfortable: according to a 2025 Gartner survey, only 54% of ML models that pass evaluation ever make it to production. The number one blocker isn't model quality — it's operational: teams can't reproduce the training run, can't trace which data produced which model, and can't roll back when a deployed model starts drifting.

We've deployed model registries for teams ranging from 3-person startups to 50-person ML orgs. The tooling choice matters less than you think. What matters is the versioning discipline you build around it. But since you need to pick one, here's our honest comparison.

The Core Problem: Model Lineage

Every production ML model needs an answer to five questions:

1. What code produced this model? (Git commit) 2. What data was it trained on? (Dataset version + hash) 3. What hyperparameters were used? (Config snapshot) 4. What metrics did it achieve? (Evaluation results) 5. What environment ran the training? (Dependencies, GPU, framework version)

If you can't answer all five for every model in production, you have a versioning problem. Let's look at how each tool solves it.

MLflow: The Open-Source Standard

MLflow is the most widely adopted ML experiment tracking and model registry. It's framework-agnostic, supports local or remote deployment, and the API is straightforward enough that data scientists actually use it.

What MLflow Does Well

Experiment tracking with minimal code changes
Model registry with staging/production lifecycle
Artifact storage on S3, GCS, Azure Blob, or local filesystem
REST API for CI/CD integration
Self-hosted — your data stays on your infrastructure

Production Setup

import mlflow
from mlflow.tracking import MlflowClient

# Configure remote tracking server
mlflow.set_tracking_uri("https://mlflow.internal.company.com")
mlflow.set_experiment("recommendation-engine-v3")

# Training run with full lineage
with mlflow.start_run(run_name="xgboost-tuned-2026-05") as run:
    # Log parameters
    mlflow.log_params({
        "model_type": "xgboost",
        "n_estimators": 500,
        "max_depth": 8,
        "learning_rate": 0.05,
        "dataset_version": "v2.3.1",
        "dataset_hash": "sha256:a4f2e8...",
        "train_samples": 1_245_000,
        "feature_count": 128,
    })
    
    # Train model
    model = train_xgboost(X_train, y_train, params)
    
    # Log metrics
    metrics = evaluate_model(model, X_test, y_test)
    mlflow.log_metrics({
        "auc_roc": metrics["auc_roc"],
        "precision_at_10": metrics["precision_at_10"],
        "recall_at_10": metrics["recall_at_10"],
        "inference_latency_p99_ms": metrics["latency_p99"],
    })
    
    # Log model with signature for input validation
    from mlflow.models import infer_signature
    signature = infer_signature(X_test[:5], model.predict(X_test[:5]))
    
    mlflow.xgboost.log_model(
        model,
        artifact_path="model",
        signature=signature,
        registered_model_name="recommendation-engine",
        pip_requirements=["xgboost==2.1.0", "scikit-learn==1.5.2"],
    )
    
    # Log training data sample for debugging
    mlflow.log_artifact("data/train_sample.parquet", "data")

# Promote to production
client = MlflowClient()
latest = client.get_latest_versions("recommendation-engine", stages=["None"])[0]
client.transition_model_version_stage(
    name="recommendation-engine",
    version=latest.version,
    stage="Production",
    archive_existing_versions=True,
)

MLflow Deployment Config

# docker-compose.yml for MLflow tracking server
services:
  mlflow:
    image: ghcr.io/mlflow/mlflow:2.18.0
    command: >
      mlflow server
        --backend-store-uri postgresql://mlflow:pass@postgres:5432/mlflow
        --default-artifact-root s3://mlflow-artifacts/
        --host 0.0.0.0
        --port 5000
    environment:
      AWS_ACCESS_KEY_ID: ${AWS_ACCESS_KEY_ID}
      AWS_SECRET_ACCESS_KEY: ${AWS_SECRET_ACCESS_KEY}
    ports:
      - "5000:5000"

  postgres:
    image: postgres:16
    environment:
      POSTGRES_DB: mlflow
      POSTGRES_USER: mlflow
      POSTGRES_PASSWORD: ${MLFLOW_DB_PASSWORD}
    volumes:
      - mlflow-db:/var/lib/postgresql/data

MLflow's Weak Spots

No native data versioning — you log dataset hashes, but MLflow doesn't track the data itself
UI is functional, not beautiful — data scientists tolerate it, they don't love it
Collaboration features are basic — no commenting, tagging team members, or shared annotations
Scale ceiling — the PostgreSQL backend gets slow above 100K runs without careful partitioning

DVC: Git for Data and Models

DVC takes a fundamentally different approach: it extends Git itself. Your model files, datasets, and pipeline definitions are tracked in Git via DVC metafiles, while the actual large files live in remote storage (S3, GCS, etc.). This means your existing Git workflow — branches, PRs, code review — now covers data and models too.

What DVC Does Well

Data versioning as a first-class citizen
Pipeline reproducibility via dvc.yaml DAGs
Git-native — works with branches, tags, diffs
Storage agnostic — S3, GCS, Azure, SSH, HDFS
No server needed — pure CLI + remote storage

Production Pipeline

# dvc.yaml — defines the reproducible ML pipeline
stages:
  preprocess:
    cmd: python src/preprocess.py
    deps:
      - src/preprocess.py
      - data/raw/
    params:
      - preprocess.min_samples
      - preprocess.feature_selection
    outs:
      - data/processed/train.parquet
      - data/processed/test.parquet

  train:
    cmd: python src/train.py
    deps:
      - src/train.py
      - data/processed/train.parquet
    params:
      - train.model_type
      - train.n_estimators
      - train.max_depth
      - train.learning_rate
    outs:
      - models/recommendation.pkl
    metrics:
      - metrics/train_metrics.json:
          cache: false
    plots:
      - plots/roc_curve.json:
          x: fpr
          y: tpr

  evaluate:
    cmd: python src/evaluate.py
    deps:
      - src/evaluate.py
      - models/recommendation.pkl
      - data/processed/test.parquet
    metrics:
      - metrics/eval_metrics.json:
          cache: false
# params.yaml — hyperparameters tracked in Git
preprocess:
  min_samples: 100
  feature_selection: "mutual_info"

train:
  model_type: "xgboost"
  n_estimators: 500
  max_depth: 8
  learning_rate: 0.05
# Reproduce the entire pipeline
dvc repro

# Compare metrics between branches
dvc metrics diff main..experiment/new-features

# Push artifacts to remote storage
dvc push

# Pull a specific model version
git checkout v2.3.1
dvc pull models/recommendation.pkl

DVC's Weak Spots

No experiment tracking UI (unless you add DVC Studio or Iterative Studio)
Learning curve for teams not comfortable with Git
No model registry — you use Git tags, which works but lacks lifecycle management
Pipeline execution is local — no built-in distributed training orchestration

Weights & Biases: The Collaboration Platform

W&B is the most polished option in terms of user experience. The experiment tracking dashboard is genuinely excellent — real-time metric visualization, hyperparameter sweep visualization, and team collaboration features that actually get used.

What W&B Does Well

Best-in-class experiment visualization — interactive plots, metric comparison
Team collaboration — comments, tags, shared reports
Hyperparameter sweep orchestration built in
Model registry with lineage and deployment tracking
Artifacts with automatic deduplication and lineage graphs

Production Integration

import wandb

# Initialize run with full config tracking
run = wandb.init(
    project="recommendation-engine",
    config={
        "model_type": "xgboost",
        "n_estimators": 500,
        "max_depth": 8,
        "learning_rate": 0.05,
        "dataset": "v2.3.1",
    },
    tags=["production-candidate", "xgboost", "q2-2026"],
)

# Log dataset as artifact with lineage
dataset_artifact = wandb.Artifact(
    "training-data", type="dataset",
    metadata={"samples": 1_245_000, "features": 128}
)
dataset_artifact.add_dir("data/processed/")
run.use_artifact(dataset_artifact)

# Training loop with real-time logging
for epoch in range(100):
    train_loss = train_epoch(model, train_loader)
    val_metrics = evaluate(model, val_loader)
    
    wandb.log({
        "epoch": epoch,
        "train_loss": train_loss,
        "val_auc": val_metrics["auc_roc"],
        "val_precision": val_metrics["precision_at_10"],
        "learning_rate": scheduler.get_last_lr()[0],
    })

# Log model artifact with lineage
model_artifact = wandb.Artifact(
    "recommendation-model", type="model",
    metadata={
        "framework": "xgboost",
        "auc_roc": val_metrics["auc_roc"],
    }
)
model_artifact.add_file("models/recommendation.pkl")
run.log_artifact(model_artifact)

# Link to model registry
run.link_artifact(
    model_artifact,
    "recommendation-engine",
    aliases=["candidate", "v3.1.0"]
)

run.finish()

W&B's Weak Spots

SaaS-first — self-hosted (W&B Server) exists but costs $$$
Pricing scales with usage — can get expensive with large teams and many runs
Vendor lock-in — migrating off W&B is painful (proprietary storage format)
No data versioning — tracks artifacts but doesn't version raw datasets like DVC

Head-to-Head Comparison

Capability
MLflow
DVC
W&B

|-----------|--------|-----|-----|

Experiment tracking
Good
Basic (needs Studio)
Excellent
Model registry
Good
Git tags (basic)
Excellent
Data versioning
Hashes only
Excellent
Artifacts only
Pipeline reproducibility
Projects (basic)
Excellent
Sweeps only
Collaboration UI
Basic
CLI-first
Excellent
Self-hosted
Free
Free
Expensive
Vendor lock-in
Low
None
High
Setup complexity
Medium
Low
Low (SaaS)
Cost (50-person team)
$0 (OSS)
$0 (OSS)
$15K-50K/yr

The Decision Framework

Choose MLflow if:

You want a self-hosted, open-source solution
Your primary need is experiment tracking + model registry
You want framework flexibility and REST API access
Budget is zero but you have DevOps capacity to run it

Choose DVC if:

Data versioning is critical (NLP, computer vision with large datasets)
You want Git-native workflows (PRs for model changes)
Reproducibility of the full pipeline matters most
Your team is comfortable with CLI tools

Choose W&B if:

Team collaboration and visualization are priorities
You have budget and prefer SaaS
You run many experiments and need sweep orchestration
Data scientists need a polished UI to adopt the tooling

The Hybrid Approach We Recommend

For most production teams, no single tool covers everything. Our recommended stack:

DVC for data versioning and pipeline reproducibility
MLflow for model registry and deployment lifecycle
Grafana for production model monitoring (drift, latency, throughput)

This gives you full lineage — from raw data (DVC) through training (MLflow tracking) to production (MLflow registry + Grafana monitoring) — without vendor lock-in and at zero license cost.

Getting Started

Pick the tool that solves your most painful problem today. If models keep breaking in production because nobody knows which data trained them: start with DVC. If you need to promote models through staging/production: start with MLflow's registry. If your team won't adopt anything without a great UI: start with W&B.

Building an MLOps stack for your team? We've implemented model versioning pipelines for organizations at every scale. [Reach out at techsaas.cloud/contact](https://techsaas.cloud/contact) and we'll help you design a registry that fits your workflow, not the other way around.

Need help with data-engineering?

TechSaaS provides expert consulting and managed services for cloud infrastructure, DevOps, and AI/ML operations.