MLflow vs DVC vs Weights & Biases: Model Versioning That Actually Works in Production
Here's a stat that should make every ML team uncomfortable: according to a 2025 Gartner survey, only 54% of ML models that pass evaluation ever make it to production. The number one blocker isn't model quality — it's operational: teams can't reproduce the training run, can't trac
# MLflow vs DVC vs Weights & Biases: Model Versioning That Actually Works in Production
Here's a stat that should make every ML team uncomfortable: according to a 2025 Gartner survey, only 54% of ML models that pass evaluation ever make it to production. The number one blocker isn't model quality — it's operational: teams can't reproduce the training run, can't trace which data produced which model, and can't roll back when a deployed model starts drifting.
We've deployed model registries for teams ranging from 3-person startups to 50-person ML orgs. The tooling choice matters less than you think. What matters is the versioning discipline you build around it. But since you need to pick one, here's our honest comparison.
The Core Problem: Model Lineage
Every production ML model needs an answer to five questions:
1. What code produced this model? (Git commit) 2. What data was it trained on? (Dataset version + hash) 3. What hyperparameters were used? (Config snapshot) 4. What metrics did it achieve? (Evaluation results) 5. What environment ran the training? (Dependencies, GPU, framework version)
If you can't answer all five for every model in production, you have a versioning problem. Let's look at how each tool solves it.
MLflow: The Open-Source Standard
MLflow is the most widely adopted ML experiment tracking and model registry. It's framework-agnostic, supports local or remote deployment, and the API is straightforward enough that data scientists actually use it.
What MLflow Does Well
Production Setup
import mlflow
from mlflow.tracking import MlflowClient
# Configure remote tracking server
mlflow.set_tracking_uri("https://mlflow.internal.company.com")
mlflow.set_experiment("recommendation-engine-v3")
# Training run with full lineage
with mlflow.start_run(run_name="xgboost-tuned-2026-05") as run:
# Log parameters
mlflow.log_params({
"model_type": "xgboost",
"n_estimators": 500,
"max_depth": 8,
"learning_rate": 0.05,
"dataset_version": "v2.3.1",
"dataset_hash": "sha256:a4f2e8...",
"train_samples": 1_245_000,
"feature_count": 128,
})
# Train model
model = train_xgboost(X_train, y_train, params)
# Log metrics
metrics = evaluate_model(model, X_test, y_test)
mlflow.log_metrics({
"auc_roc": metrics["auc_roc"],
"precision_at_10": metrics["precision_at_10"],
"recall_at_10": metrics["recall_at_10"],
"inference_latency_p99_ms": metrics["latency_p99"],
})
# Log model with signature for input validation
from mlflow.models import infer_signature
signature = infer_signature(X_test[:5], model.predict(X_test[:5]))
mlflow.xgboost.log_model(
model,
artifact_path="model",
signature=signature,
registered_model_name="recommendation-engine",
pip_requirements=["xgboost==2.1.0", "scikit-learn==1.5.2"],
)
# Log training data sample for debugging
mlflow.log_artifact("data/train_sample.parquet", "data")
# Promote to production
client = MlflowClient()
latest = client.get_latest_versions("recommendation-engine", stages=["None"])[0]
client.transition_model_version_stage(
name="recommendation-engine",
version=latest.version,
stage="Production",
archive_existing_versions=True,
)MLflow Deployment Config
# docker-compose.yml for MLflow tracking server
services:
mlflow:
image: ghcr.io/mlflow/mlflow:2.18.0
command: >
mlflow server
--backend-store-uri postgresql://mlflow:pass@postgres:5432/mlflow
--default-artifact-root s3://mlflow-artifacts/
--host 0.0.0.0
--port 5000
environment:
AWS_ACCESS_KEY_ID: ${AWS_ACCESS_KEY_ID}
AWS_SECRET_ACCESS_KEY: ${AWS_SECRET_ACCESS_KEY}
ports:
- "5000:5000"
postgres:
image: postgres:16
environment:
POSTGRES_DB: mlflow
POSTGRES_USER: mlflow
POSTGRES_PASSWORD: ${MLFLOW_DB_PASSWORD}
volumes:
- mlflow-db:/var/lib/postgresql/dataMLflow's Weak Spots
DVC: Git for Data and Models
DVC takes a fundamentally different approach: it extends Git itself. Your model files, datasets, and pipeline definitions are tracked in Git via DVC metafiles, while the actual large files live in remote storage (S3, GCS, etc.). This means your existing Git workflow — branches, PRs, code review — now covers data and models too.
What DVC Does Well
dvc.yaml DAGsProduction Pipeline
# dvc.yaml — defines the reproducible ML pipeline
stages:
preprocess:
cmd: python src/preprocess.py
deps:
- src/preprocess.py
- data/raw/
params:
- preprocess.min_samples
- preprocess.feature_selection
outs:
- data/processed/train.parquet
- data/processed/test.parquet
train:
cmd: python src/train.py
deps:
- src/train.py
- data/processed/train.parquet
params:
- train.model_type
- train.n_estimators
- train.max_depth
- train.learning_rate
outs:
- models/recommendation.pkl
metrics:
- metrics/train_metrics.json:
cache: false
plots:
- plots/roc_curve.json:
x: fpr
y: tpr
evaluate:
cmd: python src/evaluate.py
deps:
- src/evaluate.py
- models/recommendation.pkl
- data/processed/test.parquet
metrics:
- metrics/eval_metrics.json:
cache: false# params.yaml — hyperparameters tracked in Git
preprocess:
min_samples: 100
feature_selection: "mutual_info"
train:
model_type: "xgboost"
n_estimators: 500
max_depth: 8
learning_rate: 0.05# Reproduce the entire pipeline
dvc repro
# Compare metrics between branches
dvc metrics diff main..experiment/new-features
# Push artifacts to remote storage
dvc push
# Pull a specific model version
git checkout v2.3.1
dvc pull models/recommendation.pklDVC's Weak Spots
Weights & Biases: The Collaboration Platform
W&B is the most polished option in terms of user experience. The experiment tracking dashboard is genuinely excellent — real-time metric visualization, hyperparameter sweep visualization, and team collaboration features that actually get used.
What W&B Does Well
Production Integration
import wandb
# Initialize run with full config tracking
run = wandb.init(
project="recommendation-engine",
config={
"model_type": "xgboost",
"n_estimators": 500,
"max_depth": 8,
"learning_rate": 0.05,
"dataset": "v2.3.1",
},
tags=["production-candidate", "xgboost", "q2-2026"],
)
# Log dataset as artifact with lineage
dataset_artifact = wandb.Artifact(
"training-data", type="dataset",
metadata={"samples": 1_245_000, "features": 128}
)
dataset_artifact.add_dir("data/processed/")
run.use_artifact(dataset_artifact)
# Training loop with real-time logging
for epoch in range(100):
train_loss = train_epoch(model, train_loader)
val_metrics = evaluate(model, val_loader)
wandb.log({
"epoch": epoch,
"train_loss": train_loss,
"val_auc": val_metrics["auc_roc"],
"val_precision": val_metrics["precision_at_10"],
"learning_rate": scheduler.get_last_lr()[0],
})
# Log model artifact with lineage
model_artifact = wandb.Artifact(
"recommendation-model", type="model",
metadata={
"framework": "xgboost",
"auc_roc": val_metrics["auc_roc"],
}
)
model_artifact.add_file("models/recommendation.pkl")
run.log_artifact(model_artifact)
# Link to model registry
run.link_artifact(
model_artifact,
"recommendation-engine",
aliases=["candidate", "v3.1.0"]
)
run.finish()W&B's Weak Spots
Head-to-Head Comparison
|-----------|--------|-----|-----|
The Decision Framework
Choose MLflow if:
Choose DVC if:
Choose W&B if:
The Hybrid Approach We Recommend
For most production teams, no single tool covers everything. Our recommended stack:
This gives you full lineage — from raw data (DVC) through training (MLflow tracking) to production (MLflow registry + Grafana monitoring) — without vendor lock-in and at zero license cost.
Getting Started
Pick the tool that solves your most painful problem today. If models keep breaking in production because nobody knows which data trained them: start with DVC. If you need to promote models through staging/production: start with MLflow's registry. If your team won't adopt anything without a great UI: start with W&B.
Building an MLOps stack for your team? We've implemented model versioning pipelines for organizations at every scale. [Reach out at techsaas.cloud/contact](https://techsaas.cloud/contact) and we'll help you design a registry that fits your workflow, not the other way around.
Need help with data-engineering?
TechSaaS provides expert consulting and managed services for cloud infrastructure, DevOps, and AI/ML operations.