Deploying ML Models to Production

The notebook works perfectly. The model metrics look great. You're ready to ship. Then you hit production and everything breaks in ways you didn't anticipate. If this sounds familiar, you're not alone — it's the most common story in applied ML.

Here's everything I wish I'd known before my first production ML deployment.

The Training-Serving Skew Problem

The single most common production ML failure is training-serving skew: the data your model sees at inference time doesn't look like the data it trained on. This can happen because of preprocessing inconsistencies, data pipeline differences, or slow distribution drift over time.

# Training preprocessing
X_train = (X_train - X_train.mean()) / X_train.std()

# Serving preprocessing (WRONG — using training stats at inference time)
X_input = (X_input - X_input.mean()) / X_input.std()

# Serving preprocessing (CORRECT — use saved training statistics)
X_input = (X_input - saved_mean) / saved_std

Always serialize your preprocessing parameters alongside your model. Never recompute statistics from the incoming data at inference time.

Model Versioning Is Non-Negotiable

Treat your models like software releases. Every model that goes to production should have a version number, a corresponding dataset snapshot, a performance benchmark, and a git commit hash for the training code. Without this, debugging production issues becomes archaeology.

Latency Is a Feature

A model that takes 800ms to respond might be technically accurate but practically useless. Before deploying, profile your full inference path: preprocessing, model forward pass, postprocessing, network overhead. Know your p50, p95, and p99 latency numbers before your users do.

Your model's accuracy is irrelevant if users have moved on before it responds.

Monitoring: More Than Loss Metrics

Log everything you can afford to log. At minimum, you want:

Input feature distributions over time
Prediction score distributions
Business outcome metrics (not just model metrics)
Inference latency percentiles
Error rates and fallback triggers

Shadow Mode and Canary Deployments

Never replace a production model in one shot. Route a small percentage of traffic to your new model in shadow mode — log its predictions without using them — before giving it real exposure. Canary deployments, which gradually shift traffic, are even safer. The cost of getting this wrong is your production SLA.

Plan for Model Failure

Your model will fail. Plan what happens when it does. A good fallback strategy might be a simpler heuristic model, a cached response, or a graceful degradation that tells users the feature is temporarily unavailable. Fail loudly and safely, not silently and catastrophically.