Designing Scalable ML Architectures

Most ML systems start as notebooks and end up as distributed nightmares. Not because the data science was bad, but because the architecture was never designed to scale. Here are the patterns I've found most reliable for building ML systems that hold up under pressure.

Separate Training from Serving

The training environment and serving environment should be treated as completely separate systems. They have different availability requirements, different compute profiles, and different failure modes. Conflating them creates fragile systems where a training run can take down inference, or where inference requirements compromise training flexibility.

The Feature Store: Worth the Investment

A feature store is a centralized registry for computed features — values that are expensive to compute and needed by multiple models. Without one, features get recomputed in multiple places, leading to inconsistencies between training and serving (training-serving skew's sneakier cousin).

A feature store is infrastructure for feature consistency. If you have more than two models sharing features, you need one.

Event-Driven Model Triggering

Polling for new data to retrain models is fragile and wasteful. Event-driven architecture — where new data triggers downstream pipeline steps via message queues — is more reliable and easier to reason about. Tools like Kafka, Pub/Sub, or even lightweight Redis streams work well here.

# Event-driven retraining trigger (simplified)
@app.route('/webhook/data-updated', methods=['POST'])
def handle_data_update():
    event = request.json
    if event['rows_added'] > RETRAIN_THRESHOLD:
        training_queue.publish({
            'model_id': event['model_id'],
            'dataset_version': event['dataset_version'],
            'triggered_at': datetime.utcnow().isoformat()
        })
    return {'status': 'queued'}, 202

Model Registry as Truth

Every model that has ever been deployed should live in a model registry with metadata: who trained it, what data it used, what its eval metrics were, and what version of the codebase produced it. MLflow, Weights & Biases, and Vertex AI all offer this. Without it, you're flying blind.

Graceful Degradation by Default

Every model call should have a defined fallback. If the model is slow, return a default. If the model is unavailable, serve from cache. If the model's confidence is below a threshold, escalate to a human. Design your fallback strategy before you design your model.

The Hardest Part