The Short Version

MLOps (Machine Learning Operations) is the set of practices for deploying and maintaining machine learning models in production reliably.

The problem it solves: Models that work in notebooks don’t automatically work in production.

Data scientists build models. MLOps gets those models running reliably, at scale, with monitoring and maintenance. Without MLOps, you have impressive demos that never deliver business value.

Think of it like DevOps for machine learning - but harder, because ML systems have additional complexity that traditional software doesn’t.


Why MLOps Matters

The Deployment Gap

Most ML projects fail to reach production. Common estimates suggest only 10-20% of models ever get deployed.

The reasons aren’t usually about model quality. They’re about:

  • Can’t reproduce the training environment
  • Can’t serve predictions at required latency
  • No monitoring when model degrades
  • Can’t update models without breaking things
  • No governance or approval process

Building a model is maybe 20% of the work. Operating it is the other 80%.

ML Systems Are Different

Traditional software:

  • Deterministic behavior
  • Code defines functionality
  • Testing is relatively straightforward
  • Debugging follows clear paths

ML systems:

  • Probabilistic behavior
  • Data + code + model define functionality
  • Testing is genuinely hard
  • Debugging involves data, features, and model interactions

These differences mean DevOps practices aren’t sufficient. You need ML-specific operational practices.


Core MLOps Concepts

Model Versioning

Models change over time. You need to track:

  • Which model version is deployed
  • What data it was trained on
  • What parameters were used
  • What performance it achieved

This enables:

  • Rollback if new models underperform
  • Reproduction of results
  • Audit trails for compliance
  • Comparison across versions

Feature Engineering

Features are the inputs to models - transformed, aggregated, derived data.

Challenges:

  • Features developed in notebooks don’t translate to production
  • Training and serving features can diverge (training-serving skew)
  • Feature computation is often duplicated across teams
  • Historical features are hard to reproduce

Feature stores address this by:

  • Centralizing feature definitions
  • Ensuring consistency between training and serving
  • Enabling feature reuse across projects
  • Maintaining point-in-time correctness

Model Serving

Getting predictions from models. Options include:

Batch inference:

  • Run model on dataset periodically
  • Store predictions for lookup
  • Good for: Recommendations, scoring, reports

Online inference:

  • Predictions on individual requests in real-time
  • Low latency requirements
  • Good for: Search ranking, fraud detection, personalization

Embedded:

  • Model runs in application code
  • No separate serving infrastructure
  • Good for: Edge devices, latency-critical applications

Each pattern has different infrastructure requirements.

Model Monitoring

Models degrade. Monitoring catches problems:

Data drift:

  • Input data distribution changes from training
  • Example: User behavior shifts after COVID

Concept drift:

  • Relationship between inputs and outputs changes
  • Example: Economic conditions change what predicts loan default

Model performance:

  • Accuracy, precision, recall over time
  • Business metrics tied to model predictions

Infrastructure:

  • Latency, throughput, errors
  • Resource utilization

Without monitoring, you won’t know your model is failing until business impact becomes obvious.

Model Retraining

Models need updates:

  • Scheduled: Retrain weekly/monthly regardless
  • Triggered: Retrain when drift exceeds threshold
  • Continuous: Ongoing learning from new data

Retraining pipelines need to be:

  • Automated (not manual notebook runs)
  • Tested (new model validated before deployment)
  • Governed (approval before production)
  • Reversible (rollback if problems occur)

MLOps Maturity Levels

Level 0: Manual

  • Data scientists develop models in notebooks
  • Deployment is manual, ad-hoc
  • No automation, no monitoring
  • Works for: Exploration, prototyping

Level 1: ML Pipeline Automation

  • Automated training pipelines
  • Consistent, reproducible training
  • Some monitoring
  • Works for: Stable models with infrequent updates

Level 2: CI/CD for ML

  • Automated testing of data, models, and code
  • Continuous training with new data
  • Automated deployment with validation
  • Full monitoring and alerting
  • Works for: Production-critical models at scale

Most organizations are at Level 0 or early Level 1. Level 2 requires significant investment.


Common MLOps Challenges

Training-Serving Skew

Model behaves differently in production than training.

Causes:

  • Different feature computation code
  • Different data preprocessing
  • Missing features in production
  • Timing differences in feature availability

Solutions:

  • Feature stores that serve training and production
  • Shared feature computation code
  • Monitoring for skew detection

Reproducibility

Can’t recreate the model that’s in production.

Causes:

  • Notebooks don’t capture environment
  • Random seeds not fixed
  • Data changed since training
  • Dependencies not pinned

Solutions:

  • Version control for code, data, and models
  • Containerized training environments
  • Data versioning or snapshots
  • Experiment tracking tools (MLflow, Weights & Biases)

Data Dependencies

Data problems break ML systems.

  • Upstream data changes without notice
  • Data quality degrades
  • Data arrives late or not at all
  • Schema changes break feature computation

MLOps requires tight integration with data architecture and data quality practices.

Organizational Challenges

  • Data scientists who don’t want to do operations
  • Engineers who don’t understand ML
  • Nobody owning end-to-end model lifecycle
  • Unclear handoffs between teams

Structure matters. Models need owners who care about production performance, not just training accuracy.


MLOps Infrastructure

Essential Components

Experiment tracking: Track parameters, metrics, and artifacts from training runs. Tools: MLflow, Weights & Biases, Neptune

Model registry: Store, version, and manage models. Tools: MLflow, cloud-native registries

Feature store: Centralized feature management. Tools: Feast, Tecton, cloud-native options

Orchestration: Coordinate training and deployment pipelines. Tools: Airflow, Kubeflow Pipelines, Prefect

Serving: Deploy and run models in production. Tools: Seldon, KServe, cloud-native endpoints

Monitoring: Track model and data drift, performance. Tools: Evidently, WhyLabs, custom solutions

Build vs Buy

Most teams shouldn’t build MLOps infrastructure from scratch.

Use managed services when:

  • Speed matters
  • Team is small
  • Use cases are standard
  • Budget available

Build custom when:

  • Specific requirements not met by tools
  • Scale justifies investment
  • In-house expertise available
  • Vendor lock-in concerns

The ecosystem is maturing rapidly. What required custom builds two years ago may have managed options now.


Getting Started with MLOps

If You Have No MLOps

  1. Start with experiment tracking (it’s the foundation)
  2. Add model versioning and registry
  3. Automate training pipelines
  4. Add basic monitoring
  5. Expand incrementally

Don’t try to build everything at once.

If You Have Basic MLOps

  1. Identify pain points (what breaks most often?)
  2. Add feature store if feature engineering is a bottleneck
  3. Improve monitoring and alerting
  4. Automate more of the deployment process
  5. Build governance processes

If ML Is Business Critical

  1. Audit your current state against best practices
  2. Invest in reliability and governance
  3. Build organizational capability, not just tools
  4. Plan for scale

AI and Data Architecture

AI Governance

Data Foundations


Get Help

MLOps sits at the intersection of data engineering, software engineering, and data science. Getting it right requires expertise across all three.

If you’re trying to get ML models to production reliably, or struggling with models that degrade in production, book a call to discuss your challenges.