The Short Version
MLOps (Machine Learning Operations) is the set of practices for deploying and maintaining machine learning models in production reliably.
The problem it solves: Models that work in notebooks don’t automatically work in production.
Data scientists build models. MLOps gets those models running reliably, at scale, with monitoring and maintenance. Without MLOps, you have impressive demos that never deliver business value.
Think of it like DevOps for machine learning - but harder, because ML systems have additional complexity that traditional software doesn’t.
Why MLOps Matters
The Deployment Gap
Most ML projects fail to reach production. Common estimates suggest only 10-20% of models ever get deployed.
The reasons aren’t usually about model quality. They’re about:
- Can’t reproduce the training environment
- Can’t serve predictions at required latency
- No monitoring when model degrades
- Can’t update models without breaking things
- No governance or approval process
Building a model is maybe 20% of the work. Operating it is the other 80%.
ML Systems Are Different
Traditional software:
- Deterministic behavior
- Code defines functionality
- Testing is relatively straightforward
- Debugging follows clear paths
ML systems:
- Probabilistic behavior
- Data + code + model define functionality
- Testing is genuinely hard
- Debugging involves data, features, and model interactions
These differences mean DevOps practices aren’t sufficient. You need ML-specific operational practices.
Core MLOps Concepts
Model Versioning
Models change over time. You need to track:
- Which model version is deployed
- What data it was trained on
- What parameters were used
- What performance it achieved
This enables:
- Rollback if new models underperform
- Reproduction of results
- Audit trails for compliance
- Comparison across versions
Feature Engineering
Features are the inputs to models - transformed, aggregated, derived data.
Challenges:
- Features developed in notebooks don’t translate to production
- Training and serving features can diverge (training-serving skew)
- Feature computation is often duplicated across teams
- Historical features are hard to reproduce
Feature stores address this by:
- Centralizing feature definitions
- Ensuring consistency between training and serving
- Enabling feature reuse across projects
- Maintaining point-in-time correctness
Model Serving
Getting predictions from models. Options include:
Batch inference:
- Run model on dataset periodically
- Store predictions for lookup
- Good for: Recommendations, scoring, reports
Online inference:
- Predictions on individual requests in real-time
- Low latency requirements
- Good for: Search ranking, fraud detection, personalization
Embedded:
- Model runs in application code
- No separate serving infrastructure
- Good for: Edge devices, latency-critical applications
Each pattern has different infrastructure requirements.
Model Monitoring
Models degrade. Monitoring catches problems:
Data drift:
- Input data distribution changes from training
- Example: User behavior shifts after COVID
Concept drift:
- Relationship between inputs and outputs changes
- Example: Economic conditions change what predicts loan default
Model performance:
- Accuracy, precision, recall over time
- Business metrics tied to model predictions
Infrastructure:
- Latency, throughput, errors
- Resource utilization
Without monitoring, you won’t know your model is failing until business impact becomes obvious.
Model Retraining
Models need updates:
- Scheduled: Retrain weekly/monthly regardless
- Triggered: Retrain when drift exceeds threshold
- Continuous: Ongoing learning from new data
Retraining pipelines need to be:
- Automated (not manual notebook runs)
- Tested (new model validated before deployment)
- Governed (approval before production)
- Reversible (rollback if problems occur)
MLOps Maturity Levels
Level 0: Manual
- Data scientists develop models in notebooks
- Deployment is manual, ad-hoc
- No automation, no monitoring
- Works for: Exploration, prototyping
Level 1: ML Pipeline Automation
- Automated training pipelines
- Consistent, reproducible training
- Some monitoring
- Works for: Stable models with infrequent updates
Level 2: CI/CD for ML
- Automated testing of data, models, and code
- Continuous training with new data
- Automated deployment with validation
- Full monitoring and alerting
- Works for: Production-critical models at scale
Most organizations are at Level 0 or early Level 1. Level 2 requires significant investment.
Common MLOps Challenges
Training-Serving Skew
Model behaves differently in production than training.
Causes:
- Different feature computation code
- Different data preprocessing
- Missing features in production
- Timing differences in feature availability
Solutions:
- Feature stores that serve training and production
- Shared feature computation code
- Monitoring for skew detection
Reproducibility
Can’t recreate the model that’s in production.
Causes:
- Notebooks don’t capture environment
- Random seeds not fixed
- Data changed since training
- Dependencies not pinned
Solutions:
- Version control for code, data, and models
- Containerized training environments
- Data versioning or snapshots
- Experiment tracking tools (MLflow, Weights & Biases)
Data Dependencies
Data problems break ML systems.
- Upstream data changes without notice
- Data quality degrades
- Data arrives late or not at all
- Schema changes break feature computation
MLOps requires tight integration with data architecture and data quality practices.
Organizational Challenges
- Data scientists who don’t want to do operations
- Engineers who don’t understand ML
- Nobody owning end-to-end model lifecycle
- Unclear handoffs between teams
Structure matters. Models need owners who care about production performance, not just training accuracy.
MLOps Infrastructure
Essential Components
Experiment tracking: Track parameters, metrics, and artifacts from training runs. Tools: MLflow, Weights & Biases, Neptune
Model registry: Store, version, and manage models. Tools: MLflow, cloud-native registries
Feature store: Centralized feature management. Tools: Feast, Tecton, cloud-native options
Orchestration: Coordinate training and deployment pipelines. Tools: Airflow, Kubeflow Pipelines, Prefect
Serving: Deploy and run models in production. Tools: Seldon, KServe, cloud-native endpoints
Monitoring: Track model and data drift, performance. Tools: Evidently, WhyLabs, custom solutions
Build vs Buy
Most teams shouldn’t build MLOps infrastructure from scratch.
Use managed services when:
- Speed matters
- Team is small
- Use cases are standard
- Budget available
Build custom when:
- Specific requirements not met by tools
- Scale justifies investment
- In-house expertise available
- Vendor lock-in concerns
The ecosystem is maturing rapidly. What required custom builds two years ago may have managed options now.
Getting Started with MLOps
If You Have No MLOps
- Start with experiment tracking (it’s the foundation)
- Add model versioning and registry
- Automate training pipelines
- Add basic monitoring
- Expand incrementally
Don’t try to build everything at once.
If You Have Basic MLOps
- Identify pain points (what breaks most often?)
- Add feature store if feature engineering is a bottleneck
- Improve monitoring and alerting
- Automate more of the deployment process
- Build governance processes
If ML Is Business Critical
- Audit your current state against best practices
- Invest in reliability and governance
- Build organizational capability, not just tools
- Plan for scale
Related Reading
AI and Data Architecture
AI Governance
Data Foundations
Related Topics
- Building Data Teams - Hiring for ML capabilities
- Data Platform Scaling - Infrastructure for ML at scale
- What Is Technical Debt? - ML technical debt patterns
Get Help
MLOps sits at the intersection of data engineering, software engineering, and data science. Getting it right requires expertise across all three.
If you’re trying to get ML models to production reliably, or struggling with models that degrade in production, book a call to discuss your challenges.