Model Deployment: Types, Strategies and Best Practices

November 4, 2024
3:00 pm

Learn the best MLOps practices for deploying machine learning models to production.

Introduction: The Art of Deploying ML Systems

Machine Learning is a complicated domain. Having these pieces of software running in a production environment is not trivial. There are several moving pieces in a Machine Learning System and each of them has its own peculiarities and challenges.

Since ML became popular in business, the methods and approaches for deploying them have varied. Data scientists started with very rudimentary manual processes. Copying and pasting files into instances directly, moving them manually into SFTPs or specific shared drives. This was the past.

The field has vastly professionalized its methods. We have learned from standard software engineering, and, more lately, from DevOps. This progression into safer and more automated processes to deploy and upgrade ML systems has led to the origination of a brand-new area of knowledge. MLOps or Machine Learning Operations.

MLOps can now be seen materialized as strategies, approaches, frameworks, and even roles. MLOps engineers are the people who build, maintain, and scale the systems ensuring a healthy operation of Machine Learning systems. Everything related to tracking, versioning, deploying, monitoring, and upgrading production ML falls into the scope of this term.

The field, although still maturing, is already flooded with frameworks, open-source libraries, and SaaS tools. More recently, we’ve seen a big move from these tools in order to catch the Generative AI wave. Terms such as GenAIOps or LLMOps are now becoming widespread across practitioners. They deal with the same components and challenges as MLOps but also entail some of the peculiarities of LLMs. We will discuss later what are the top tools or frameworks that you can use for different use cases.

What is ML Deployment?

Generally, deploying a Machine Learning system entails setting up a trained Machine Learning model in a production environment. In this production environment, the ML system is required to provide predictions, either on demand or asynchronously. These can be users of your business application or other software components of your tool or solution.

An ML model makes a prediction when it’s fed with specific features. These features need to be a representation of the entity that you are going to make a prediction for. It can be a user or any other entity in your software solution (e.g.: A movie on Netflix can have specific features, such as times seen, length, etc)

The process of deploying an ML model requires a set of sequential steps. It needs to be trained with a dataset. Therefore, a dataset needs to be built first. And after that, its parameters are updated for the given dataset. This is what we call the “Training process”. The training process generates an artifact, which is essentially a file consisting of the different parameters of the model (or weights). Some other by-products can be the evaluation metrics after training, which can be used later to automate if a model needs to be promoted into production or not.

These are some of the most common entities that are created just before deployment to production: Features, Dataset, Model Parameters, and Evaluation Metrics.

Regardless of the way the ML will be served in production (a topic we will cover in the following section), these entities are always there, at least.

Some complex ML systems have other entities around. More specifically, when the system is deployed under a specific business logic, sometimes extra entities are needed. Think of mapping files, ad-hoc models for filtering, transforming, etc.

Types of ML Systems

An ML system can be materialized into a production environment in different ways. Essentially, the deployment approach changes depending on how fresh the predictions need to be. They can be batched, online or near-real time. This requirement plus the availability of internet connection can also impact the infrastructure to be used (cloud or on-device).

reference

Batch Predictions

This is probably the simplest way to run a Machine Learning model. It consists of a process that uses a trained ML model to make predictions in bulk for a given set of entities and their corresponding features. This type of deployment offers scalability so that vast amounts of data are processed efficiently, cost-effectively, and consistently.

They typically run on top of job scheduling or orchestration platforms, such as Apache Airflow, Dagster, or cloud-native solutions such as Sagemaker Jobs. These jobs are executed in ephemeral compute instances in most cases, allowing for optimal resource allocation.

Models are fetched from a model store (which can be a simple S3 bucket or a more sophisticated platform with model promotion mechanisms that may allow A/B testing or controlled upgrades).

Online Predictions

A very typical setup when batched predictions are not quick enough and we need to make predictions on-demand and almost in real-time is by deploying ML models inside web services (e.g: REST services).

FastAPI and Flask are the most common Python frameworks used for this. Model serving frameworks such as BentoML, TorchServe, Seldon or Kserve can also be useful as they simplify the model packaging into a Docker image, provide well-designed and clearly defined APIs and built-in performance optimization capabilities.

Online prediction systems are characterized by their responsiveness, as they provide predictions in low latency and scalability, as they can be easily scaled horizontally to handle increased traffic.

Features are fetched from a feature store or provided in the request. Models are also fetched from model stores (typically loaded at service start time) or pre-loaded in the service container image. Continuous deployment processes are responsible for any of these.

There is also a specific subset of ML systems that offer online predictions, but the architecture can be slightly different. This is the case of streaming async models. This setup involves having a model embedded in a data streaming consumer (e.g: using Kafka, Kinesis, or a queue type of input). This model makes predictions while receiving streaming inputs and predictions are stored in a database. The predictions are typically served through another web service, which offers extremely low latency because the predictions have been already pre-computed.

Types of ML Deployments

Several deployment strategies have emerged to mitigate risks and ensure smooth transitions when updating models in production environments. These strategies not only apply to the models themselves but also to the entire ML pipeline, including data preprocessing, feature engineering, and post-processing steps.

Canary deployments and blue/green deployments are two popular approaches for rolling out new ML models with minimal risk. In a canary deployment, a new model version is gradually introduced to a small subset of users or traffic, allowing teams to monitor its performance and detect any issues before a full rollout. This approach is particularly useful for catching unexpected behaviors in real-world scenarios.

Blue/green deployments, on the other hand, involve maintaining two identical production environments (blue and green). The new model is deployed to the inactive environment, thoroughly tested, and then traffic is switched over, enabling a quick rollback if problems arise. This setup, depending on the model, can be quite costly. If you have two LLMs like this, the GPU cost can be quite high.

Shadow deployments take this a step further by running the new model alongside the existing one in production, comparing outputs without affecting user-facing results. This method provides valuable insights into how the new model would perform under real conditions without any risk to current operations.

A/B testing is another crucial technique in the ML deployment toolkit, allowing teams to compare the performance of different models or model versions in a controlled manner. By randomly assigning users or requests to different variants, organizations can gather statistically significant results on model performance, user engagement, or business metrics. For batch processes, offline simulations and backtesting play a vital role in validating model performance before deployment. These methods involve running new models on historical data to assess their accuracy and impact.

By combining these deployment strategies – canary, blue/green, shadow and A/B testing – organizations can create a robust framework for safely and efficiently updating their ML models in production, ensuring continuous improvement while minimizing risks to business operations.

Challenges in ML Models Deployment

Re-training Frequency

Determining the optimal cadence for model retraining is crucial for maintaining prediction accuracy in dynamic environments. Factors influencing this decision include input data changing over time (data drift), shifting relationships between variables (concept drift), and evolving company needs all influence this kind of decision.

Automated monitoring systems can trigger retraining based on performance degradation or data shifts. Tools like MLflow and Kubeflow provide frameworks for managing the retraining lifecycle, while feature stores ensure consistency between training and serving data. The challenge lies in balancing model freshness with computational costs and operational complexity.

Champion-Challenger Mechanisms

This approach involves continuously evaluating new model candidates against the current production model (champion). Challengers are tested in a controlled environment, often through A/B testing or shadow deployments. If a challenger consistently outperforms the champion across key metrics, it may be promoted to production. Platforms like Seldon Core and AWS SageMaker offer built-in support for champion-challenger setups. This mechanism enables continuous model improvement while mitigating the risks associated with deploying new models.

Model Production Promotions

The process of moving models from development to production involves several stages, including validation, staging, and gradual rollout. A robust promotion pipeline ensures that only well-tested, high-performing models reach production. Tools like MLflow Model Registry and Google Cloud AI Platform facilitate this process by providing versioning, approval workflows, and deployment automation. The challenge is to create a seamless yet controlled path from experimentation to production while maintaining model integrity and performance.

Monitoring and Observability

Effective ML systems require comprehensive monitoring across the entire pipeline, from data ingestion to prediction serving. Key areas include model performance metrics, data drift detection, resource utilization, and system health. Tools like Prometheus, Grafana, and specialized ML monitoring platforms like Arize AI or Fiddler AI provide insights into model behavior in production. The challenge is to design monitoring systems that can detect subtle issues early, enabling proactive maintenance and ensuring model reliability.

Versioning and Reproducibility

Maintaining version control for models, data, and code is essential for reproducibility and auditing. This involves tracking model lineage, including training data, hyperparameters, and dependencies. Git-based version control, combined with ML-specific tools like DVC (Data Version Control) and MLflow, help manage these aspects. The challenge lies in creating a comprehensive versioning strategy that covers all components of the ML pipeline while remaining manageable and accessible to both data scientists and operations teams

Scaling and Performance Optimization

As ML models move to production, they often face challenges in handling high request volumes and maintaining low latency. Techniques like model quantization (reduces model size by converting weights from 32-bit floating point to lower precision formats like 8-bit integers), pruning (removes unnecessary weights/connections from neural networks to make them smaller and faster), and distillation (trains a smaller “student” model to mimic a larger “teacher” model’s behavior, preserving performance while reducing size) can reduce model size and inference time. Containerization and orchestration tools like Docker and Kubernetes enable flexible scaling of ML services. GPU acceleration and specialized hardware (e.g., TPUs) can significantly boost performance in terms of latency and throughput. The challenge is balancing model accuracy with computational efficiency and cost-effectiveness at scale.

Regulatory Compliance and Ethics

Deploying ML models in regulated industries (e.g., finance, healthcare) requires adherence to specific guidelines and ethical considerations. This involves ensuring model interpretability, fairness, and privacy protection. Tools like IBM AI Fairness 360 and Google’s What-If Tool help assess model fairness and bias. SHAP (SHapley Additive exPlanations) aids in model interpretability. The challenge is integrating these considerations throughout the ML lifecycle, from data collection to model deployment and monitoring, while maintaining compliance with evolving regulations.

Best Practices for Model Deployment

Automate when needed. Solve real problems.

This is the first point that I want to make clear. Over-engineering is a reality and it’s very easy to fall into it when building machine learning systems. There are so many risks, uncertainties, and great literature about production ML that professionals might feel the need to cover all the possible cases and invest heavily in automation. But not all risks need to be solved and not every problem is actually important.

For example, consider a startup developing a recommendation system for an e-commerce platform. Initially, they might be tempted to implement a complex, fully automated ML pipeline with real-time model updates, extensive A/B testing, and sophisticated monitoring. However, if the platform only has a few thousand monthly active users and product catalogs change infrequently, such a system would be overkill. Instead, a simpler approach with weekly batch predictions and manual model updates could suffice, allowing the team to focus on more pressing business needs like user acquisition or improving the core product features.

Another scenario: a large financial institution implementing a fraud detection system. Here, automation becomes crucial due to the high stakes involved and the need for rapid response to emerging fraud patterns. Implementing automated model retraining triggered by performance degradation or data drift detection, along with a robust CI/CD pipeline for model deployment, would be justified. The system could automatically retrain on new data daily, run validation tests, and deploy updated models if they meet predefined performance thresholds.

The value of ML is the first thing to be proved. ROI is a metric that business managers always aim to optimise. Therefore, progressive and continuous improvement is what will make a ML system thrive.

To demonstrate ML value, start with a simple baseline model and measure its impact on key business metrics. For instance, an online retailer implementing a product recommendation system could begin with a basic collaborative filtering algorithm. They might measure the click-through rate (CTR) and conversion rate of recommended products compared to a control group without recommendations. If the initial model shows a 5% increase in CTR and a 2% increase in conversions, this provides a clear ROI that justifies further investment in the ML system.

As the system proves its worth, gradually introduce more sophisticated techniques. The retailer might evolve their recommendation system to incorporate deep learning models that consider user browsing history, demographic data, and current trends. Each iteration should be benchmarked against the previous version and tied to business outcomes. For example, if the advanced model increases the average order value by 10% compared to the baseline, it provides a compelling case for continued development and resource allocation.

Make ML system’s upgrades automatic. Reduce manual errors.

Following from the last paragraph from the above point. When you have a system that is proven to add value, you can start automating. Automating ML system upgrades minimizes human error and ensures consistency in the deployment process. Implement a CI/CD pipeline specifically designed for ML workflows. For example, use tools like Jenkins or GitLab CI to automate the following steps:

Trigger model retraining when new data is available or on a set schedule.Run automated tests to validate model performance, data integrity, and system compatibility.If tests pass, automatically package the model and its dependencies into a container.Deploy the container to a staging environment for final validation.If staging validation is successful, automatically promote the model to production.

A concrete example: A credit scoring company automates their model upgrade process. When new financial data is ingested daily, it triggers a pipeline that retrains the model, evaluates its performance against the current production model, and runs a suite of fairness and bias checks. If the new model outperforms the old one and passes all checks, it’s automatically deployed to a canary environment where it processes 5% of incoming credit applications. After 24 hours, if the model’s performance in the canary environment meets predefined criteria, it’s automatically promoted to handle 100% of the traffic.

Use CI/CD. Deploy quickly and safely.

Implementing continuous integration and deployment (CI/CD) for ML systems ensures that code changes, data updates, and model improvements are consistently and reliably integrated into the production environment. This approach helps maintain code quality, reduces integration issues, and accelerates the delivery of ML features.

For instance, a team developing a natural language processing model for customer service automation could set up a CI/CD pipeline using GitHub Actions and Kubernetes. Here’s how it might work:

Data scientists commit code changes and update training data in a Git repository.GitHub Actions automatically triggers unit tests, integration tests, and model performance evaluations.If tests pass, the pipeline builds a Docker image containing the updated model and its serving code.The image is pushed to a container registry and a new Kubernetes deployment is created in a staging environment.Automated smoke tests are run in the staging environment to verify basic functionality.If staging tests pass, the pipeline updates the production Kubernetes deployment with a rolling update strategy, ensuring zero downtime.

This CI/CD setup allows the team to rapidly iterate on their NLP model, testing new features or data preprocessing techniques with confidence. It also provides a clear audit trail of changes and makes it easy to roll back to previous versions if issues are detected.

Track ML artifacts. Keep your history safe.

Tracking ML artifacts is crucial for reproducibility, debugging, and compliance. Artifacts include not just the trained models, but also datasets, feature engineering code, hyperparameters, and evaluation metrics. Use specialized ML tracking tools to maintain this history.

For example, a healthcare company developing models for disease prediction could use DagsHub to track their experiments:

For each model iteration, DagsHub can track:Code versions through Git integrationData versions through DVC integrationExperiment parameters and results through integration with experiment tracking tools (like MLflow, Weights & Biases, etc.)Model artifacts through DVC or Git LFSData scientists can use Git tags and branches to mark versionsThe DagsHub UI provides experiment visualization and comparison through its integrations with experiment tracking toolsWhen deploying a model, versioning can be tracked through Git commits and DVC versions

This tracking system becomes invaluable when the healthcare company needs to audit their models for regulatory compliance or when investigating unexpected model behavior in production. They can easily trace back from a deployed model to the exact data and code used to create it, facilitating thorough analysis and validation.

Rollout models gradually. Be conservative.

Gradual rollouts minimize the risk of widespread issues when deploying new ML models. This approach allows you to catch potential problems early and limit their impact. Implement techniques like canary releases or blue-green deployments to achieve this.

Consider a large social media platform implementing a new content moderation model:

Start with a shadow deployment: The new model runs alongside the current production model, processing the same inputs but not affecting user-facing decisions. This allows comparison of the models’ outputs without risk.If the shadow deployment shows promising results, move to a canary release:Deploy the new model to handle 1% of user-generated contentMonitor key metrics like false positive rate, latency, and user reportsGradually increase traffic to the new model: 5%, 10%, 25%, etc.At each stage, compare metrics to the old model and predefined thresholdsIf issues arise at any stage, quickly roll back to the previous modelOnce the new model has proven itself with real traffic, complete the rollout to 100%

This conservative approach ensures that any unforeseen issues (e.g., biases in certain types of content, performance degradation under real-world load) are caught early and affect only a small portion of users.

Set up early impact monitoring. Fix proactively.

Early impact monitoring allows you to detect and address issues before they significantly affect your system’s performance or user experience. Set up comprehensive monitoring that covers both technical metrics and business KPIs.

For instance, an e-commerce company using ML for dynamic pricing could implement the following monitoring strategy:

Technical Metrics:Model inference latency (alert if >100ms for 95th percentile)Prediction confidence scores (flag if average drops below 0.8)Input data distribution (detect drift using KL divergence)Business KPIs:Conversion rate (alert if drops >5% compared to 7-day moving average)Average order value (notify if outside ±10% of expected range)Inventory turnover rate (flag significant deviations from forecast)Set up a real-time dashboard using tools like Grafana or Datadog to visualize these metricsImplement automated alerts that trigger when metrics exceed predefined thresholdsCreate an on-call rotation for the ML team to respond to alerts

With this system in place, the team can quickly identify and investigate issues. For example, if they notice a sudden drop in conversion rates coupled with unusually low prediction confidence scores, they might discover a data quality issue in a recent feature update. By catching this early, they can roll back the problematic feature or adjust the model before it significantly impacts the revenue.

Do load tests. Know what’s your system’s maximum RPM.

Load testing is crucial for understanding your ML system’s performance limits and ensuring it can handle expected traffic spikes. Conduct regular load tests to determine the maximum requests per minute (RPM) your system can handle while maintaining acceptable latency and accuracy.

Here’s an example of how a team might conduct load testing for a real-time image classification API:

Set up a staging environment that mirrors the production setup, including model serving infrastructure, databases, and caching layers.Use a load testing tool like Apache JMeter or Locust to simulate various traffic patterns:Steady increase in RPM until performance degradesSudden traffic spikes (e.g., 10x normal load for 5 minutes)Sustained high load over several hoursMonitor and record key metrics during tests:Response time (average and 95th percentile)Error rateModel accuracy (using a set of pre-labeled test images)CPU and memory usageDatabase query timesIdentify the breaking point: the RPM at which the system fails to meet SLAs (e.g., >500ms latency or >1% error rate)Analyze results to find bottlenecks. For instance, they might discover:The model serving layer maxes out at 1000 RPM per instanceDatabase connections become saturated at 5000 RPMAccuracy drops by 5% when CPU utilization exceeds 80%Use these insights to:Set auto-scaling rules (e.g., add new model serving instances when RPM exceeds 800)Optimize database queries or implement caching to reduce database loadImplement rate limiting to prevent system overloadAfter making improvements, re-run load tests to verify increased capacity

By regularly performing these load tests, the team ensures their ML system can handle expected growth and temporary traffic spikes, such as those during marketing campaigns or holiday seasons.

Conclusion

Machine Learning deployment has evolved from rudimentary manual processes to sophisticated, automated systems. The emergence of MLOps has professionalized the field, introducing robust strategies and tools for deploying, monitoring, and maintaining ML models in production environments.

Successful ML deployment requires a deep understanding of various deployment types – from batch predictions to real-time online systems – and the ability to choose the right approach based on specific business needs and technical constraints. Equally important is the implementation of effective deployment strategies such as canary releases, blue/green deployments, and A/B testing, which minimize risks and ensure smooth transitions when updating models.

While the challenges in ML deployment are numerous – from determining optimal retraining frequencies to ensuring regulatory compliance – a set of best practices has emerged to address these issues. These include automating judiciously, proving ML value through incremental improvements, implementing CI/CD pipelines, tracking ML artifacts meticulously, rolling out models gradually, setting up proactive monitoring, and conducting thorough load testing.

The key to successful ML deployment lies in striking a balance between sophistication and practicality. Over-engineering should be avoided, and solutions should be tailored to the specific needs and scale of each project. As the field continues to mature, staying updated with the latest tools and methodologies while adhering to these core best practices will be crucial for organizations aiming to leverage the full potential of machine learning in production environments.

Ultimately, effective ML deployment is not just about technical excellence, but about delivering tangible business value while managing risks and ensuring system reliability. By following the strategies and best practices outlined in this article, organizations can build robust, scalable, and impactful machine learning systems that drive innovation and competitive advantage in today’s data-driven world.