Machine Learning Experiment Tracking: Your Ultimate Guide

   ​

Learn the importance of machine learning experiment tracking and how to implement it effectively to streamline your ML workflows.​

​[[{“value”:”

Machine learning workflows are a collection of multiple critical stages including data collection, preprocessing, model development, training, evaluation, and deployment. Each of these stages requires a thorough decision-making process starting from selecting preprocessing techniques to selecting the best-performing algorithm, and identifying where to deploy the solution. Especially in the model training phase, developers tend to make multiple decisions to find the best-performing hyperparameters for their model based on metrics, resource usage, or inference time.

The road to getting the best-performing hyperparameters includes several trials and experiments, trying new decisions, and analyzing and comparing the results from multiple experiments. With these several trials, it is quite common to lose track of all the combinations that you might have tried out in pursuit of a better ML model. Experiment tracking is the technique of saving all the experiment-related information that you feel is most significant. While there could be multiple things to track based on the project requirements, some of the common metadata to track include:

Scripts or notebooks used to run the experimentsEnvironment configuration filesDataset versions used for training, validation, and testingML and DL (deep learning) evaluation metricsBusiness relevant KPIs like click through rate, Customer Acquisition Cost (CAC), Customer Retention Rate, Churn Rate, Return on Investment (ROI), etc.Model weights and training parametersVisualization of performance such as confusion matrix-based metrics and ROC curves

In this article, you will learn about machine learning experiment tracking, its importance, and implementation details to streamline your machine learning experimentation process.

Why is ML Experiment Tracking Important?

In the machine learning development lifecycle, experiment tracking plays a crucial role in ensuring the efficiency and reproducibility of model development. Here are some of the key reasons why experiment tracking is essential:

Avoiding Redundency and Resource Waste in Iterative Model Training

Machine learning model training is an iterative process where multiple changes in the algorithms and their hyperparameters are tried out. A slight change in the parameter (variables that the model learns during training such as weights and coefficients) or hyperparameter (model configurations that are set before the training process such as learning rate, number of epochs, etc.) can significantly change the model performance. If you lose track of all the changes you have tried out, you are likely to repeat the same experiment multiple times leading to a waste of resources and time, or not conduct an experiment that would have provided good results. This is one of the most important reasons why you need to keep track of all the experiments (including minor ones) in the modeling phase.

Achieving Reproducibility in ML Experiments

Reproducibility is one of the core requirements of machine learning projects. Imagine working on an industry/research project and achieving good model performance but when someone else tries to run the same code with the same setup as yours they fail to produce the same results. In this case, your model can not be trusted by the other parties. If you do not keep track of the precise changes in the model, it will be fairly complex to debug and reproduce the model results. By documenting all the aspects of an experiment including the dataset version, hyperparameters, code, and environment settings, ML experiment tracking ensures that others including you can replicate any experiment with ease.

Better Model Comparison, Tuning, and Auditing

Usually, experiment tracking systems use a single repository to log all the experiment findings. This way it enables you to systematically compare the outcomes of different models or experiments side by side and helps in identifying which one worked the best for you. Moreover, you can quickly compare the model performance metrics across various iterations that can help you in efficient fine-tuning as model elements like learning rate, architecture, etc. can be adjusted based on the past performance of the model.

For some organizations in regulated environments like healthcare and finance, keeping an audit trail of all the experiments is often required for regulatory purposes. Experiment tracking can help in preparing a document for auditing that contains information about used data, methodologies, and decisions made during the development process.

Improved Collaboration

Large-scale machine learning projects are often the result of team efforts. For example, in the model training phase, different team members can try out different sets of models and hyperparameters to expedite the modeling process. But this increased manpower also creates some issues like two team members working on the same set of models and hyperparameters that lead to a waste of time and resources. Experiment tracking allows teams to work together seamlessly by providing a centralized location where every team member can view and contribute to past and ongoing experiments.

Also, having everything in one place helps avoid redundancy as team members become well aware of what has already been tried. Moreover, team members can also check the status of various model development efforts and make a resource allocation strategy according to that.

Figure: Benefits of ML Experiment Tracking

Machine Learning Experiment Tracking vs MLOps: What is the Difference?

In machine learning, the terms experiment tracking and MLOps are often confused with each other, but they are quite different and serve different purposes in the broader ML development landscape. MLOps (Machine Learning Operations) is a set of tools and best practices that are designed to automate, scale, and manage the end-to-end machine learning lifecycle. This means it deals with every part of the ML development lifecycle from data collection and preprocessing to scheduling distributed training jobs, deployment, monitoring, and maintenance. Just as DevOps deals with streamlining the entire software development process, MLOps focuses on moving the machine learning models to production.

Figure: Experiment Tracking vs MLOps

Experiment tracking is one of the components of the MLOps ecosystem that focuses on a specific phase of the development process: the experimentation and iterative model development phase. This is the stage where data scientists and ML engineers run different trials, test different algorithms, tune hyperparameters, and compare model performance to select the one best-performing model. While experiment tracking takes care of the model development and evaluation stages of MLOps, once a model is selected and ready for deployment, the responsibilities shift to the other aspects of MLOps. One important thing to note is that experiment tracking is quite important even if your models are not going to production for some reason. Especially for research projects and POCs (Proof of Concepts) where you would want to revisit all your experiments and findings for future analysis.

Experiment Tracking vs. Experiment Management: What is the Difference?

The terms experiment tracking and experiment management are closely related to each other but they are two distinct processes. Now that you are well aware of experiment tracking, this section only requires the introduction of experiment management.

Experiment management is the process of coordinating and organizing experiments, workflows, and processes. While experiment tracking focuses on individual runs, experiment management focuses on the planning, scheduling, and optimization of the entire experimentation workflow. It always makes sure that the experiments are well managed and that they align with the project goals, timelines, and resource availability.

Figure: Experiment Tracking vs Experiment Management

Experiment management works by setting up the objective for different experiments and managing dependencies among them. This way, it tries to optimize the use of computer resources like GPUs and TPUs and ensures that the results from multiple experiments are integrated into the broader project lifecycle.

Best Practices in ML Experiment Tracking

Now that you know about the importance of experiment tracking and how it differs from experiment management and MLOps, you should also be aware of some of the best practices that can help you get the most out of the experiment tracking process.

Consistent Naming Conventions for Experiments

As you tend to perform several trials in the model training phase, using a consistent naming convention for your experiments becomes vital. Providing vague experiment names such as ‘model1’, ‘model_2’, etc. to your experiments does not hold any value as it is quite difficult to understand what the experiment holds with these names. You should always include some of the key indicators like model type, dataset, and purpose of the experiment to easily locate specific runs later. For example, an experiment name like ‘ResNet50-augmented-imagenet-exp-01’ provides more information about the model architecture, dataset, and experiment number. Sometimes including the hyperparameters information such as learning rate, batch size, accuracy, etc. in the name can also be informative.

Versioning Data, Models, and Code

Experiment tracking is not only concerned with tracking the hyperparameters and ML models, it also takes care of the data and code (scripts and notebooks). It’s essential to version each component so that experiments can be reproduced and compared across different stages of development. This versioning helps in tracking the dataset version, model iteration, and code commit that were used in each experiment. You can use tools like Git for code versioning and DVC for data versioning.

Logging Metrics and Hyperparameters Accurately

To compare the performance of different models and hyperparameter settings, you need to track the metrics and hyperparameters. Metrics like accuracy, loss, precision, recall, and F1-score and hyperparameters like learning rate, batch size, and optimization algorithms should be tracked according to your project needs. If your projects allow, you can use the automated tools for logging to minimize the risk that can occur due to manual (using document or Excel) logging errors.

Choosing the Right Experiment Tracking Mechanism

There are multiple ways of tracking the metadata of your models that depend on the team size, the number of experiments you plan to run, the complexity of the models, and the level of detail you require from experiment metadata. Four of the main methods for experiment tracking that are used today are as follows:

Manual Tracking: This tracking involves the use of a spreadsheet or a docs file to manually track all the experiments. With each trial, you will have to manually write the findings in the file. In cases where your tracking needs are minimal for example where you are relying on fewer trials, or a really small set of hyperparameters this technique is sufficient. For most of the real-world use cases, it is not an ideal option.Automated Versioning: Another popular option is the use of a version control tool like Git to save all your experiment metadata. Although Git is not specially built for experiment tracking, you can use it to store hyperparameters, model configurations, and results as files in a Git repository. These files could be text documents, code, configuration files, and even serialized versions of models. With Git, you can keep track of all the experiments in the form of different commits. However, these systems are not optimized for ML use cases that use large datasets and produce large model files. Also, the features on model visualization and comparing experiments are limited in Git.Dedicated Experiment Tracking Tools: One of the highly used approaches for experiment tracking is to use specialized software that is specifically designed to record, organize, and compare data from machine learning experiments. Almost all of these tools come up with a user interface that allows you to compare different experiments, visualize model results, and create a custom dashboard. Also, these tools provide APIs to log the experiment data and monitor the progress of the experiment in real-time.Using Model Cards: You can utilize model cards which are the standardized way to document important details about machine learning models. These cards are essentially markdown files that include information about the model overview, intended use, performance metrics, training data, evaluation data, limitations, and ethical considerations. These model cards are key drivers for discoverability, reproducibility, and promoting transparency to help stakeholders make informed decisions about deploying machine learning models in various contexts.

Tips for Integrating Tracking Systems With Existing ML Pipelines

Integrating experiment-tracking systems into ML pipelines can save a lot of time and effort. It is an ideal practice to integrate the tracking systems into your workflow from the beginning of your project so that all your experiments are logged automatically. Most of the modern experiment tracking platforms offer APIs that you can embed in your training scripts to log all your experiments and model-related data automatically. Additionally, you can set up alerts or automated reporting that can help you monitor experiments in real-time and identify any issues early in the training process.

Experiment Tracking for LLMs vs. Traditional ML Models

Since large language models are on the rise they also require experiment tracking capabilities that come with their own set of challenges and best practices. This is often because they involve larger datasets, longer training times, and more complex metrics. Let’s have a detailed look into what you should focus on for LLM experiment tracking.

Unique Challenges with LLMs

Experimentation (including fine-tuning) with LLMs often involves large datasets and longer training times that make tracking challenging. As LLMs take longer to train, logging intermediate checkpoints and evaluating performance periodically becomes vital to monitor the progress. Moreover, LLMs normally use specialized hardware such as TPUs or multiple GPUs that add more complexity to resource usage tracking and fine-tuning.

Differences in Parameter Tracking and Evaluation Metrics

Unlike traditional models, LLMs often involve millions or billions of parameters that require more robust tooling and infrastructure. Also, LLM evaluation metrics like BLEU, ROUGE, perplexity, RAGAs, etc. differ significantly from traditional ML metrics like accuracy or precision and are more complex. This is the reason you might require the tracking tools that support these unique metrics. Also, these tools should allow for custom evaluations based on specific tasks like text generation or summarization.

Moreover, you should also be aware that most of the model-based LLM evaluation metrics like LLM-as-a-Judge or RAGAs are not supported by dedicated experiment tracking tools. In this case, you need to use the traditional method (using a document or model cards) to keep track of the metrics-related information.

Special Considerations for Tracking Resource Usage and Fine-tuning LLMs

As LLMs are both resource and memory-intensive models, it is important to track the resource usage (GPU/TPU hours, memory consumption, etc.) along with experiment metrics to optimize performance and cost-effectiveness. Moreover, LLMs are usually fine-tuned for organization use cases whether through transfer learning. This enables the need for precise tracking of the model’s pre-training and fine-tuning stages. By recording the exact configurations used in the fine-tuning stage, you can ensure the reproducibility of your model and also highlight the key areas where model performance can be optimized further.

While fine-tuning is quite a popular concept in machine learning models, in the context of LLMs, a new type of experimentation has emerged which is prompt-based experimentation. This approach does not rely on training the model on data but rather on changing the input prompts to guide the model behavior. Tracking these prompts is often crucial so that you can compare different prompts for performance comparison, and replicate results consistently anytime you want. Moreover, this tracking makes sure that you don’t waste time revisiting prompt formulations that were not working well in the first place and focus on the ones that performed the best.

Most of the dedicated experiment tracking tools available in the market support the LLM experiment tracking, you will see a list of those tools in the upcoming section.

How to Implement Experiment Tracking?

Now that you are aware of experiment tracking, its importance, and some other relevant information about it, it is time to delve into some technicalities and implementation parts of experiment tracking.

Organizations usually rely on specialized experiment tracking software that can record, organize, and compare data from machine learning experiments. These software/tools are specifically designed for all kinds of ML, DL, and LLM models, and can seamlessly integrate with different kinds of frameworks and platforms like TensorFlow, PyTorch, HuggingFace, etc. Let’s have a look at the market-leading experiment tracking tools one by one:

MLFlow

MLFlow is one of the most popular open-source tools for managing the end-to-end machine learning lifecycle. It provides an experiment tracking component to help you log organize, and compare machine learning experiments with ease.

Source: https://towardsdatascience.com/experiment-tracking-with-mlflow-in-10-minutes-f7c2128b8f2c

Here are some of the key features of MLFLow:

It allows the logging of experiment-related information, such as hyperparameters, metrics, and output files (artifacts).It stores a history of all the experiment runs for example data versioning, code snapshots, and environment details.Using MLFlow, you can compare different runs through a simple UI to determine the best-performing models based on your metrics and hyperparameters.It provides a model registry that allows you to store version, and transition models through different stages (e.g., “staging” to “production”).  This registry ensures that only the best model is deployed to the production.

To know more about MLFlow, you can refer to this official documentation.

Comet

Comet is a cloud-based experiment tracking platform designed to track, compare, and manage machine learning experiments. Comet enables effective experimentation, collaboration, and reproducibility by enabling data scientists to log parameters, metrics, model artifacts, and visualizations. Flexibility and comprehensiveness are two of the main characteristics of Comet which makes it suitable for all kinds of organizations and teams working on ML projects.

Source: https://www.comet.com/site/products/ml-experiment-tracking/

Some of the key features of Comet include:

It enables the real-time logging of experiments and allows you to monitor metrics, hyperparameters, and model performance as training progresses.You can compare multiple experiments using Comet UI and visualize performance metrics, hyperparameter trends, and confusion matrices for better decision-making.It can track code versions, datasets, and hyperparameters and enables ML reproducibility. Comet can also integrate with Git to capture the code snapshot for each experiment.Just like MLFlow, Comet also provides a centralized platform where you can share experiment results and collaborate on model performance improvements.

To know more about Comet, you can refer to this official documentation.

W&B

Weights and Biases (W&B) is another widely adopted platform for machine learning experiment tracking and collaboration. W&B seamlessly integrates with popular ML frameworks and becomes easy to use without disrupting existing workflows. W&B allows the tracking of machine learning experiments, hyperparameter tuning, and model versioning. You can also monitor your resource usage to identify training bottlenecks and avoid wasting expensive resources.

Source: https://docs.wandb.ai/guides/track/

Here are some of the main features of the W&B tracking tool:

Just like Comet, W&B also supports the real-time logging and tracking of experiments to monitor ML models as they progress.It provides a wide variety of visualization tools to compare experiments, track performance metrics, and plot results across models.It supports special features called hyperparameter sweeps that enable you to explore multiple combinations of hyperparameters automatically.It enables reproducibility by automatically tracking the code, datasets, and hyperparameters used in each experiment.

To know more about W&B, you can refer to this official documentation.

Neptune

Neptune is another web-based, versatile, and user-friendly platform designed for machine learning (ML) experiment tracking and model management. It serves the same purpose of logging, organizing, and analyzing ML experiments effectively. Neptune is ideally well suited for large teams working on multiple experiments simultaneously due to its extensive collaboration, comparison, and version control features.

Source: https://kozodoi.me/blog/20210430/tracking-experiments

Some of the key features of Neptune are as follows:

It provides a centralized repository where all experiments, metrics, models, and logs are stored. This provides transparency in searching, comparing, and retrieving past experiments.It provides a flexible API and a nested metadata structure that lets you customize the metadata logging structure however you want.It provides a strong version control to ensure that all the experiment’s code, data, and model configurations are fully tracked.It also allows you to create custom dashboards to combine different metadata types in one view, check resource usage, and see your learning curves, image predictions, and confusion matrix to debug model quality.

To know more about Neptune, you can refer to this official documentation.

DVC

Data Version Control (DVC) is an open-source version control system that is specially designed to track not just code, but also data, models, and machine learning pipelines. The power of DVC lies in its integration capability with Git which allows machine learning teams to manage experiments, version datasets, track models, and share workflows effectively. Using DVC for experiment tracking ensures that users can organize and reproduce complex ML workflows in a scalable manner.

Source: https://dvc.org/doc/start/experiments/experiment-tracking

Here are some of the key features of DVC:

It uses the DAG-based (Directed Acyclic Graph) approach to keep track of all changes made to datasets, parameters, and code.It enables developers to define machine learning pipelines with stages like data preprocessing, model training, evaluation, and more.It works with various storage backends, such as AWS S3, Google Cloud Storage, Azure blog storage, and local storage, to store datasets and model files.It provides collaboration capabilities by allowing team members to share experiments, datasets, and models through Git and remote storage.

To know more about DVC, you can refer to this official documentation.

DagsHub

While the above-mentioned tools come up with their own interesting features to log metadata related to ML models, it can take you some time to learn and understand each of these tools. Now you might be wondering if there is any solution for this that can help you perform the experiment tracking with ease. The answer is, yes there is and that tool is called DagsHub. DagsHub is an ideal tool for developers who want a unified platform that offers more than just experiment tracking. It is tightly integrated with Git and DVC and provides full version control of data, models, and experiments, making it highly suitable for projects where reproducibility and collaboration are crucial. DagsHub can directly integrate with the most popular tracking tool MLFlow. This integration allows you to track experiments using MLFlow’s API while benefiting from DagsHub’s additional features like data versioning and collaboration.

Source: https://dagshub.com/blog/launching-dagshub-integration-with-databricks-mlflow/

Some of its key features include:

With the Git and DVC integration, DagsHub provides end-to-end versioning capability.Since DagsHub is built on top of open-source tools like MLFlow, DVC, and Git it allows users to avoid vendor lock-in and gives them the flexibility to move their projects across platforms.Collaboration is the key feature of DagsHub, by providing a GitHub-like experience for machine learning projects, it allows teams to easily share and collaborate on experiments.DagsHub provides a visual interface to view and manage all your experiments.With its lineage tracking system, DagsHub allows users to trace back results and model outputs to specific versions of data and code.

Step-by-step Guide on Setting up Experiment Tracking

Now that you briefly know the tools, let’s check a simple practical implementation of storing the models and related metadata for the most popular iris dataset. Before you begin, you need to install the following Python dependencies:

pip install mlflow
pip install comet-ml
pip install neptune
pip install wandb
pip install dvc

Once installed you can proceed with loading all the necessary dependencies as follows:

import mlflow
import mlflow.sklearn
import wandb
import neptune.new as neptune
from comet_ml import Experiment
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

Next, you can load the dataset, split it, train a simple decision tree classifier, and check its accuracy as follows:

# Load the dataset
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2)

# Train a Decision Tree model
model = DecisionTreeClassifier(max_depth=3)
model.fit(X_train, y_train)

# Evaluate model accuracy
accuracy = accuracy_score(y_test, model.predict(X_test))

Now all that remains is storing this trained model, hyperparameter, and metrics in the experiment tracking platforms. Let’s take a look at the code for each platform:

MLFlow

In MLFlow, you need to start a run with the start_run() method and then using the log_param(), log_metric(), and log_model() methods, you can log all the necessary hyperparameters, metrics, models for your use case as follows:

# Start an MLFlow run
with mlflow.start_run():
# Log model parameters and metrics
mlflow.log_param(“max_depth”, 3)
mlflow.log_metric(“accuracy”, accuracy)

# Log the model
mlflow.sklearn.log_model(model, “decision_tree_model”)

Comet

To use Comet for experiment logging, you need to create an account on Comet and obtain an API key that will be used for authentication. Once you are connected with the Comet using the API key, you can use the log_parameter(), log_metric(), and log_model() methods to log all the necessary experiment details as follows:

# Create a new Comet experiment
experiment = Experiment(api_key=”your_comet_api_key”, project_name=”iris-classification”)

# Log parameters, metrics, and model to Comet
experiment.log_parameter(“max_depth”, 3)
experiment.log_metric(“accuracy”, accuracy)
experiment.log_model(“DecisionTreeClassifier”, model)

W&B

To use W&B, you need to initialize the wandb instance using the init() method and then you can use the log() method to log all the hyperparameters and other experiment details as follows:

# Initialize a new W&B run
wandb.init(project=”iris-classification”, name=”decision_tree”)

# Log parameters, metrics, and model to W&B
wandb.log({“max_depth”: 3, “accuracy”: accuracy})

Neptune

Similar to Comet, you will have to obtain an API key from the Neptune platform for logging all your experiments. You need to begin with initializing the run using the init_run() method and providing it the API token and project name, then you can log any kind of experimental information in Neptune by providing information as dictionary values as follows:

# Initialize Neptune run
run = neptune.init_run(
project=”your_workspace/your_project”,
api_token=”your_neptune_api_token”
)

# Log parameters, metrics, and model to Neptune
run[“parameters/max_depth”] = 3
run[“metrics/accuracy”] = accuracy

# Stop the run
run.stop()

DVC

DVC works a little differently than other experiment tracking tools, here you need to use the terminal/command prompt to save all the experiment-related details as follows:

dvc add data/iris.csv
dvc add model.pkl
git add data/iris.csv.dvc model.pkl.dvc
git commit -m “Tracked dataset and model”

DagsHub

To use DagsHub for experiment tracking, you need to set the DagsHub repository and then using the DVC and MLFLow (as explained above) you can track each and every aspect of your model training process.

Choosing the Right Tool Based on Your Use Case

Since multiple tools are available for experiment tracking, it can be quite confusing to choose the one best suited for your ML use case. So, here are some of the considerations for choosing the right one as per your requirements:

MLflow

Type: Open-sourceKey Features: Flexible, and supports the entire ML lifecycle, including experiment tracking, model deployment, and registry.Use Case: Ideal for teams looking for an open-source solution with broad functionality.

Comet

Type: Open-source & EnterpriseKey Features: Real-time tracking, customizable dashboards, and powerful collaboration features.Use Case: Great for teams needing detailed visualizations and enterprise-grade reporting.

Weights & Biases (W&B)

Type: Cloud-based & EnterpriseKey Features: Experiment tracking, collaboration, and integration with popular deep learning frameworks (e.g., TensorFlow, PyTorch).Use Case: Popular for projects focused on deep learning and large-scale experiments.

Neptune.ai

Type: Cloud-based & EnterpriseKey Features: Easy integration, real-time monitoring, versioning, and dashboards.Use Case: Suited for tracking experiments and models in large teams with a focus on collaboration.

DVC (Data Version Control)

Type: Open-sourceKey Features: Designed for versioning datasets and models, integrated with Git.Use Case: Ideal for teams using version control for large datasets and models, with a focus on reproducibility.

DagsHub

Type: Open-sourceKey Features: Supports version control for code, data, models, and experiments using Git, DVC, and MLflow. Collaboration-friendly, integrates with popular ML tools, and provides a visual interface for tracking experiments.Use Case: Ideal for teams looking for an open-source, collaboration-focused platform to manage machine learning projects, track experiments, and version datasets and models.

Conclusion

After reading this article, you now know the importance of machine learning experiment tracking. You have seen a list of popular tools that are often used by organizations to manage their experiments in the modeling phase. Systematically recording and organizing experiments enables data scientists and teams to maintain reproducibility, collaboration, and efficiency throughout the model development lifecycle.

As machine learning continues to grow especially large-scale models like LLMs, the need for a robust experiment tracking system in place is essential for ensuring transparency, traceability, and performance optimization. By following the strategies and best practices mentioned in this article, you are now ready to implement effective experiment tracking and drive better outcomes in your machine learning projects.

“}]] 

Related Posts

Recent Events

Scroll to Top