Every machine learning project starts simple – that’s a good thing. But at some point you need to promote it to production-grade. With 3 lines of code, you can get from a notebook to a central source of truth with code, data, models, and experiments which you can share with team members.
TL;DR
Every machine learning project starts simple – that’s a good thing. But at some point, you need to promote it to production-grade – and that transition should be as easy and as seamless as possible. Let’s see how we can boil it down to 3 lines of code, with DagsHub, and get from a notebook to a central source of truth including your code, data, models, and experiments which you can share with team members.
Machine Learning Has Scrappy Beginnings – It’s a Feature, Not a Bug!
You’ve just started working on a new machine learning project. You want to show results ASAP, so other stakeholders in your organization understand the business value of the model and you can continue working on this project and build a model into some real-world machine learning application.
Starting scrappy is the right way to go – this usually means a notebook. If you can, you might even ask for a non-sensitive data sample and use Colab since it provides powerful compute resources and is easily shareable. Your goal is to arrive at some result as fast as you can, so you don’t want to get bogged down in unnecessary processes and tooling – after all, if this direction is a dead end, you might throw everything out the window, so all that infrastructure and process investment would be a waste.
Throughout the building process, you might extract some code into functions, for more convenient use, and even commit it to the team’s machine learning utility repo, but a lot of the meat will remain in the notebook itself.
A few days or weeks later, you show your company’s stakeholders the results, and they’re excited! Let’s get this to production ASAP! You know this means the project will be more long-term, and that requires more rigorous processes and tooling to make sure that the data, models, experiments, and code are tracked, work can be split and shared between team members, and the project has a central source of truth.
Now that you’ve spent so much time in the prototyping phase, that’s a non-trivial amount of work. So you put it off for later. You need to get to production. Processes can come later. But what if that didn’t need to be the case? What if you could organize your project and get all those benefits with just 3 lines of code? Let’s see how you can do this with DagsHub so that you never need to compromise again!
What is DagsHub?
If you’re already familiar with DagsHub, skip this part and get to the juice of the next section – if not, read on.
DagsHub is a platform for managing and organizing machine learning projects. It provides a central source of truth for your code, data, models, experiments and more, and enables you to collaborate more effectively and get your models to production. It is based on popular open source tools like Git, MLflow, DVC, and Label Studio so that you aren’t reinventing the wheel, but use agreed-upon formats and tools for everything.
3 Lines of Code to Upgrade Your Machine Learning Project
Starting from a Colab (or local) notebook, let’s see how you can do the following with 3 lines of code:
Track DataTrack ExperimentsTrack Notebooks + Code
The prerequisites to these three lines are simply installing the DagsHub client and creating a DagsHub repo.
To install the client, simply:
pip install dagshub
And don’t forget to import dagshub.
Then, to create a repo, sign up to DagsHub and click the “Create” button in the top right of the page.
You can either create a blank repository, use a project template, OR if you already have a code repo you’d like to connect, connect an existing repo to DagsHub – with the integrations to all popular Git providers, you’ll be able to add data, experiments, and notebooks to existing repos (this will create Git commits where necessary).
1. Track Data
Let’s assume that in scrappy mode, you just got a CSV, or a bunch of image files that you uploaded to GDrive, into a folder named data, and the drive is mounted to your Colab notebook. The first line of code we’ll use is:
dagshub.upload_files(repo=”<repo_owner>/<repo_name>”, local_path=”drive/MyDrive/data”, remote_path=”data”)
If you have your data in a different folder, just change the local_path= argument.
✅ Phase 1 DONE! You should now see a folder named data with your data file on DagsHub.
2. Track Experiments
For experiment tracking, we’ll use MLflow – the most popular open source experiment tracking tool. DagsHub is integrated with MLflow, which means that every project comes with a fully configured MLflow server, ready to log experiments and models. Assuming your code uses MLflow for experiment tracking, you’ll only need one line to track the experiment and its model to DagsHub.
We’ll use the following:
dagshub.init(repo_owner=”<repo_owner>”, repo_name=”<repo_name>”)
✅ Phase 2 DONE! If you go to your experiment table in your repository, located at https://dagshub.com/<repo_owner>/<repo_name>/experiments/ you’ll be able to see your first experiment. In the MLflow UI associated with your repository (located at https://dagshub.com/<repo_owner>/<repo_name>.mlflow, you’ll also be able to see the actual model logged.
Note: The easiest way to instrument your code with MLflow is to use the autolog API which supports most standard ML libraries, e.g. in the example of Scikit Learn, the code you need is:
with mlflow.start_run(run_name=”my_run”):
mlflow.sklearn.autolog()
# Add code to train your Sklearn model
…
3. Track Code/Notebook
If you’re working locally, you can use the same line of code from “1. Track Data” to upload your code files too.
dagshub.upload_files(repo=”<repo_owner>/<repo_name>”, local_path=”path/to/notebook.ipynb”, remote_path=”notebook_to_production.ipynb”)
However, if you’re in Google Colab, that might be more of an involved process (you’d need to upload it’s save location in GDrive, which might be harder to find). That’s why we created a dedicated save_notebook function, which will just save the notebook to DagsHub easily.
dagshub.notebook.save_notebook(repo=”<repo_owner>/<repo_name>”)
✅ Phase 3 DONE! You’ll now see your notebook in your DagsHub project. Which concludes are 3-line adventure.
For Your Next Machine Learning Project
To make sure my promises hold true, I created a sample project that goes through these steps. Find it on DagsHub
Now that we’ve seen how easy it is to organize a machine learning project and make it production-ready, Next time you’re working on a machine learning project, you won’t need to worry about organizing it from the get go. You can move fast and get initial results, then easily go through this process to organize it and share it with other stakeholders and collaborators.
Good luck building!