Machine learning kickstart project
Contents
Machine learning kickstart project
We will build a comprehensive machine learning project such it will scale in terms of compute, team, and system design for almost any smart solution.
Setup
Let’s start by forking this repo and setting the virtual environment:
$ git xet clone "https://xethub.com/${XET_USER_NAME}/kickstart_ml.git"
$ cd kickstart_ml
$ python -m venv .venv && . .venv/bin/activate
(.venv) $ pip install -r requirements.txt
(.venv) $ git checkout base # this will be out starting point
Train
Before we start - we checkout a baseline branch: git checkout -b baseline
Run the train*.ipynb* Jupyter Notebook - this will:
Download the Titanic dataset.
Build a model.
Run evaluation.
Save the model, the data and the metrics to files.
We can run the entire notebook as follow:
(cd notebooks && ipython -c "%run train.ipynb")# rawFor more sophisticated execution, use papermill.
Let’s push it into our repository and merge it to main.
git add . && git commit -m "baseline training" && git push
git checkout main && git merge baseline && git push
Congratulations! You did your first project!
Next step
Data
Should we save our data in the same repo? That is absolutely possible, but as your project scale, more data is ingested from other services; you might want to have different permissions for adding/removing data and would want to manage it differently than your machine learning code.
In our example, we save our machine learning datasets, but we can also have our a/b test data, databases backup dumps and any other data type.
Create a repository here and call it “kickstart_data”.
And add the data straight to it:
import pyxet
fs = pyxet.XetFS()
# a transaction is needed for write
with fs.transaction("xet://${XET_USER_NAME}/kickstart_data/main/"):
fs.cp("data/titanic.csv", "xet://${XET_USER_NAME}/kickstart_data/main/titanic.csv")
We can delete our local data file: rm -rf data.
We use the git submodule to clone the kickstart_data repository instead.
git submodule add --force "https://xethub.com/${XET_USER_NAME}/kickstart_data data"
If you don’t see the titanic.csv file inside the folder, try
(cd data && git pull && git xet checkout --)which will materialise the file from a pointer.This is very important in case you’re using big data.
Let’s adjust our Jupyter Notebook to load the data from “local” and not save it.
...
# df = pd.read_csv("xet://xdssio/titanic/main/titanic.csv") <-- delete
df = pd.read_csv("../data/titanic.csv")
...
# df.to_csv('../data/data.csv', index=False) <-- delete
We can push our changes, and now we manage our data and project with Git 💪!
(cd notebooks && ipython -c "%run train.ipynb") # retrain (for testing)
git add . && git commit -m "moving data to submodule" && git push
Now other teammates can upload data, and we can pull it.
This is great for reproducing and for re-training cycles, as we show later
Deployment
Let’s build a FastAPI app to give us predictions.
This can be done also in a different repo and with a submodule but for the sake of simplicity here, we’ll keep it all in the same repo.
First we create an app branch and install our new requirements.
git checkout -b app
pip install fastapi uvicorn pytest
To save time, we simply copy it from a ready app branch.
import pyxet
fs = pyxet.XetFS()
fs.cp("xdssio/kickstart_ml/app/server", "server")
Test with:
pytest server/testsDeploy with:
unicorn server.app:app --reloadQuery:
curl -X 'POST' \ 'http://127.0.0.1:8000/predict' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '[ { "Pclass": 0, "SibSp": 0, "Parch": 0 } ]'
Let’s have it as part of our project:
git add server && git commit -m "add fastapi app" && git push
As a best practice, let’s have our production code in a **********prod branch.
git checkout -b prod && git push
Experiments
Our model is pretty simple - let’s up our game a bit.
git checkout -b experiment1
Let’s add xgboost and change add some feature engineering:
num_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler())
])
cat_pipeline = Pipeline([
('encoder', OneHotEncoder())
])
preprocessor = ColumnTransformer([
('num', num_pipeline, ['Age', 'Fare']),
('cat', cat_pipeline, ['Sex', 'Embarked'])
])
pipeline = Pipeline([
('preprocessor', preprocessor),
('classifier', XGBClassifier())
])
To get the full code… we’ll copy it from the existing version:
import pyxet
fs = pyxet.XetFS()
fs.cp("xdssio/titanic/experiment1/notebooks/train.ipynb", "notebooks/train.ipynb")
if we run our tests (pytest app/tests) we’ll see it failed! we need to fix our app…
We change the server/app Query object and the example in our tests:
# server/app.py
class Query(BaseModel):
Age: float
Fare: float
Sex: str
Embarked: str
# server/tests/app_test.py
@pytest.fixture
def example():
return [{"Sex": "male", "Age": 22.0, "Fare": 7.25, "Embarked": "S"}]
Currently XetHub doesn’t support git-workflows but in the future, this tests can be done automatically before a merge as a standard CICD.
git add . \
&& git commit -m "experiment with Sex, Age, Fare, Embarked and XGBoost" \
&& git push
Merge new model
git checkout prod
First we compare the results to see the model is better - we’ll compare the accuracy on weighted avg.
This can be done from any branch.
import pyxet
import os
import pandas as pd
username = os.getenv('XET_USER_NAME')
results = []
for branch in ["prod", "experiment1"]:
results.append(pd.read_csv(pyxet.open(f"xet://{username}/kickstart_ml/{branch}/metrics/results.csv")))
df = pd.concat(results)
df = df[df['target']=='weighted avg']
df[['precision','recall','f1-score','accuracy','branch']]
precision recall f1-score accuracy branch
3 0.729375 0.731844 0.729102 0.731844 prod
3 0.780827 0.782123 0.780847 0.782123 experiment1
Looks good!
Let’s merge the new model to prod:
git merge experiment1 && git push
Congratulations you are managing you ML project like a boss!
Retrain with more data
Let’s imagine you get more data from the backend which saved onto our data repo.
We simulate it by just adding data there:
import pyxet
fs = pyxet.XetFS()
with fs.transaction("xet://${XET_USER_NAME}/kickstart_data/main/"):
fs.cp("data/titanic.csv", "xet://${XET_USER_NAME}/kickstart_data/main/titanic2.csv")
We can have any naming convention for our ”training-cycle-jobs” branches.
git checkout -b retrain
(cd data && git pull && git xet checkout --) # we'll get us the data localy
We fix our train.ipynb
import glob
# df = pd.read_csv("../data/titanic.csv") # replace this
df = pd.concat(map(pd.read_csv, glob.glob('../data/*.csv')))
Retrain: (cd notebooks && ipython -c "%run train.ipynb")
We can compare the results and merge to production like before.
Some options:
You can revert all your models and deployments.
If you decide to have every training with a new branch name like:
retrain/v1for example, you could always compare them and create dashboards and alertsYou can automate this “checkout-get-data-train-merge” cycle with those few lines of code.
For more about XetHub, use-cases and examples, checkout these: