Machine learning kickstart project

We will build a comprehensive machine learning project such it will scale in terms of compute, team, and system design for almost any smart solution.

Setup

Let’s start by forking this repo and setting the virtual environment:

$ git xet clone "https://xethub.com/${XET_USER_NAME}/kickstart_ml.git"
$ cd kickstart_ml
$ python -m venv .venv && . .venv/bin/activate
(.venv) $ pip install -r requirements.txt
(.venv) $ git checkout base # this will be out starting point

Train

Before we start - we checkout a baseline branch: git checkout -b baseline

Run the train*.ipynb* Jupyter Notebook - this will:

Download the Titanic dataset.
Build a model.
Run evaluation.
Save the model, the data and the metrics to files.

We can run the entire notebook as follow: (cd notebooks && ipython -c "%run train.ipynb") # raw
For more sophisticated execution, use papermill.

Let’s push it into our repository and merge it to main.

git add . && git commit -m "baseline training" && git push 
git checkout main && git merge baseline && git push

Congratulations! You did your first project!

Next step

Data

Should we save our data in the same repo? That is absolutely possible, but as your project scale, more data is ingested from other services; you might want to have different permissions for adding/removing data and would want to manage it differently than your machine learning code.

In our example, we save our machine learning datasets, but we can also have our a/b test data, databases backup dumps and any other data type.

Create a repository here and call it “kickstart_data”.

And add the data straight to it:

import pyxet

fs = pyxet.XetFS()
# a transaction is needed for write
with fs.transaction("xet://${XET_USER_NAME}/kickstart_data/main/"):
    fs.cp("data/titanic.csv", "xet://${XET_USER_NAME}/kickstart_data/main/titanic.csv")

We can delete our local data file: rm -rf data.

We use the git submodule to clone the kickstart_data repository instead.

git submodule add --force "https://xethub.com/${XET_USER_NAME}/kickstart_data data"

If you don’t see the titanic.csv file inside the folder, try (cd data && git pull && git xet checkout --) which will materialise the file from a pointer.
- This is very important in case you’re using big data.

Let’s adjust our Jupyter Notebook to load the data from “local” and not save it.

...
# df = pd.read_csv("xet://xdssio/titanic/main/titanic.csv") <-- delete
df = pd.read_csv("../data/titanic.csv")
...
# df.to_csv('../data/data.csv', index=False) <-- delete

We can push our changes, and now we manage our data and project with Git 💪!

(cd notebooks && ipython -c "%run train.ipynb") # retrain (for testing)
git add . && git commit -m "moving data to submodule" && git push 

Now other teammates can upload data, and we can pull it.

This is great for reproducing and for re-training cycles, as we show later

Deployment

Let’s build a FastAPI app to give us predictions.

This can be done also in a different repo and with a submodule but for the sake of simplicity here, we’ll keep it all in the same repo.

First we create an app branch and install our new requirements.

git checkout -b app
pip install fastapi uvicorn pytest

To save time, we simply copy it from a ready app branch.

import pyxet

fs = pyxet.XetFS()
fs.cp("xdssio/kickstart_ml/app/server", "server")

Test with: pytest server/tests
Deploy with: unicorn server.app:app --reload

Query:

curl -X 'POST' \
  'http://127.0.0.1:8000/predict' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '[
  {
    "Pclass": 0,
    "SibSp": 0,
    "Parch": 0
  }
]'

Let’s have it as part of our project:

git add server && git commit -m "add fastapi app" && git push

As a best practice, let’s have our production code in a **********prod branch.

git checkout -b prod && git push

Experiments

Our model is pretty simple - let’s up our game a bit.

git checkout -b experiment1

Let’s add xgboost and change add some feature engineering:

num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])
cat_pipeline = Pipeline([
    ('encoder', OneHotEncoder())
])
preprocessor = ColumnTransformer([
    ('num', num_pipeline, ['Age', 'Fare']),
    ('cat', cat_pipeline, ['Sex', 'Embarked'])
])
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', XGBClassifier())
])

To get the full code… we’ll copy it from the existing version:

import pyxet

fs = pyxet.XetFS()

fs.cp("xdssio/titanic/experiment1/notebooks/train.ipynb", "notebooks/train.ipynb")

if we run our tests (pytest app/tests) we’ll see it failed! we need to fix our app…

We change the server/app Query object and the example in our tests:

# server/app.py
class Query(BaseModel):
    Age: float
    Fare: float
    Sex: str
    Embarked: str

# server/tests/app_test.py
@pytest.fixture
def example():
    return [{"Sex": "male", "Age": 22.0, "Fare": 7.25, "Embarked": "S"}]

Currently XetHub doesn’t support git-workflows but in the future, this tests can be done automatically before a merge as a standard CICD.

git add . \
  && git commit -m "experiment with Sex, Age, Fare, Embarked and XGBoost" \
  && git push

Merge new model

git checkout prod

First we compare the results to see the model is better - we’ll compare the accuracy on weighted avg.

This can be done from any branch.

import pyxet
import os
import pandas as pd

username = os.getenv('XET_USER_NAME')
results = []
for branch in ["prod", "experiment1"]:
    results.append(pd.read_csv(pyxet.open(f"xet://{username}/kickstart_ml/{branch}/metrics/results.csv")))

df = pd.concat(results)
df = df[df['target']=='weighted avg']
df[['precision','recall','f1-score','accuracy','branch']]

precision    recall  f1-score  accuracy       branch
3   0.729375  0.731844  0.729102  0.731844         prod
3   0.780827  0.782123  0.780847  0.782123  experiment1

Looks good!

Let’s merge the new model to prod:

git merge experiment1 && git push

Congratulations you are managing you ML project like a boss!

Retrain with more data

Let’s imagine you get more data from the backend which saved onto our data repo.

We simulate it by just adding data there:

import pyxet

fs = pyxet.XetFS()
with fs.transaction("xet://${XET_USER_NAME}/kickstart_data/main/"):
    fs.cp("data/titanic.csv", "xet://${XET_USER_NAME}/kickstart_data/main/titanic2.csv")

We can have any naming convention for our ”training-cycle-jobs” branches.

git checkout -b retrain
(cd data && git pull && git xet checkout --) # we'll get us the data localy 

We fix our train.ipynb

import glob

# df = pd.read_csv("../data/titanic.csv")  # replace this
df = pd.concat(map(pd.read_csv, glob.glob('../data/*.csv')))

Retrain: (cd notebooks && ipython -c "%run train.ipynb")

We can compare the results and merge to production like before.

Some options:

You can revert all your models and deployments.
If you decide to have every training with a new branch name like: retrain/v1 for example, you could always compare them and create dashboards and alerts
You can automate this “checkout-get-data-train-merge” cycle with those few lines of code.

For more about XetHub, use-cases and examples, checkout these:

pyxet 0.0.4 documentation

Machine learning kickstart project

Contents