Machine learning kickstart project

We will build a comprehensive machine learning project such it will scale in terms of compute, team, and system design for almost any smart solution.

Setup

Let’s start by forking this repo and setting the virtual environment:

$ git xet clone "https://xethub.com/${XET_USER_NAME}/kickstart_ml.git"
$ cd kickstart_ml
$ python -m venv .venv && . .venv/bin/activate
(.venv) $ pip install -r requirements.txt
(.venv) $ git checkout base # this will be out starting point

Train

Before we start - we checkout a baseline branch: git checkout -b baseline

Run the train*.ipynb* Jupyter Notebook - this will:

  1. Download the Titanic dataset.

  2. Build a model.

  3. Run evaluation.

  4. Save the model, the data and the metrics to files.

  • We can run the entire notebook as follow: (cd notebooks && ipython -c "%run train.ipynb") # raw

  • For more sophisticated execution, use papermill.

Let’s push it into our repository and merge it to main.

git add . && git commit -m "baseline training" && git push 
git checkout main && git merge baseline && git push

Congratulations! You did your first project!

Next step

Data

Should we save our data in the same repo? That is absolutely possible, but as your project scale, more data is ingested from other services; you might want to have different permissions for adding/removing data and would want to manage it differently than your machine learning code.

  • In our example, we save our machine learning datasets, but we can also have our a/b test data, databases backup dumps and any other data type.

Create a repository here and call it “kickstart_data”.

And add the data straight to it:

import pyxet

fs = pyxet.XetFS()
# a transaction is needed for write
with fs.transaction("xet://${XET_USER_NAME}/kickstart_data/main/"):
    fs.cp("data/titanic.csv", "xet://${XET_USER_NAME}/kickstart_data/main/titanic.csv")

We can delete our local data file: rm -rf data.

We use the git submodule to clone the kickstart_data repository instead.

git submodule add --force "https://xethub.com/${XET_USER_NAME}/kickstart_data data"
  • If you don’t see the titanic.csv file inside the folder, try (cd data && git pull && git xet checkout --) which will materialise the file from a pointer.

    • This is very important in case you’re using big data.

Let’s adjust our Jupyter Notebook to load the data from “local” and not save it.

...
# df = pd.read_csv("xet://xdssio/titanic/main/titanic.csv") <-- delete
df = pd.read_csv("../data/titanic.csv")
...
# df.to_csv('../data/data.csv', index=False) <-- delete

We can push our changes, and now we manage our data and project with Git 💪!

(cd notebooks && ipython -c "%run train.ipynb") # retrain (for testing)
git add . && git commit -m "moving data to submodule" && git push 

Now other teammates can upload data, and we can pull it.

This is great for reproducing and for re-training cycles, as we show later

Deployment

Let’s build a FastAPI app to give us predictions.

  • This can be done also in a different repo and with a submodule but for the sake of simplicity here, we’ll keep it all in the same repo.

First we create an app branch and install our new requirements.

git checkout -b app
pip install fastapi uvicorn pytest

To save time, we simply copy it from a ready app branch.

import pyxet

fs = pyxet.XetFS()
fs.cp("xdssio/kickstart_ml/app/server", "server")
  • Test with: pytest server/tests

  • Deploy with: unicorn server.app:app --reload

  • Query:

    curl -X 'POST' \
      'http://127.0.0.1:8000/predict' \
      -H 'accept: application/json' \
      -H 'Content-Type: application/json' \
      -d '[
      {
        "Pclass": 0,
        "SibSp": 0,
        "Parch": 0
      }
    ]'
    

Let’s have it as part of our project:

git add server && git commit -m "add fastapi app" && git push

As a best practice, let’s have our production code in a **********prod branch.

git checkout -b prod && git push

Experiments

Our model is pretty simple - let’s up our game a bit.

git checkout -b experiment1

Let’s add xgboost and change add some feature engineering:

num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])
cat_pipeline = Pipeline([
    ('encoder', OneHotEncoder())
])
preprocessor = ColumnTransformer([
    ('num', num_pipeline, ['Age', 'Fare']),
    ('cat', cat_pipeline, ['Sex', 'Embarked'])
])
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', XGBClassifier())
])

To get the full code… we’ll copy it from the existing version:

import pyxet

fs = pyxet.XetFS()

fs.cp("xdssio/titanic/experiment1/notebooks/train.ipynb", "notebooks/train.ipynb")

if we run our tests (pytest app/tests) we’ll see it failed! we need to fix our app…

We change the server/app Query object and the example in our tests:

# server/app.py
class Query(BaseModel):
    Age: float
    Fare: float
    Sex: str
    Embarked: str

# server/tests/app_test.py
@pytest.fixture
def example():
    return [{"Sex": "male", "Age": 22.0, "Fare": 7.25, "Embarked": "S"}]
  • Currently XetHub doesn’t support git-workflows but in the future, this tests can be done automatically before a merge as a standard CICD.

git add . \
  && git commit -m "experiment with Sex, Age, Fare, Embarked and XGBoost" \
  && git push

Merge new model

git checkout prod

First we compare the results to see the model is better - we’ll compare the accuracy on weighted avg.

  • This can be done from any branch.

import pyxet
import os
import pandas as pd

username = os.getenv('XET_USER_NAME')
results = []
for branch in ["prod", "experiment1"]:
    results.append(pd.read_csv(pyxet.open(f"xet://{username}/kickstart_ml/{branch}/metrics/results.csv")))

df = pd.concat(results)
df = df[df['target']=='weighted avg']
df[['precision','recall','f1-score','accuracy','branch']]
precision    recall  f1-score  accuracy       branch
3   0.729375  0.731844  0.729102  0.731844         prod
3   0.780827  0.782123  0.780847  0.782123  experiment1

Looks good!

Let’s merge the new model to prod:

git merge experiment1 && git push

Congratulations you are managing you ML project like a boss!

Retrain with more data

Let’s imagine you get more data from the backend which saved onto our data repo.

We simulate it by just adding data there:

import pyxet

fs = pyxet.XetFS()
with fs.transaction("xet://${XET_USER_NAME}/kickstart_data/main/"):
    fs.cp("data/titanic.csv", "xet://${XET_USER_NAME}/kickstart_data/main/titanic2.csv")

We can have any naming convention for our ”training-cycle-jobs” branches.

git checkout -b retrain
(cd data && git pull && git xet checkout --) # we'll get us the data localy 

We fix our train.ipynb

import glob

# df = pd.read_csv("../data/titanic.csv")  # replace this
df = pd.concat(map(pd.read_csv, glob.glob('../data/*.csv')))

Retrain: (cd notebooks && ipython -c "%run train.ipynb")

We can compare the results and merge to production like before.

Some options:

  • You can revert all your models and deployments.

  • If you decide to have every training with a new branch name like: retrain/v1 for example, you could always compare them and create dashboards and alerts

  • You can automate this “checkout-get-data-train-merge” cycle with those few lines of code.

For more about XetHub, use-cases and examples, checkout these: