# Machine learning kickstart project
We will build a comprehensive machine learning project such it will scale in terms of compute, team, and system design
for almost any smart solution.
## Setup
Let’s start by [forking this repo](https://xethub.com/xdssio/kickstart_ml) and setting the virtual environment:
```bash
$ git xet clone "https://xethub.com/${XET_USER_NAME}/kickstart_ml.git"
$ cd kickstart_ml
$ python -m venv .venv && . .venv/bin/activate
(.venv) $ pip install -r requirements.txt
(.venv) $ git checkout base # this will be out starting point
```
## Train
Before we start - we checkout a *baseline* branch: `git checkout -b baseline`
Run the train*.ipynb* Jupyter Notebook - this will:
1. Download the [Titanic dataset](https://www.kaggle.com/c/titanic).
2. Build a model.
3. Run evaluation.
4. Save the model, the data and the metrics to files.
- We can run the entire notebook as follow:
`(cd notebooks && ipython -c "%run train.ipynb")` # raw
- For more sophisticated execution, use [papermill](https://papermill.readthedocs.io/en/latest/index.html).
Let’s push it into our repository and merge it to *main.*
```bash
git add . && git commit -m "baseline training" && git push
git checkout main && git merge baseline && git push
```
**Congratulations!** You did your first project!
# Next step
## Data
Should we save our data in the same repo? That is absolutely possible, but as your project scale, more data is ingested
from other services; you might want to have different permissions for adding/removing data and would want to manage it
differently than your machine learning code.
- In our example, we save our machine learning datasets, but we can also have our a/b test data, databases backup dumps
and any other data type.
Create a repository [here](https://xethub.com/xet/create) and call it “kickstart_data”.
And add the data straight to it:
```python
import pyxet
fs = pyxet.XetFS()
# a transaction is needed for write
with fs.transaction("xet://${XET_USER_NAME}/kickstart_data/main/"):
fs.cp("data/titanic.csv", "xet://${XET_USER_NAME}/kickstart_data/main/titanic.csv")
```
We can delete our local data file: `rm -rf data`.
We use the [git submodule](https://git-scm.com/docs/git-submodule) to clone the *kickstart_data* repository instead.
```bash
git submodule add --force "https://xethub.com/${XET_USER_NAME}/kickstart_data data"
```
- If you don’t see the *titanic.csv* file inside the folder, try
`(cd data && git pull && git xet checkout --)`
which will materialise the file from a pointer.
- This is very important in case you’re using big data.
Let’s adjust our Jupyter Notebook to load the data from “local” and not save it.
```bash
...
# df = pd.read_csv("xet://xdssio/titanic/main/titanic.csv") <-- delete
df = pd.read_csv("../data/titanic.csv")
...
# df.to_csv('../data/data.csv', index=False) <-- delete
```
We can push our changes, and now we manage our data and project with Git 💪!
```bash
(cd notebooks && ipython -c "%run train.ipynb") # retrain (for testing)
git add . && git commit -m "moving data to submodule" && git push
```
Now other teammates can upload data, and we can pull it.
This is great for reproducing and for re-training cycles, as we show later
## Deployment
Let’s build a [FastAPI](https://fastapi.tiangolo.com/lo/) app to give us predictions.
- This can be done also in a different repo and with a **submodule** but for the sake of simplicity here, we’ll keep it
all in the same repo.
First we create an app branch and install our new requirements.
```bash
git checkout -b app
pip install fastapi uvicorn pytest
```
To save time, we simply copy it from a ready *app* branch.
```python
import pyxet
fs = pyxet.XetFS()
fs.cp("xdssio/kickstart_ml/app/server", "server")
```
- Test with: `pytest server/tests`
- Deploy with: `unicorn server.app:app --reload`
- Query:
```bash
curl -X 'POST' \
'http://127.0.0.1:8000/predict' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '[
{
"Pclass": 0,
"SibSp": 0,
"Parch": 0
}
]'
```
Let’s have it as part of our project:
```bash
git add server && git commit -m "add fastapi app" && git push
```
As a best practice, let’s have our *production* code in a *************prod*** branch.
```bash
git checkout -b prod && git push
```
## Experiments
Our model is pretty simple - let’s up our game a bit.
`git checkout -b experiment1`
Let’s add [xgboost](https://xgboost.readthedocs.io/en/stable/) and change add some feature engineering:
```bash
num_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler())
])
cat_pipeline = Pipeline([
('encoder', OneHotEncoder())
])
preprocessor = ColumnTransformer([
('num', num_pipeline, ['Age', 'Fare']),
('cat', cat_pipeline, ['Sex', 'Embarked'])
])
pipeline = Pipeline([
('preprocessor', preprocessor),
('classifier', XGBClassifier())
])
```
To get the full code… we’ll copy it from the existing version:
```python
import pyxet
fs = pyxet.XetFS()
fs.cp("xdssio/titanic/experiment1/notebooks/train.ipynb", "notebooks/train.ipynb")
```
if we run our tests (`pytest app/tests`) we’ll see it **failed**! we need to fix our app…
We change the server/app *Query* object and the example in our tests:
```bash
# server/app.py
class Query(BaseModel):
Age: float
Fare: float
Sex: str
Embarked: str
# server/tests/app_test.py
@pytest.fixture
def example():
return [{"Sex": "male", "Age": 22.0, "Fare": 7.25, "Embarked": "S"}]
```
- Currently XetHub doesn’t support git-workflows but in the future, this tests can be done automatically before a merge
as a standard CICD.
```bash
git add . \
&& git commit -m "experiment with Sex, Age, Fare, Embarked and XGBoost" \
&& git push
```
# Merge new model
`git checkout prod`
First we compare the results to see the model is better - we’ll compare the *accuracy* on *weighted avg.*
- This can be done from any branch.
```bash
import pyxet
import os
import pandas as pd
username = os.getenv('XET_USER_NAME')
results = []
for branch in ["prod", "experiment1"]:
results.append(pd.read_csv(pyxet.open(f"xet://{username}/kickstart_ml/{branch}/metrics/results.csv")))
df = pd.concat(results)
df = df[df['target']=='weighted avg']
df[['precision','recall','f1-score','accuracy','branch']]
```
```bash
precision recall f1-score accuracy branch
3 0.729375 0.731844 0.729102 0.731844 prod
3 0.780827 0.782123 0.780847 0.782123 experiment1
```
Looks good!
Let’s merge the new model to prod:
`git merge experiment1 && git push`
Congratulations you are managing you ML project like a boss!
# Retrain with more data
Let’s imagine you get more data from the backend which saved onto our data repo.
We simulate it by just adding data there:
```python
import pyxet
fs = pyxet.XetFS()
with fs.transaction("xet://${XET_USER_NAME}/kickstart_data/main/"):
fs.cp("data/titanic.csv", "xet://${XET_USER_NAME}/kickstart_data/main/titanic2.csv")
```
We can have any naming convention for our ”training-cycle-jobs” branches.
```bash
git checkout -b retrain
(cd data && git pull && git xet checkout --) # we'll get us the data localy
```
We fix our *train.ipynb*
```bash
import glob
# df = pd.read_csv("../data/titanic.csv") # replace this
df = pd.concat(map(pd.read_csv, glob.glob('../data/*.csv')))
```
Retrain: `(cd notebooks && ipython -c "%run train.ipynb")`
We can compare the results and merge to production like before.
Some options:
- You can revert all your models and deployments.
- If you decide to have every training with a new branch name like: `retrain/v1` for example, you could always compare
them and create dashboards and alerts
- You can automate this “checkout-get-data-train-merge” cycle with those few lines of code.
For more about XetHub, use-cases and examples, checkout these: