Quickstart#

XetHub is a cloud storage platform with Git capabilities. It is a great place to store your data, models, logs, and code with versioning. The pyxet library allows you to easily access XetHub files directly from Python.

Installation#

Set up your virtualenv with:

$ python -m venv .venv
$ . .venv/bin/activate

Then, install pyxet with:

$ pip install pyxet

Authentication#

Signup on XetHub and obtain a username and personal access token. You should write this down.

There are three ways to authenticate with XetHub:

Command Line#

xet login -e <email> -u <username> -p <personal_access_token>

Xet login will write to authentication information to ~/.xetconfig

Environment Variable#

Environment variables may be sometimes more convenient:

export XET_USER_EMAIL = <email>
export XET_USER_NAME = <username>
export XET_USER_TOKEN = <personal_access_token>

In Python#

Finally if in a notebook environment, or a non-persistent environment, we also provide a method to authenticate directly from Python. Note that this must be the first thing you run before any other operation:

import pyxet
pyxet.login(<username>, <personal_access_token>, <email>)

Demo#

To verify that pyxet is working, let’s load a CSV file directly into a Pandas dataframe, leveraging pyxet’s support for Python fsspec.

import pyxet            # make xet:// protocol available
import pandas as pd     # assumes pip install pandas has been run

df = pd.read_csv('xet://XetHub/titanic/main/titanic.csv')
df

This should return something like:

Out[3]:
     PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked
0              1         0       3  ...   7.2500   NaN         S
1              2         1       1  ...  71.2833   C85         C
2              3         1       3  ...   7.9250   NaN         S
3              4         1       1  ...  53.1000  C123         S
4              5         0       3  ...   8.0500   NaN         S
..           ...       ...     ...  ...      ...   ...       ...
886          887         0       2  ...  13.0000   NaN         S
887          888         1       1  ...  30.0000   B42         S
888          889         0       3  ...  23.4500   NaN         S
889          890         1       1  ...  30.0000  C148         C
890          891         0       3  ...   7.7500   NaN         Q

[891 rows x 12 columns]

Working with a Blob Store#

pyxet provides a Python SDK (with a CLI on the way!) for interacting with XetHub repositories as blob stores, while leveraging the power of Git branches and versioning.

A XetHub URL for pyxet is in the form:

xet://<repo_owner>/<repo_name>/<branch>/<path_to_file>

Unlike with traditional blob stores, the ability to call a branch means that you can choose whether to use the most recent version of a file/directory or to reference a particular branch or commit.

To work with a repository as a file system, use pyxet.XetFS, which implements fsspec Here are some simple ways to access information from an existing repository:

import pyxet

fs = pyxet.XetFS()  # fsspec filesystem

fs.info("XetHub/titanic/main/titanic.csv")  
# returns repo level info: {'name': 'XetHub/main/titanic.csv', 'size': 61194, 'type': 'file'}

fs.open("XetHub/titanic/main/titanic.csv", 'r').read(20)
# returns first 20 characters: 'PassengerId,Survived'

fs.get("XetHub/titanic/main/data/", "data", recursive=True)  
# download remote directory recursively into a local data folder

fs.ls("XetHub/titanic/main/data/", detail=False)  
# returns ['data/titanic_0.parquet', 'data/titanic_1.parquet']

Pyxet also allows you to write to repositories with Git versioning.