Quickstart

XetHub is a cloud storage platform with Git capabilities. It is a great place to store your data, models, logs, and code with versioning. The pyxet library allows you to easily access XetHub files directly from Python.

Installation

Set up your virtualenv with:

$ python -m venv .venv
$ . .venv/bin/activate

Then, install pyxet with:

$ pip install pyxet

Authentication

Signup on XetHub and obtain a username and personal access token. You should write this down.

There are three ways to authenticate with XetHub:

Command Line

xet login -e <email> -u <username> -p <personal_access_token>

Xet login will write to authentication information to ~/.xetconfig

Environment Variable

Environment variables may be sometimes more convenient:

export XET_USER_EMAIL = <email>
export XET_USER_NAME = <username>
export XET_USER_TOKEN = <personal_access_token>

In Python

Finally if in a notebook environment, or a non-persistent environment, we also provide a method to authenticate directly from Python. Note that this must be the first thing you run before any other operation:

import pyxet
pyxet.login(<username>, <personal_access_token>, <email>)

Demo

To verify that pyxet is working, let’s load a CSV file directly into a Pandas dataframe, leveraging pyxet’s support for Python fsspec.

import pyxet            # make xet:// protocol available
import pandas as pd     # assumes pip install pandas has been run

df = pd.read_csv('xet://XetHub/titanic/main/titanic.csv')
df

This should return something like:

Out[3]:
     PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked
0              1         0       3  ...   7.2500   NaN         S
1              2         1       1  ...  71.2833   C85         C
2              3         1       3  ...   7.9250   NaN         S
3              4         1       1  ...  53.1000  C123         S
4              5         0       3  ...   8.0500   NaN         S
..           ...       ...     ...  ...      ...   ...       ...
886          887         0       2  ...  13.0000   NaN         S
887          888         1       1  ...  30.0000   B42         S
888          889         0       3  ...  23.4500   NaN         S
889          890         1       1  ...  30.0000  C148         C
890          891         0       3  ...   7.7500   NaN         Q

[891 rows x 12 columns]

Working with a Blob Store

pyxet provides a Python SDK (with a CLI on the way!) for interacting with XetHub repositories as blob stores, while leveraging the power of Git branches and versioning.

A XetHub URL for pyxet is in the form:

xet://<repo_owner>/<repo_name>/<branch>/<path_to_file>

Unlike with traditional blob stores, the ability to call a branch means that you can choose whether to use the most recent version of a file/directory or to reference a particular branch or commit.

To work with a repository as a file system, use pyxet.XetFS, which implements fsspec Here are some simple ways to access information from an existing repository:

import pyxet

fs = pyxet.XetFS()  # fsspec filesystem

fs.info("XetHub/titanic/main/titanic.csv")  
# returns repo level info: {'name': 'XetHub/main/titanic.csv', 'size': 61194, 'type': 'file'}

fs.open("XetHub/titanic/main/titanic.csv", 'r').read(20)
# returns first 20 characters: 'PassengerId,Survived'

fs.get("XetHub/titanic/main/data/", "data", recursive=True)  
# download remote directory recursively into a local data folder

fs.ls("XetHub/titanic/main/data/", detail=False)  
# returns ['data/titanic_0.parquet', 'data/titanic_1.parquet']

Pyxet also allows you to write to repositories with Git versioning.