Quickstart
Contents
Quickstart
XetHub is a cloud storage platform with Git capabilities. It is a great place to store your data, models, logs, and code with versioning. The pyxet library allows you to easily access XetHub files directly from Python.
Installation
Set up your virtualenv with:
$ python -m venv .venv
$ . .venv/bin/activate
Then, install pyxet with:
$ pip install pyxet
Authentication
Signup on XetHub and obtain a username and personal access token. You should write this down.
There are three ways to authenticate with XetHub:
Command Line
xet login -e <email> -u <username> -p <personal_access_token>
Xet login will write to authentication information to ~/.xetconfig
Environment Variable
Environment variables may be sometimes more convenient:
export XET_USER_EMAIL = <email>
export XET_USER_NAME = <username>
export XET_USER_TOKEN = <personal_access_token>
In Python
Finally if in a notebook environment, or a non-persistent environment, we also provide a method to authenticate directly from Python. Note that this must be the first thing you run before any other operation:
import pyxet
pyxet.login(<username>, <personal_access_token>, <email>)
Demo
To verify that pyxet is working, let’s load a CSV file directly into a Pandas dataframe, leveraging pyxet’s support for Python fsspec.
import pyxet # make xet:// protocol available
import pandas as pd # assumes pip install pandas has been run
df = pd.read_csv('xet://XetHub/titanic/main/titanic.csv')
df
This should return something like:
Out[3]:
PassengerId Survived Pclass ... Fare Cabin Embarked
0 1 0 3 ... 7.2500 NaN S
1 2 1 1 ... 71.2833 C85 C
2 3 1 3 ... 7.9250 NaN S
3 4 1 1 ... 53.1000 C123 S
4 5 0 3 ... 8.0500 NaN S
.. ... ... ... ... ... ... ...
886 887 0 2 ... 13.0000 NaN S
887 888 1 1 ... 30.0000 B42 S
888 889 0 3 ... 23.4500 NaN S
889 890 1 1 ... 30.0000 C148 C
890 891 0 3 ... 7.7500 NaN Q
[891 rows x 12 columns]
Working with a Blob Store
pyxet provides a Python SDK (with a CLI on the way!) for interacting with XetHub repositories as blob stores, while leveraging the power of Git branches and versioning.
A XetHub URL for pyxet is in the form:
xet://<repo_owner>/<repo_name>/<branch>/<path_to_file>
Unlike with traditional blob stores, the ability to call a branch means that you can choose whether to use the most recent version of a file/directory or to reference a particular branch or commit.
To work with a repository as a file system, use pyxet.XetFS
, which implements fsspec
Here are some simple ways to access information from an existing repository:
import pyxet
fs = pyxet.XetFS() # fsspec filesystem
fs.info("XetHub/titanic/main/titanic.csv")
# returns repo level info: {'name': 'XetHub/main/titanic.csv', 'size': 61194, 'type': 'file'}
fs.open("XetHub/titanic/main/titanic.csv", 'r').read(20)
# returns first 20 characters: 'PassengerId,Survived'
fs.get("XetHub/titanic/main/data/", "data", recursive=True)
# download remote directory recursively into a local data folder
fs.ls("XetHub/titanic/main/data/", detail=False)
# returns ['data/titanic_0.parquet', 'data/titanic_1.parquet']
Pyxet also allows you to write to repositories with Git versioning.