This is a Python library that binds to Apache Arrow distributed query engine Ballista.
- Start Ballista schedulers and executors from Python
- Execute distributed SQL queries (with DataFusion backend)
- Use DataFrame API to read files and execute distributed queries (with DataFusion backend)
- Support for CSV, Parquet, and Avro formats
- Python UDFs
- JSON
- Support reading JSON
- Support distributed Python UDFs and UDAFs
- Support distributed query execution against Python DataFrame libraries such as Polars, Pandas, and cuDF, that are already supported by DataFusion's Python bindings (this will require new features in Ballista)
- Query a Parquet file using SQL
- Query a Parquet file using DataFrame API
- Start a scheduler from within a Python process
- Start an executor from within a Python process
pip install ballista
# or
python -m pip install ballista
This assumes that you have rust and cargo installed. We use the workflow recommended by pyo3 and maturin.
Bootstrap:
# fetch this repo
git clone [email protected]:apache/arrow-ballista-python.git
# change to python directory
cd arrow-ballista-python
# prepare development environment (used to build wheel / install in development)
python3 -m venv venv
# activate the venv
source venv/bin/activate
# update pip itself if necessary
python -m pip install -U pip
# if python -V gives python 3.7
python -m pip install -r requirements-37.txt
# if python -V gives python 3.8/3.9/3.10
python -m pip install -r requirements-310.txt
Whenever rust code changes (your changes or via git pull
):
# make sure you activate the venv using "source venv/bin/activate" first
maturin develop
python -m pytest
To change test dependencies, change the requirements.in
and run
# install pip-tools (this can be done only once), also consider running in venv
python -m pip install pip-tools
# change requirements.in and then run
python -m piptools compile --generate-hashes -o requirements-37.txt
# or run this is you are on python 3.8/3.9/3.10
python -m piptools compile --generate-hashes -o requirements.txt
To update dependencies, run with -U
python -m piptools compile -U --generate-hashes -o requirements-310.txt
More details here