BlazingSQL is a GPU accelerated SQL engine built on top of the RAPIDS ecosystem. RAPIDS is based on the Apache Arrow columnar memory format, and cuDF is a GPU DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data.
BlazingSQL is a SQL interface for cuDF, with various features to support large scale data science workflows and enterprise datasets.
Query Data Stored Externally - a single line of code can register remote storage solutions, such as Amazon S3.
Simple SQL - incredibly easy to use, run a SQL query and the results are GPU DataFrames (GDFs).
Interoperable - GDFs are immediately accessible to any RAPIDS library for data science workloads.
Try our 5-min Welcome Notebook to start using BlazingSQL and RAPIDS AI.
Getting Started
Here's two copy + paste reproducable BlazingSQL snippets, keep scrolling to find example Notebooks below.
Create and query a table from a cudf.DataFrame with progress bar:
importcudfdf=cudf.DataFrame()
df['key'] = ['a', 'b', 'c', 'd', 'e']
df['val'] = [7.6, 2.9, 7.1, 1.6, 2.2]
fromblazingsqlimportBlazingContextbc=BlazingContext(enable_progress_bar=True)
bc.create_table('game_1', df)
bc.sql('SELECT * FROM game_1 WHERE val > 4') # the query progress will be shown
This is the recommended way of building all of the BlazingSQL components and dependencies from source. It ensures that all the dependencies are available to the build process.
The build process will checkout the BlazingSQL repository and will build and install into the conda environment.
cd$CONDA_PREFIX
git clone https://github.com/BlazingDB/blazingsql.git
cd blazingsql
export CUDACXX=/usr/local/cuda/bin/nvcc
./build.sh
NOTE: You can do ./build.sh -h to see more build options.
NOTE: You can perform static analysis with cppcheck with the command cppcheck --project=compile_commands.json in any of the cpp project build directories.
$CONDA_PREFIX now has a folder for the blazingsql repository.
Storage plugins
To build without the storage plugins (AWS S3, Google Cloud Storage) use the next arguments:
The RAPIDS suite of open source software libraries aim to enable execution of end-to-end data science and analytics pipelines entirely on GPUs. It relies on NVIDIA® CUDA® primitives for low-level compute optimization, but exposing that GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces.
Apache Arrow on GPU
The GPU version of Apache Arrow is a common API that enables efficient interchange of tabular data between processes running on the GPU. End-to-end computation on the GPU avoids unnecessary copying and converting of data off the GPU, reducing compute time and cost for high-performance analytics common in artificial intelligence workloads. As the name implies, cuDF uses the Apache Arrow columnar data format on the GPU. Currently, a subset of the features in Apache Arrow are supported.
Describe the bug
I'm trying the new Hive querying feature (in v11) using pyhive. Pyhive connection and cursor are created correctly. When I call the create_table() operation, the command fails with hdfs path not found. The error message shows that the hdfs path is collected correctly, so the pyhive connection
and the metadata collection in get_hive_table() may be working correctly.
Error snippet: "ParseSchemaError: [ParseSchema Error] Path 'hdfs://.../../../' does not exist."
Steps/Code to reproduce bug
from blazingsql import BlazingContext
from pyhive import hive
bc = BlazingContext()
cursor = hive.connect('your_hive_ip_address').cursor()
bc.create_table("hive_db_name.hive_table_name",cursor) # fails here
Expected behavior
bc.create_table() should succeed
Additional context
Not HDFS register step followed while testing pyhive based connection.
Hello, to execute a simple demo. I can see the query result is correct but in the end there is a weird message:
Demo:
from blazingsql import BlazingContext
bc = BlazingContext()
import time
nation = bc.create_table('nation', '/tmp/nation_0_0.parquet')
sql = "select n_nationkey, n_comment from nation"
result_gdf = bc.sql(sql).get()
print(result_gdf)
Output:
(blazingsql) root@8d16c30f8ae3:~# python hola.py
connection established
columns = n_nationkey n_comment
0 0 haggle. carefully final deposits detect slyly...
1 1 al foxes promise slyly according to the regula...
2 2 y alongside of the pending deposits. carefully...
3 3 eas hang ironic, silent packages. slyly regula...
4 4 y above the carefully unusual theodolites. fin...
5 5 ven packages wake quickly. regu
6 6 refully final requests. regular, ironi
7 7 l platelets. regular accounts x-ray: unusual, ...
8 8 ss excuses cajole slyly across the packages. d...
9 9 slyly express asymptotes. regular deposits ha...
10 10 efully alongside of the slyly final dependenci...
11 11 nic deposits boost atop the quickly final requ...
12 12 ously. final, express gifts cajole a
13 13 ic deposits are blithely about the carefully r...
14 14 pending excuses haggle furiously deposits. pe...
15 15 rns. blithely bold courts among the closely re...
16 16 s. ironic, unusual asymptotes wake blithely r
17 17 platelets. blithely pending dependencies use f...
18 18 c dependencies. furiously express notornis sle...
19 19 ular asymptotes are about the furious multipli...
20 20 ts. silent requests haggle. closely express pa...
21 21 hely enticingly express accounts. even, final
22 22 requests against the platelets use never acco...
23 23 eans boost carefully special requests. account...
24 24 y final packages. slow foxes cajole quickly. q...
resultToken = 14785780709824492479
interpreter_path = 127.0.0.1
interpreter_port = 8891
handle = [<numba.cuda.cudadrv.driver.IpcHandle object at 0x7ff86d2819b0>]
client = <pyblazing.api.PyConnector object at 0x7ff86d2daef0>
calciteTime = 301
ralTime = 310
totalTime = 851
error_message =
Exception ignored in: <function ResultSetHandle.del at 0x7ff86d2e42f0>
Traceback (most recent call last):
File "/miniconda3/envs/blazingsql/lib/python3.7/site-packages/pyblazing/api.py", line 310, in del
File "/miniconda3/envs/blazingsql/lib/python3.7/site-packages/pyblazing/api.py", line 229, in free_result
File "/miniconda3/envs/blazingsql/lib/python3.7/site-packages/pyblazing/api.py", line 47, in _send_request
File "/miniconda3/envs/blazingsql/lib/python3.7/site-packages/blazingdb/protocol/init.py", line 71, in send
TypeError: can only concatenate str (not "tuple") to str
Exception ignored in: <function PyConnector.del at 0x7ff86d2d6c80>
Traceback (most recent call last):
File "/miniconda3/envs/blazingsql/lib/python3.7/site-packages/pyblazing/api.py", line 64, in del
File "/miniconda3/envs/blazingsql/lib/python3.7/site-packages/pyblazing/api.py", line 108, in close_connection
File "/miniconda3/envs/blazingsql/lib/python3.7/site-packages/pyblazing/api.py", line 46, in _send_request
File "/miniconda3/envs/blazingsql/lib/python3.7/site-packages/blazingdb/protocol/init.py", line 61, in init
TypeError: can only concatenate str (not "tuple") to str
Colla
Step 10/16 : RUN conda install -y -c conda-forge -c defaults -c nvidia -c rapidsai -c blazingsql/label/cuda10.0 blazingsql-calcite blazingsql-orchestrator blazingsql-ral blazingsql-python python=3.7 cudatoolkit=10.0
---> Running in e811e50d46c5
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve.
PackagesNotFoundError: The following packages are not available from current channels:
- blazingsql-orchestrator
- blazingsql-calcite
Current channels:
- https://conda.anaconda.org/conda-forge/linux-64
- https://conda.anaconda.org/conda-forge/noarch
- https://repo.anaconda.com/pkgs/main/linux-64
- https://repo.anaconda.com/pkgs/main/noarch
- https://repo.anaconda.com/pkgs/r/linux-64
- https://repo.anaconda.com/pkgs/r/noarch
- https://conda.anaconda.org/nvidia/linux-64
- https://conda.anaconda.org/nvidia/noarch
- https://conda.anaconda.org/rapidsai/linux-64
- https://conda.anaconda.org/rapidsai/noarch
- https://conda.anaconda.org/blazingsql/label/cuda10.0/linux-64
- https://conda.anaconda.org/blazingsql/label/cuda10.0/noarch
To search for alternate channels that may provide the conda package you're
looking for, navigate to
https://anaconda.org
and use the search bar at the top of the page.
The command '/bin/sh -c conda install -y -c conda-forge -c defaults -c nvidia -c rapidsai -c blazingsql/label/cuda10.0 blazingsql-calcite blazingsql-orchestrator blazingsql-ral blazingsql-python python=3.7 cudatoolkit=10.0' returned a non-zero code: 1
Describe the bug
After 25.10.19 I have not been able to get Blazingsql to work.
CODE:
python3.7
from blazingsql import BlazingContext
bc = BlazingContext()
... Output:
WARNING: blazingsql-orchestrator was not automativally started, its probably already running
WARNING: blazingsql-engine was not automativally started, its probably already running
^CTraceback (most recent call last):
File "cs_dataprep.py", line 17, in
bc = BlazingContext()
File "/home/REMOVED/miniconda3/envs/blazing2/lib/python3.7/site-packages/pyblazing/apiv2/context.py", line 173, in init
internal_api.SetupOrchestratorConnection(orchestrator_host_ip, orchestrator_port)
File "/home/REMOVED/miniconda3/envs/blazing2/lib/python3.7/site-packages/pyblazing/api.py", line 904, in SetupOrchestratorConnection
client.connect(orchestrator_host_ip, orchestrator_port)
File "/home/REMOVED/miniconda3/envs/blazing2/lib/python3.7/site-packages/pyblazing/api.py", line 353, in connect
self.orchestrator_port, requestBuffer)
File "/home/REMOVED/miniconda3/envs/blazing2/lib/python3.7/site-packages/pyblazing/api.py", line 311, in send_request
return client.send(requestBuffer, expect_response)
File "/home/REMOVED/miniconda3/envs/blazing2/lib/python3.7/site-packages/blazingdb/protocol/init.py", line 69, in send
length = struct.unpack('I', self.connection.socket.recv(4))[0]
KeyboardInterrupt
Exception ignored in: <function BlazingContext.del at 0x7fccc269bbf8>
Traceback (most recent call last):
File "/home/REMOVED/miniconda3/envs/blazing2/lib/python3.7/site-packages/pyblazing/apiv2/context.py", line 212, in del
if self.need_shutdown: AttributeError: 'BlazingContext' object has no attribute 'need_shutdown'
Environment overview (please complete the following information)
However when I try to bc = BlazingContext() I got the following:
AttributeError: 'BlazingContext' object has no attribute 'processes'
FileNotFoundError: [Errno 2] No such file or directory: 'blazingsql-orchestrator': 'blazingsql-orchestrator'
Perhaps there is something else I need to do before I call bc = BlazingContext() that I'm not aware of?
Not sure Dask has a way to specify the number of worked / maximum memory to be used for a job or not if so inherit then and apply then at the Query level. This means for every query we should have the option to limit workers and memory.
Recommended Modes
Fixed Mode
In this mode Workers / Memory / both are fixed per process
Step Mode
In this mode Workers / Memory / both can have a range of values (lower and higher ) and Step value. Using the step value it should increment to the higher range
Auto Mode
It just runs the Step Mode with some default values or with the size of the data that needs to be processed and available resources.
# tag column names
column_names = ['key', 'fare_amount', 'pickup_longitude', 'pickup_latitude',
'dropoff_longitude', 'dropoff_latitude', 'passenger_count']
# 5m row table (given column names) from taxi_00.csv
bc.create_table('taxi', '/home/winston/bsql-demos/taxi_00.csv', names=column_names)
# find january instances with 20 in the fare amount
query = '''
select
*
from taxi
where key like '%-01-%'
and fare_amount like '%20%'
'''
# query the table
results = bc.sql(query).get()
# extract cudf dataframe
df = results.columns
# how's it look?
df.head()
Expected behavior
Return instances where key has -01- (January) in it's value and fare_amount has 20 in it's value.
Environment overview
Environment location: Cloud (Google)
Method of cuDF install: conda
Environment details
Please run and paste the output of the print_env.sh script here, to gather any other relevant environment details
Why we need UDF?
As the users and use cases increase they may come with different types of functions that are required for their use cases. So proving a way to register their own logic/code as a UDF for their SQL function will help the user to accomplish their task
How to it needs to be implemented?
As BlazingSQL is around python, the developer should be able to write his/her own function in python and register it with a decorator.
Additional context
Different type's of UDF would be appreciated
Simple which works on a column
UDF on a Window/ Sliding Window
Able to run UDF in the clustered mode (not sure just throwing my idea)
Yesterday I was able to create a table based on a cudf but today I'm having some errors. I have already created again the whole instance and did again all the steps but I'm facing some errors similar to the following: b'In function ddlCreateTableService: cannot create the table: Could not create table'
There is no more information. Any ideas how I can trace the error?
In addition, it seems in version 0.4.2 there will be some changes in how BlazingContext launches processes, could it be related to this? When this release will be conda installable?
If I execute again BlazingContext() and try to create the table I can get two kinds of different errors:
1) Already connected to the Orchestrator
b'In function ddlCreateTableService: cannot create the table: Connection to server failed.'
WARNING: blazingsql-orchestrator was not automativally started, its probably already running
WARNING: blazingsql-engine was not automativally started, its probably already running
WARNING: blazingsql-algebra was not automativally started, its probably already running
Already connected to the Orchestrator
Unexpected error on create_table, can only concatenate str (not "tuple") to str
Steps/Code to reproduce bug
bc.gcs('dir_name',project_id='xxx,bucket_name='xxx',use_default_adc_json_file=False,adc_json_file='../home/david/Downloads/a3e4838767e8.json')
Expected behavior
Using JSON from service account credentials.
Environment overview (please complete the following information)
Docker Pull
>>> bc.create_table('tbl', df)
<pyblazing.apiv2.context.BlazingTable object at 0x7f7c4969cb90>
>>> df
a b
0 0 1.5
1 1 1.5
2 0 1.5
3 1 1.5
4 0 1.5
5 1 1.5
6 0 1.5
7 1 1.5
8 0 1.5
9 1 1.5
>>> bc.sql('SELECT CASE WHEN a <> 1 THEN 0 ELSE b END as s, a, b from tbl')
30956
s a b
0 0.000000e+00 0 1.5
1 4.940656e-324 1 1.5
2 0.000000e+00 0 1.5
3 4.940656e-324 1 1.5
4 0.000000e+00 0 1.5
5 4.940656e-324 1 1.5
6 0.000000e+00 0 1.5
7 4.940656e-324 1 1.5
8 0.000000e+00 0 1.5
9 4.940656e-324 1 1.5
>>> bc.sql('SELECT CASE WHEN a <> 1 THEN 0 ELSE b END as s, a, b from tbl').s.sum()
30956
2.5e-323
>>> int(bc.sql('SELECT CASE WHEN a <> 1 THEN 0 ELSE b END as s, a, b from tbl').s.sum())
30956
0
The doc Url is https://blog.blazingdb.com/data-lake-to-ai-blazingsql-rapids-initial-benchmark-aa753031ac8b
To see the E2E workflow, see our Public Github Repo.
You can see the full workload at the link above, but we want to go over a few code snippets to show you how you can expect to interact with BlazingSQL.
The Public Github Repo url is https://github.com/BlazingDB/blazingsql-public-demos/blob/master/mortgage-xgboost/e2e.py ,but it is 404.
Describe the bug
All queries run slower on the first launch of BlazingSQL. If I launch a python Kernel with BlazingSQL all queries and create_table statements run slower. If I restart the kernel and launch again BlazingSQL gets dramatically faster.
Steps/Code to reproduce bug
I launched a new GCP server with CUDA 10.0 installed.
I installed miniconda and then installed bsql with dask-cuda and jupyterlab:
The above code will run dramatically faster if I restart the kernel and run again. Something seems to be making the first launched kernel to run slower than it should.
Expected behavior
It should run equivalently as fast with the first launch or the nth launch.
Environment overview (please complete the following information)
n1-standard-8 w/ 4 Tesla T4 GPUs, CUDA 10.0, Ubuntu 16.04
Method of BSQL install: [conda, Docker, or from source]
conda blazingsql-nightly
Additional context
Add any other context about the problem here.
I am trying to do a fresh source build but it seems to be failing with the following stack trace:
-- The following REQUIRED packages have been found:
* GTest
-- Configuring done
-- Generating done
-- Build files have been written to: /opt/conda/envs/bzsqlenv/blazingdb-communication/build
[ 4%] Building CXX object CMakeFiles/blazingdb-transport.dir/src/blazingdb/transport/Server.cc.o
[ 9%] Building CXX object CMakeFiles/blazingdb-transport.dir/src/blazingdb/transport/Client.cc.o
[ 13%] Building CXX object CMakeFiles/blazingdb-manager.dir/src/blazingdb/manager/Manager.cc.o
[ 18%] Building CXX object CMakeFiles/blazingdb-manager.dir/src/blazingdb/transport/io/fd_reader_writer.cpp.o
[ 22%] Building CXX object CMakeFiles/blazingdb-transport.dir/src/blazingdb/transport/io/fd_reader_writer.cpp.o
[ 27%] Building CXX object CMakeFiles/blazingdb-transport.dir/src/blazingdb/manager/Manager.cc.o
/opt/conda/envs/bzsqlenv/blazingdb-communication/src/blazingdb/transport/io/fd_reader_writer.cpp:4:19: fatal error: zmq.hpp: No such file or directory
compilation terminated.
CMakeFiles/blazingdb-manager.dir/build.make:153: recipe for target 'CMakeFiles/blazingdb-manager.dir/src/blazingdb/transport/io/fd_reader_writer.cpp.o' failed
make[2]: *** [CMakeFiles/blazingdb-manager.dir/src/blazingdb/transport/io/fd_reader_writer.cpp.o] Error 1
make[2]: *** Waiting for unfinished jobs....
In file included from /opt/conda/envs/bzsqlenv/blazingdb-communication/src/blazingdb/transport/Client.cc:4:0:
/opt/conda/envs/bzsqlenv/blazingdb-communication/include/blazingdb/network/TCPSocket.h:13:19: fatal error: zmq.hpp: No such file or directory
compilation terminated.
CMakeFiles/blazingdb-transport.dir/build.make:75: recipe for target 'CMakeFiles/blazingdb-transport.dir/src/blazingdb/transport/Client.cc.o' failed
make[2]: *** [CMakeFiles/blazingdb-transport.dir/src/blazingdb/transport/Client.cc.o] Error 1
make[2]: *** Waiting for unfinished jobs....
In file included from /opt/conda/envs/bzsqlenv/blazingdb-communication/src/blazingdb/transport/Server.cc:4:0:
/opt/conda/envs/bzsqlenv/blazingdb-communication/include/blazingdb/network/TCPSocket.h:13:19: fatal error: zmq.hpp: No such file or directory
compilation terminated.
CMakeFiles/blazingdb-transport.dir/build.make:88: recipe for target 'CMakeFiles/blazingdb-transport.dir/src/blazingdb/transport/Server.cc.o' failed
make[2]: *** [CMakeFiles/blazingdb-transport.dir/src/blazingdb/transport/Server.cc.o] Error 1
/opt/conda/envs/bzsqlenv/blazingdb-communication/src/blazingdb/transport/io/fd_reader_writer.cpp:4:19: fatal error: zmq.hpp: No such file or directory
compilation terminated.
CMakeFiles/blazingdb-transport.dir/build.make:153: recipe for target 'CMakeFiles/blazingdb-transport.dir/src/blazingdb/transport/io/fd_reader_writer.cpp.o' failed
make[2]: *** [CMakeFiles/blazingdb-transport.dir/src/blazingdb/transport/io/fd_reader_writer.cpp.o] Error 1
In file included from /opt/conda/envs/bzsqlenv/blazingdb-communication/src/blazingdb/manager/Manager.cc:6:0:
/opt/conda/envs/bzsqlenv/blazingdb-communication/include/blazingdb/network/TCPSocket.h:13:19: fatal error: zmq.hpp: No such file or directory
compilation terminated.
CMakeFiles/blazingdb-manager.dir/build.make:101: recipe for target 'CMakeFiles/blazingdb-manager.dir/src/blazingdb/manager/Manager.cc.o' failed
make[2]: *** [CMakeFiles/blazingdb-manager.dir/src/blazingdb/manager/Manager.cc.o] Error 1
CMakeFiles/Makefile2:104: recipe for target 'CMakeFiles/blazingdb-manager.dir/all' failed
make[1]: *** [CMakeFiles/blazingdb-manager.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....
In file included from /opt/conda/envs/bzsqlenv/blazingdb-communication/src/blazingdb/manager/Manager.cc:6:0:
/opt/conda/envs/bzsqlenv/blazingdb-communication/include/blazingdb/network/TCPSocket.h:13:19: fatal error: zmq.hpp: No such file or directory
compilation terminated.
CMakeFiles/blazingdb-transport.dir/build.make:166: recipe for target 'CMakeFiles/blazingdb-transport.dir/src/blazingdb/manager/Manager.cc.o' failed
make[2]: *** [CMakeFiles/blazingdb-transport.dir/src/blazingdb/manager/Manager.cc.o] Error 1
CMakeFiles/Makefile2:649: recipe for target 'CMakeFiles/blazingdb-transport.dir/all' failed
make[1]: *** [CMakeFiles/blazingdb-transport.dir/all] Error 2
Makefile:140: recipe for target 'all' failed
make: *** [all] Error 2
######################################################################### Build failed blazingdb-communication @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
Describe the problems or issues found in the documentation
When installing with Conda, changing the first two commands from this:
mkdir /blazingsql # Make a blazingsql directory in root folder for Apache Calcite schema management. This will requirement will be removed soon.
chown <user_name> /blazingsql
from blazingsql import BlazingContext
import cudf
bc = BlazingContext(dask_client=client)
bc.create_table('lineitem', 's3://bsql_data/tpch_sf1/lineitem/0_0_0.parquet') #this is a user error but a common one
bc.s3('bsql_data', bucket_name='blab', access_key_id='', secret_key='')
bc.create_table('lineitem', 's3://bsql_data/tpch_sf1/lineitem/0_0_0.parquet')
If a user accidentally makes a table before registering the file system that table doesnt get created properly and it cant be dropped without restarting. When a table is not created it should leave 0 state that would impact the creation of that same table again.
Describe the bug
Sometimes running the queries with JOIN we see crashes like:
ERROR: CUDA Runtime call cudaStreamSynchronize(stream) in line 283 of file /conda/envs/bzsqlenv/blazingdb-ral/src/Interpreter/interpreter_cpp.cu failed with an illegal memory access was encountered (77).
terminate called after throwing an instance of 'thrust::system::system_error'
what(): rmm_allocator::deallocate(): RMM_FREE: __global__ function call is not configured
Aborted (core dumped)
1、In the official website, you can see that the data processed by BlazingSQL and Spark is 15.6G, which is less than the 16G of the T4 GPU. If the data volume exceeds 16G, can the T4 GPU process it?
2、Didn't see the introduction to distributed in the official website, including the description of flexible extension, is the distributed ability of Apache Arrow used?
Error: TypeError: expected str, bytes or os.PathLike object, not NoneType
Using the install script from this repo's README.md (CUDA 10, python 3.6), I was able to successfully install BlazingSQL v0.11 on P-100 instance in Google Colab (after installing RAPIDS AI v0.11, importing cuDF & making a test Series), but ran into an error trying to import BlazingContext.
Steps/Code to reproduce bug Here's the notebook I ran in Google Colab. It's just the default "Get Started" notebook from rapids.ai with BlazingSQL v0.11 install script added in the cell below the RAPIDS install cell.
Conda install script (run in Google Colab after running rapids-colab.sh and importing cuDF):
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-26-0b19b5b41f48> in <module>()
----> 1 from blazingsql import BlazingContext
2 frames
/usr/local/lib/python3.6/site-packages/blazingsql/__init__.py in <module>()
1 from pyblazing.apiv2 import S3EncryptionType
2 from pyblazing.apiv2 import DataType
----> 3 from pyblazing.apiv2.context import BlazingContext
/usr/local/lib/python3.6/site-packages/pyblazing/apiv2/context.py in <module>()
41 os.path.join(
42 os.getenv("CONDA_PREFIX"),
---> 43 'lib/blazingsql-algebra.jar'))
44 jpype.addClassPath(
45 os.path.join(
/usr/lib/python3.6/posixpath.py in join(a, *p)
78 will be discarded. An empty last part will result in a path that
79 ends with a separator."""
---> 80 a = os.fspath(a)
81 sep = _get_sep(a)
82 path = a
TypeError: expected str, bytes or os.PathLike object, not NoneType
Provided alias column names are not applying to columns of query results which are instead generically titled $f0, $f1 ... $fn.
Context
After creating a table ("taxi"), I'm trying to:
extract hour, month, and year from each row of a datetime column (key) with each being a new column titled hours, months, and years (respectively)
find the difference between 2 columns with dropoff and pickup longitude as a new column longitude_distance
find the difference between 2 columns with dropoff and pickup latitudes as a new column latitude_distance
but the new time column names (hours, months, years) are being output as $f0, $f1 and $f2, and the distance column names (longitude_distance, latitude_distance) are being output as $f3 and $f4.
Here's the query and execution:
# define the query
query = '''
SELECT hour(key) as hours, month(key) as months, year(key) - 2000 as years,
dropoff_longitude - pickup_longitude as longitude_distance,
dropoff_latitude - pickup_latitude as latitude_distance,
passenger_count FROM main.taxi
'''
# run query on table
X_train = bc.sql(query).get()
# extract dataframe
X_train_gdf = X_train.columns
# how's that look?
X_train_gdf.head()
Dataframe with incorrect column names output is displayed and can be reproduced by downloading and running the notebook locally. Currently there is a manual correction fix in place.
Expected behavior
Environment overview
Environment location: local and Google Cloud (same issue in both)
Describe the bug
Cannot register HDFS to the BlazingContext and hdfs register step fails to connect to HDFS. Tested on both stable and nightly version of blazingSQL.
RAL.log file shows the error trace: "|TRACE|deregisterFileSystem: filesystem authority not found".
Steps/Code to reproduce bug
from blazingsql import BlazingContext
import cudf
bc = BlazingContext()
bc.hdfs('test_dir', host='', port=<port_number>, user='', kerberos_ticket='/path/to/keytablefile.keytab')
Expected behavior
bc.hdfs() should register HDFS successfully without any failures.
Environment overview (please complete the following information)
Environment location: Docker ( and conda install of blazingSQL packages)
Method of cuDF install: conda
If method of install is [Docker], provide docker pull & docker run commands used
Environment details
Please run and paste the output of the print_env.sh script here, to gather any other relevant environment details
Additional context
Add any other context about the problem here.
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Describe the solution you'd like
A clear and concise description of what you want to happen.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
----For BlazingSQL Developers---- How and where should this be implemented?
What part of the code should be feature be implemented? What should the APIs and/or classes look like?
**Other design considerations **
What components of the engine could be affected by this? What functions should we make sure we use/reuse?
Testing considerations?
What sort of unit tests and/or End to End tests be implemented to test this?
Our data center has 20 servers, each with 4 Graphic cards.
Can we build one cluster and make full use of all resources (80 NV 2080ti )? to support maximum parallelism.
Thanks!
In blazingsql (0.4.4) I want to execute a distributed sql on several GPUs. According to the documentation, the first steps are:
from blazingsql import BlazingContext
import cudf
import dask_cudf
import dask
from dask.distributed import Client
client = Client('127.0.0.1:8786')
However, in client = Client('127.0.0.1:8786') I got the following error: OSError: Timed out trying to connect to 'tcp://127.0.0.1:8786' after 10 s: in <distributed.comm.tcp.TCPConnector object at 0x7f97eb566e80>: ConnectionRefusedError: [Errno 111] Connection refused
I would like to know if there is something else I need to do or if the error is because of something not working in my environment.
-- The following OPTIONAL packages have been found:
* PythonLibs
-- The following REQUIRED packages have been found:
* aws-cpp-sdk-core, <https://aws.amazon.com/sdk-for-cpp/>
AWS SDK for C++ allows to integrate any C++ application with AWS services. Module: aws-cpp-sdk-core
* aws-cpp-sdk-s3, <https://aws.amazon.com/sdk-for-cpp/>
AWS SDK for C++ allows to integrate any C++ application with AWS services. Module: aws-cpp-sdk-s3
* aws-cpp-sdk-kms, <https://aws.amazon.com/sdk-for-cpp/>
AWS SDK for C++ allows to integrate any C++ application with AWS services. Module: aws-cpp-sdk-kms
* aws-cpp-sdk-s3-encryption, <https://aws.amazon.com/sdk-for-cpp/>
AWS SDK for C++ allows to integrate any C++ application with AWS services. Module: aws-cpp-sdk-s3-encryption
* storage_client, <https://github.com/googleapis/google-cloud-cpp>
Google Cloud Client Library for C++
* Threads
* GTest
-- The following OPTIONAL packages have not been found:
* PkgConfig
-- Configuring done
-- Generating done
-- Build files have been written to: /conda/envs/bsql/blazingsql/engine/build
make -j all && ctest
Scanning dependencies of target blazingsql-engine
[ 0%] Building CXX object CMakeFiles/blazingsql-engine.dir/src/config/BlazingConfig.cpp.o
[ 2%] Building CUDA object CMakeFiles/blazingsql-engine.dir/src/config/GPUManager.cu.o
[ 2%] Building CXX object CMakeFiles/blazingsql-engine.dir/src/exception/RalException.cpp.o
[ 2%] Building CXX object CMakeFiles/blazingsql-engine.dir/src/io/Schema.cpp.o
[ 3%] Building CXX object CMakeFiles/blazingsql-engine.dir/src/io/data_parser/ParquetParser.cpp.o
[ 3%] Building CXX object CMakeFiles/blazingsql-engine.dir/src/operators/OrderBy.cpp.o
[ 5%] Building CXX object CMakeFiles/blazingsql-engine.dir/src/operators/JoinOperator.cpp.o
[ 5%] Building CXX object CMakeFiles/blazingsql-engine.dir/src/operators/GroupBy.cpp.o
[ 6%] Building CXX object CMakeFiles/blazingsql-engine.dir/src/io/data_provider/UriDataProvider.cpp.o
[ 7%] Building CXX object CMakeFiles/blazingsql-engine.dir/src/Traits/RuntimeTraits.cpp.o
[ 8%] Building CXX object CMakeFiles/blazingsql-engine.dir/src/io/data_parser/CSVParser.cpp.o
[ 9%] Building CXX object CMakeFiles/blazingsql-engine.dir/src/io/data_parser/JSONParser.cpp.o
[ 10%] Building CXX object CMakeFiles/blazingsql-engine.dir/src/utilities/RalColumn.cpp.o
[ 10%] Building CXX object CMakeFiles/blazingsql-engine.dir/src/io/data_parser/GDFParser.cpp.o
[ 13%] Building CXX object CMakeFiles/blazingsql-engine.dir/src/utilities/StringUtils.cpp.o
[ 13%] Building CXX object CMakeFiles/blazingsql-engine.dir/src/Config/Config.cpp.o
[ 13%] Building CXX object CMakeFiles/blazingsql-engine.dir/src/io/data_parser/OrcParser.cpp.o
[ 14%] Building CXX object CMakeFiles/blazingsql-engine.dir/src/io/data_parser/ArgsUtil.cpp.o
[ 14%] Building CXX object CMakeFiles/blazingsql-engine.dir/src/io/data_parser/ParserUtil.cpp.o
[ 15%] Building CXX object CMakeFiles/blazingsql-engine.dir/src/io/DataLoader.cpp.o
[ 15%] Building CXX object CMakeFiles/blazingsql-engine.dir/src/utilities/CommonOperations.cpp.o
[ 16%] Building CUDA object CMakeFiles/blazingsql-engine.dir/src/Interpreter/interpreter_cpp.cu.o
[ 18%] Building CXX object CMakeFiles/blazingsql-engine.dir/src/CalciteInterpreter.cpp.o
[ 18%] Building CXX object CMakeFiles/blazingsql-engine.dir/src/CalciteExpressionParsing.cpp.o
[ 18%] Building CXX object CMakeFiles/blazingsql-engine.dir/src/utilities/TableWrapper.cpp.o
[ 18%] Building CUDA object CMakeFiles/blazingsql-engine.dir/src/ColumnManipulation.cu.o
[ 19%] Building CXX object CMakeFiles/blazingsql-engine.dir/src/ResultSetRepository.cpp.o
[ 22%] Building CUDA object CMakeFiles/blazingsql-engine.dir/src/GDFColumn.cu.o
[ 22%] Building CXX object CMakeFiles/blazingsql-engine.dir/src/JoinProcessor.cpp.o
[ 22%] Building CUDA object CMakeFiles/blazingsql-engine.dir/src/GDFCounter.cu.o
[ 22%] Building CXX object CMakeFiles/blazingsql-engine.dir/src/LogicalFilter.cpp.o
[ 24%] Building CXX object CMakeFiles/blazingsql-engine.dir/src/cython/initialize.cpp.o
[ 24%] Building CUDA object CMakeFiles/blazingsql-engine.dir/src/Utils.cu.o
[ 24%] Building CXX object CMakeFiles/blazingsql-engine.dir/src/QueryState.cpp.o
[ 25%] Building CXX object CMakeFiles/blazingsql-engine.dir/src/cython/io.cpp.o
[ 26%] Building CXX object CMakeFiles/blazingsql-engine.dir/src/CodeTimer.cpp.o
[ 27%] Building CXX object CMakeFiles/blazingsql-engine.dir/src/parser/expression_utils.cpp.o
[ 27%] Building CXX object CMakeFiles/blazingsql-engine.dir/src/cython/errors.cpp.o
[ 28%] Building CXX object CMakeFiles/blazingsql-engine.dir/src/cython/engine.cpp.o
[ 29%] Building CXX object CMakeFiles/blazingsql-engine.dir/src/cuDF/Allocator.cpp.o
[ 29%] Building CUDA object CMakeFiles/blazingsql-engine.dir/src/cuDF/generator/sample_generator.cu.o
[ 30%] Building CXX object CMakeFiles/blazingsql-engine.dir/src/communication/factory/MessageFactory.cpp.o
[ 32%] Building CXX object CMakeFiles/blazingsql-engine.dir/src/communication/network/Client.cpp.o
[ 32%] Building CXX object CMakeFiles/blazingsql-engine.dir/src/communication/CommunicationData.cpp.o
[ 32%] Building CXX object CMakeFiles/blazingsql-engine.dir/src/communication/network/Server.cpp.o
[ 34%] Building CXX object CMakeFiles/blazingsql-engine.dir/src/distribution/NodeColumns.cpp.o
[ 34%] Building CXX object CMakeFiles/blazingsql-engine.dir/src/distribution/NodeSamples.cpp.o
[ 35%] Building CXX object CMakeFiles/blazingsql-engine.dir/src/distribution/Exception.cpp.o
In file included from /conda/envs/bsql/blazingsql/engine/src/CodeTimer.cpp:8:0:
/conda/envs/bsql/blazingsql/engine/src/CodeTimer.h:11:39: fatal error: blazingdb/manager/Context.h: No such file or directory
compilation terminated.
CMakeFiles/blazingsql-engine.dir/build.make:452: recipe for target 'CMakeFiles/blazingsql-engine.dir/src/CodeTimer.cpp.o' failed
make[2]: *** [CMakeFiles/blazingsql-engine.dir/src/CodeTimer.cpp.o] Error 1
make[2]: *** Waiting for unfinished jobs....
[ 36%] Building CUDA object CMakeFiles/blazingsql-engine.dir/src/distribution/primitives_util.cu.o
[ 36%] Building CXX object CMakeFiles/blazingsql-engine.dir/src/distribution/primitives.cpp.o
In file included from /conda/envs/bsql/blazingsql/engine/src/communication/network/Client.cpp:1:0:
/conda/envs/bsql/blazingsql/engine/src/communication/network/Client.h:3:47: fatal error: blazingdb/manager/NodeDataMessage.h: No such file or directory
compilation terminated.
CMakeFiles/blazingsql-engine.dir/build.make:608: recipe for target 'CMakeFiles/blazingsql-engine.dir/src/communication/network/Client.cpp.o' failed
make[2]: *** [CMakeFiles/blazingsql-engine.dir/src/communication/network/Client.cpp.o] Error 1
In file included from /conda/envs/bsql/blazingsql/engine/src/communication/network/Server.cpp:1:0:
/conda/envs/bsql/blazingsql/engine/src/communication/network/Server.h:3:41: fatal error: blazingdb/transport/Message.h: No such file or directory
compilation terminated.
CMakeFiles/blazingsql-engine.dir/build.make:621: recipe for target 'CMakeFiles/blazingsql-engine.dir/src/communication/network/Server.cpp.o' failed
make[2]: *** [CMakeFiles/blazingsql-engine.dir/src/communication/network/Server.cpp.o] Error 1
In file included from /conda/envs/bsql/blazingsql/engine/src/communication/CommunicationData.cpp:1:0:
/conda/envs/bsql/blazingsql/engine/src/communication/CommunicationData.h:4:38: fatal error: blazingdb/transport/Node.h: No such file or directory
compilation terminated.
CMakeFiles/blazingsql-engine.dir/build.make:634: recipe for target 'CMakeFiles/blazingsql-engine.dir/src/communication/CommunicationData.cpp.o' failed
make[2]: *** [CMakeFiles/blazingsql-engine.dir/src/communication/CommunicationData.cpp.o] Error 1
In file included from /conda/envs/bsql/blazingsql/engine/src/operators/OrderBy.cpp:1:0:
/conda/envs/bsql/blazingsql/engine/src/operators/OrderBy.h:5:39: fatal error: blazingdb/manager/Context.h: No such file or directory
compilation terminated.
In file included from /conda/envs/bsql/blazingsql/engine/src/operators/GroupBy.cpp:1:0:
/conda/envs/bsql/blazingsql/engine/src/operators/GroupBy.h:5:39: fatal error: blazingdb/manager/Context.h: No such file or directory
compilation terminated.
CMakeFiles/blazingsql-engine.dir/build.make:101: recipe for target 'CMakeFiles/blazingsql-engine.dir/src/operators/OrderBy.cpp.o' failed
make[2]: *** [CMakeFiles/blazingsql-engine.dir/src/operators/OrderBy.cpp.o] Error 1
CMakeFiles/blazingsql-engine.dir/build.make:127: recipe for target 'CMakeFiles/blazingsql-engine.dir/src/operators/GroupBy.cpp.o' failed
make[2]: *** [CMakeFiles/blazingsql-engine.dir/src/operators/GroupBy.cpp.o] Error 1
/conda/envs/bsql/blazingsql/engine/src/cython/initialize.cpp:19:50: fatal error: blazingdb/transport/io/reader_writer.h: No such file or directory
compilation terminated.
CMakeFiles/blazingsql-engine.dir/build.make:517: recipe for target 'CMakeFiles/blazingsql-engine.dir/src/cython/initialize.cpp.o' failed
make[2]: *** [CMakeFiles/blazingsql-engine.dir/src/cython/initialize.cpp.o] Error 1
In file included from /conda/envs/bsql/blazingsql/engine/src/distribution/NodeSamples.cpp:1:0:
/conda/envs/bsql/blazingsql/engine/src/distribution/NodeSamples.h:5:38: fatal error: blazingdb/transport/Node.h: No such file or directory
compilation terminated.
CMakeFiles/blazingsql-engine.dir/build.make:673: recipe for target 'CMakeFiles/blazingsql-engine.dir/src/distribution/NodeSamples.cpp.o' failed
make[2]: *** [CMakeFiles/blazingsql-engine.dir/src/distribution/NodeSamples.cpp.o] Error 1
In file included from /conda/envs/bsql/blazingsql/engine/src/distribution/primitives.cpp:1:0:
/conda/envs/bsql/blazingsql/engine/src/distribution/primitives.h:6:39: fatal error: blazingdb/manager/Context.h: No such file or directory
compilation terminated.
CMakeFiles/blazingsql-engine.dir/build.make:686: recipe for target 'CMakeFiles/blazingsql-engine.dir/src/distribution/primitives.cpp.o' failed
make[2]: *** [CMakeFiles/blazingsql-engine.dir/src/distribution/primitives.cpp.o] Error 1
In file included from /conda/envs/bsql/blazingsql/engine/src/operators/JoinOperator.cpp:1:0:
/conda/envs/bsql/blazingsql/engine/src/operators/JoinOperator.h:4:39: fatal error: blazingdb/manager/Context.h: No such file or directory
compilation terminated.
CMakeFiles/blazingsql-engine.dir/build.make:114: recipe for target 'CMakeFiles/blazingsql-engine.dir/src/operators/JoinOperator.cpp.o' failed
make[2]: *** [CMakeFiles/blazingsql-engine.dir/src/operators/JoinOperator.cpp.o] Error 1
In file included from /conda/envs/bsql/blazingsql/engine/src/communication/messages/ComponentMessages.h:3:0,
from /conda/envs/bsql/blazingsql/engine/src/communication/factory/MessageFactory.h:3,
from /conda/envs/bsql/blazingsql/engine/src/communication/factory/MessageFactory.cpp:1:
/conda/envs/bsql/blazingsql/engine/src/communication/messages/GPUComponentMessage.h:4:41: fatal error: blazingdb/transport/Address.h: No such file or directory
compilation terminated.
CMakeFiles/blazingsql-engine.dir/build.make:595: recipe for target 'CMakeFiles/blazingsql-engine.dir/src/communication/factory/MessageFactory.cpp.o' failed
make[2]: *** [CMakeFiles/blazingsql-engine.dir/src/communication/factory/MessageFactory.cpp.o] Error 1
In file included from /conda/envs/bsql/blazingsql/engine/src/distribution/NodeColumns.cpp:1:0:
/conda/envs/bsql/blazingsql/engine/src/distribution/NodeColumns.h:5:38: fatal error: blazingdb/transport/Node.h: No such file or directory
compilation terminated.
CMakeFiles/blazingsql-engine.dir/build.make:660: recipe for target 'CMakeFiles/blazingsql-engine.dir/src/distribution/NodeColumns.cpp.o' failed
make[2]: *** [CMakeFiles/blazingsql-engine.dir/src/distribution/NodeColumns.cpp.o] Error 1
In file included from /conda/envs/bsql/blazingsql/engine/src/io/DataLoader.cpp:2:0:
/conda/envs/bsql/blazingsql/engine/src/io/DataLoader.h:16:39: fatal error: blazingdb/manager/Context.h: No such file or directory
compilation terminated.
make[2]: *** [CMakeFiles/blazingsql-engine.dir/src/io/DataLoader.cpp.o] Error 1
CMakeFiles/blazingsql-engine.dir/build.make:348: recipe for target 'CMakeFiles/blazingsql-engine.dir/src/io/DataLoader.cpp.o' failed
In file included from /conda/envs/bsql/blazingsql/engine/src/CalciteInterpreter.h:9:0,
from /conda/envs/bsql/blazingsql/engine/src/CalciteInterpreter.cpp:1:
/conda/envs/bsql/blazingsql/engine/src/io/DataLoader.h:16:39: fatal error: blazingdb/manager/Context.h: No such file or directory
compilation terminated.
In file included from /conda/envs/bsql/blazingsql/engine/src/LogicalFilter.cpp:19:0:
/conda/envs/bsql/blazingsql/engine/src/CodeTimer.h:11:39: fatal error: blazingdb/manager/Context.h: No such file or directory
compilation terminated.
In file included from /conda/envs/bsql/blazingsql/engine/src/cython/io.cpp:2:0:
/conda/envs/bsql/blazingsql/engine/src/cython/../io/DataLoader.h:16:39: fatal error: blazingdb/manager/Context.h: No such file or directory
compilation terminated.
CMakeFiles/blazingsql-engine.dir/build.make:426: recipe for target 'CMakeFiles/blazingsql-engine.dir/src/LogicalFilter.cpp.o' failed
make[2]: *** [CMakeFiles/blazingsql-engine.dir/src/LogicalFilter.cpp.o] Error 1
CMakeFiles/blazingsql-engine.dir/build.make:374: recipe for target 'CMakeFiles/blazingsql-engine.dir/src/CalciteInterpreter.cpp.o' failed
make[2]: *** [CMakeFiles/blazingsql-engine.dir/src/CalciteInterpreter.cpp.o] Error 1
CMakeFiles/blazingsql-engine.dir/build.make:530: recipe for target 'CMakeFiles/blazingsql-engine.dir/src/cython/io.cpp.o' failed
make[2]: *** [CMakeFiles/blazingsql-engine.dir/src/cython/io.cpp.o] Error 1
In file included from /conda/envs/bsql/blazingsql/engine/src/cython/../CalciteInterpreter.h:9:0,
from /conda/envs/bsql/blazingsql/engine/src/cython/engine.cpp:2:
/conda/envs/bsql/blazingsql/engine/src/cython/../io/DataLoader.h:16:39: fatal error: blazingdb/manager/Context.h: No such file or directory
compilation terminated.
make[2]: *** [CMakeFiles/blazingsql-engine.dir/src/cython/engine.cpp.o] Error 1
CMakeFiles/blazingsql-engine.dir/build.make:556: recipe for target 'CMakeFiles/blazingsql-engine.dir/src/cython/engine.cpp.o' failed
/conda/envs/bsql/blazingsql/engine/src/Interpreter/helper_cuda.h: In function ‘const char* _cudaGetErrorEnum(cudaError_t)’:
/conda/envs/bsql/blazingsql/engine/src/Interpreter/helper_cuda.h:38:8: warning: enumeration value ‘cudaErrorSystemNotReady’ not handled in switch [-Wswitch]
switch(error) {
^
/conda/envs/bsql/blazingsql/engine/src/Interpreter/helper_cuda.h:38:8: warning: enumeration value ‘cudaErrorIllegalState’ not handled in switch [-Wswitch]
/conda/envs/bsql/blazingsql/engine/src/Interpreter/helper_cuda.h:38:8: warning: enumeration value ‘cudaErrorStreamCaptureUnsupported’ not handled in switch [-Wswitch]
/conda/envs/bsql/blazingsql/engine/src/Interpreter/helper_cuda.h:38:8: warning: enumeration value ‘cudaErrorStreamCaptureInvalidated’ not handled in switch [-Wswitch]
/conda/envs/bsql/blazingsql/engine/src/Interpreter/helper_cuda.h:38:8: warning: enumeration value ‘cudaErrorStreamCaptureMerge’ not handled in switch [-Wswitch]
/conda/envs/bsql/blazingsql/engine/src/Interpreter/helper_cuda.h:38:8: warning: enumeration value ‘cudaErrorStreamCaptureUnmatched’ not handled in switch [-Wswitch]
/conda/envs/bsql/blazingsql/engine/src/Interpreter/helper_cuda.h:38:8: warning: enumeration value ‘cudaErrorStreamCaptureUnjoined’ not handled in switch [-Wswitch]
/conda/envs/bsql/blazingsql/engine/src/Interpreter/helper_cuda.h:38:8: warning: enumeration value ‘cudaErrorStreamCaptureIsolation’ not handled in switch [-Wswitch]
/conda/envs/bsql/blazingsql/engine/src/Interpreter/helper_cuda.h:38:8: warning: enumeration value ‘cudaErrorStreamCaptureImplicit’ not handled in switch [-Wswitch]
/conda/envs/bsql/blazingsql/engine/src/Interpreter/helper_cuda.h:38:8: warning: enumeration value ‘cudaErrorCapturedEvent’ not handled in switch [-Wswitch]
/conda/envs/bsql/blazingsql/engine/src/Interpreter/helper_cuda.h: In function ‘const char* _cudaGetErrorEnum(CUresult)’:
/conda/envs/bsql/blazingsql/engine/src/Interpreter/helper_cuda.h:220:8: warning: enumeration value ‘CUDA_ERROR_ILLEGAL_STATE’ not handled in switch [-Wswitch]
switch(error) {
^
/conda/envs/bsql/blazingsql/engine/src/Interpreter/helper_cuda.h:220:8: warning: enumeration value ‘CUDA_ERROR_SYSTEM_NOT_READY’ not handled in switch [-Wswitch]
/conda/envs/bsql/blazingsql/engine/src/Interpreter/helper_cuda.h:220:8: warning: enumeration value ‘CUDA_ERROR_STREAM_CAPTURE_UNSUPPORTED’ not handled in switch [-Wswitch]
/conda/envs/bsql/blazingsql/engine/src/Interpreter/helper_cuda.h:220:8: warning: enumeration value ‘CUDA_ERROR_STREAM_CAPTURE_INVALIDATED’ not handled in switch [-Wswitch]
/conda/envs/bsql/blazingsql/engine/src/Interpreter/helper_cuda.h:220:8: warning: enumeration value ‘CUDA_ERROR_STREAM_CAPTURE_MERGE’ not handled in switch [-Wswitch]
/conda/envs/bsql/blazingsql/engine/src/Interpreter/helper_cuda.h:220:8: warning: enumeration value ‘CUDA_ERROR_STREAM_CAPTURE_UNMATCHED’ not handled in switch [-Wswitch]
/conda/envs/bsql/blazingsql/engine/src/Interpreter/helper_cuda.h:220:8: warning: enumeration value ‘CUDA_ERROR_STREAM_CAPTURE_UNJOINED’ not handled in switch [-Wswitch]
/conda/envs/bsql/blazingsql/engine/src/Interpreter/helper_cuda.h:220:8: warning: enumeration value ‘CUDA_ERROR_STREAM_CAPTURE_ISOLATION’ not handled in switch [-Wswitch]
/conda/envs/bsql/blazingsql/engine/src/Interpreter/helper_cuda.h:220:8: warning: enumeration value ‘CUDA_ERROR_STREAM_CAPTURE_IMPLICIT’ not handled in switch [-Wswitch]
/conda/envs/bsql/blazingsql/engine/src/Interpreter/helper_cuda.h:220:8: warning: enumeration value ‘CUDA_ERROR_CAPTURED_EVENT’ not handled in switch [-Wswitch]
CMakeFiles/Makefile2:756: recipe for target 'CMakeFiles/blazingsql-engine.dir/all' failed
make[1]: *** [CMakeFiles/blazingsql-engine.dir/all] Error 2
Makefile:140: recipe for target 'all' failed
make: *** [all] Error 2
cp libblazingsql-engine.so /conda/envs/bsql/lib/libblazingsql-engine.so
cp: cannot stat 'libblazingsql-engine.so': No such file or directory
The command '/bin/bash -c source activate bsql && cd $CONDA_PREFIX && git clone https://github.com/BlazingDB/blazingsql.git && cd blazingsql && export CUDACXX=/usr/local/cuda/bin/nvcc && conda/recipes/blazingsql/build.sh' returned a non-zero code: 1
**Flatting Json **
In the document it says it supports all the JSON function via cuIO not sure can it be performed in the SQL or not.
Describe the solution you'd like
Reading the JSON document using SQL functions and flatting it to a table and other functions that will help for data validation similarly do it for XML document
When i try to read the attached table twice , it works once but fails when i try to read it again. I also checked against cudf and it seems to read that file correctly.
BlazingContext ready
0 1 2 3 4 ... 17 18 19 20 21
0 17820 AAAAAAAAAAAABAJK 2000-07-09 None regular somas past the fluffy braids engage up... ... cyan Oz Unknown 57 8MJdS2aYdSydZ9Zw6llsXatytb6AUrj42owNPSpbbr0ARL
1 17821 AAAAAAAAAAAABAJL 2000-08-23 None quiet idle hockey players would was. enticing ... ... metallic Pallet Unknown 33 bmqB7bBbNCSJvGawoJGX25VQD24bX
[2 rows x 22 columns]
Worked 0 time
terminate called after throwing an instance of 'thrust::system::system_error'
what(): rmm_allocator::deallocate(): RMM_FREE: __global__ function call is not configured
Aborted (core dumped)
Environment overview (please complete the following information)
Environment location: Docker
Method of cuDF install: Docker (Docker nightly on 16th October)
Additional context
This will be most probably related to the issue of reading files in a certain order discussed on slack.
BlazeSQL is unable to find s3 files that are (and have been confirmed) to actually be in S3.
ParseSchemaError`` Traceback (most recent call last) ParseSchemaError: [ParseSchema Error] Path '/test/rocky2/restid=000002/' does not exist. File or directory paths are expected to be in one of the following formats: For local file paths: '/folder0/folder1/fileName.extension' For local file paths with wildcard: '/folder0/folder1/*fileName*.*' For local directory paths: '/folder0/folder1/' For s3 file paths: 's3://registeredFileSystemName/folder0/folder1/fileName.extension' For s3 file paths with wildcard: '/folder0/folder1/*fileName*.*' For s3 directory paths: 's3://registeredFileSystemName/folder0/folder1/' For gs file paths: 'gs://registeredFileSystemName/folder0/folder1/fileName.extension' For gs file paths with wildcard: '/folder0/folder1/*fileName*.*' For gs directory paths: 'gs://registeredFileSystemName/folder0/folder1/' For HDFS file paths: 'hdfs://registeredFileSystemName/folder0/folder1/fileName.extension' For HDFS file paths with wildcard: '/folder0/folder1/*fileName*.*' For HDFS directory paths: 'hdfs://registeredFileSystemName/folder0/folder1/' Exception ignored in: 'cio.parseSchemaPython' cio.ParseSchemaError: [ParseSchema Error] Path '/test/rocky2/restid=000002/' does not exist. File or directory paths are expected to be in one of the following formats: For local file paths: '/folder0/folder1/fileName.extension' For local file paths with wildcard: '/folder0/folder1/*fileName*.*' For local directory paths: '/folder0/folder1/' For s3 file paths: 's3://registeredFileSystemName/folder0/folder1/fileName.extension' For s3 file paths with wildcard: '/folder0/folder1/*fileName*.*' For s3 directory paths: 's3://registeredFileSystemName/folder0/folder1/' For gs file paths: 'gs://registeredFileSystemName/folder0/folder1/fileName.extension' For gs file paths with wildcard: '/folder0/folder1/*fileName*.*' For gs directory paths: 'gs://registeredFileSystemName/folder0/folder1/' For HDFS file paths: 'hdfs://registeredFileSystemName/folder0/folder1/fileName.extension' For HDFS file paths with wildcard: '/folder0/folder1/*fileName*.*' For HDFS directory paths: 'hdfs://registeredFileSystemName/folder0/folder1/'
`
I start the BlazingContext and set the s3 details like so:
Would like to be able to specify compressoin like one can in cudf.read_csv
Describe alternatives you've considered
calling read_csv directly then creating a table but this is complicated and uses more memory in case the query only requires a few columns
Why we need window functions?
When working with big data window functions help to slice the things out like removing the duplicated with rank/row number/dense rank without theses inbuilt functions it's hard /complex to remove duplicates. and here are the few use cases listed below
Running totals within the group's
Athematic calculations like Max, Min, Avg within a group
First Value / Last Vale /Nth Value within a group
Above are a few use cases of the standard window function. How about calculating/data filling on a sliding window(by range/rows) within a group like Sales for last's 7 days from the current day (sliding by range from the current date in the row to the last 7 days within the window).
Describe the solution you'd like
For implementation please refer to Hive/Presto/Pandas/Spark SQL/MySQL/Postgres implementations
Primary code included -- the app crashes every time -- but struggling to reproduce standalone. The same healthcheck run in the same container from an empty python env (no surrounding gpu app running) passes.
Maybe there's a way to pull out lower-level logs/traces? Let me know.
import cudf
import uuid
import asyncio
import sys
from blazingsql import BlazingContext
import cudf
bc = BlazingContext()
async def run_query(sql, tables):
out = None
if True:
for table_name in tables:
bc.create_table(table_name, tables[table_name])
try:
out = bc.sql(sql)
except Exception as e:
print('blazing run_query err', e)
raise e
finally:
for table_name in tables:
bc.drop_table(table_name)
return out
def healthcheck()
print('start')
nines = cudf.DataFrame({'a': [9,9,9,9,9], 'b': [0, 1, 2, 3, 4]})
table_name = 'u' + uuid.uuid4().hex
tables = {table_name: nines}
gdf = await run_query('select SUM(a * b) as cross_product from ' + table_name, tables)
print('end')
health()
=>
forge-etl-python_1 | 2019-11-27T20:50:55.091791578Z start
forge-etl-python_1 | 2019-11-27T20:50:55.642078285Z #
forge-etl-python_1 | 2019-11-27T20:50:55.642138234Z # A fatal error has been detected by the Java Runtime Environment:
forge-etl-python_1 | 2019-11-27T20:50:55.642144365Z #
forge-etl-python_1 | 2019-11-27T20:50:55.642162646Z # SIGSEGV (0xb) at pc=0x00007f99c1bcea10, pid=6, tid=0x00007f9a36183740
forge-etl-python_1 | 2019-11-27T20:50:55.642169205Z #
forge-etl-python_1 | 2019-11-27T20:50:55.642181950Z # JRE version: OpenJDK Runtime Environment (8.0_192-b01) (build 1.8.0_192-b01)
forge-etl-python_1 | 2019-11-27T20:50:55.642187416Z # Java VM: OpenJDK 64-Bit Server VM (25.192-b01 mixed mode linux-amd64 compressed oops)
forge-etl-python_1 | 2019-11-27T20:50:55.642191729Z # Problematic frame:
forge-etl-python_1 | 2019-11-27T20:50:55.642527428Z # C [libblazingsql-engine.so+0xfaa10] gdf_column_cpp::dtype() const+0x0
forge-etl-python_1 | 2019-11-27T20:50:55.642561387Z #
forge-etl-python_1 | 2019-11-27T20:50:55.642569283Z # Core dump written. Default location: /opt/graphistry/apps/forge/etl-server-python/core or core.6
forge-etl-python_1 | 2019-11-27T20:50:55.642576476Z #
forge-etl-python_1 | 2019-11-27T20:50:55.644013778Z # An error report file with more information is saved as:
forge-etl-python_1 | 2019-11-27T20:50:55.644034209Z # /tmp/hs_err_pid6.log
forge-etl-python_1 | 2019-11-27T20:50:55.677110517Z #
forge-etl-python_1 | 2019-11-27T20:50:55.677153111Z # If you would like to submit a bug report, please visit:
forge-etl-python_1 | 2019-11-27T20:50:55.677159727Z # http://www.azulsystems.com/support/
forge-etl-python_1 | 2019-11-27T20:50:55.677164188Z # The crash happened outside the Java Virtual Machine in native code.
forge-etl-python_1 | 2019-11-27T20:50:55.677168627Z # See problematic frame for where to report the bug.
forge-etl-python_1 | 2019-11-27T20:50:55.677172691Z #
ubuntu@ip-172-31-28-246:~/graphistry$
Environment overview (please complete the following information)
Describe the bug
A clear and concise description of what the bug is.
Calcite gets unstable after I run a create table only put the table name and file path.
Steps/Code to reproduce bug
When I create a table with the following code:
bc.create_table('nation', table_list)
I mean, without these params: delimiter='|', dtype=column_types, names=column_names
Expected behavior
The SQL script must be executed correctly.
Environment overview (please complete the following information)
Docker
Additional context
I got this error message on Algebra:
ERROR: org.hibernate.AssertionFailure - an assertion failure occured (this may indicate a bug in Hibernate, but is more likely due to unsafe use of the session)
org.hibernate.AssertionFailure: null id in com.blazingdb.calcite.catalog.domain.CatalogColumnImpl entry (don't flush the Session after an exception occurs)
at org.hibernate.event.def.DefaultFlushEntityEventListener.checkId(DefaultFlushEntityEventListener.java:82)
at org.hibernate.event.def.DefaultFlushEntityEventListener.getValues(DefaultFlushEntityEventListener.java:190)
at org.hibernate.event.def.DefaultFlushEntityEventListener.onFlushEntity(DefaultFlushEntityEventListener.java:147)
at org.hibernate.event.def.AbstractFlushingEventListener.flushEntities(AbstractFlushingEventListener.java:219)
at org.hibernate.event.def.AbstractFlushingEventListener.flushEverythingToExecutions(AbstractFlushingEventListener.java:99)
at org.hibernate.event.def.DefaultFlushEventListener.onFlush(DefaultFlushEventListener.java:50)
at org.hibernate.impl.SessionImpl.flush(SessionImpl.java:1216)
at org.hibernate.impl.SessionImpl.managedFlush(SessionImpl.java:383)
at org.hibernate.transaction.JDBCTransaction.commit(JDBCTransaction.java:133)
at com.blazingdb.calcite.catalog.repository.DatabaseRepository.createDatabase(DatabaseRepository.java:58)
at com.blazingdb.calcite.catalog.repository.DatabaseRepository.updateDatabase(DatabaseRepository.java:156)
at com.blazingdb.calcite.catalog.connection.CatalogServiceImpl.dropTable(CatalogServiceImpl.java:36)
at com.blazingdb.calcite.application.CalciteService.processRequest(CalciteService.java:78)
at com.blazingdb.calcite.application.TCPService.run(TCPService.java:134)
at java.base/java.lang.Thread.run(Thread.java:834)
Waiting for messages in TCP port: 8890
I think pyBlazing should verify the params before executing the SQL script.
As we all know not all the time you scan the whole table we may sometimes just look at the incremental load for which we need to apply a (sub) partitions key to the table.
In the distributed world it good to have the sort key options within the worker and global which give the developer an option for optimization
Describe the solution you'd like
Look at the Hive implementation.
What is your question?
Hello. I'm trying to filter my data set using the LIKE filter but I always get the same result regardless of the filter value I use. So I'm wondering if LIKE is actually supported.