Giter Site home page Giter Site logo

qwshen / spark-flight-connector Goto Github PK

View Code? Open in Web Editor NEW
32.0 4.0 6.0 168 KB

A Spark Connector that reads data from / writes data to Arrow-Flight end-points with Arrow-Flight and Flight-SQL

License: GNU General Public License v3.0

Java 92.05% Scala 7.95%
apache-arrow apache-spark arrow arrow-flight dremio sql apache-flight data-source-api flight-sql spark-connector

spark-flight-connector's Issues

Spark jobs size increasing double for same dataframe for same proration

Hi Wayne

I have parquet file in Dremio, I have used below query to read it in Spark setup.
val df=spark.read.option("host", "[SPARK_IP]").option("port", 32010).option("tls.enabled", false).option("tls.verifyServer", false).option("user", "user").option("password", "password").option("partition.size", 320).option("partition.byColumn", "COLUMN1").flight(""""dremio_space"."file"""").filter("COLUMN == 22")

After above query, I have executed df.count, 320 spark jobs are running to get count (count=15), when I ran df.count second time, 640 jobs run to get count (count=30), again I ran df.count, 960 jobs run to get count (count=45).
Do you have any idea, why jobs are increasing and count also increasing, after each execution.
Same issue happened, when I tried to run Group BY on dataframe.

Please let me know, what should I do to fic this issue.

Thanks
Nagaraja M M

Spark UNAUTHENTICATED issue with Dremio

Hi

I did a Dremmio v23.0.1 setup with master and executor, also Spark v3.2.2 setup in linux machines, all nodes are in different VM.
I have logged in to Dremio, it is working fine.
Tried to connect Apache arrow flight from python script, it is working. able to fetch data.

Build spark-flight-connector to jar, and used below command to run spark.
./spark-shell --master local[*] --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.13.2 --jars ./spark-flight-connector_3.2.1-1.0.1.jar --conf spark.sql.catalog.iceberg_catalog=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.iceberg_catalog.type=hadoop --conf spark.sql.catalog.iceberg_catalog.warehouse=file:///home/name/tem/data

But when I tried to use below spark code, to get data, it is throwing UNAUTHENTICATED error.
spark.read.format("flight").option("host", "").option("port", "32010").option("user", "test").option("password", "test").option("table", """"name"."table"""").load

Please help me, how to read data from Apache arrow, using spark.

Thanks
Nagaraj M M

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.