Giter Site home page Giter Site logo

Comments (7)

superhadooper avatar superhadooper commented on May 18, 2024

Here is an example scala snippet that repros the issue:
spark.read.bigquery("<large_table>").select("_col0").count()

we can clearly see that all columns are pulled to Spark (vs 0 columns) via network traffic monitor using current JAR (gs://spark-lib/bigquery/spark-bigquery-latest.jar)

from spark-bigquery-connector.

davidrabinowitz avatar davidrabinowitz commented on May 18, 2024

Debugging this issue it seems that not always spark provides those fields to the connector. In the upcoming release (0.12.0-beta) I will add additional logging so that the pushed down columns and filters (the WHERE conditions) will be more transparent.

from spark-bigquery-connector.

aray avatar aray commented on May 18, 2024

Do you have a reproducible example where Spark does not provide the columns?

from spark-bigquery-connector.

davidrabinowitz avatar davidrabinowitz commented on May 18, 2024

Version 0.12.0-beta adds logging for the columns and the filters it receives from the spark DataSource API, and which it pushes down to BigQuery.

@superhadooper Can you please try again?
@aray as mentioned in the README and as you found out, the BigQuery Storage API does not allow us to have a zero column projection. The suggestion is to select the smaller field for the count, for minimal data transfer. The new logging should help to understand what is being pushed down.

from spark-bigquery-connector.

aray avatar aray commented on May 18, 2024

@davidrabinowitz Thanks for updating the README with a workaround. Although I would quickly note that count(col) is only equivalent to count(*) if col is non null.

I see two logical ways to solve this.

  1. Push the requirement upstream that the bigquery storage api needs to support zero column projections. Since the api is still beta maybe there is still a chance to change?
  2. Special case zero column projections in this connector to do select count(*) from $t where $f in BigQuery and then generate the given number of empty rows.

Do you see any other options?

To your prior comment:

it seems that not always spark provides those fields to the connector

I'm curious because the column projection pushdown is used by many other sources and so if you can reproduce that then its a bug in Spark that needs fixed.

For reference the Spark ORC source had a similar issue with zero column projections that I fixed a little over 3 years ago. apache/spark#15898

from spark-bigquery-connector.

davidrabinowitz avatar davidrabinowitz commented on May 18, 2024

@aray Thanks for your notes. Please notice that df.select(col).count() should not necessary mean count(col) as Spark seems to read the rows, regardless of the of the content.

I have tried to have a special treatment for count, and it would have worked if it was an RDD rather than DataFrame. In DataFrame the connector is limited to the DataSource API and unfortunately there is no hook for performing the count() action. We haven't given up on this, and we will defintely try other approaches, including in the API level.

I can't seem to find the case where I had issues with column projections, perhaps because caching was involved? I have added further logging in the latest release (0.12.0-beta) to help debugging such cases.

from spark-bigquery-connector.

davidrabinowitz avatar davidrabinowitz commented on May 18, 2024

Should work now.

from spark-bigquery-connector.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.