When a simple count(*) is done on a table spark pushes down a zero column projection.

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Zero column projection not handled correctly,about googleclouddataproc/spark-bigquery-connector

Comments (7)

superhadooper commented on May 18, 2024

Here is an example scala snippet that repros the issue:
spark.read.bigquery("<large_table>").select("_col0").count()

we can clearly see that all columns are pulled to Spark (vs 0 columns) via network traffic monitor using current JAR (gs://spark-lib/bigquery/spark-bigquery-latest.jar)

from spark-bigquery-connector.

davidrabinowitz commented on May 18, 2024

Debugging this issue it seems that not always spark provides those fields to the connector. In the upcoming release (0.12.0-beta) I will add additional logging so that the pushed down columns and filters (the WHERE conditions) will be more transparent.

from spark-bigquery-connector.

aray commented on May 18, 2024

Do you have a reproducible example where Spark does not provide the columns?

from spark-bigquery-connector.

davidrabinowitz commented on May 18, 2024

Version 0.12.0-beta adds logging for the columns and the filters it receives from the spark DataSource API, and which it pushes down to BigQuery.

@superhadooper Can you please try again?
@aray as mentioned in the README and as you found out, the BigQuery Storage API does not allow us to have a zero column projection. The suggestion is to select the smaller field for the count, for minimal data transfer. The new logging should help to understand what is being pushed down.

from spark-bigquery-connector.

aray commented on May 18, 2024

@davidrabinowitz Thanks for updating the README with a workaround. Although I would quickly note that count(col) is only equivalent to count(*) if col is non null.

I see two logical ways to solve this.

Push the requirement upstream that the bigquery storage api needs to support zero column projections. Since the api is still beta maybe there is still a chance to change?
Special case zero column projections in this connector to do select count(*) from $t where $f in BigQuery and then generate the given number of empty rows.

Do you see any other options?

To your prior comment:

it seems that not always spark provides those fields to the connector

I'm curious because the column projection pushdown is used by many other sources and so if you can reproduce that then its a bug in Spark that needs fixed.

For reference the Spark ORC source had a similar issue with zero column projections that I fixed a little over 3 years ago. apache/spark#15898

from spark-bigquery-connector.

davidrabinowitz commented on May 18, 2024

@aray Thanks for your notes. Please notice that df.select(col).count() should not necessary mean count(col) as Spark seems to read the rows, regardless of the of the content.

I have tried to have a special treatment for count, and it would have worked if it was an RDD rather than DataFrame. In DataFrame the connector is limited to the DataSource API and unfortunately there is no hook for performing the count() action. We haven't given up on this, and we will defintely try other approaches, including in the API level.

I can't seem to find the case where I had issues with column projections, perhaps because caching was involved? I have added further logging in the latest release (0.12.0-beta) to help debugging such cases.

from spark-bigquery-connector.

davidrabinowitz commented on May 18, 2024

Should work now.

from spark-bigquery-connector.

Zero column projection not handled correctly about spark-bigquery-connector HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent