Comments (7)
Here is an example scala snippet that repros the issue:
spark.read.bigquery("<large_table>").select("_col0").count()
we can clearly see that all columns are pulled to Spark (vs 0 columns) via network traffic monitor using current JAR (gs://spark-lib/bigquery/spark-bigquery-latest.jar)
from spark-bigquery-connector.
Debugging this issue it seems that not always spark provides those fields to the connector. In the upcoming release (0.12.0-beta) I will add additional logging so that the pushed down columns and filters (the WHERE conditions) will be more transparent.
from spark-bigquery-connector.
Do you have a reproducible example where Spark does not provide the columns?
from spark-bigquery-connector.
Version 0.12.0-beta adds logging for the columns and the filters it receives from the spark DataSource API, and which it pushes down to BigQuery.
@superhadooper Can you please try again?
@aray as mentioned in the README and as you found out, the BigQuery Storage API does not allow us to have a zero column projection. The suggestion is to select the smaller field for the count, for minimal data transfer. The new logging should help to understand what is being pushed down.
from spark-bigquery-connector.
@davidrabinowitz Thanks for updating the README with a workaround. Although I would quickly note that count(col)
is only equivalent to count(*)
if col
is non null.
I see two logical ways to solve this.
- Push the requirement upstream that the bigquery storage api needs to support zero column projections. Since the api is still beta maybe there is still a chance to change?
- Special case zero column projections in this connector to do
select count(*) from $t where $f
in BigQuery and then generate the given number of empty rows.
Do you see any other options?
To your prior comment:
it seems that not always spark provides those fields to the connector
I'm curious because the column projection pushdown is used by many other sources and so if you can reproduce that then its a bug in Spark that needs fixed.
For reference the Spark ORC source had a similar issue with zero column projections that I fixed a little over 3 years ago. apache/spark#15898
from spark-bigquery-connector.
@aray Thanks for your notes. Please notice that df.select(col).count() should not necessary mean count(col) as Spark seems to read the rows, regardless of the of the content.
I have tried to have a special treatment for count, and it would have worked if it was an RDD rather than DataFrame. In DataFrame the connector is limited to the DataSource API and unfortunately there is no hook for performing the count() action. We haven't given up on this, and we will defintely try other approaches, including in the API level.
I can't seem to find the case where I had issues with column projections, perhaps because caching was involved? I have added further logging in the latest release (0.12.0-beta) to help debugging such cases.
from spark-bigquery-connector.
Should work now.
from spark-bigquery-connector.
Related Issues (20)
- Load failure caused by comment at top of query string (llegalArgumentException: Invalid Table ID) HOT 1
- BigQueryConnectorException: Error creating destination table HOT 5
- Unable to overwrite partition HOT 3
- Map column of a complex type in values causes error "Data type not expected: struct<...>" HOT 1
- Table expiration with write() operation HOT 1
- Impersonate Service Account HOT 1
- Map type with Complex Value not supported any more HOT 1
- Direct writemethod not working in Databricks for Spark 3.5 HOT 5
- Idempotent write support in BQ
- JARs marked 'latest' not being updated HOT 1
- Automatically read JSON types
- Storage Read API logging HOT 5
- BigQuery Pushdown filtering on Spark 3.4.2 HOT 9
- BIGNUMERIC Precision Handling: Inaccurate Decimal Values HOT 2
- java.lang.NoSuchMethodError: 'org.apache.spark.sql.catalyst.encoders.ExpressionEncoder org.apache.spark.sql.catalyst.encoders.RowEncoder.apply(org.apache.spark.sql.types.StructType)' HOT 4
- How to handle column datatype change? HOT 7
- Add `createTable(TableInfo)` method to `BigQueryClient`
- Error: This connector was made for Scala null, it was not meant to run on Scala 2.12 HOT 2
- Predicate pushdown doesn't work with DateTime BQ field - Spark 3.5, connector version 0.37 HOT 1
- error while writing with many timestamp data
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from spark-bigquery-connector.