Comments (5)
More stacktrace from a failed executor.
15/10/20 15:26:01 INFO bigquery.DynamicFileListRecordReader: Initializing DynamicFileListRecordReader with split 'InputSplit:: length:2125889 locations: [] toString(): gs://starship/hadoop/tmp/bigquery/job_201510201525_0024/shard-1/data-*.avro[2125889 estimated records]', task context 'TaskAttemptContext:: TaskAttemptID:attempt_201510201525_0024_m_000001_0 Status:'
15/10/20 15:26:01 INFO bigquery.DynamicFileListRecordReader: Adding new file 'data-000000000000.avro' of size 0 to knownFileSet.
15/10/20 15:26:01 INFO bigquery.DynamicFileListRecordReader: Moving to next file 'gs://starship/hadoop/tmp/bigquery/job_201510201525_0024/shard-1/data-000000000000.avro' which has 0 bytes. Records read so far: 0
15/10/20 15:26:02 WARN bigquery.DynamicFileListRecordReader: Got non-null delegateReader during close(); possible premature close() call.
15/10/20 15:26:02 WARN rdd.NewHadoopRDD: Exception in RecordReader.close()
java.lang.IllegalStateException
at com.google.cloud.hadoop.repackaged.com.google.common.base.Preconditions.checkState(Preconditions.java:158)
at com.google.cloud.hadoop.io.bigquery.AvroRecordReader.close(AvroRecordReader.java:110)
at com.google.cloud.hadoop.io.bigquery.DynamicFileListRecordReader.close(DynamicFileListRecordReader.java:239)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.org$apache$spark$rdd$NewHadoopRDD$$anon$$close(NewHadoopRDD.scala:170)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1$$anonfun$3.apply(NewHadoopRDD.scala:136)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1$$anonfun$3.apply(NewHadoopRDD.scala:136)
at org.apache.spark.TaskContextImpl$$anon$1.onTaskCompletion(TaskContextImpl.scala:60)
at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:79)
at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:77)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:77)
at org.apache.spark.scheduler.Task.run(Task.scala:90)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/10/20 15:26:02 ERROR executor.Executor: Exception in task 1.1 in stage 6.0 (TID 31)
java.io.IOException: Not an Avro data file
at org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50)
at com.google.cloud.hadoop.io.bigquery.AvroRecordReader.initialize(AvroRecordReader.java:49)
at com.google.cloud.hadoop.io.bigquery.DynamicFileListRecordReader.nextKeyValue(DynamicFileListRecordReader.java:176)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:143)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:277)
at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/10/20 15:26:02 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 34
15/10/20 15:26:02 INFO executor.Executor: Running task 1.3 in stage 6.0 (TID 34)
15/10/20 15:26:02 INFO spark.CacheManager: Partition rdd_25_1 not found, computing it
15/10/20 15:26:02 INFO rdd.NewHadoopRDD: Input split: gs://starship/hadoop/tmp/bigquery/job_201510201525_0024/shard-1/data-*.avro[2125889 estimated records]
15/10/20 15:26:02 INFO bigquery.DynamicFileListRecordReader: Initializing DynamicFileListRecordReader with split 'InputSplit:: length:2125889 locations: [] toString(): gs://starship/hadoop/tmp/bigquery/job_201510201525_0024/shard-1/data-*.avro[2125889 estimated records]', task context 'TaskAttemptContext:: TaskAttemptID:attempt_201510201525_0024_m_000001_0 Status:'
15/10/20 15:26:02 INFO bigquery.DynamicFileListRecordReader: Adding new file 'data-000000000000.avro' of size 0 to knownFileSet.
15/10/20 15:26:02 INFO bigquery.DynamicFileListRecordReader: Moving to next file 'gs://starship/hadoop/tmp/bigquery/job_201510201525_0024/shard-1/data-000000000000.avro' which has 0 bytes. Records read so far: 0
15/10/20 15:26:02 WARN bigquery.DynamicFileListRecordReader: Got non-null delegateReader during close(); possible premature close() call.
15/10/20 15:26:02 WARN rdd.NewHadoopRDD: Exception in RecordReader.close()
java.lang.IllegalStateException
at com.google.cloud.hadoop.repackaged.com.google.common.base.Preconditions.checkState(Preconditions.java:158)
at com.google.cloud.hadoop.io.bigquery.AvroRecordReader.close(AvroRecordReader.java:110)
at com.google.cloud.hadoop.io.bigquery.DynamicFileListRecordReader.close(DynamicFileListRecordReader.java:239)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.org$apache$spark$rdd$NewHadoopRDD$$anon$$close(NewHadoopRDD.scala:170)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1$$anonfun$3.apply(NewHadoopRDD.scala:136)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1$$anonfun$3.apply(NewHadoopRDD.scala:136)
at org.apache.spark.TaskContextImpl$$anon$1.onTaskCompletion(TaskContextImpl.scala:60)
at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:79)
at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:77)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:77)
at org.apache.spark.scheduler.Task.run(Task.scala:90)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/10/20 15:26:02 ERROR executor.Executor: Exception in task 1.3 in stage 6.0 (TID 34)
java.io.IOException: Not an Avro data file
at org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50)
at com.google.cloud.hadoop.io.bigquery.AvroRecordReader.initialize(AvroRecordReader.java:49)
at com.google.cloud.hadoop.io.bigquery.DynamicFileListRecordReader.nextKeyValue(DynamicFileListRecordReader.java:176)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:143)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:277)
at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
from hadoop-connectors.
Thanks for the report, @nevillelyh. I'm going to start looking at this. AFAIU, BigQuery shouldn't be producing 0-length files, but instead 0-record files.
As a workaround, consider setting mapred.bq.input.sharded.export.enable to false (or AbstractBigQueryinputFormat.setEnableShardedOutput(conf, false). The overall job will run more slowly as BigQuery won't be writing records to GCS as frequently, but it may allow you to make progress.
from hadoop-connectors.
I tried adding the following line:
conf.set(BigQueryConfiguration.ENABLE_SHARDED_EXPORT_KEY, "false")
But am getting this stacktrace now:
Caused by: java.lang.IllegalArgumentException: Split should be instance of UnshardedInputSplit.
at com.google.cloud.hadoop.repackaged.com.google.common.base.Preconditions.checkArgument(Preconditions.java:122)
at com.google.cloud.hadoop.io.bigquery.AbstractBigQueryInputFormat.createRecordReader(AbstractBigQueryInputFormat.java:158)
at com.google.cloud.hadoop.io.bigquery.AbstractBigQueryInputFormat.createRecordReader(AbstractBigQueryInputFormat.java:145)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:131)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
... 3 more
from hadoop-connectors.
@nevillelyh - apologies for the broken-ness here. I've opened PR #17 to fix unsharded exports. For the original issue of Avro files being invalid with sharded exports, BigQuery is writing an initial 0-byte file to GCS before writing the finalized object which is does not seem like valid behavior (and is different from how JSON is exported) and a bug is open with the BigQuery team on this. I have not yet seen the 0-byte file issue with unsharded exports.
from hadoop-connectors.
I've done some testing and BigQuery is no longer writing 0-length files as part of sharded exports.
from hadoop-connectors.
Related Issues (20)
- Issue with cached credentials when attempting to use different keyfiles in the same Spark App HOT 1
- Test failures after HADOOP-18724
- Question: How to use gcs-connector on GKE with Workload Identity HOT 1
- BQ storage libray blocked on update to grpc v1.56 HOT 1
- GoogleCloudStorageFileSystem#delete recursive does not page
- Memory issues while running Apache Spark streaming applications on Google Dataproc cluster | OutOfMemoryError Java heap space
- flumk sink hdfs to gcs, all gcs write thread blocked
- how to transfer file from local to gcs bucket using dataproc hadoop in intellij
- GCS Connector fails with StackOverflowError during accessing hadoop credentials
- GhfsStorageStatistics cannot be cast ERROR HOT 9
- Support disabling automatic decompression of gzip files in GCS connector
- gcs-connector 3.0 not working with pyspark HOT 5
- gcs-connector:3.0.0 failing due to certificate when accessing to GCS from Github runner with WIF configuration HOT 7
- Feature request: automatic identity deduction a la google.auth.default()
- gcs-connector-3.0.0-shaded CVEs HOT 1
- How can I sink GCS connector metrics into GCP Cloud Monitor? HOT 2
- globStatus should prioritize server-side filtering over listing all files and performing local matches
- Conversion from InputStream -> ByteBuffer on gRPC writes creates many byte[] allocations. HOT 2
- Bug in `GoogleCloudStorageReadChannel` can cause an infinite loop
- Unauthorized access on gRPC via DirectPath have issues HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hadoop-connectors.