Giter Site home page Giter Site logo

Comments (15)

saifellafi avatar saifellafi commented on August 23, 2024

There is a slightly different error if I invoke using:

import org.apache.spark.sql.SQLContext

val sqlContext = new SQLContext(sc)
val df = sqlContext.read.format("com.github.saurfang.sas.spark").load("/tmp/file.sas7bdat")
df.count


org.apache.spark.SparkException: Job aborted due to stage failure: Task 14 in stage 0.0 failed 1 times, most recent failure: Lost task 14.0 in stage 0.0 (TID 14, localhost): java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at com.github.saurfang.sas.util.PrivateMethodCaller.apply(PrivateMethodExposer.scala:11) at com.github.saurfang.sas.mapred.SasRecordReader.(SasRecordReader.scala:110) at com.github.saurfang.sas.mapred.SasInputFormat.getRecordReader(SasInputFormat.scala:15) at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.StackOverflowError at java.io.FileInputStream.read(FileInputStream.java:255) at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.read(RawLocalFileSystem.java:149) at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at java.io.DataInputStream.read(DataInputStream.java:149) at org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker.java:435) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:249) at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:275) at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:227) at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:195) at java.io.DataInputStream.read(DataInputStream.java:149) at org.apache.commons.io.input.ProxyInputStream.read(ProxyInputStream.java:98) at java.io.DataInputStream.readFully(DataInputStream.java:195) at com.ggasoftware.parso.SasFileParser.readNextPage(SasFileParser.java:920) at com.ggasoftware.parso.SasFileParser.readNextPage(SasFileParser.java:932) at com.ggasoftware.parso.SasFileParser.readNextPage(SasFileParser.java:932) at com.ggasoftware.parso.SasFileParser.readNextPage(SasFileParser.java:932) at com.ggasoftware.parso.SasFileParser.readNextPage(SasFileParser.java:932) at com.ggasoftware.parso.SasFileParser.readNextPage(SasFileParser.java:932) at com.ggasoftware.parso.SasFileParser.readNextPage(SasFileParser.java:932)

from spark-sas7bdat.

saifellafi avatar saifellafi commented on August 23, 2024

Today I found out there is a problematic row in the data, probably a bad formatted number: 3,000.5

from spark-sas7bdat.

saurfang avatar saurfang commented on August 23, 2024

interesting. I didn't expect you can have malformatted number in SAS as columns are typed. I can see if I can introduce a permissive parsing mode that skips the bad rows just like spark-csv

from spark-sas7bdat.

saifellafi avatar saifellafi commented on August 23, 2024

I am investigating the data, I know the problematic row, but we have more than 400 columns. as soon as I dentify the failing column will let you know!

from spark-sas7bdat.

saifellafi avatar saifellafi commented on August 23, 2024

Sadly this error is coming up very frequently for me. Trying to load huge sas files, I can load up to a specific row, from which it fails. Sadly I am unable to share the data, and I am unable to identify why is it failing.

from spark-sas7bdat.

saurfang avatar saurfang commented on August 23, 2024

Upon close inspection, the stacktrace you attached shows a stack overflow exception. I pushed a new commit that might fix your problem. Do you know how to create an assembly and try again with the latest master version?

from spark-sas7bdat.

saifellafi avatar saifellafi commented on August 23, 2024

I did a pull from master and recompile jar file. Allow me to share the new different stack trace of the error. On my end, and after some research, we found many different reasons for problematic rows. The most common one is that SAS null values are represented as -9999900 which is an overloaded value by the schema data.

Please take a look at the caused by tracktrace

org.apache.spark.SparkException: Job aborted due to stage failure: Task 97 in stage 3.0 failed 1 times, most recent failure: Lost task 97.0 in stage 3.0 (TID 203, localhost): java.lang.reflect.InvocationTargetException at sun.reflect.GeneratedMethodAccessor114.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at com.github.saurfang.sas.util.PrivateMethodCaller.apply(PrivateMethodExposer.scala:11) at com.github.saurfang.sas.mapred.SasRecordReader.readNext$lzycompute$1(SasRecordReader.scala:118) at com.github.saurfang.sas.mapred.SasRecordReader.readNext$1(SasRecordReader.scala:117) at com.github.saurfang.sas.mapred.SasRecordReader.next(SasRecordReader.scala:130) at com.github.saurfang.sas.mapred.SasRecordReader.next(SasRecordReader.scala:20) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:248) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:216) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.ArrayIndexOutOfBoundsException at java.util.Arrays.copyOfRange(Arrays.java:3521) at com.ggasoftware.parso.SasFileParser.processByteArrayWithData(SasFileParser.java:1091) at com.ggasoftware.parso.SasFileParser.readNext(SasFileParser.java:887) ... 37 more

Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1266) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1257) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1256) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1256) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1450) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1411) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

from spark-sas7bdat.

saurfang avatar saurfang commented on August 23, 2024

Hmm. I just pushed a commit where I ignore the problematic bytes, by which I'm totally not confident is the correct way to solve your problem. If you're still stuck on this, do you mind giving it a try? It would be helpful if you can reproduce this with a dummy dataset that I can try.

from spark-sas7bdat.

saifellafi avatar saifellafi commented on August 23, 2024

Hi,

We are still facing java.lang.reflect.InvocationTargetException
Please let me know if there are any news on the issue.

from spark-sas7bdat.

saurfang avatar saurfang commented on August 23, 2024

Again if you can't give me a minimal dataset that I use to reproduce your problem, it is close to impossible for me to debug the problem you are facing.

from spark-sas7bdat.

fernandrez avatar fernandrez commented on August 23, 2024

Good day @saurfang, here I can share a SAS file which is also generating a java.lang.reflect.InvocationTargetException. My best guess is that any parso failure will end up throwing this kind of exception from spark-sas7bdat.

https://www.cms.gov/Medicare/Medicare-Fee-for-Service-Payment/AcuteInpatientPPS/Downloads/FY2015-FR-HCRIS-Data-File.zip

The file was found in the following repo:
https://bitbucket.org/jaredhobbs/sas7bdat/issues/17/example-of-a-non-working-file

As stated in the issue a common error is generated in parso.BinDecompressor.

Looking forward to help fixing this kind of errors, currently I do not have means to look into the SAS files, will find a way soon. Please let me know if any progress can be made. Thank you very much @saurfang.

from spark-sas7bdat.

fernandrez avatar fernandrez commented on August 23, 2024

Good day @saurfang, hope everything is going well. Do you have access to the file provided in the previous comment?

from spark-sas7bdat.

saurfang avatar saurfang commented on August 23, 2024

Thanks @fernandrez providing a sample data that exhibits problems. However I wasn't able to figure out what exactly caused the parser to choke. I don't believe you are seeing the same error as @saifellafi was facing.

See below that the parser fails to decompress the data stream. Since I don't have access to SAS at the moment, I am afraid I cannot be of help further.

16/01/31 00:18:23 INFO SasRecordReader: Bitness: x86
Compressed: SASYZCR2
Endianness: LITTLE_ENDIANNESS
Name: PRDS_HOSP10_YR2012
File type: DATA
Date created: Mon Apr 14 10:49:06 EDT 2014
Date modified: Mon Apr 14 10:49:06 EDT 2014
SAS release: 9.0301M2
SAS server type: X64_7PRO
OS name: 
OS type: 
Header Length: 1024
Page Length: 48128
Page Count: 1267
Row Length: 47737
Row Count: 5941
Mix Page Row Count: 1
Columns Count: 5281

16/01/31 00:18:23 INFO SasRecordReader: Shrunk 8192 bytes.
16/01/31 00:18:23 ERROR BinDecompressor: Unknown marker 8 at offset  12574
16/01/31 00:18:23 INFO SasRecordReader: Read 578560 bytes and 0 records (0/34 on last page).

from spark-sas7bdat.

STHITAPRAJNAS avatar STHITAPRAJNAS commented on August 23, 2024

I am facing the same InvocationTargetException error while doing a count check of my huge sas file. My aim is just to convert it into a parquet file and in turn to a Hive table.

Did anyone get it to work ??

16/03/16 15:19:45 WARN TaskSetManager: Lost task 12.0 in stage 2.0 (TID 55, lpdn0160.kdc.capitalone.com): java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.github.saurfang.sas.util.PrivateMethodCaller.apply(PrivateMethodExposer.sca

from spark-sas7bdat.

andrewrothstein avatar andrewrothstein commented on August 23, 2024

Doesn't look like it. Can I recommend you use bisection to narrow down your huge SAS file to the record/set of records that are causing the parser to choke?

from spark-sas7bdat.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.