Giter Site home page Giter Site logo

Cannot read SAS File about spark-sas7bdat HOT 22 CLOSED

saurfang avatar saurfang commented on August 23, 2024
Cannot read SAS File

from spark-sas7bdat.

Comments (22)

saurfang avatar saurfang commented on August 23, 2024 2

I have collected the discussion so far and speculation to #38. Please contribute any helpful clues there. Thank you! The biggest challenge for us is to identify the smallest/simplest possible dataset that can reproduce this problem. This cannot be done without all of your help. @nelson2005 @vivard @myloginid @TangoTom If we don't have a reproducible dataset, then there is nothing we can do to debug this issue.

I highly suspect the issue has to do with a compressed dataset and/or some specific SAS versions. Therefore if the dataset you are having trouble with is one that you generated yourself, it would be helpful if you take a public dataset, read into SAS, and export it as sas7bdat using the same setting, and we will likely to reproduce this issue without exposing your proprietary data. For example, you can find large-sized CSV data here https://www.kaggle.com/datasets

I will start a bounty on #38 using Gitcoin, where all of you may choose to fund additional money to attract developers. I will open the bounty by the end of this week or when any of you can provide a better sample dataset, whichever comes first.

from spark-sas7bdat.

thesuperzapper avatar thesuperzapper commented on August 23, 2024 2

@saurfang this is resolved

from spark-sas7bdat.

Tagar avatar Tagar commented on August 23, 2024

We also have a similar issue on the new release

java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:657)
at java.util.ArrayList.get(ArrayList.java:433)
at com.epam.parso.impl.SasFileParser.readNext(SasFileParser.java:493)
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.github.saurfang.sas.util.PrivateMethodCaller.apply(PrivateMethodExposer.scala:27)
at com.github.saurfang.sas.parso.SasFileParserWrapper.readNext(ParsoWrapper.scala:80)
at com.github.saurfang.sas.mapred.SasRecordReader.readNext$lzycompute$1(SasRecordReader.scala:125)
at com.github.saurfang.sas.mapred.SasRecordReader.readNext$1(SasRecordReader.scala:124)
at com.github.saurfang.sas.mapred.SasRecordReader.next(SasRecordReader.scala:137)
at com.github.saurfang.sas.mapred.SasRecordReader.next(SasRecordReader.scala:33)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:266)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:203)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)

As in your case getting "Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 "

from spark-sas7bdat.

mattiasmoser avatar mattiasmoser commented on August 23, 2024

We have the same issue with a SAS file with a few 10,000 records. If we take a 100 record sample file, it works fine.

For us, this is NOT a problem in the most recent version only. We tried all versions back to 1.0.0 (using the jars provided here: https://spark-packages.org/package/saurfang/spark-sas7bdat) and all reported the same error.

The most relevant part of the error message is this: (this one is from version 1.0.0, but all other versions reported the same error)

Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:653)
at java.util.ArrayList.get(ArrayList.java:429)
at com.epam.parso.impl.SasFileParser.readNext(SasFileParser.java:493)
... 29 more

from spark-sas7bdat.

printsev avatar printsev commented on August 23, 2024

@mattiasmoser @Tagar
To me it sounds like parso issue actually.. Does it happen for you if you use the parso library directly? If yes, it might be worth creating a new issue in parso project: https://github.com/epam/parso/issues
Would it be possible for you to share the SAS file so we can dig it out?

from spark-sas7bdat.

mattiasmoser avatar mattiasmoser commented on August 23, 2024

@Tagar, @TangoTom
Would you be able to share the SAS files that cause your issue. Our data contains some proprietary information. If I find the time, I can try to redact/remove anything proprietary, but might not get to it for a while.

from spark-sas7bdat.

TangoTom avatar TangoTom commented on August 23, 2024

I wrote a simple java program that reads the SAS-file observation by observation with the parso library.
The SAS-file that crashes with spark-sas7bdat works with my java-program.
Do think that this is a good sign that parso is not the cause of the crash?

Code below:
import com.epam.parso.impl.SasFileReaderImpl;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;

public class readSasFile
{
public static void main (String[] args)
{
try {
FileInputStream fis = new FileInputStream(new File(args[0]));
com.epam.parso.SasFileReader sasFileReader = new SasFileReaderImpl(fis);
long n = 0;
while(sasFileReader.readNext() != null) {
n += 1;
if (n%1000 == 0) System.out.print(".");
}
System.out.println();
System.out.println(n);

    } catch (FileNotFoundException e) {
            System.out.println(e);
    }
    catch (IOException e) {
            System.out.println(e);
    }

}
}

from spark-sas7bdat.

printsev avatar printsev commented on August 23, 2024

Yes, so parso could work correctly.. I would just convert rows to strings to ensure that the output is correct. If the file is not that large, readAll would be easier (it could consume a lot of memory though).

from spark-sas7bdat.

TangoTom avatar TangoTom commented on August 23, 2024

I nullified a data set that failed to get read but then it was read. My conclusion for this is that there was data in it that passed through parso but made spark-sas7bdat fail (however this is possible).
I bet this conclusion is too early and not based on solid data so I hope to find some other data set that I can give into public.

from spark-sas7bdat.

witwall avatar witwall commented on August 23, 2024

I successfully using 2.0.0 version to convert 15G sas7bdat to parquet.

I tried both sparklyr and pyspark

from spark-sas7bdat.

myloginid avatar myloginid commented on August 23, 2024

@saurfang - We faced the same issue when reading from hdfs and S3. Python https://pypi.org/project/sas7bdat/ library can read the same datasets correctly.

If you can provide me some pointers on where the issue could be, I would be happy to submit a PR.

cc - @saurabh2086

from spark-sas7bdat.

vivard avatar vivard commented on August 23, 2024

Hey,
I faced the same issue on some specifiic dataset.
For this specific dataset, if I limit the number of line in order to have in the log:
INFO YarnScheduler: Adding task set 0.0 with 1 task.

But, as soon as I see in the log, this line with more than 1 task, I faced the error:
java.lang.IndexOutOfBoundsException: Index: 0, Size: 0

It must be something with the dataset because, on other dataset I have read the dataset with more than 30 tasks.

from spark-sas7bdat.

printsev avatar printsev commented on August 23, 2024

@vivard:
are you able to share the sas dataset you experience issue with? is it ok if you use just parso without spark-sas7bdat?

from spark-sas7bdat.

vivard avatar vivard commented on August 23, 2024

I replace all value by NULL, for confidentiality. But the issue is the same.
I use parso-2.0.7 and spark-sas7bdat-2.0.0-s_2.10
Spark 2.2.0.2
small_mdt_cc_201806.zip

I didn't try to use just parso.

from spark-sas7bdat.

nelson2005 avatar nelson2005 commented on August 23, 2024

I'm experiencing the same problem, only for specific datasets. It's repeatable- that is, if I resave the sas7bdat (from SPDS) the same sas7bdat is unreadable each time.

from spark-sas7bdat.

Tagar avatar Tagar commented on August 23, 2024

@vivard:
are you able to share the sas dataset you experience issue with? is it ok if you use just parso without spark-sas7bdat?

@printsev did you notice any issue in parso library to read the dataset that @vivard has shared?
thank you!

from spark-sas7bdat.

nelson2005 avatar nelson2005 commented on August 23, 2024

I adapted @TangoTom 's code to use parso-only

`val fis = new FileInputStream("test.sas7bdat")

val reader = new SasFileReaderImpl(fis)
var n: Long = 0
while (reader.readNext != null) {
  n += 1
  if (n % 10000 == 0) print(".")
}
println("")
println(n)`

and it ran without incident

... lots of dots ...
28975153

I'm willing to help troubleshoot if someone can offer guidance. This issue doesn't affect most files but it is quite troublesome on files that it does affect. Does anyone have a recommendation about how I could place a bug bounty on this issue?

from spark-sas7bdat.

saurfang avatar saurfang commented on August 23, 2024

If parso can parse the data file by itself, then the bug is unlikely in parso. My best guess is that the bug would be where we determine is the appropriate splits to divide the sas7bdat file for parallel processing. https://github.com/saurfang/spark-sas7bdat/blob/master/src/main/scala/com/github/saurfang/sas/mapred/SasRecordReader.scala since everything else is just a thin wrapper over Parso thanks to @mulya 's contribution.

One potential way to debug this is to create a local multi-threaded Parso reader, without Spark, to replicate and validate the splitting logic. It might help identify and isolate the issue. You might also discover functions, interfaces, and encapsulation that can be contributed back to Parso, which could greatly simplifies this package too.

from spark-sas7bdat.

nelson2005 avatar nelson2005 commented on August 23, 2024

Yes, I agree that it's likely in the splitting logic... those splitting functions have come up in other github issues as well.
Unfortunately, I'm unable to dig into this- would be happy to fund a bounty.

from spark-sas7bdat.

Tagar avatar Tagar commented on August 23, 2024

@saurfang - thanks for submitting the bounty !

or when any of you can provide a better sample dataset

@vivard has uploaded a file on Aug 13 - see
#32 (comment)

Was that file helpful to reproduce the issue?

from spark-sas7bdat.

saurfang avatar saurfang commented on August 23, 2024

@Tagar I am not 100% sure. AFAIK @vivard didn't verify the file works with parso only. If someone can provide code or create a PR that demonstrate a failed test case using this data (and corresponding code that parses correctly using parso,) that would be very helpful.

Everyone, please feel free to chip in to the bounty at https://gitcoin.co/issue/saurfang/spark-sas7bdat/38/1394

from spark-sas7bdat.

thesuperzapper avatar thesuperzapper commented on August 23, 2024

If anyone has files which used to fail because of this issue, please check them against PR #44 and report back.

from spark-sas7bdat.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.