There appears to be an issue reading internally compressed sas7bdat as discussed in <a

Issue Status: 1. Open 2. Started 3. Submitted 4. Done <p dir

<g-emoji class="g-emoji" alias="moneybag" fallback-src="https://github.gi

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Splitting Internally Compressed sas7bdat,about saurfang/spark-sas7bdat

Comments (64)

gitcoinbot commented on July 22, 2024 1

Issue Status: 1. Open 2. Started 3. Submitted 4. Done

This issue now has a funding of 0.4 ETH (92.12 USD @ $230.29/ETH) attached to it.

If you would like to work on this issue you can 'start work' on the Gitcoin Issue Details page.
Want to chip in? Add your own contribution here.
Questions? Checkout Gitcoin Help or the Gitcoin Slack
$38,930.64 more funded OSS Work available on the Gitcoin Issue Explorer

from spark-sas7bdat.

saurfang commented on July 22, 2024 1

💰 A crowdfund contribution worth 199.80000 DAI (199.8 USD @ $1.0/DAI) has been attached to this funded issue from @saurfang.💰

Want to chip in also? Add your own contribution here.

Thanks @nelson2005 for the $200 contribution! Please consider this transaction as receipt.

from spark-sas7bdat.

nelson2005 commented on July 22, 2024 1

@cristamd
If I understand correctly, this is means the dataset will be read into a single partition, which is limited in size to 2GB

https://issues.apache.org/jira/browse/SPARK-1476

from spark-sas7bdat.

cristamd commented on July 22, 2024 1

@Tagar I was able to try both the no split logic and the split changes in #44 with a 27G SAS file and the system was able to process, sort, and save the file as a CSV with both options. Obviously only using one partition was much slower, but both jobs completed. I'm seeing some differences in the output at first glance, but I'll have to dig in a little more to find out if the data being parsed and then saved out is actually different. Hopefully I'll have a chance to do that tomorrow.

This is what I saw in the spark history server for the read stage when using the splitting logic:

This is what I saw when using the non-splitting logic:

The row count is the same for both, so I'm hopeful that the data will match as well.

from spark-sas7bdat.

cristamd commented on July 22, 2024 1

I was finally able to run the comparison between the non-partitioned parser and the partitioned one in #44 and can confirm they produce the same dataframe for our 27GB SAS file. Nice work 👍

from spark-sas7bdat.

thesuperzapper commented on July 22, 2024 1

@Tagar
I will see about moving to the new HadoopRDD API, which makes use of org.apache.hadoop.mapreduce rather than org.apache.hadoop.mapred meaning we can specify min and max split sizes. (rather than a hint for how many we want.)

from spark-sas7bdat.

printsev commented on July 22, 2024 1

Hi everyone, great to hear that the spark version is improved! So the next steps seem to be the following:

try with parso itself
if the number of records is less than it should be, create a new issue in parso, and we will try to help.

PS would be great if you could check the number of rows using the parso object API, without converting it to CSV to rule out potential issues in CSVWriter

from spark-sas7bdat.

PCaff commented on July 22, 2024 1

@thesuperzapper So, I ran it using just parso and got the same count as with the spark-sas7bdat package. Before I move this issue to the parso page, I'm going to triple check that this file did not get corrupted when I moved it to S3.

from spark-sas7bdat.

saurfang commented on July 22, 2024 1

@nelson2005 I agree. @thesuperzapper can you claim this bounty by visiting https://gitcoin.co/issue/saurfang/spark-sas7bdat/38/1394? I can initiate payout once you submit your work in gitcoin.

In order to accept payout, you might need to install metamask and create your own ethereum wallet if you don't have one already. You may use the faucet mentioned here to get some starting ETH to pay transaction fees when associating your github account and claiming issues.

Let me know if you hit any roadblocks. I highly recommend you try it out since it allows us provably pay out this bounty. If you were to find it too overwhelming, I totally understand and please let me know on this issue and ping me your preferred way (e.g. paypal email) to receive the bounty money.

from spark-sas7bdat.

nelson2005 commented on July 22, 2024

To clarify, this problem also appears on non-compressed sas7bdat; that is, the issue still exists with SAS code like

options compress=no;

That being said, making compressed files splittable (#35 ) would be be most helpful. I'm only suggesting that this problem is possibly not related to SAS internal compression.

from spark-sas7bdat.

nelson2005 commented on July 22, 2024

Okay, I've really tried to figure out how to fund this and it's maddening. I installed MetaMask, which asks for all my permissions and a blood sample. It seems insanely complicated to set up some Ethereum system just to fund a bounty. @saurfang can I just send you some cash on paypal or something? I'd be happy to double what you started with. Perhaps you use weixin and can contact me OOB there. I haven't figured out how to send you a private message on github either- it seems my lameness knows no bounds.
Thanks- tingbaide

from spark-sas7bdat.

saurfang commented on July 22, 2024

Sorry to hear that, @nelson2005. Cryptocurrency doesn't have a great onboarding UX yet 😢 Your experience is very valuable and I will pass it along or even contribute to a better solution.

Meanwhile you can message me on twitter or maybe on gitter which you can log in with your github account (and I believe you can click on my profile and do a private chat.) We can then figure out what's the best way to get your money into this issue. Thanks a lot for your patience!

from spark-sas7bdat.

gitcoinbot commented on July 22, 2024

💰 A crowdfund contribution worth 0.04000 DAI (0.04 USD @ $1.0/DAI) has been attached to this funded issue from @saurfang.💰

Want to chip in also? Add your own contribution here.

from spark-sas7bdat.

gitcoinbot commented on July 22, 2024

💰 A crowdfund contribution worth 199.80000 DAI (199.8 USD @ $1.0/DAI) has been attached to this funded issue from @saurfang.💰

Want to chip in also? Add your own contribution here.

from spark-sas7bdat.

kraj007 commented on July 22, 2024

Hi @saurfang ,
we have carried out test for different .sas7bdat file sizes to load into parquet format.
first Dataset ~ 420 MB (4 partition ,4 tasks ),
second dataset ~ 850MB(7 partitions , 7 tasks) considering spark's default partitions sized 128MB for optimal performance.

In Output parquet files observed that only first or one partitions having data and remaining partitions are having empty files.
For small files < 128 MB , it works fine but as file size increases > 128MB we observed this issue
Can you please help or guide us to resolve this issue.

from spark-sas7bdat.

saurfang commented on July 22, 2024

@kraj007 Sorry to hear that, but like I said in the other issue, I do not have any time to work on this project anymore. Your best bet would be to identify a dataset that can replicate your issue and does not contain any private information so you can make it available to download here. Hopefully, somebody might be able to pick it up and help you either because they are solving the issue for themselves or they are attracted by the bounty attached to this issue. In the latter case, you are also welcome to contribute to the bounty to increase the visibility and attractiveness.

from spark-sas7bdat.

nelson2005 commented on July 22, 2024

@saurfang I ran into the problem fairly frequently, maybe 1 out of 10 datasets. Unfortunately, I'm unable so contribute one now. :(

from spark-sas7bdat.

cristamd commented on July 22, 2024

If anyone is looking for a workaround, changing the SasInputFormat to always return false for isSplittable prevents the file from being split and causes it to return the correct data in the DataFrame. Obviously this has a massive performance impact for large files since the processing will only use one node. An example of the necessary changes is available here: https://github.com/PerkinElmer/spark-sas7bdat/tree/no-split

from spark-sas7bdat.

cristamd commented on July 22, 2024

@nelson2005 I tested with a 27GB SAS file and it appears you are correct, unfortunately. Thanks for the heads up.

Update: I tested this again since apparently my SAS file got corrupted while uploading it to S3 and with a valid SAS file I am at least able to do a count of the 28G file using one partition. I'll try another job that does a data export to make sure all the data makes it through.

from spark-sas7bdat.

Tagar commented on July 22, 2024

@nelson2005 @cristamd did you guys try spark-sas7bdat with #44 ?
It also upgrades parso to 2.0.10 which has some fixes for compressed char data in sas files.

from spark-sas7bdat.

thesuperzapper commented on July 22, 2024

@nelson2005 @cristamd Here are some pre-built jars for Spark 2.2, with #44 applied for convenience.

spark-sas7bdat_2.11-2.1.0_SNAPSHOT.zip

from spark-sas7bdat.

Tagar commented on July 22, 2024

@nelson2005 @cristamd Here are some pre-built jars for Spark 2.2, with #44 applied for convenience.

spark-sas7bdat_2.11-2.1.0_SNAPSHOT.zip

Thank you @thesuperzapper ..
We're on Spark 2.3 .. let us know if you have a version precompiled for that?
Will ask our users if they can test this.

Thanks!

from spark-sas7bdat.

thesuperzapper commented on July 22, 2024

@Tagar, I am pretty sure that Spark 2.3 should work with those files, tell me if not.

from spark-sas7bdat.

nelson2005 commented on July 22, 2024

@Tagar I have not had a chance to try #44 yet.

from spark-sas7bdat.

PCaff commented on July 22, 2024

@Tagar It seems it's not quite there but it's very close! I'm able to get a row count from one of my data sets that would error out prior to the jars you shared above. However, it looks like the split size may be cutting off some rows at the beginning or end of the split. I cannot provide the data file, but I can try to help by providing some metadata. Row mismatch of: 148

Source:
RowCount: 282124814
ColCount: 40
CreationDate: Jan 2016
Compressed
Number of Deleted Rows: 53700 (Not sure if this affects anything)

Target:
RowCount: 282124666
ColCount: 40

from spark-sas7bdat.

nelson2005 commented on July 22, 2024

@cristamd can you confirm that your 27GB file this was one of the sas7bdat files that failed to load with the pre #44 code?

from spark-sas7bdat.

Tagar commented on July 22, 2024

@PCaff, @thesuperzapper is driving these fixes. Hopefully he can help with this one.

I cannot provide the data file, but I can try to help by providing some metadata. Row mismatch of: 148

I think it's the only unsuccessful report ..

However, it looks like the split size may be cutting off some rows at the beginning or end of the split.

I think if you force this file to be read with just one partition so there are no splits and see if this fixes 148 lost records in output. If there are none, it's probably a parso issue and not spark-sas7bdat.

Also would be nice to see which records excactly didn't make it to the target.

@thesuperzapper I wonder if parso or spark-sas7bdat have an option to detect bad records?
Not sure how this happens.

from spark-sas7bdat.

cristamd commented on July 22, 2024

@nelson2005 My 27G file didn't fail to load with the pre #44 code (job completed successfully), but it only read the first few pages of data so instead of a million plus rows it was only around 100,000 rows.

from spark-sas7bdat.

nelson2005 commented on July 22, 2024

@Tagar you can't read big files into one partition due to spark partition size limits. Additionally, I think it's been established that parso reads these files without incident, for example in #32

from spark-sas7bdat.

cristamd commented on July 22, 2024

@nelson2005 I was able to read the 27G file using one partition without any issues as long as the file was a valid SAS file. With a corrupted SAS file I got a bunch of memory errors. I think our cluster is running Spark 2.2.1 on YARN.

from spark-sas7bdat.

nelson2005 commented on July 22, 2024

@cristamd that was loading all the data into a single partition, or just a count() or some other operation? Does that mean there's no 2GB partition limit?

from spark-sas7bdat.

cristamd commented on July 22, 2024

@nelson2005 I read the 27GB SAS file with one partition, repartitioned it to 400 partitions and then wrote it out to a partitioned CSV file in S3. That worked fine without any issues. If you look at the second screenshot in my post from ~12 days ago you can see the Spark History Server output for the job with the single partition. Again, this was running Spark (2.2.1) via YARN on AWS EMR (5.12.1), so maybe the 2G partition limit applies elsewhere.

from spark-sas7bdat.

PCaff commented on July 22, 2024

@cristamd how did you force it to read into one partition? I would like to give this a go to help pin point the issue.

from spark-sas7bdat.

cristamd commented on July 22, 2024

@PCaff I made a small change to the code for the SasInputFormat to have isSplittable always return false. I have a branch on my fork of the repo at https://github.com/PerkinElmer/spark-sas7bdat/tree/no-split that already has the change if you want to check it out and compile it.

from spark-sas7bdat.

Tagar commented on July 22, 2024

I wonder if we should have an equivalent parameter maxPartitions that can be set to 1 to force one single partition.

minPartitions (Default: 0)

Int: The minimum number splits to perform. (Adherence is not guaranteed)

See comment here #44 (comment)

If we would have maxPartitions, like we have minPartitions, testing would be easier for such cases.

cc @thesuperzapper

from spark-sas7bdat.

thesuperzapper commented on July 22, 2024

@PCaff can you try with the latest jar at the bottom of #44.

And if it still has the issue, can you please post the stderr logs for a single executor (INFO level)? The pieces I am interested in are:

XX/X/XX XX:XX:XX INFO HadoopRDD: XXXXXXXXXXXXXXXXXXXXXX
XX/X/XX XX:XX:XX INFO SasRecordReader: XXXXXXXXXXXXXXXXXXXXXX

from spark-sas7bdat.

Tagar commented on July 22, 2024

@thesuperzapper That would be a very good change.
Thank you for all your improvements here.

from spark-sas7bdat.

PCaff commented on July 22, 2024

@thesuperzapper
I cleansed it a little bit for obvious reasons.
exec_logs.txt

from spark-sas7bdat.

thesuperzapper commented on July 22, 2024

@PCaff can you see if the error still happens with no splits?

Use the latest jar from: #44, and specify maxSplitSize larger than the file. (I am not 100% sure if it will let you do that)

If that doesn't work, I will make you a no split version to check.

from spark-sas7bdat.

PCaff commented on July 22, 2024

@thesuperzapper
Sadly, it did not work. The file updated overnight and now the output is missing even more records (roughly 1 million).

Is it possible the split logic is not capturing the mix pages correctly?

from spark-sas7bdat.

Tagar commented on July 22, 2024

@PCaff were you able to force this to be a single split like @thesuperzapper suggested?

from spark-sas7bdat.

PCaff commented on July 22, 2024

@Tagar I was waiting for @thesuperzapper to provide the no split version he mentioned in his previous post.

from spark-sas7bdat.

Tagar commented on July 22, 2024

@PCaff got it. I wasn't sure if you tried to "specify maxSplitSize larger than the file" as it could force one single partition.

I understand now that you wait for @thesuperzapper to provide a separate no split build.

Thanks.

from spark-sas7bdat.

PCaff commented on July 22, 2024

@Tagar Oh yeah, I'm sorry. I tried that method, but with no success :(

from spark-sas7bdat.

thesuperzapper commented on July 22, 2024

@PCaff here is a version which will never split files, please check if it still has the issue:

spark-sas7bdat_2.11-2.1.0_SNAPSHOT_v6_NOSPLIT.zip

from spark-sas7bdat.

nelson2005 commented on July 22, 2024

FWIW, there are python and R implementations of sas7bdat readers in the wild that could be useful references regarding the file format.

from spark-sas7bdat.

PCaff commented on July 22, 2024

@thesuperzapper the no split jar did not work as well. Still have the same count of records after the dataframe is loaded.

from spark-sas7bdat.

Tagar commented on July 22, 2024

FWIW, there are python and R implementations of sas7bdat readers in the wild that could be useful references regarding the file format.

@nelson2005, https://github.com/epam/parso library is superior to other known sas7bdat format reader implementations, for example, it's the only one that supports reading compressed files.

Look for example here BioStatMatt/sas7bdat#12 (comment) - it seems also not being actively maintained.

from spark-sas7bdat.

Tagar commented on July 22, 2024

@thesuperzapper the no split jar did not work as well. Still have the same count of records after the dataframe is loaded.

It's most likely an issue with parso reader then, which saurfang/spark-sas7bdat is based on.

cc parso library developers @printsev @FriedEgg @Yana-Guseva

from spark-sas7bdat.

thesuperzapper commented on July 22, 2024

@PCaff can you try reading that file with https://github.com/epam/parso directly?

from spark-sas7bdat.

kraj007 commented on July 22, 2024

FWIW, there are python and R implementations of sas7bdat readers in the wild that could be useful references regarding the file format.

I have tried sas7bdat python package.
Its slow compared to spark package. It took hours to convert compressed sas file to csv for 3 GB file. But it works great for compressed small SAS files.

Also we can check for pandas.read_sas with DASK (http://docs.dask.org/en/latest/spark.html) which works well with pandas especially

from spark-sas7bdat.

Tagar commented on July 22, 2024

Didn't know pandas has sas7bdat reader too
https://github.com/pandas-dev/pandas/tree/master/pandas/io/sas

Yep it would be much slower as implemented completely in Python and spark-sas7bdat obviously makes reads highly parallel.

Thanks.

from spark-sas7bdat.

nelson2005 commented on July 22, 2024

@thesuperzapper @Tagar @printsev iiuc this was pretty clearly demonstrated to not be a parso issue by bot @TangoTom and me in #32 quite some time ago... is there some new information here that indicates it's a parso problem?

@PCaff if you want to try parso-only, simple sample code is in #32

from spark-sas7bdat.

thesuperzapper commented on July 22, 2024

@nelson2005, the main issue here has already been fixed with #44, we are talking about a new issue, relating to @PCaff having a specific file not work.

from spark-sas7bdat.

nelson2005 commented on July 22, 2024

@thesuperzapper maybe I'm missing something... by my reading of the traffic, the two demonstrated problems were

@cristamd :

My 27G file didn't fail to load with the pre #44 code (job completed successfully), but it only read the first few pages of data so instead of a million plus rows it was only around 100,000 rows."

and @PCaff :

Sadly, it did not work. The file updated overnight and now the output is missing even more records (roughly 1 million).

Neither of which seems to be what was described in #32

Was there a test of a sas7bdat that completely failed as in #32 and then was fixed in #44?

Am I missing something? Apologies in advance- it wouldn't be the first time.

from spark-sas7bdat.

thesuperzapper commented on July 22, 2024

@nelson2005, if you go read comments under the PR #44, people have reported that files which previously read the wrong number of rows, or crashed, now work. (And in fact you quote @cristamd saying it now works with #44)

I believe that @PCaff is having an unrelated issue, potentially related to Parso itself. (Which is why we need to have him try with plain parso before we can fix it)

EDIT: yea, we did add a unit test of a previously broken file.

from spark-sas7bdat.

PCaff commented on July 22, 2024

@thesuperzapper @nelson2005 @cristamd I'll try a read with parso directly hopefully soon. Things have gotten busy on my end. Just wanted to drop this note so that you guys know I'm still involved.

from spark-sas7bdat.

nelson2005 commented on July 22, 2024

@thesuperzapper Thanks for setting me straight. I'll test out the jar- v6 is the right one?

from spark-sas7bdat.

thesuperzapper commented on July 22, 2024

@nelson2005 all good.

Yeah, try V6.

from spark-sas7bdat.

nelson2005 commented on July 22, 2024

Okay, I tried V6 on a few previously broken datasets with ~100MM rows... looks good to me, thanks much!

I'll try it out on a few hundred more files over the next week or so.

One (unrelated) thing that I noticed was that loading sas7bdats with wildcards ( filename like sasfile*.sas7bdat ) fails, where it does work for csv, parquet, etc.

from spark-sas7bdat.

thesuperzapper commented on July 22, 2024

@nelson2005, sounds good, test as many as you can.

With regard to wildcards, I agree that we need to support multiple sas files, I will make an issue for it.

EDIT: issue created #45

from spark-sas7bdat.

nelson2005 commented on July 22, 2024

I loaded about a hundred more sas7bdats using V6 from #44 without seeing any errors. The total row counts and data spot checks tie out.

Thanks to all who participated in resolving this issue!

It's @saurfang's call, but I'm fine with considering this item fixed primarily by @thesuperzapper

from spark-sas7bdat.

gitcoinbot commented on July 22, 2024

Issue Status: 1. Open 2. Cancelled

The funding of 0.4 ETH (plus a crowdfund of 199.84 DAI worth 199.84 USD) (44.52 USD @ $111.31/ETH) attached to this issue has been cancelled by the bounty submitter

Questions? Checkout Gitcoin Help or the Gitcoin Slack
$43,810.23 more funded OSS Work available on the Gitcoin Issue Explorer

from spark-sas7bdat.

saurfang commented on July 22, 2024

I have arranged the transfer of all pledged bounty with @thesuperzapper offline. Also, shout out to @nelson2005 for the bounty contribution. Thanks, everyone who contributed to this issue for your dataset, code, and time.

from spark-sas7bdat.

Splitting Internally Compressed sas7bdat about spark-sas7bdat HOT 64 CLOSED

Comments (64)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent