Giter Site home page Giter Site logo

Parso library update about spark-sas7bdat HOT 15 CLOSED

saurfang avatar saurfang commented on August 23, 2024
Parso library update

from spark-sas7bdat.

Comments (15)

printsev avatar printsev commented on August 23, 2024 2

@Tagar as you are working on the PR to update parso to the latest version, could you please update link to parso in README.md in the Related Work section? It should be https://github.com/epam/parso instead of scitouch site. @saurfang hope you are ok with the link change.

from spark-sas7bdat.

saurfang avatar saurfang commented on August 23, 2024 1

Yes please! I forked the implementation originally because their library development went dormant. Now that they are back on Github I am more than happy to push/pull changes to the upstream library.

from spark-sas7bdat.

saurfang avatar saurfang commented on August 23, 2024

Just to give some pointers in case someone wants to help. Here you may find the total diff I made on the original 1.2.1 parso library to support splitting: https://www.diffchecker.com/V6tGyqNC
The main takeaway is we have to make sure we can seek to the position we'd like to read, read fully when reading, and avoid stackoverflow for large dataset.

It looks like the library has been refactored quite a bit in 2.0 so correspondingly touch points need to be found and updated.

from spark-sas7bdat.

Tagar avatar Tagar commented on August 23, 2024

@saurfang could you please give us some guidance how to update saurfang/spark-sas7bdat to latest parso library?

I see https://github.com/saurfang/spark-sas7bdat/blob/master/src/main/java/com/ggasoftware/parso/SasFileParser.java
Can we just overwrite it with latest file from https://github.com/epam/parso/tree/master/src/main/java/com/epam/parso ? I am probably oversimplifying this. Or you expect something will break? If something breaks, we will be glad to help with that too.

We will try to get this resolved. There are a few bugs we ran into today and both of them seems to be could be resolved by getting to latest praso library.

from spark-sas7bdat.

saurfang avatar saurfang commented on August 23, 2024

@Tagar thank you for taking the interests. Like I outlined in my comment above, the whole reason that we have our own copy of SasFileParser.java in the repo is that we need to make modifications to support parallel read from a split location. The diff exactly shows the points where these logics are implemented.

To see #19 to be successful, I would like to see a minimal test data that replicates the failure on 1.2.1. Only then will we able to confidently say that any changes we lay on top of it to upgrade to 2.0 actually solve the problems. That is not to say #19 is incorrect but I just do not have the capacity to perform detailed code review and has to rely on unit tests to convince myself that the change is actually doing and fixing what it claims.

from spark-sas7bdat.

Tagar avatar Tagar commented on August 23, 2024

Thank you @saurfang
A colleague of mine and me are working on this currently.
It may take some more time.

Can we ask why that diff in SasFileParser.java can't be pushed over to Parso library itself?
If it'll be accepted into Parso library itself, this diff doesn't have to be maintained and
re-applied after each Parso library release.

from spark-sas7bdat.

Tagar avatar Tagar commented on August 23, 2024

epam/parso#21

from spark-sas7bdat.

Tagar avatar Tagar commented on August 23, 2024

@saurfang that diff is already in upstream Parso library.
See comments from @printsev in epam/parso#21
We created a fork of spark-sas7bdat and

  • added unit test that you were referencing to
  • get it running with Parso 2.0.7
    We can confirm that all issues we had earlier with reading compressed files and some
    other issues are now resolved.

Please let us know if that's okay with you to create a new PR for this work ?

from spark-sas7bdat.

Tagar avatar Tagar commented on August 23, 2024

@saurfang our manager asked to bounce our contribution at legal department first as it's first time we contribute back to an open-source project. I don't expect this will be an issue, but will just take some time. We delivered this change internally a week ago. Thanks.

@printsev sure - will do. Thanks.

from spark-sas7bdat.

spatil6 avatar spatil6 commented on August 23, 2024

With earlier version of parso, was getting error while reading compressed files. So I have updated parso 2.0.7 and done relevant changes, now able to read compressed file. :)
However while reading big file (2 GB) from HDFS,Getting Error #9

@Tagar Seems like you have successfully implemented. Have you faced this issue? If yes, can you please share your idea/implementation. Please.

@saurfang Any suggestion or inputs, would like to provide on this.

from spark-sas7bdat.

spatil6 avatar spatil6 commented on August 23, 2024

After doing some analysis on #9 with updated parso library 2.0.7. I have found that issue is with below method.

private def getPartialPageLength(pos: Long) = (pos - sasFileProperties.getHeaderLength) % sasFileProperties.getPageLength

When it try to access bytes from previous page, it won't able read subheader.

@saurfang Can you please elabarote this method, on what basis we are calculating partial page length. Any supporting blog/document.

from spark-sas7bdat.

saurfang avatar saurfang commented on August 23, 2024

If I remember correctly, this is based on the reverse engineered http://www2.uaem.mx/r-mirror/web/packages/sas7bdat/vignettes/sas7bdat.pdf @spatil6

from spark-sas7bdat.

printsev avatar printsev commented on August 23, 2024

@spatil6
Do you face the same issue when working with parso directly? If yes, could you please raise a new issue in parso library itself? If you could share the dataset, it would be great.
The line you reference to looks like a part of spark-sas7bdat to me, but there could be something in parso as well.

from spark-sas7bdat.

spatil6 avatar spatil6 commented on August 23, 2024

Issue resolved.
@printsev Thanks for asking. There is issue in spark-sas7bdat, not in parso library.
@saurfang Existing SasRecoder class implementation won't work with latest parso library. It require some modification, to make it work. So make note of it.
Thanks.

from spark-sas7bdat.

saurfang avatar saurfang commented on August 23, 2024

Thanks all for the discussion and contribution. Kudos to @mulya to pull everything through.

from spark-sas7bdat.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.