Giter Site home page Giter Site logo

Comments (6)

nelson2005 avatar nelson2005 commented on August 23, 2024

SAS column formatters are for display-only... I'd strongly suggest not using them for inferring numeric datatypes with the exception of date/timestamp. SAS has only two datatypes... char and double. Trying to add much magic structure on that on top of the randomness with which people assign column formats is bound to devolve into a festering pool of lameness.

In that vein I'd suggest removing the numeric type inference (excepting date/timestamp) which masqerade as adding functionality that doesn't actually exist in SAS. I'm happy to talk about it more, but feel like this feature will be a gift that keeps on giving without much payoff.

from spark-sas7bdat.

Tagar avatar Tagar commented on August 23, 2024

I see your point @nelson2005 and it makes a lot of sense.

Since non-double data type discover is not on default currently, I think if a developer leverages formatters schema discovery, he should be aware of its limitations etc.

So I think it's good to have this functionality in as long as its disabled by default and we have a proper disclaimer in documentation...

Spark for example has schema inference for text data source (csv) and while it's not perfect
https://github.com/apache/spark/blob/f982ca07e80074bdc1e3b742c5e21cf368e4ede2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala#L99
it works very well in many cases.

from spark-sas7bdat.

thesuperzapper avatar thesuperzapper commented on August 23, 2024

@nelson2005, the main issue is that sas7bdat sometimes stores with integer precision errors, e.g. 1.999999, and while SAS itself seems to correct that, the file itself still stores 1.999999.

But yea, we chose to disable non-date inference by default for the reasons you provided.

from spark-sas7bdat.

nelson2005 avatar nelson2005 commented on August 23, 2024

I'm not sure I would call it an integer precision error, since integer is not a SAS data type.

Agreed that SAS has lots of easter eggs there that don't translate well to the Spark world. Fuzzing numeric comparisons such that 1.999999 == 2 is one of them. Since formats are a display-only concept, I'd think getting the same number that's displayed in SAS to show up in Spark might be a big task. For example, suppose the user has 1.99999 formatted as 8.3. What number would that convert to in Spark? DecimalType(8,3)? Will that display the same thing to the user in Spark and SAS?

At any rate, I think since there schema inference is much easier here than for CSV, since the SAS type system lacks the richness of Spark datatypes. A simple first cut might be to require all of the globbed files to have identical schemas (char, date, or numeric)... after the user converts them to, say, Parquet, they can use the schema merging that's already available there.

from spark-sas7bdat.

nelson2005 avatar nelson2005 commented on August 23, 2024

@saurfang @thesuperzapper @Tagar
I'd be willing to put up a bonus on this feature, say $100 USD.

from spark-sas7bdat.

dmoore247 avatar dmoore247 commented on August 23, 2024

Hmmm, one approach is N writes to the Delta Lake format (delta.io) with .option("mergeSchema","true")
My use case didn't call for merge schema, it did call for multiple parallel writes. This is my PySpark code with the merge schema option... I'd rather see this built into the reader.

cols = ["ID",..., "FILE_PATH" ]
files = ["x1.sas7bdat","x2.sas7bdat","x3.sas7bdat","x4.sas7bdat","x5.sas7bdat","x6.sas7bdat","x7.sas7bdat","x8.sas7bdat"]

#setup function to load one file using spark
def sas_read_write_delta(file_name):
  print(file_name)
  (
    spark
      .read
      .format("com.github.saurfang.sas.spark")
      .load(path+file_name, 
            forceLowercaseNames=True, 
            inferLong=True, 
            metadataTimeout=60)
      .withColumn("FILE_PATH", input_file_name())
      .select(cols)
      .write
        .format("delta")
        .mode("append")
        .option("mergeSchema","true")
        .saveAsTable(target_table)
  )

# run (4) loads in parallel, each load runs in parallel by splitting the source file
# Delta Lake tables support concurrent writes
if __name__ == '__main__':
    from multiprocessing.pool import ThreadPool
    pool = ThreadPool(4)
    pool.map(
      lambda file_name: sas_read_write_delta(file_name), files)

from spark-sas7bdat.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.