Giter Site home page Giter Site logo

Fix fst write routines about m4ra HOT 6 CLOSED

mpadge avatar mpadge commented on August 16, 2024
Fix fst write routines

from m4ra.

Comments (6)

rafapereirabr avatar rafapereirabr commented on August 16, 2024

Hi @mpadge . Quick question. Why did you choose to use fst instead of parquet?

from m4ra.

mpadge avatar mpadge commented on August 16, 2024

Hiya Rafa! Chose just because it's more lightweight, and arrow for this workflow is overkill. Everything here is internally indexed by city name; arrow would enable city to be used an index, but then you'd lose oversight of what individual files were. Does that make sense?

from m4ra.

rafapereirabr avatar rafapereirabr commented on August 16, 2024

Yes, it makes perfect sense! Thanks for clarifying! I'm following your work on m4ra with great interest!

from m4ra.

mpadge avatar mpadge commented on August 16, 2024

Have to re-open because fst still produces unreliable values. The package is clearly going to have to be dumped here, because the unreliability breaks all analyses here. It seems that the row-wise ordering in different columns is randomly rearranged, so that rows are not recovered, but include random data mixed from other rows.

from m4ra.

mpadge avatar mpadge commented on August 16, 2024

Can't use arrow, because even with current dev version, it errors because

Error in rawToChar(out): long vectors not supported yet: raw.c:68

That's on a data frame with 6,5 million rows, 21 columns. I guess arrow still primarily envisions numerical data ... so it's back to plain old saveRDS. That will slow these routines down somewhat, but at least everything will once again be reliable.

from m4ra.

mpadge avatar mpadge commented on August 16, 2024

Confirmed that full accuracy has been regained, and load times even for the data.frame above with 6.5M rows are still well under 1 minute, which is okay

from m4ra.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.