Giter Site home page Giter Site logo

Comments (18)

cigrainger avatar cigrainger commented on June 15, 2024 2

Definitely a great idea. I don't think we'll need to download to a temporary location -- we can just stream the file into memory. If you're keen to have a swing at it, feel free. I won't be able to get to it till after the holidays. But I'll have a go then otherwise. 👍

from explorer.

kimjoaoun avatar kimjoaoun commented on June 15, 2024 1

Ok @cigrainger I've just realised that might be too complex for me. You are free to do it :)

from explorer.

cigrainger avatar cigrainger commented on June 15, 2024 1

No prob. Thanks!

from explorer.

kimjoaoun avatar kimjoaoun commented on June 15, 2024 1

Oh, I've just realised that scidata's get!() implements something similar. It might be useful taking a look at it.

from explorer.

srowley avatar srowley commented on June 15, 2024 1

I have a POC with :httpc.request basically working, but it does rely on copying the file to a temporary directory.

I don't immediately see how to modify the Rust backend CSV reader to accept a string as opposed to a filename. That is not saying much as I don't know much Rust, but I would need some guidance there if the goal is to not copy the file.

from explorer.

josevalim avatar josevalim commented on June 15, 2024 1

Some programming language have the abstraction of BufferReader and similar. So you can read a file or in memory contents and it is transparent for the code. Does Rust/Polars have something similar? If so, could we use it?

from explorer.

cigrainger avatar cigrainger commented on June 15, 2024 1

Thanks @srowley! I agree it does look that way from the documentation and I'd be surprised if it didn't work. I'll take a stab at reading from a buffer and report back shortly.

Edit: I think I've got it. Seems to be the CsvReader::new() can take a std::io::Cursor as an arg. Just trying it out now.

from explorer.

kimjoaoun avatar kimjoaoun commented on June 15, 2024

I'll try to do it :)

from explorer.

srowley avatar srowley commented on June 15, 2024

I would be willing to take a stab at this. I am inclined to use an HTTP client library like Req so that you could pass options for authentication, etc., but if adding a dependency is undesirable could use :httpc.request, more or less copying the scidata approach. Could also make Req optional I guess and use :httpc.request if it isn't loaded.

WDYT?

from explorer.

josevalim avatar josevalim commented on June 15, 2024

@srowley I don't know how Rustler works but maybe you could try passing a tuple {:file, ...}? Or have a separate functions in the backend, like: read_csv_from_file, read_csv_from_url, read_csv_from_memory, and then pick the appropriate one from Elixir.

from explorer.

kimjoaoun avatar kimjoaoun commented on June 15, 2024

I don't know if this approach that @josevalim suggested works, what I know is that Polars currently only supports file paths as inputs. The polars-io engine can't read CSV's from strings.
I think we need a way to trick polars into thinking our string is a path, but no ideas on how to do so, I failed in all my attempts. I think @cigrainger might have a clue on how to implement this one.

from explorer.

srowley avatar srowley commented on June 15, 2024

From the documentation it kind of looks like it - it appears we could use CsvReader::new instead of CsvReader::from_path, but how to do that exactly is not obvious to me.

from explorer.

cigrainger avatar cigrainger commented on June 15, 2024

Yep! That does it. As a minimal experiment I added the following to ./native/explorer/src/dataframe.rs:

#[rustler::nif]
pub fn df_read_buf(buf: &str) -> Result<ExDataFrame, ExplorerError> {
    let buffer = std::io::Cursor::new(buf);
    let df = CsvReader::new(buffer).finish()?;
    Ok(ExDataFrame::new(df))
}

Then just added df_read_buf to ./native/explorer/src/lib.rs and to ./lib/explorer/polars_backend/native.ex. Then you can try it out with:

iex> Explorer.Datasets.iris() |> Explorer.DataFrame.to_binary() |> Explorer.PolarsBackend.Native.df_read_buf()

You should get polars's native tabular output.

from explorer.

cigrainger avatar cigrainger commented on June 15, 2024

Also sorry not to weigh in on this earlier @srowley -- first, thanks so much for picking this up. Second, yeah I think :http.request is preferable to avoid a dependency (as much as I'm a fan of Wojtek's work on Req).

from explorer.

srowley avatar srowley commented on June 15, 2024

No worries! @cigrainger I do want to make sure I understand how you want to proceed from here. Would you prefer to:

  1. Modify the Explorer.Backend.DataFrame behavior so that read_csv can accept a filename or binary, and then modify the Rust backend so that it figures out what do based on the argument passed to it
  2. Modify the behavior so that there are read_csv_from_file and read_csv_from_memory callbacks (which read_csv could use under the hood, or not) and then implement those in the Rust backend
  3. Something else?

from explorer.

cigrainger avatar cigrainger commented on June 15, 2024

@srowley let's go with 2. I think that's the clearest. You should be able to extract some shared logic from df_read_csv to use in df_read_buf.

from explorer.

nyo16 avatar nyo16 commented on June 15, 2024

Let me know if I can help anyhow or if you have any branch that you working on.

from explorer.

josevalim avatar josevalim commented on June 15, 2024

We will support parsing from memory. See #186.

from explorer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.