Comments (18)
Definitely a great idea. I don't think we'll need to download to a temporary location -- we can just stream the file into memory. If you're keen to have a swing at it, feel free. I won't be able to get to it till after the holidays. But I'll have a go then otherwise. 👍
from explorer.
Ok @cigrainger I've just realised that might be too complex for me. You are free to do it :)
from explorer.
No prob. Thanks!
from explorer.
Oh, I've just realised that scidata's get!()
implements something similar. It might be useful taking a look at it.
from explorer.
I have a POC with :httpc.request
basically working, but it does rely on copying the file to a temporary directory.
I don't immediately see how to modify the Rust backend CSV reader to accept a string as opposed to a filename. That is not saying much as I don't know much Rust, but I would need some guidance there if the goal is to not copy the file.
from explorer.
Some programming language have the abstraction of BufferReader and similar. So you can read a file or in memory contents and it is transparent for the code. Does Rust/Polars have something similar? If so, could we use it?
from explorer.
Thanks @srowley! I agree it does look that way from the documentation and I'd be surprised if it didn't work. I'll take a stab at reading from a buffer and report back shortly.
Edit: I think I've got it. Seems to be the CsvReader::new()
can take a std::io::Cursor
as an arg. Just trying it out now.
from explorer.
I'll try to do it :)
from explorer.
I would be willing to take a stab at this. I am inclined to use an HTTP client library like Req so that you could pass options for authentication, etc., but if adding a dependency is undesirable could use :httpc.request
, more or less copying the scidata approach. Could also make Req optional I guess and use :httpc.request
if it isn't loaded.
WDYT?
from explorer.
@srowley I don't know how Rustler works but maybe you could try passing a tuple {:file, ...}
? Or have a separate functions in the backend, like: read_csv_from_file, read_csv_from_url, read_csv_from_memory, and then pick the appropriate one from Elixir.
from explorer.
I don't know if this approach that @josevalim suggested works, what I know is that Polars currently only supports file paths as inputs. The polars-io
engine can't read CSV's from strings.
I think we need a way to trick polars
into thinking our string is a path, but no ideas on how to do so, I failed in all my attempts. I think @cigrainger might have a clue on how to implement this one.
from explorer.
From the documentation it kind of looks like it - it appears we could use CsvReader::new
instead of CsvReader::from_path
, but how to do that exactly is not obvious to me.
from explorer.
Yep! That does it. As a minimal experiment I added the following to ./native/explorer/src/dataframe.rs
:
#[rustler::nif]
pub fn df_read_buf(buf: &str) -> Result<ExDataFrame, ExplorerError> {
let buffer = std::io::Cursor::new(buf);
let df = CsvReader::new(buffer).finish()?;
Ok(ExDataFrame::new(df))
}
Then just added df_read_buf
to ./native/explorer/src/lib.rs
and to ./lib/explorer/polars_backend/native.ex
. Then you can try it out with:
iex> Explorer.Datasets.iris() |> Explorer.DataFrame.to_binary() |> Explorer.PolarsBackend.Native.df_read_buf()
You should get polars
's native tabular output.
from explorer.
Also sorry not to weigh in on this earlier @srowley -- first, thanks so much for picking this up. Second, yeah I think :http.request
is preferable to avoid a dependency (as much as I'm a fan of Wojtek's work on Req
).
from explorer.
No worries! @cigrainger I do want to make sure I understand how you want to proceed from here. Would you prefer to:
- Modify the
Explorer.Backend.DataFrame
behavior so thatread_csv
can accept a filename or binary, and then modify the Rust backend so that it figures out what do based on the argument passed to it - Modify the behavior so that there are
read_csv_from_file
andread_csv_from_memory
callbacks (whichread_csv
could use under the hood, or not) and then implement those in the Rust backend - Something else?
from explorer.
@srowley let's go with 2. I think that's the clearest. You should be able to extract some shared logic from df_read_csv
to use in df_read_buf
.
from explorer.
Let me know if I can help anyhow or if you have any branch that you working on.
from explorer.
We will support parsing from memory. See #186.
from explorer.
Related Issues (20)
- Precompiled NIFs for freebsd HOT 2
- Add parse datetime from string HOT 1
- add DataFrame.frequencies/2 HOT 1
- Q: How to specify that a number is an epoch when reading a Parquet file? HOT 4
- Normalise errors to return `Exception.t()` from the backend HOT 6
- Document `:backend` and `:lazy` options for `from_*` IO functions HOT 1
- Normalize IO dataframe operations to return {:error, Exception.t}
- `mix ci` difficulties on MacOS as an Elixir beginner HOT 2
- `Explorer.Series.cut` crashes when series is a dataframe field HOT 1
- Add binding to str.slice and/or str.split and/or add trim/2 HOT 6
- `DataFrame.from_csv` incorrectly reads "NA" as `nil` HOT 2
- [Question] Linting Elixir code as an Elixir beginner HOT 1
- Pairwise operations HOT 2
- mutate only certain rows based on a filter? HOT 5
- It is not clear how to filter on categories HOT 3
- Aggregate a series to list in groupby HOT 2
- Missing behaviour on filtering DataFrames on categories. HOT 3
- &Explorer.DataFrame.print/2 does not print header names if no options is provided HOT 4
- Pass dtype to Series callbacks
- Use sink cloud versions of ipc HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from explorer.