elixir-explorer / explorer Goto Github PK

View Code? Open in Web Editor NEW

990.0 990.0 111.0 3.06 MB

Series (one-dimensional) and dataframes (two-dimensional) for fast and elegant data exploration in Elixir

Home Page: https://hexdocs.pm/explorer

License: MIT License

Elixir 84.93% Rust 14.84% Nix 0.11% Shell 0.12%

data-science dataframes elixir rust

explorer's People

Contributors

Stargazers

Watchers

Forkers

alexjuca sobolevn losvedir drowzy geehlee roberto-xy philss dgreiss ben-dyer amferraboli benjamin-philip dustinfarris ricpruss cristineguadelupe mattias01 srowley brady131313 guigaoliveira nyo16 robinbeekhof isaacsanders brodeuralexis halian-vilela mixwui lostkobrakai cnpryer kimjoaoun mattneel joshuataylor dkuku sabiwara ingvardsen karangejo t-rutten boudra andreyuhai ian-gl uiforks filmor pgeraghty thbar nallwhy xavirodri liveforeverx rovel jeantux tmr08c treble37 sasikumar87 linusdm thmsmlr fargerio jhonatannunessilva ahnlabb layeddie liamdiprose sriehl mlineen akiocode yii-iiy blutack capitalist42 roger120981 pedromss hanspagh jeregrine treebee lostbean axelclark taetrecaumji elias-ba awerment manhvu wojtekmach thehabbos007 qqwy wisdawg guarilha maltekrupa viniciusmuller zero-one-group firefly-cpp deemooneill spatchkaa billylanchantin vishal-h kellyfelkins gungorkocak clm-a rodrigues virviil maennchen costaraphael paulsullivanjr rtvu jongretar niccolox lkarthee iurimateus ricardopadua

explorer's Issues

Does not work with `livebook/livebook` docker image

Image docs: https://github.com/livebook-dev/livebook#docker

Installation with

Mix.install([
  {:explorer, "~> 0.1.0-dev", github: "amplifiedai/explorer"}
])

Fails with:

* Getting explorer (https://github.com/amplifiedai/explorer.git)
remote: Enumerating objects: 642, done.        
remote: Counting objects: 100% (642/642), done.        
remote: Compressing objects: 100% (402/402), done.        
remote: Total 642 (delta 400), reused 410 (delta 210), pack-reused 0        
origin/HEAD set to main
* Getting nx (https://github.com/elixir-nx/nx.git - origin/main)
remote: Enumerating objects: 9554, done.        
remote: Counting objects: 100% (1950/1950), done.        
remote: Compressing objects: 100% (922/922), done.        
remote: Total 9554 (delta 1173), reused 1671 (delta 965), pack-reused 7604        
Resolving Hex dependencies...
Dependency resolution completed:
New:
  rustler 0.22.0
  toml 0.5.2
* Getting rustler (Hex package)
* Getting toml (Hex package)
==> nx
Compiling 20 files (.ex)
Generated nx app
==> toml
Compiling 10 files (.ex)
Generated toml app
==> rustler
Compiling 7 files (.ex)
Generated rustler app
==> explorer
Compiling 14 files (.ex)
Compiling crate explorer in release mode (native/explorer)

== Compilation error in file lib/explorer/polars_backend/native.ex ==
** (ErlangError) Erlang error: :enoent
    (elixir 1.12.0) lib/system.ex:1041: System.cmd("cargo", ["rustc", "--release"], [cd: "/home/livebook/.cache/mix/installs/elixir-1.12.0-erts-12.0/b2b65fb8181f59bc8768c99cf9e3e5bc/deps/explorer/native/explorer", stderr_to_stdout: true, env: [{"CARGO_TARGET_DIR", "/home/livebook/.cache/mix/installs/elixir-1.12.0-erts-12.0/b2b65fb8181f59bc8768c99cf9e3e5bc/_build/dev/lib/explorer/native/explorer"}], into: %IO.Stream{device: :standard_io, line_or_bytes: :line, raw: false}])
    (rustler 0.22.0) lib/rustler/compiler.ex:27: Rustler.Compiler.compile_crate/2
    lib/explorer/polars_backend/native.ex:4: (module)
    (stdlib 3.15) erl_eval.erl:685: :erl_eval.do_apply/6
could not compile dependency :explorer, "mix compile" failed. You can recompile this dependency with "mix deps.compile explorer", update it with "mix deps.update explorer" or clean it with "mix deps.clean explorer"

I guess the solution is to provide a custom livebook image with extra dependecies.

Reading data from a web URL

Proposal: Reading a dataset (csv?) from the internet with read_csv.

I think it would be great if we had a way to read data from a URL when using Explorer.read_csv().

In R's library readr package (which uses lib vroom as a backend) you can supply a string that starts with "http", "https", "ftp" or even "ftps" and it will connect to that address and download the data from the link. I think it downloads the data in a temporary location, loads it in R and then deletes it.
I think it would be great to have a similar functionality in Explorer, because it is specially useful when giving lectures/classes.

Implementation idea:

It could be implemented with pattern matching, by matching the beginning of the string with those protocols identifiers and downloading it in a temporary location, loading it in Elixir using the already existing Explorer.read_csv(), and then deleting the file.
I could even try to help implementing if necessary.

Something like the example below would be perfect:

url = "https://raw.githubusercontent.com/dadosgovbr/catalogos-dados-brasil/master/dados/catalogos.csv"

df =
  url
  |> Explorer.read_csv()

Edit: It is also a functionality available in Python's Pandas.

`DataFrame.select/3` raises when attempting to drop > 1 column

Should `concat_rows` concat series with integers and floats?

When creating a new Series one can use Series.from_list/1 that accepts both integers and floats:

iex> Explorer.Series.from_list([1, 2.0, 3])
#Explorer.Series<
  float[3]
  [1.0, 2.0, 3.0]
>

But it is not possible to concat_rows with data frames that have mixed column types:

one = DataFrame.from_map(%{a: [1.0, 2.0, 3.0]})
two = DataFrame.from_map(%{a: [2, 3, 4]})

DataFrame.concat_rows(one, two)

** (ArgumentError) columns and dtypes must be identical for all dataframes

Should we have the same behaviour from Series.from_list/1 for DataFrame.concat_rows/2?

This issue was found by @cristineguadelupe :)

Implement `DataFrame.bind_rows`

See: https://dplyr.tidyverse.org/reference/bind.html

Move datasets out of priv

Files in priv are copied to build and are made part of the release, while likely very unlikely to be used there.

My suggestion is to move them out of priv. We have two options:

Move them to examples alongside moving lib/explorer/datasets.ex to examples/fossil_fuel.exs, etc

Move them to datasets. If we do this, we need to load them like this:

@datasets_dir Path.join(File.cwd!(), "datasets")

def fossil_fuel, do: load(Path.join(@datasets_dir, ...)

Read the default backend from the app environment

As done here: elixir-nx/nx@31b7c92

Ordered return on series distinct

As it was noted in PR #101, currently it is not possible to test the Explorer.Series.distinct() function because of its unpredictable behaviour when returning the data. It shuffles the series and always returns the result in a different order.
It would be great not having this random behaviour.

Example

iex(1)> s = [1, 1, 2, 2, 3, 3] |> Explorer.Series.from_list()
#Explorer.Series<
  integer[6]
  [1, 1, 2, 2, 3, 3]
>
iex(2)> s |> Explorer.Series.distinct()
#Explorer.Series<
  integer[3]
  [3, 2, 1]
>
iex(3)> s |> Explorer.Series.distinct()
#Explorer.Series<
  integer[3]
  [2, 3, 1]
>

Different results in the same operation.

Allow user specified dtypes when creating a DataFrame from map

Similar to how a user can specify the dtypes when reading from a csv, it would be helpful to allow the user to specify the dtypes when calling DataFrame.from_map.

Thanks!

implement some metrics for series?

I would suggest implementing some common metrics for time series:

Mean Squared Error (MSE)
Mean Absolute Error (MAE)
Root Mean Squared Error (RMSE)
Mean Absolute Percentage Error (MAP)
R-squared

The idea comes to use for forecasting applications.

** thinking about it, I thought there might be functions like sqrt (I know I can implement it by pow (..., 0.5) 😅) and log for series too.

** for inspiration https://scikit-learn.org/stable/modules/classes.html#regression-metrics

DataFrame cast string to Time

I am kinda stuck with a simple problem. Since I can't find the solution in the documentation and SO is quiet, maybe this is the place? Even if only this issue serves as documentation for someone else, it might be worth it?

So my simplified scenario:

I try to cast an Explorer.Series row from a string to a Time type.

df = Explorer.DataFrame.from_map(%{a: ["00:30:00", "01:00:00", "05:30:00"]})
    
transform_duration = fn duration ->
  time = Time.from_iso8601!(duration)
  time.hour + (time.minute / 60) + (time.second / 60*60)
end

DataFrame.mutate(df, a: Series.transform(s, fn x -> transform_duration.(x) end))

The DataFrame.mutate function won't allow me to transform to a different type. Should I combine this Series.transform with a Series.cast?

Maybe my whole approach ain't good? It ain't working in any way I'v tried to grasp the idea.

Handle randomness

Some functions depend on random number generators (e.g. random samples). Need to come up with a unified way of setting seeds in Elixir and Rust.

Would it make sense to introduce a DataFrame.to_tensor?

We have a Series.to_tensor method, however, for many ML tasks, we will want more than one column.

Would it make sense to add one for DataFrames?

Right now I am doing the following as a quick hack:

def df_to_tensor(df) do
    df
    |> DataFrame.names()
    |> Enum.map(fn name ->
      df[name]
      |> Series.to_tensor()
      |> Nx.reshape({:auto, 1})
    end)
    |> Nx.concatenate(axis: 1)
  end

x =
  DataFrame.select(df, ["Life", "Country"], :drop)
  |> df_to_tensor()

A sufficient implementation would at minimum include the following:

Control over which columns are selected and in what order
Control over the final data type (not sure if this would require a single type option to be passed to the inner Nx.tensor)

Beyond that, I can't think of any other changes that could not be trivially performed in Explorer before conversion or Nx after.

Why not do this?

Tensor options become non-obvious. Right now, it is not clear how to handle names. What happens if more options are added to Nx.tensor?

Proposal

If this is a good idea:

  @doc """
  Converts a dataframe to a `t:Nx.Tensor.t/0`.

  Can also convert a subset of columns by name

  ## Supported dtypes

    * `:float`
    * `:integer`

  ## Examples

      iex> df = Explorer.DataFrame.from_map(%{floats: [1.0, 2.0], ints: [1, 2]})
      #Explorer.DataFrame<
        [rows: 2, columns: 2]
        floats float [1.0, 2.0]
        ints integer [1, 2]
      >
      iex> Explorer.DataFrame.to_tensor(df)
      #Nx.Tensor<
        f32[2][2]
        [
          [1.0, 1.0],
          [2.0, 2.0]
        ]
      >
  """
  def to_tensor(%DataFrame{} = df, column_names) do
    column_names
    |> Enum.map(fn name ->
      df[name]
      |> Series.to_tensor()
      |> Nx.reshape({:auto, 1})
    end)
    |> Nx.concatenate(axis: 1)
  end

Support `Series.fill_missing/2` also accepting a fixed scalar fill value

I've got a DataFrame loaded with a pile of nils and would like to be able set those to a fixed value per the Series.fill_missing/2 function:

df["X10"] |> Series.fill_missing(0.0)

Currently that function only accepts atoms referencing internal algorithms and raises an exception when passed a scalar. Let me know if you need anything else or if I can help in any way.

Thanks!

Create DataFrame from Series

Currently, the only ways to create a DataFrame is from a map or by reading from a csv file. It could be helpful to create a DataFrame from a Series (or a list of Series), while also providing column names.

Similarly, there's no way to add a Series with the correct length to a DataFrame.

Thanks for considering!

Add tabular print method

Useful influences:
https://github.com/alexhallam/tv
https://pillar.r-lib.org/dev/index.html

Provide Column Name in Mismatched Type Error Message

When DataFrame.from_map throws an error when there is a mismatched type, the error message does not give the column name where the error was found. This is especially needed when the data is large and not easily printed. For example this does not tell me which column is throwing the error:

Explorer.DataFrame.from_map(%{a: [1, 1.0], b: ["a", "b"]})
> * (ArgumentError) Cannot make a series from mismatched types. Type of 1.0 does not match inferred dtype integer.

compilation succeeded, but the on_load function failed on my macbook pro (Big Sur) with apple M1 chip

How can I solve this issue?
Error messages printed on the console where livebook is started:

10:03:43.111 [warn] The on_load function for module Elixir.Explorer.PolarsBackend.Native returned:
{:error,
{:load_failed,
'Failed to load NIF library: 'dlopen(/Users/zhangzh/Library/Caches/mix/installs/elixir-1.12.3-erts-12.1/c8c0b2eb20455bd67cc55fc3c3acd0de/_build/prod/lib/explorer/priv/native/libexplorer.so, 2): no suitable image found. Did find:\n\t/Users/zhangzh/Library/Caches/mix/installs/elixir-1.12.3-erts-12.1/c8c0b2eb20455bd67cc55fc3c3acd0de/_build/prod/lib/explorer/priv/native/libexplorer.so: mach-o, but wrong architecture\n\t/Users/zhangzh/Library/Caches/mix/installs/elixir-1.12.3-erts-12.1/c8c0b2eb20455bd67cc55fc3c3acd0de/_build/prod/lib/explorer/priv/native/libexplorer.so: stat() failed with errno=35''}}

=========

➜  ~ file /Users/zhangzh/test/my_app/_build/dev/lib/explorer/priv/native/libexplorer.so
/Users/zhangzh/test/my_app/_build/dev/lib/explorer/priv/native/libexplorer.so: Mach-O 64-bit dynamically linked shared library x86_64

elixir: 1.12.3-otp-24
erlang: 24.1
rustc: 1.56.1 (59eed8a2a 2021-11-01)

Improve type casting when creating a Series or DataFrame

These 2 examples currently throw an error:

Explorer.Series.from_list([1, 2.0])
Explorer.Series.from_list([nil, nil])

Would it be possible for a mix of integers and floats down cast to floats, similar to how it is done when adding a two series:

s = Series.from_list([1, 2, 3])
s1 = Series.from_list([1.0, 2.0, 3.0])
Series.add(s, s1)

For the list of nils, would it be possible to be casted to a default type. In R, a list of NA is casted to a boolean type tibble::tibble(a = c(NA))

Thanks!

inconsistency when running Explorer.DataFrame.pivot_wider/4

columns "a" and "b" change order when executing:

iex(1)>  df = Explorer.DataFrame.from_map(%{id: [1, 1], variable: ["a", "b"], value: [1, 2]})
#Explorer.DataFrame<
  [rows: 2, columns: 3]
  id integer [1, 1]
  value integer [1, 2]
  variable string ["a", "b"]
>
iex(2)> Explorer.DataFrame.pivot_wider(df, "variable", "value")  
#Explorer.DataFrame<
  [rows: 1, columns: 3]
  id integer [1]
  a integer [1]
  b integer [2]
>
iex(3)> Explorer.DataFrame.pivot_wider(df, "variable", "value")
#Explorer.DataFrame<
  [rows: 1, columns: 3]
  id integer [1]
  b integer [2]
  a integer [1]
>
iex(4)> Explorer.DataFrame.pivot_wider(df, "variable", "value")
#Explorer.DataFrame<
  [rows: 1, columns: 3]
  id integer [1]
  b integer [2]
  a integer [1]
>
iex(5)> Explorer.DataFrame.pivot_wider(df, "variable", "value")
#Explorer.DataFrame<
  [rows: 1, columns: 3]
  id integer [1]
  a integer [1]
  b integer [2]
>

Compilation error in file lib/mix/tasks/rustler.new.e

Elixir 1.13
Erl. 24
Windows 10

Mix deps.get succesded, but running mix phx.serverfails with the following error.

==> rustler

Compiling 7 files (.ex)

== Compilation error in file lib/mix/tasks/rustler.new.ex ==

** (File.Error) could not read file "r:/1.PY/Livebook/explore_df/_build/dev/lib/rustler/priv/templates/basic/.cargo/config": I/O error

(elixir 1.13.1) lib/file.ex:355: File.read!/1

lib/mix/tasks/rustler.new.ex:29: anonymous fn/3 in :elixir_compiler_9.__MODULE__/1

(elixir 1.13.1) lib/enum.ex:2396: Enum."-reduce/3-lists^foldl/2-0-"/3

lib/mix/tasks/rustler.new.ex:26: (module)

could not compile dependency :rustler, "mix compile" failed. Errors may have been logged above. You can recompile this dependency with "mix deps.compile rustler", update it with "mix deps.update rustler" or clean it with "mix deps.clean rustler"

Settle on strings vs. atoms for column names

One more thing that came up, most noticeable in the dtypes and names test is the inconsistency of referring to columns sometimes as atoms and sometimes as strings. with_columns and names both takes the column names as strings, while dtypes expects a keyword list where they're atoms. Should I add another commit to settle on one or the other? I could also make either work, but that gives me flashbacks to Rails where controller params keys could be either strings or atoms, which caused a lot of confusion, and which Phoenix rightly (IMO) fixed by standardizing on just strings.

On this, I think the entire API is a bit split on this. I've tried to make both work in many cases where I felt it was more ergonomic, but I'm feeling the same as you that they should be standardised. I think the obvious way to do that is to settle on strings. Let's leave this the way it is right now and we can revisit this in a broader context.

Originally posted by @cigrainger in #48 (comment)

read_csv does not support gziped csv's

It is common for large csv's to be compressed. Polars supports compressed csv's and autodecompresses them. Feature request maybe or maybe it was just turn off by mistake?

Implement Kino.Render protocol

create data frame from records

We are trying out explorer in our application to do some last-mile data shaping.

We pull records from the db and it would be nice if DataFrame had a function to ingest those.

Currently doing this:

data = [%{id: 1, name: "John"}, %{id: 2, name: "Jane"}]

df =
  data
  |> Enum.zip_with(
    fn [{key, _value} | _rest] = zipped_column_enum -> 
      %{key => Enum.map(zipped_column_enum, fn {_key, value} -> value end)} 
    end
  )
  |> Enum.reduce(&Map.merge/2)
  |> Explorer.DataFrame.from_map()

from_map -> from_series

Somewhat unrelated: another reason to call this function from_series is that we could receive any enumerable as a series. This could be beneficial because keywords preserve key ordering but maps do not.

Originally posted by @josevalim in #130 (comment)

Proposal: Dynamically generate dataset functions at compile time

Right now, in order to add a dataset, one needs to:

Add the data in datasets
Add a function for that dataset in the Explorer.Datasets module.

The Problem with this is that we are essentially writing the same code repeatedly:

def fossil_fuels,
    do: @datasets_dir |> Path.join("fossil_fuels.csv") |> DataFrame.read_csv!()

def wine,
    do: @datasets_dir |> Path.join("wine.csv") |> DataFrame.read_csv!()

In essence:

def unquote(dataset) do
  @datasets_dir
  |> Path.join(unquote(dataset) <> ".csv")
  |> Dataframe.read_csv!()
end

Note: I'm pretty new to macros, and I haven't tested this code..

I feel that this is unnecessary repetition which can easily be automated.

I propose that we dynamically generate dataset functions at compile time.

To be more specific, I would automate it like this:

Place the data in datasets/:name/data.csv.
Place the docs in datasets/:name/docs.md.

Iterate over all dirs in datasets and write the above function.

Reading "partitioned data"

Hey all,

I'm looking into using this project, and I don't know that there is an API for reading "partitioned data" as defined in the Apache Arrow documentation (source: https://arrow.apache.org/docs/python/dataset.html#reading-partitioned-data).

Both the python and nodejs libraries appear to implement this pattern.

Clarification / Question: Dataframe.filter AND/OR mask?

I see in the docs that one can filter using a mask:

df = Explorer.DataFrame.from_map(%{a: ["a", "b", "c"], b: [1, 2, 3]})
Explorer.DataFrame.filter(df, Explorer.Series.greater(df["b"], 1))

#Explorer.DataFrame<
[rows: 2, columns: 2]
a string ["b", "c"]
b integer [2, 3]
>

How does one create an AND/OR mask? I could compute them manually and use a list, but is there a better way?
Sort of what I'm looking for:

df = Explorer.DataFrame.from_map(%{a: ["a", "b", "c"], b: [1, 2, 3]})
mask = Explorer.Series.or(Explorer.Series.equal(df["b"], 1), Explorer.Series.equal(df["b"], 3))

Explorer.DataFrame.filter(df, mask)

#Explorer.DataFrame<
[rows: 2, columns: 2]
a string ["b", "c"]
b integer [1, 3]
>

caching rust deps locally

Not sure if this is possible, but could you suggest a way to keep from having to recompile polars everytime we update to the latest commit here?

Explorer.Series.pow does not work when supplied integers as exponent

Explorer.Series.pow returns an error when the user supplies an integer as exponent. It only works when exponent is a float, and I think because of that all values in the Series are casted to floating point numbers, which is not a desirable behaviour.

When exponent is a integer:

iex> s1 = [8, 16, 32] |> Explorer.Series.from_list()
#Explorer.Series<
  integer[3] 
  [8, 16, 32]
>
iex> Explorer.Series.pow(s1, 2)
** (ArgumentError) argument error
    (explorer 0.1.0-dev) Explorer.PolarsBackend.Native.s_pow(shape: (3,)
Series: '' [i64]
[
        8
        16
        32
], 1)
    (explorer 0.1.0-dev) lib/explorer/polars_backend/shared.ex:14: Explorer.PolarsBackend.Shared.apply_native/3

When exponent is a float:

iex> s1 = [8, 16, 32] |> Explorer.Series.from_list()
iex> Explorer.Series.pow(s1, 2.0)
#Explorer.Series<
  float[3]
  [64.0, 256.0, 1024.0]
>

CI for rust code?

It is a good practice to run:

cargo fmt --check
cargo clippy
on CI runs.

One more thing, Rust CI should run only when Rust code is changed.

I that's fine - I can send a PR.

Improvements to the Inspect protocol implementation

As discussed in the WG meeting, this is a suggestion to use the built-in Elixir functions for doc algebra to build the inspect representation

Enable joining on columns with different names

Add more datasets

We'd love to have more datasets available. Toy datasets in explorer itself are very welcome. And if there are ideas for linking up to and consuming datasets from https://github.com/elixir-nx/scidata that's fantastic too.

Integrating geospatial data like geopandas does

In python there is the geopandas library that allows working with geospatial data within a dataframe (geodataframe) and series (geoseries), there is some possibility of integrating them in this project like in that library?

Feature request: Capture Stream from write_parquet

Context: Be able to write stream to blob instead of having to roundtrip to disk, to allow for scenarios where you want fewer and bigger files

(Somewhat) silly question: what is a 'dataframe'?

TLDR

I'm just curious 🙂

Too long to read

I think I know what 'dataframe' means. I've been thinking of writing my own 'datacube' Elixir library, basically an 'in-memory database table'.

But I'm still not exactly sure what 'dataframe' means and some other people are similarly (somewhat) confused; see:

Explorer - Dataframes for Elixir : elixir

What's a little annoying/frustrating (in a very specific kind of way that these things sometimes are nowadays) is that doing a web (Google) search for "dataframe" or "dataframe definition" or "dataframe wikipedia" all return results about specific dataframe products/projects and all of them seem to assume you already know what a 'dataframe' is.

Here's an example result that's not quite entirely satisfying:

What are DataFrames? - Databricks

It kinda seems like 'dataframe' is so abstract that regular RDMS DB tables themselves are, in a sense, 'dataframes'. But that doesn't seem quite right to me!

The docs are a little helpful:

The Explorer library is a set of functions and data types to work with tabular data in Elixir.

In what sense is Ecto also (or not) a 'dataframe' library for Elixir?

Install process fails with timeout in `livebook`

This code (inside livebook)

Mix.install([{:explorer, "~> 0.1.0-dev", github: "amplifiedai/explorer"}])

Fails with:

* Updating explorer (https://github.com/amplifiedai/explorer.git)
origin/HEAD set to main
* Updating nx (https://github.com/elixir-nx/nx.git - origin/main)

** (exit) exited in: GenServer.call(Hex.Registry.Server, {:versions, "hexpm", "rustler"}, 60000)
    ** (EXIT) time out
    (elixir 1.12.2) lib/gen_server.ex:1024: GenServer.call/3
    (hex 0.20.6) lib/hex/remote_converger.ex:182: Hex.RemoteConverger.verify_package_req/4
    (elixir 1.12.2) lib/enum.ex:930: Enum."-each/2-lists^foreach/1-0-"/2
    (hex 0.20.6) lib/hex/remote_converger.ex:172: Hex.RemoteConverger.verify_input/2
    (hex 0.20.6) lib/hex/remote_converger.ex:42: Hex.RemoteConverger.converge/2
    (mix 1.12.2) lib/mix/dep/converger.ex:95: Mix.Dep.Converger.all/4
    (mix 1.12.2) lib/mix/dep/converger.ex:51: Mix.Dep.Converger.converge/4
    (mix 1.12.2) lib/mix/dep/fetcher.ex:16: Mix.Dep.Fetcher.all/3

`DataFrame.read_parquet` and `DataFrame.write_parquet`

Docs for Polars are here: reading and writing.

I've held back on implementing these as almost certainly a pure elixir backend won't be able to use it. I do think it's worth getting in though. Not sure what the best practice is here: just raising that it's not supported for a given backend?

Implement lazy by default

In a functional language with immutable data, memory management is important. The current implementation utilises polars's eager mode and computes new dataframes for every function. Because the dataframes are represented as a ResourceArc, they are only dropped from memory when the GC runs. This can be pretty heavy on memory, to say the least. The most efficient approach would be to treat dataframes as lazy by default with 'peeking' for inspect. In R, for example, function arguments are only evaluated when they are needed to show output.

An additional benefit to lazy by default is the opportunity to optimise queries. Why evaluate every function call when you can build up a query that may be executed in a more efficient way all together?

Polars has polars_lazy which permits exactly this. Making this shift will then permit the use of lazy evaluation for other backends -- esp. Datafusion/Ballista and Ecto.

For Explorer, we'll need to do a bit of exploration (pun absolutely intended) for how we can achieve this while maintaining the flexibility of pluggable backends. And when looking to a pure Elixir backend we should consider whether it's unnecessarily onerous compared to the benefits.

I'd really love ideas and feedback for making Explorer lazy by default. Is there a good peeking mechanism in other libraries? For example, something I'm going to be exploring is how tibbles in R minimise computation for print and head.

Read and write JSON

Should be pretty straightforward. https://pola-rs.github.io/polars/polars_io/json/index.html

Implement `Series.concatenate`

Add `group_by` and `summarise`

Add `pivot_wider` and `pivot_longer`

Corrupting data when converting integers after DataFrame.to_map()

While I was using some functions I saw a weird behaviour when converting a DataFrame to a map.
In the fossil_fuels dataset, the bunker_fuels column value that originally was the integer 9 became a '\t'.

iex(1)> df = Explorer.Datasets.fossil_fuels()
iex(2)> df  |> Explorer.DataFrame.slice(0, 1) |> Explorer.DataFrame.to_map()
%{          
  bunker_fuels: '\t',
  cement: [5],
  country: ["AFGHANISTAN"],
  gas_flaring: [0],
  gas_fuel: 'J',
  liquid_fuel: [1601],
  per_capita: [0.08],
  solid_fuel: [627],
  total: [2308],
  year: [2010]
}

And it gets worse, because it is not just a printing problem, if I try to parse the map to a list using Map.to_list(), IEx returns a list with the same '\t' character

iex(3)> df  |> Explorer.DataFrame.slice(0, 1) |> Explorer.DataFrame.to_map() |> Map.to_list()
[
  bunker_fuels: '\t',
  cement: [5],
  country: ["AFGHANISTAN"],
  gas_flaring: [0],
  gas_fuel: 'J',
  liquid_fuel: [1601],
  per_capita: [0.08],
  solid_fuel: [627],
  total: [2308],
  year: [2010]
]

Align comparison function names with `Nx`

Functions like Series.neq/2 should be Series.not_equal/2 to align with Nx.not_equal/2.

Additional backends

Explorer is primarily an API. The idea for pluggable backends was shamelessly stolen from Nx and dplyr. With Rustler precompiled, we can depend in polars but we want additional ones in the future.

So with that said, these are the backends that I think make the most sense to implement. I'm curious to hear if there are others that might make sense. For example, I've mentally written off Spark as being too difficult because I'm unfamiliar with Elixir <> JVM interop, but I'd love to hear if someone has a strategy.

LazyPolars
DataFusion/Ballista
Ecto/SQL

What about something like DuckDB? Does DataFusion have us covered for OLAP?

Implement "vstack"

Right now it doesn't seem there is a way to concatenate DataFrames. There is a function for this: https://docs.rs/polars/0.14.2/polars/frame/struct.DataFrame.html#method.vstack

elixir-explorer / explorer Goto Github PK

explorer's People

Contributors

Stargazers

Watchers

Forkers

explorer's Issues

Why not do this?

Proposal

TLDR

Too long to read

Recommend Projects

Recommend Topics

Recommend Org