The dataframes from enso-org

Add `Std` as an `AggregateFunction`

We need to be able to calculate standard deviation during rolling etc.

Summary

This provides a set of simple statistical functions, useful in exploratory data analysis.

Value

Fundamental for providing a streamlined flow for EDA.

Specification

This comprises several different functionalities, listed below:

Column.countValues – should return a 2 columns by n rows frame, where n is the number of different values in the column. Each row should be of the shape (value, count), where count is the number of occurrences of value in the column.
Column.countMissing – return the count of missing values in the column.
Table.{min, max, median, mean, std, var, sum, quantile n} – should be self explanatory. Consult documentation of pandas when in doubt. We are not accepting any special options here, just implement the simplest versions of each.
Table.correlations – returns a nrows x nrows Table with pairwise correlation coefficients between rows. Consult pandas's corr documentation for details.

The performance of each operation should be comparable (within 50% margin) to its pandas counterpart.

Acceptance Criteria & Test Cases

Tested manually, by comparing to relevant pandas output on the same data source.

shortRep method for all types

As reported:
2) [minor but annoying] most of the DataFrame-related types do not have shortReps (Histogram, Column, etc)

LQuery support for chunked arrays

General Summary

Currently LQuery interpreter requires columns to be built from a single array. Chunked arrays don't happen "naturally" but user can manually create them or they are result of some operations (namely the recently added shift method).

Test plot visualizations on Windows (and fix if needed)

At the moment no one ever saw working plots on Windows. Part of the issue is that there was no available Luna Studio build with proper Dataframes support.

Dataframes Tutorial

Pretty self-explanatory. Ideally, we should include sections for people transitioning from other platforms, like:

pandas
R
Alteryx (?)

Investigate what's up with matplotlib-cpp libpython loading on Linux

We have this workaround: https://github.com/luna/Dataframes/blob/master/native_libs/third-party/matplotlib-cpp/matplotlibcpp.h#L129

It is abominable. We should investigate to understand why it solves problem on Linux. If possible, it should be removed and fixed in a better. Otherwise, we should at least document it and try making more future proof (what about Pytohn 3.7?).

Dataframes documentation

Illegal instruction when saving CSV on Mac

I got this reported twice through Discord, once from @sylwiabr and once from @kustosz .
When writing CSV file the error mentioning illegal instruction happens:

[SUCCESS] column 2: [3, 6, 9, 12] == [3, 6, 9, 12]
Generate case 1
zsh: illegal hardware instruction  LUNA_LIBS_PATH=/Users/marcinkostrzewa/code/luna-core/stdlib  run --target

or

Running in interpreted mode.
Illegal instruction: 4

Apparently it is enough to run dataframes Luna tests to repro.
Issue was observed only on Mac.
As I have no Mac (and I don't imagine VM compiling Luna), I'd like to ask for help in diagnosing issue:

crashdump
CPU on which it happened
disassembly around the crashing instruction

Support for custom operations on rolling windows

Right now we are restricted to a set of hard-coded operations. Ideally, we would provide some kind of an apply function that allows us to pass any function (or rather: any LQuery expression).

Strange box (empty legend?) in the plot

Consider the following:

import Std.Base
import Std.Foreign.C.Value

import Dataframes.Internal.Utils
import Dataframes.Array
import Dataframes.Column
import Dataframes.Table
import Dataframes.Types
import Dataframes.Internal.Test.Test


def main:
    xCol = Column.fromList "x" Int64Type [1,2,3,4,5]
    yCol = Column.fromList "y" Int64Type [1,2,3,4,5]
    table1 = Table.fromColumns [xCol, yCol]
    chart1 = table1.chart
    plot1 = chart1.plot "x" "y"
    None

In the chart a small box next to 5.0 on Y axis can be seen. Likely this is an empty legend box.
When there is no legend, there should be no legend box.

Overhaul of the Time-related types

This is in context of the time series:

TimeInterval for expressing difference between two timestamps
some kind of correspondence between Luna's Std.Time and the timestamps we use for time series.

CSV parser: trailing newline should be ignored

Currently trailing newline creates an invalid record with one empty field. This is not desired behavior, RFC4180 explicitly allows trailing CRLF after the last record.

Add a docstrings to the dataframes functions

There should be documentation available in the searcher for all the functionalities available in th library. To achieve that the description in the docstrings is enough.

Rolling window operations for DataFrames

Pandas provides a great API for rolling window computations.

Ideally, we would have two kinds of windows:

constant time interval window for time series
constant number of samples window for the general case.

Join for Dataframes

See https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.join.html for reference.

reading/writing files

for reading/writing files we should have:

CSV.read and CSV.write functions
general method which will check file extension: Table.read and Table.write
If the extension is txt or there is none it should try to read it with all available methods.

Add type information to column header

To obtain the column type name the following can bey used:
column.field.type.toText

Instead of `.chart.plot` we should have `.plot`

Histogram does not display properly on linux while working perfectly on MacOS

download and build Dataframes project
checkout on commit 3d35ac29117a4551fd2281db240ba88a60f2984e
on linux we have:

while on MacOS histogram is generated without any problems.
The dataset used for this issue is gaggle Two Sigma dataset

Review matplotlib-cpp and related code

The matplotlib-cpp code and our code in Plot.cpp / Learn.cpp in some regards doesn't meet our standards (conventions, error handling, polluting standard output and so on).

The code should be checked, actionable issues should be identified and either fixed or written down to our backlog.

Revise the current API

Some points that have been raised:

max of a column is not a number, but a column (I understand why that is so, but it is not nice to work with)
The map over a column works in a way that gets me totally lost. I need to specify the type of the column: but I don't know it in the first place. That's a big win for pandas, where I can map like I know it.

Basic rolling window on timeseries

Rolling window defined by an time interval, allowing to calculate the following functions on columns:

mean
median
std

Support for timestamp type

Add support for timestamp column field type. Timestamp shall be internally treated as int64_t with nanoseconds count since epoch. In future it is desired to allow other Arrow-specified units, but we ignore this for now.

List of things to make sure that work:

… more to come

`interpolate` method for Table and Column

General Summary

Add interpolate method that will fill missing values:

with linearly interpolated value when there are available values before and after nulls
with first/last non-null value for leading/trailing nulls

If column does not contain any valid values, then interpolate() does nothing.

`sort` method for Dataframe

General Summary

We need to allow user sort the dataframe by values in column

Motivation

It is a basic feature for a data library

Specification

Table.sort colNames ascending naPosition - sort a values in dataframe. Returns new sorted dataframe.
colNames is a list of names to sort by;
ascending Sort ascending vs. descending. Specify list for multiple sort orders. This is a list of bools, it must match the length of the colNames;
naPosition {‘first’, ‘last’}, default ‘last’, first puts NaNs at the beginning, last puts NaNs at the end

Acceptance Criteria & Test Cases

Will be tested manually on large dataframes, by comparing the outputs with equivalent pandas operations.

add `index` column to Table

Method for Table class which adds the extra column with index numbers

Column generators

One example: creating a column with a repeated constant value.

Add new type of plot: violinPlot

Connect a violin plot from Seaborn to Luna's Dataframes library: https://seaborn.pydata.org/generated/seaborn.violinplot.html

Seaborn is a plotting library already connected with Dataframes. The violin plot should be connected just like kde plots.
The specific modification methods should be available for violin plots:

setLabel label
setColor color
setInner
split
setPalette
setLinewidth
setOrientation

The examples and implementations details will be provided when the issue will be picked up.

Is C++ required?

Since Luna can do pointer arithmetic and should be pretty fast on its own, why not implement the library in pure Luna, without using C?

Better `shortRep` for `Table` type

Short representation for Table shoul be just

Table rows x cols

Matplotlib integration

Summary

We should provide a matplotlib binding for the most common classes of plots.

Value

This allows users to create a variety of plots useful in EDA.

Specification

Bind several plots from matplotlib and seaborn:

Scatter
Histogram
Heat matrix
Several distribution plots from seaborn (KDE)

Each of these should be wrapped in a Luna–flavored (immutable) API, provide the most common configuration options. They should all work with Luna Studio and also be exportable to image files.
We should also provide an API for specifying axis/plot names and combining plots together both by stacking them and layouting them in a grid.

Acceptance Criteria & Test Cases

Tested by eyeballing the visualizations.

Column type deduction for XLSX files reading

Handling missing values

Summary

We need to allow users to handle (filter/fill) missing values in a dataframe.

Value

It's a fundamental feature for any data library.

Specification

We need a 2x2 matrix of functions for filling/dropping NAs per the whole table or a single column:

Table.dropNa – removes all rows where any of the values is missing.
Table.fillNa x – changes all NA occurences in the table with x.
Table.dropNaAt columnName – removes all rows where value in the given column is missing.
Table.fillNaAt colName x – changes all NAs to x inside colName column.

Performance should be comparable to mapping/filtering a simple predicate over the table.

Acceptance Criteria & Test Cases

Will be tested manually on large dataframes, by comparing the outputs with equivalent pandas operations.

Plotting int(int) chart gives floating axis labels on Y axis

Consider the following:

import Std.Base
import Std.Foreign.C.Value

import Dataframes.Internal.Utils
import Dataframes.Array
import Dataframes.Column
import Dataframes.Table
import Dataframes.Types
import Dataframes.Internal.Test.Test


def main:
    xCol = Column.fromList "x" Int64Type [1,2,3,4,5]
    yCol = Column.fromList "y" Int64Type [1,2,3,4,5]
    table1 = Table.fromColumns [xCol, yCol]
    chart1 = table1.chart
    plot1 = chart1.plot "x" "y"
    None

The Y axis is labeled using floating format (i.e. 1.0 instead of 1). The X axis is for some reason fine.

According to @sylwiabr this does not happen on Linux. Mac remains to be checked.

A true groupBy operation for Dataframes

@mwu-tow knows the details -- this task is just to keep track of the progress.

Short explanation: right now we can only do a groupBy operation followed by some aggregate. This task aims to expose a standalone groupBy functionality.

Building C++ parts of library should not be unoptimized by default

Currently running cmake without any additional arguments shall yield an unoptimized build.
As the library relies heavily on compiler optimization to achieve sensible performance, it should either be optimized by default or require user to specify the build type.

Make sure that new IO works fine with non-ascii paths on windows

Handle non-native line endings in csv parser

General Summary

Reading CSV with Windows-style line endings (CRLF) does not work properly on Mac/Linux, as CR-LF is not properly handled on non-Windows platforms.

Code Formatting

Can we standardise the C++ to four spaces please? ^^

Rolling window operations for DataFrames - constant number of samples window for the general case.

This is a second part of issue #35 which was partially implemented.

Filtering and mapping facilities using a DSL

Summary

This concerns an implementation of each and filter functions on a Table.

Value

These are the basic functions for querying the frame and data exploration.

Specification

The API–level description is provided in this Gist: https://gist.github.com/kustosz/49e1c588de4c1513cf91b18dd6342c15

This library should use the specified JSON format for exchanging queries between Luna and the C++ engine.

Simple operations such as:

df.each v: v.at "NUM_INSTALMENT_VERSION" * 2 + 4

df.filter v: v.at "NUM_INSTALMENT_VERSION" > 2

df.filter v: v.at "NUM_INSTALMENT_VERSION" > v.at "NUM_INSTALMENT_NUMBER"

Should take no longer than 200ms (pandas takes ~130ms for each of these), where df is the installments_payments.csv file from the Credit Default Risk competition at Kaggle.

Acceptance Criteria & Test Cases

The provided functions need to return correct values. The returned values will be compared to pandas outputs on the same queries.

Dataframe API docstrings

For every public function there needs to be a docstring outlining its general purposeand parameters. For key functions we need a more detailed description of how the functions work and some examples.

As a side-task, we need to come up with a template for a good docstring (the long one). Here is some inspiration: https://www.python.org/dev/peps/pep-0257/ and the pandas API is well documented, with lots of examples.

NOTE: this is a rather cumbersome task, but very important.

Styling Plotly plots for Dataframes

RSI should not return non-normal values

RSI function can now return e.g. -Infinity when given only positive values.
RSI should always return either:
— a number from 0—100 interval whenever possible
— a null value otherwise

Hardcoded RSI function on the rolling window

As a workaround for not being able to apply an arbitrary function to a rolling window, we need the RSI to be hardcoded. The (Python) code is as follows:

def rsi(values):
    up = values[values>0].mean()
    down = -1*values[values<0].mean()
    return 100 * up / (up + down)

(see this kernel for more info)

Dataframes 1.0 Epic

Summary

This section should summarise the work we want to accomplish during the epic.

Value

A description of the value this epic brings to users.
The motivation behind this epic.

Specification

The high-level requirements of the epic.
Any performance requirements for the epic.

Acceptance Criteria & Test Cases

The high-level acceptance criteria for the epic.
The test plan for the epic.

Reading `.h5` files

We want to support reading hdf5 type files into Dataframes. Example file: https://drive.google.com/open?id=12dvpSIzt9JbMpcj18bonMUoq5fpNZyyn

Appveyor Builds for C++

Summary

Currently it is proving difficult to know if the C++ components of dataframes are able to build successfully on all of our supported platforms. To ensure that they do, Appveyor CI (for windows support) should be set up to build these components.

Value

Having this set up will allow a faster development cadence as we can rely on the CI infrastructure to detect issues ahead of time, rather than finding the issues later on.

Specification

Set up the dataframes repo with Appveyor CI.
Have appveyor CI build the C++ components of this repo on Linux, MacOS and Windows.

Acceptance Criteria & Test Cases

Appveyor is able to build this code, and reports build success or failure for every commit.

Writing file with relative path is not working correctly

Repro steps:

import Dataframes.Column 
import Dataframes.Types
import Dataframes.Table 
l1 = [1,2,3,4,5]
l2 = [11,12,13,14,15]
col1 = Column.fromList "col1" Int64Type l1
col2 = Column.fromList "col2" Int64Type l2
table = Table.fromColumns [col1 , col2]
CSVGenerator . writeFile "./data/foo.csv" table

is not saving to ./data folder while giving full path is saving everything correctly

Required behavior:
We should not expand the relative paths magically. We should use the Current Working Directory env variable, which should be set to project root. We need to check if it is set so, or the problem is caused by something else.

enso-org / dataframes Goto Github PK

dataframes's People

Contributors

Stargazers

Watchers

Forkers

dataframes's Issues

Summary

Value

Specification

Acceptance Criteria & Test Cases

General Summary

General Summary

General Summary

Motivation

Specification

Acceptance Criteria & Test Cases

Summary

Value

Specification

Acceptance Criteria & Test Cases

Summary

Value

Specification

Acceptance Criteria & Test Cases

General Summary

Summary

Value

Specification

Acceptance Criteria & Test Cases

Summary

Value

Specification

Acceptance Criteria & Test Cases

Summary

Value

Specification

Acceptance Criteria & Test Cases

Recommend Projects

Recommend Topics

Recommend Org