enso-org / dataframes Goto Github PK
View Code? Open in Web Editor NEWA library for working with tabular data in Luna.
Home Page: https://luna-lang.org
License: MIT License
A library for working with tabular data in Luna.
Home Page: https://luna-lang.org
License: MIT License
We need to be able to calculate standard deviation during rolling etc.
This provides a set of simple statistical functions, useful in exploratory data analysis.
Fundamental for providing a streamlined flow for EDA.
This comprises several different functionalities, listed below:
Column.countValues
– should return a 2 columns by n rows frame, where n
is the number of different values in the column. Each row should be of the shape (value, count)
, where count is the number of occurrences of value
in the column.Column.countMissing
– return the count of missing values in the column.Table.{min, max, median, mean, std, var, sum, quantile n}
– should be self explanatory. Consult documentation of pandas when in doubt. We are not accepting any special options here, just implement the simplest versions of each.Table.correlations
– returns a nrows x nrows
Table with pairwise correlation coefficients between rows. Consult pandas's corr
documentation for details.The performance of each operation should be comparable (within 50% margin) to its pandas counterpart.
Tested manually, by comparing to relevant pandas output on the same data source.
As reported:
2) [minor but annoying] most of the DataFrame-related types do not have shortReps (Histogram, Column, etc)
Currently LQuery interpreter requires columns to be built from a single array. Chunked arrays don't happen "naturally" but user can manually create them or they are result of some operations (namely the recently added shift
method).
At the moment no one ever saw working plots on Windows. Part of the issue is that there was no available Luna Studio build with proper Dataframes support.
Pretty self-explanatory. Ideally, we should include sections for people transitioning from other platforms, like:
We have this workaround: https://github.com/luna/Dataframes/blob/master/native_libs/third-party/matplotlib-cpp/matplotlibcpp.h#L129
It is abominable. We should investigate to understand why it solves problem on Linux. If possible, it should be removed and fixed in a better. Otherwise, we should at least document it and try making more future proof (what about Pytohn 3.7?).
I got this reported twice through Discord, once from @sylwiabr and once from @kustosz .
When writing CSV file the error mentioning illegal instruction happens:
[SUCCESS] column 2: [3, 6, 9, 12] == [3, 6, 9, 12]
Generate case 1
zsh: illegal hardware instruction LUNA_LIBS_PATH=/Users/marcinkostrzewa/code/luna-core/stdlib run --target
or
Running in interpreted mode.
Illegal instruction: 4
Apparently it is enough to run dataframes Luna tests to repro.
Issue was observed only on Mac.
As I have no Mac (and I don't imagine VM compiling Luna), I'd like to ask for help in diagnosing issue:
Right now we are restricted to a set of hard-coded operations. Ideally, we would provide some kind of an apply
function that allows us to pass any function (or rather: any LQuery expression).
Consider the following:
import Std.Base
import Std.Foreign.C.Value
import Dataframes.Internal.Utils
import Dataframes.Array
import Dataframes.Column
import Dataframes.Table
import Dataframes.Types
import Dataframes.Internal.Test.Test
def main:
xCol = Column.fromList "x" Int64Type [1,2,3,4,5]
yCol = Column.fromList "y" Int64Type [1,2,3,4,5]
table1 = Table.fromColumns [xCol, yCol]
chart1 = table1.chart
plot1 = chart1.plot "x" "y"
None
In the chart a small box next to 5.0 on Y axis can be seen. Likely this is an empty legend box.
When there is no legend, there should be no legend box.
This is in context of the time series:
TimeInterval
for expressing difference between two timestampsStd.Time
and the timestamps we use for time series.Currently trailing newline creates an invalid record with one empty field. This is not desired behavior, RFC4180 explicitly allows trailing CRLF after the last record.
There should be documentation available in the searcher for all the functionalities available in th library. To achieve that the description in the docstrings is enough.
Pandas provides a great API for rolling window computations.
Ideally, we would have two kinds of windows:
for reading/writing files we should have:
CSV.read
and CSV.write
functionsTable.read
and Table.write
txt
or there is none it should try to read it with all available methods.To obtain the column type name the following can bey used:
column.field.type.toText
The matplotlib-cpp code and our code in Plot.cpp / Learn.cpp in some regards doesn't meet our standards (conventions, error handling, polluting standard output and so on).
The code should be checked, actionable issues should be identified and either fixed or written down to our backlog.
Some points that have been raised:
max of a column is not a number, but a column (I understand why that is so, but it is not nice to work with)
The map over a column works in a way that gets me totally lost. I need to specify the type of the column: but I don't know it in the first place. That's a big win for pandas, where I can map like I know it.
Rolling window defined by an time interval, allowing to calculate the following functions on columns:
Add support for timestamp column field type. Timestamp shall be internally treated as int64_t with nanoseconds count since epoch. In future it is desired to allow other Arrow-specified units, but we ignore this for now.
List of things to make sure that work:
Time
typeTime
type… more to come
Add interpolate method that will fill missing values:
If column does not contain any valid values, then interpolate() does nothing.
We need to allow user sort the dataframe by values in column
It is a basic feature for a data library
Table.sort colNames ascending naPosition
- sort a values in dataframe. Returns new sorted dataframe.
colNames
is a list of names to sort by;
ascending
Sort ascending vs. descending. Specify list for multiple sort orders. This is a list of bools, it must match the length of the colNames
;
naPosition
{‘first’, ‘last’}, default ‘last’, first puts NaNs at the beginning, last puts NaNs at the end
Will be tested manually on large dataframes, by comparing the outputs with equivalent pandas operations.
Method for Table
class which adds the extra column with index numbers
One example: creating a column with a repeated constant value.
Connect a violin plot from Seaborn to Luna's Dataframes library: https://seaborn.pydata.org/generated/seaborn.violinplot.html
Seaborn is a plotting library already connected with Dataframes. The violin plot should be connected just like kde
plots.
The specific modification methods should be available for violin plots:
setLabel label
setColor color
setInner
split
setPalette
setLinewidth
setOrientation
The examples and implementations details will be provided when the issue will be picked up.
Since Luna can do pointer arithmetic and should be pretty fast on its own, why not implement the library in pure Luna, without using C?
Short representation for Table shoul be just
Table rows x cols
We should provide a matplotlib binding for the most common classes of plots.
This allows users to create a variety of plots useful in EDA.
Bind several plots from matplotlib and seaborn:
Each of these should be wrapped in a Luna–flavored (immutable) API, provide the most common configuration options. They should all work with Luna Studio and also be exportable to image files.
We should also provide an API for specifying axis/plot names and combining plots together both by stacking them and layouting them in a grid.
Tested by eyeballing the visualizations.
We need to allow users to handle (filter/fill) missing values in a dataframe.
It's a fundamental feature for any data library.
We need a 2x2 matrix of functions for filling/dropping NAs per the whole table or a single column:
Table.dropNa
– removes all rows where any of the values is missing.Table.fillNa x
– changes all NA occurences in the table with x
.Table.dropNaAt columnName
– removes all rows where value in the given column is missing.Table.fillNaAt colName x
– changes all NAs to x
inside colName
column.Performance should be comparable to mapping/filtering a simple predicate over the table.
Will be tested manually on large dataframes, by comparing the outputs with equivalent pandas operations.
Consider the following:
import Std.Base
import Std.Foreign.C.Value
import Dataframes.Internal.Utils
import Dataframes.Array
import Dataframes.Column
import Dataframes.Table
import Dataframes.Types
import Dataframes.Internal.Test.Test
def main:
xCol = Column.fromList "x" Int64Type [1,2,3,4,5]
yCol = Column.fromList "y" Int64Type [1,2,3,4,5]
table1 = Table.fromColumns [xCol, yCol]
chart1 = table1.chart
plot1 = chart1.plot "x" "y"
None
The Y axis is labeled using floating format (i.e. 1.0 instead of 1). The X axis is for some reason fine.
According to @sylwiabr this does not happen on Linux. Mac remains to be checked.
@mwu-tow knows the details -- this task is just to keep track of the progress.
Short explanation: right now we can only do a groupBy
operation followed by some aggregate. This task aims to expose a standalone groupBy
functionality.
Currently running cmake without any additional arguments shall yield an unoptimized build.
As the library relies heavily on compiler optimization to achieve sensible performance, it should either be optimized by default or require user to specify the build type.
Reading CSV with Windows-style line endings (CRLF) does not work properly on Mac/Linux, as CR-LF is not properly handled on non-Windows platforms.
Can we standardise the C++ to four spaces please? ^^
This is a second part of issue #35 which was partially implemented.
This concerns an implementation of each
and filter
functions on a Table.
These are the basic functions for querying the frame and data exploration.
The API–level description is provided in this Gist: https://gist.github.com/kustosz/49e1c588de4c1513cf91b18dd6342c15
This library should use the specified JSON format for exchanging queries between Luna and the C++ engine.
Simple operations such as:
df.each v: v.at "NUM_INSTALMENT_VERSION" * 2 + 4
df.filter v: v.at "NUM_INSTALMENT_VERSION" > 2
df.filter v: v.at "NUM_INSTALMENT_VERSION" > v.at "NUM_INSTALMENT_NUMBER"
Should take no longer than 200ms (pandas takes ~130ms for each of these), where df
is the installments_payments.csv
file from the Credit Default Risk competition at Kaggle.
The provided functions need to return correct values. The returned values will be compared to pandas outputs on the same queries.
For every public function there needs to be a docstring outlining its general purposeand parameters. For key functions we need a more detailed description of how the functions work and some examples.
As a side-task, we need to come up with a template for a good docstring (the long one). Here is some inspiration: https://www.python.org/dev/peps/pep-0257/ and the pandas API is well documented, with lots of examples.
NOTE: this is a rather cumbersome task, but very important.
RSI function can now return e.g. -Infinity when given only positive values.
RSI should always return either:
— a number from 0—100 interval whenever possible
— a null value otherwise
As a workaround for not being able to apply an arbitrary function to a rolling window, we need the RSI to be hardcoded. The (Python) code is as follows:
def rsi(values):
up = values[values>0].mean()
down = -1*values[values<0].mean()
return 100 * up / (up + down)
(see this kernel for more info)
We want to support reading hdf5
type files into Dataframes. Example file: https://drive.google.com/open?id=12dvpSIzt9JbMpcj18bonMUoq5fpNZyyn
Currently it is proving difficult to know if the C++ components of dataframes are able to build successfully on all of our supported platforms. To ensure that they do, Appveyor CI (for windows support) should be set up to build these components.
Having this set up will allow a faster development cadence as we can rely on the CI infrastructure to detect issues ahead of time, rather than finding the issues later on.
Repro steps:
import Dataframes.Column
import Dataframes.Types
import Dataframes.Table
l1 = [1,2,3,4,5]
l2 = [11,12,13,14,15]
col1 = Column.fromList "col1" Int64Type l1
col2 = Column.fromList "col2" Int64Type l2
table = Table.fromColumns [col1 , col2]
CSVGenerator . writeFile "./data/foo.csv" table
is not saving to ./data
folder while giving full path is saving everything correctly
Required behavior:
We should not expand the relative paths magically. We should use the Current Working Directory env variable, which should be set to project root. We need to check if it is set so, or the problem is caused by something else.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.