juliadata / dataapi.jl Goto Github PK

View Code? Open in Web Editor NEW

33.0 33.0 13.0 59 KB

A data-focused namespace for packages to share functions

License: Other

Julia 100.00%

data julia julialang

dataapi.jl's People

Contributors

Stargazers

Watchers

Forkers

gaybro8777 bkamins nosferican pdeffebach isgasho stjordanis iblislin tokazama kaigouthro ablaom vezy mobley-trent rafaqz

dataapi.jl's Issues

Confusing `levels` fallback

Looking at the concrete implementation of levels in CategoricalArrays, I see that this function conveniently extracts possible levels from both arrays and individual values:

julia> a = CategoricalArray(["abc", "def", "abc"])
3-element CategoricalArray{String,1,UInt32}:
 "abc"
 "def"
 "abc"

julia> x = a[1]
CategoricalValue{String, UInt32} "abc"

julia> levels(a)
2-element Vector{String}:
 "abc"
 "def"

julia> levels(x)
2-element Vector{String}:
 "abc"
 "def"

However, the fallback implemented in DataAPI itself is only correct for collections, not individual values:

# like unique(x), makes sense?
julia> levels(["abc", "def", "abc"])
2-element Vector{String}:
 "abc"
 "def"

# makes no sense:
julia> levels("abc")
3-element Vector{Char}:
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
 'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)

# especially confusing given that:
julia> x == "abc"
true

Maybe, the fallback shouldn't exist at all, and something like haslevels(x)::Bool added instead?

Deprecate `All`

Maybe I'm missing something but there seems to be support in

DataFrames: JuliaData/DataFrames.jl#2203
DataAPI: #16

for deprecating All in favor of Cols.

Cols(): No columns
Cols(:) All columns

This is so easy with Cols that there is no need for All().

We should deprecate All.

If someone has a special need for a function that disregards types, they can make their own wrapper function. The function in DataAPI that's the public function for getting values out of CategoricalValues should only take CategoricalValue. It is too easy to accidentally call unwrap on a type that is not CategoricalValue, such as Vector{CategoricalValue}.

See also JuliaData/CategoricalArrays.jl#399 This change is breaking so should be considered for DataAPI 2.0.

add kwarg to levels to keep missing

See JuliaData/DataFrames.jl#3012 (comment)

clarify Between

as given:

"""
    Between(first, last)

Select the columns between `first` and `last` from a table.
"""
struct Between{T1 <: Union{Int, Symbol}, T2 <: Union{Int, Symbol}}
    first::T1
    last::T2
end

"between" may be understood as "through", inclusive of first and last
"between" may be understood as "within", exclusive of first and last

if you intend "through", add e.g. first and last (including both) from a table
if you intend "within", add e.g. first and last (excluding both) from a table

a few concerns about metadata methods

I've started implementing metadata and colmetadata for Parquet2.jl. I have a few thoughts, sorry for not bringing this up when this was being discussed but there was a lot of conversation and I tuned out at some point.

There is currently no way as part of the API to fetch with a default such as in Base.get. In many situations this means there is no way of fetching data without at least 2 lookups.
There isn't a clean way in the API of fetching all metadata, one would have to do something like Dict(k=>metadata(x, k) for k \in metadatakeys(x)) which seems a bit awkward, especially considering that in many cases the object is probably just sitting there in the first place and shouldn't have to be reconstructed.
I'm not sure if this is a problem, but I thought I'd point out that Tables.jl supports cases where the relationship between an object and its colmetadata is more complicated than this API suggests. For example, in Parquet2 a Dataset is a table that has columns which are concatenations of sub-columns which belong ti sub-tables (which are also Tables.jl tables) called RowGroup. It's therefore not possible to define colmetadata on Dataset because it would be ambiguous which column metadata should be used (or whether it would be appropriate to merge them). This is surely not a typical case, but it seems worth pointing out that Tables.jl isn't enough to specify what colmetadata should do.
Defining ArgumentError fallbacks seems a bit dubious. These clearly should be MethodError if there is no reasonable fallback. The most obvious consequence of this is that error handling routines might catch a wrong error here. Nothing else immediately comes to mind, though I do vaguely remember somebody writing a blog post at some point describing why this pattern leads to trouble... I'd also be a little worried about it making method ambiguity cases worse.

I realize that opening this issue might seem like more of an annoyance than anything else since the ship has sailed and now we'd have to deal with breakage. However there might still be room to add a few methods such as, perhaps

metadata(x)
metadata(x, k, default)
metadata!(x, k, default)
metadata!(f, x, k)

Perhaps it's already fine for packages to include these but in that case perhaps it should be documented.

`Between` should accept more than `Int` and `Symbol`

With the new DataFrames String indexing, Between(a::String, b::String) fails due to the type constraint in Between here.

My understanding is that DataAPI doesn't need to enforce exactly the type signature in this context. It's enough that functions live here to avoid type piracy.

Rename `All` to `Union`, or to `Cols`?

It ocurred to me that a statement like All(Not(:x), :y) is a bit of an oxymoron. Also, if one were to construct a statement outside a transform-like function call you would see

All(Not(:x), :y)

Which raises the question, All of what?

Perhaps it could be Union(Not(:x), :y). This reads better in my opinion.

Initial Roadmap

I'm opening this issue as a place to discuss the path forward for this package and for people to give their feedback. Here's where things are at in my mind:

I've opened PRs for Tables.jl, DataValues.jl, CategoricalArrays.jl, StatsBase.jl, and DataFrames.jl showing how packages would take a dependency on DataAPI.jl (once registered obviously) and extend functions from DataAPI.jl instead of defining themselves or using Requires.jl
I haven't made any changes to packages using the refX DataAPI.jl functions; while I understand the basics, I'm not as familiar w/ the implementations, but I'm willing to take a stab at it if people would like. @nalimilan and @piever are much more aware of how packages like PooledArrays, CategoricalArrays, and StructArrays can take advantage of sharing the common ref functions. I think it was also suggested at some point that we may want a RefArrays.jl package that was home to various sorting/grouping optimization routines that DataFrames/StructArrays could then share. I'm happy to push forward on making those changes, but I'll need to have some discussions w/ @nalimilan and @piever for guidance.
I've made @davidanthoff and @piever admins in this repo, in addition to the regular JuliaData maintainers who have access by default (@nalimilan, @bkamins, @andyferris, @andreasnoack, @ararslan , etc.); I think it's important that at least the initial set of package maintainers who depend on DataAPI.jl have an equal say in how the package works/is maintained and have some control over things.

In terms of steps forward, here's what I think:

@nalimilan, @piever, and myself (and anyone else who'd like) work on making sure the ref* function API is solid and draft PRs showing how packages could share these functions
Confirm w/ @davidanthoff that the DataValues.jl PR looks good (it's very minimal code, so hopefully should be uncontroversial); I know he's been swamped lately, but hopefully we can find a few minutes to sync on this
Register DataAPI.jl officially; I'm planning on the initial release being 1.0.0
Merge PRs to respective packages taking on the dependency

Please ping anyone else who might be interested or have something useful to add to the discussion here. I don't think there's a super rush on any of this, but I know DataFrames is approaching a 1.0 release in the next few months and it would be good to cleanup its dependencies soon.

Define `rename` and `rename!` for modifying column names?

I thought it would be useful to define rename and rename! in DataAPI.jl for functions that achieve purposes similar to what they do for DataFrames.jl.

isordered

This function is defined in CategoricalArrays.jl.

CategoricalArrays.isordered(x::CategoricalValue) = isordered(x.pool)

However, it is quite generic and would be beneficial to have it defined in DataAPI.jl as something like

DataAPI.isordered(::Any) = false
DataAPI.isordered(obj::AbstractVector) = issorted(obj)

Thoughts?

Add `Selector` abstract type for ecosytem compat, and rethink `Between`

DimensionalData.jl defines a Selector abstract type and selectors Between, Near, At, Where, and Contains.

AxisKeys.jl also defines Selector and Near and Inervals selectors (ping @mcabbott)

It would be useful to standardise these so we can all use the simplest words without clashes, and this is probably the place for that.

We could define Selector here, and make Between <: Selector. Near could also move here to avoid the (maybe less common) clashes between AxisKeys.jl and DimensionalData.jl, if @mcabbott is also interested in sharing these types.

Between in DD holds a 2-tuple of any types, while here it has two fields limited to Union{Int,Symbol}.

Details of the functionality of Cols

We had a discussion with @nalimilan how exactly Cols should work.

The question is if:

Cols(:x1, :x1)

should be an allowed selection rule.

What I mean is that clearly things like:

Cols(r"x", Between(:x, :y))

should be allowed (taking a union), but should we allow taking a union of single column selectors.

As a corner case:

Cols(:x1, [:x1])

would be allowed (as [:x1] selects multiple columns - just in this case it is just a single column).

The rationale is that it is unlikely that:

Cols(:x1, :x1)

is an intended thing (probably this is an error on the user side).

@piever - do you have any opinion on this?

Add placeholders to define Tables interface

I was wondering whether we could consider adding the minimal placeholders to implement the Tables interface here. In particular:

istable
rowaccess
columnaccess
rows
columns
schema
Schema

The idea is that this way it is possible to be a Table "source" without depending on Tables. For example, I would like StructArrays to keep being a Tables source but there were concerns both on having it depend on Tables and on using Requires, which is the current solution.

This still makes it harder to be a Table sink without depending on Tables but that would be weird: one should not have a default fallback that depends on a package that maybe was not loaded.

Column selectors should guarantee column order is preserved

I think that all column selectors (other than arrays) should guarantee that the column order in the original table is preserved. One would certainly expect that to be the case for Between, though it's not explicitly mentioned in the docstring. It would be a bummer if you had foo(x, y) = 2x .+ y but Between(:x, :y) => foo happened to lower to [:y, :x] => foo instead of [:x, :y] => foo.

And I think it makes sense to guarantee column order preservation for the other selectors. E.g.

df = DataFrame(a=1, b=2, c=3)
select(df, Not(:b) => foo)

should be guaranteed to lower to

select(df, [:a, :c] => foo)

rather than

select(df, [:c, :a] => foo)

I'm not totally certain the best way to specify the column ordering properties of Cols, but I think this specification makes sense:

Individual column selectors inside Cols are first lowered to (ordered) arrays.
- The lowering of the individual column selectors (except for arrays) follows the rule above that table column order should be preserved.
Cols is then lowered as follows: Cols(A, B, C) ==> [A, B\A, C\(A ∪ B)] (where the arguments on the right side are splatted into the array).

Since setdiff on arrays preserves the order of the first argument to setdiff, we get the following behavior:

df = DataFrame(a=1, b=2, c=3)
Cols([:c, :b], [:a, :b]) == [:c, :b, :a]
Cols(r"[bc]", r"[ab]") == [:b, :c, :a]

Add method for iterating metadata

Any type that supports metadata will likely need a way to iterate over metadata for propagating metadata to new instances or merging metadata. The generic way of iterating through metadata right now is using accessing each key form the iterable returned by metadatakeys. Depending on how metadata is stored this may not be the most efficient way to accomplish this. If the method that is propagating metadata internally is managing a single collection of metadata then a unique implementation can be written for whatever method of iteration is best. However, when combining multiple sets of metadata it's not practical to have a unique method for every combination of metadata storage patterns.

A couple things to consider in supporting generic iteration over metadata:

How do we manage to including the style information. This may be as simple as having two methods such as metapairs and styledmetapairs.
Do we want a default for this method that functions like the current heuristic using metadatakeys?

Plan for 1.7 release

Do we expect anything from #36 to be added to 1.7 release? If not - I would make a release to avoid having keeping tracok of it in DataFrames.jl.

hcat! function

Following JuliaData/DataFrames.jl#1866 (comment) we could consider adding hcat! to DataAPI.jl (alternatively we can have it in DataFrames.jl as we currently have).
CC @nalimilan

nrow and ncol for undefined values

When a table does not have nrow or ncol defined I we can add a description in their docstring what they should return. nothing seems a reasonable value.

`metadata` method

It is often the case that one wants to attach metadata of some sort to an array/graph/etc. How do people feel about adding something basic like metadata(x) = nothing that can then be extended by other packages?

Add flatten to DataAPI.jl

Both SplitApplyCombine.jl and DataFrames.jl export flatten. I would add it to DataAPI.jl. The question is what docstring it should have? Maybe something like:

Flatten collection of collections into a single collection

Is enough?

@andyferris - after this is established maybe you could add DataAPI.jl to SplitApplyCombine.jl as a dependency and make innerjoin and flatten implement this interface? Then SplitApplyCombine.jl and DataFrames.jl could be used together more easily.

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

Change describe contract

Currently describe contract is that it does pretty print the passed object.
The contract does not say what the function returns. I propose that describe should keep to print what it promises, but also return the computed statistics.

The crucial problem is that in REPL this would "double print" the contents. Also there is a duplication between describe and summarystats.

I am not sure what is best, but I leave this issue open to keep track of it.

CC @nalimilan

default to metadata! style=:default

Maybe this was discussed already, but I would think a style named :default would be defaulted to if style is not provided.
I would like to see style as an optional keyword argument so that metadata!(df, "key", "value") is allowed. (Then TableMetadataTools.jl would define a way to change what style metadata!(df, "key", "value") uses.)

`fit`, `fit!`, `predict`, and `predict!` functions

Currently, the fit, fit!, predict, and predict! functions live in StatsBase.

What do you think about putting the function stubs for those four functions in DataAPI?

The generic fallback definitions would continue to live in StatsBase.

ellipsis notation for Beetwen

I was thinking that it may be convenient and intuitive to use IntervalSets.jl's ellipsis notation as an alternative to Between(:a, :b):

df[:, :a..:b]
df[:, :a..end]

Sorry if this has already been discussed before.

Best,
Carlo

rownumber

DataFrames defines rownumber and row, and I want to define a similar function for the rows (TimeStepRow) of my table type (TimeStepTable) to be able to retreive any other row of the Table from one row.

I was wondering if it would be interesting to add the functions here ?

Add `flatten` function

Both DataFrames and JuliaDB have flatten now. Since both functionalities for flatten are pretty far from Iterators.flatten so it makes sense to both extend a flatten from DataAPI.