juliadata / dataapi.jl Goto Github PK
View Code? Open in Web Editor NEWA data-focused namespace for packages to share functions
License: Other
A data-focused namespace for packages to share functions
License: Other
Looking at the concrete implementation of levels
in CategoricalArrays
, I see that this function conveniently extracts possible levels from both arrays and individual values:
julia> a = CategoricalArray(["abc", "def", "abc"])
3-element CategoricalArray{String,1,UInt32}:
"abc"
"def"
"abc"
julia> x = a[1]
CategoricalValue{String, UInt32} "abc"
julia> levels(a)
2-element Vector{String}:
"abc"
"def"
julia> levels(x)
2-element Vector{String}:
"abc"
"def"
However, the fallback implemented in DataAPI
itself is only correct for collections, not individual values:
# like unique(x), makes sense?
julia> levels(["abc", "def", "abc"])
2-element Vector{String}:
"abc"
"def"
# makes no sense:
julia> levels("abc")
3-element Vector{Char}:
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)
# especially confusing given that:
julia> x == "abc"
true
Maybe, the fallback shouldn't exist at all, and something like haslevels(x)::Bool
added instead?
Maybe I'm missing something but there seems to be support in
for deprecating All
in favor of Cols
.
Cols()
: No columnsCols(:)
All columnsThis is so easy with Cols
that there is no need for All()
.
We should deprecate All
.
If someone has a special need for a function that disregards types, they can make their own wrapper function. The function in DataAPI that's the public function for getting values out of CategoricalValues should only take CategoricalValue. It is too easy to accidentally call unwrap
on a type that is not CategoricalValue
, such as Vector{CategoricalValue}
.
See also JuliaData/CategoricalArrays.jl#399 This change is breaking so should be considered for DataAPI 2.0.
as given:
"""
Between(first, last)
Select the columns between `first` and `last` from a table.
"""
struct Between{T1 <: Union{Int, Symbol}, T2 <: Union{Int, Symbol}}
first::T1
last::T2
end
if you intend "through", add e.g. first
and last
(including both) from a table
if you intend "within", add e.g. first
and last
(excluding both) from a table
I've started implementing metadata
and colmetadata
for Parquet2.jl. I have a few thoughts, sorry for not bringing this up when this was being discussed but there was a lot of conversation and I tuned out at some point.
Base.get
. In many situations this means there is no way of fetching data without at least 2 lookups.Dict(k=>metadata(x, k) for k \in metadatakeys(x))
which seems a bit awkward, especially considering that in many cases the object is probably just sitting there in the first place and shouldn't have to be reconstructed.colmetadata
is more complicated than this API suggests. For example, in Parquet2 a Dataset
is a table that has columns which are concatenations of sub-columns which belong ti sub-tables (which are also Tables.jl tables) called RowGroup
. It's therefore not possible to define colmetadata
on Dataset
because it would be ambiguous which column metadata should be used (or whether it would be appropriate to merge them). This is surely not a typical case, but it seems worth pointing out that Tables.jl isn't enough to specify what colmetadata
should do.ArgumentError
fallbacks seems a bit dubious. These clearly should be MethodError
if there is no reasonable fallback. The most obvious consequence of this is that error handling routines might catch a wrong error here. Nothing else immediately comes to mind, though I do vaguely remember somebody writing a blog post at some point describing why this pattern leads to trouble... I'd also be a little worried about it making method ambiguity cases worse.I realize that opening this issue might seem like more of an annoyance than anything else since the ship has sailed and now we'd have to deal with breakage. However there might still be room to add a few methods such as, perhaps
metadata(x)
metadata(x, k, default)
metadata!(x, k, default)
metadata!(f, x, k)
Perhaps it's already fine for packages to include these but in that case perhaps it should be documented.
With the new DataFrames String
indexing, Between(a::String, b::String)
fails due to the type constraint in Between
here.
My understanding is that DataAPI doesn't need to enforce exactly the type signature in this context. It's enough that functions live here to avoid type piracy.
It ocurred to me that a statement like All(Not(:x), :y)
is a bit of an oxymoron. Also, if one were to construct a statement outside a transform
-like function call you would see
All(Not(:x), :y)
Which raises the question, All
of what?
Perhaps it could be Union(Not(:x), :y)
. This reads better in my opinion.
I'm opening this issue as a place to discuss the path forward for this package and for people to give their feedback. Here's where things are at in my mind:
refX
DataAPI.jl functions; while I understand the basics, I'm not as familiar w/ the implementations, but I'm willing to take a stab at it if people would like. @nalimilan and @piever are much more aware of how packages like PooledArrays, CategoricalArrays, and StructArrays can take advantage of sharing the common ref functions. I think it was also suggested at some point that we may want a RefArrays.jl package that was home to various sorting/grouping optimization routines that DataFrames/StructArrays could then share. I'm happy to push forward on making those changes, but I'll need to have some discussions w/ @nalimilan and @piever for guidance.In terms of steps forward, here's what I think:
ref*
function API is solid and draft PRs showing how packages could share these functionsPlease ping anyone else who might be interested or have something useful to add to the discussion here. I don't think there's a super rush on any of this, but I know DataFrames is approaching a 1.0 release in the next few months and it would be good to cleanup its dependencies soon.
I thought it would be useful to define rename
and rename!
in DataAPI.jl
for functions that achieve purposes similar to what they do for DataFrames.jl
.
This function is defined in CategoricalArrays.jl.
CategoricalArrays.isordered(x::CategoricalValue) = isordered(x.pool)
However, it is quite generic and would be beneficial to have it defined in DataAPI.jl as something like
DataAPI.isordered(::Any) = false
DataAPI.isordered(obj::AbstractVector) = issorted(obj)
Thoughts?
DimensionalData.jl defines a Selector
abstract type and selectors Between
, Near
, At
, Where
, and Contains
.
AxisKeys.jl also defines Selector
and Near
and Inervals
selectors (ping @mcabbott)
It would be useful to standardise these so we can all use the simplest words without clashes, and this is probably the place for that.
We could define Selector
here, and make Between <: Selector
. Near
could also move here to avoid the (maybe less common) clashes between AxisKeys.jl and DimensionalData.jl, if @mcabbott is also interested in sharing these types.
Between
in DD holds a 2-tuple of any types, while here it has two fields limited to Union{Int,Symbol}
.
We had a discussion with @nalimilan how exactly Cols
should work.
The question is if:
Cols(:x1, :x1)
should be an allowed selection rule.
What I mean is that clearly things like:
Cols(r"x", Between(:x, :y))
should be allowed (taking a union), but should we allow taking a union of single column selectors.
As a corner case:
Cols(:x1, [:x1])
would be allowed (as [:x1]
selects multiple columns - just in this case it is just a single column).
The rationale is that it is unlikely that:
Cols(:x1, :x1)
is an intended thing (probably this is an error on the user side).
@piever - do you have any opinion on this?
I was wondering whether we could consider adding the minimal placeholders to implement the Tables interface here. In particular:
istable
rowaccess
columnaccess
rows
columns
schema
Schema
The idea is that this way it is possible to be a Table "source" without depending on Tables. For example, I would like StructArrays to keep being a Tables source but there were concerns both on having it depend on Tables and on using Requires, which is the current solution.
This still makes it harder to be a Table sink without depending on Tables but that would be weird: one should not have a default fallback that depends on a package that maybe was not loaded.
I think that all column selectors (other than arrays) should guarantee that the column order in the original table is preserved. One would certainly expect that to be the case for Between
, though it's not explicitly mentioned in the docstring. It would be a bummer if you had foo(x, y) = 2x .+ y
but Between(:x, :y) => foo
happened to lower to [:y, :x] => foo
instead of [:x, :y] => foo
.
And I think it makes sense to guarantee column order preservation for the other selectors. E.g.
df = DataFrame(a=1, b=2, c=3)
select(df, Not(:b) => foo)
should be guaranteed to lower to
select(df, [:a, :c] => foo)
rather than
select(df, [:c, :a] => foo)
I'm not totally certain the best way to specify the column ordering properties of Cols
, but I think this specification makes sense:
Cols
are first lowered to (ordered) arrays.
Cols
is then lowered as follows: Cols(A, B, C) ==> [A, B\A, C\(A โช B)]
(where the arguments on the right side are splatted into the array).Since setdiff
on arrays preserves the order of the first argument to setdiff
, we get the following behavior:
df = DataFrame(a=1, b=2, c=3)
Cols([:c, :b], [:a, :b]) == [:c, :b, :a]
Cols(r"[bc]", r"[ab]") == [:b, :c, :a]
Any type that supports metadata will likely need a way to iterate over metadata for propagating metadata to new instances or merging metadata. The generic way of iterating through metadata right now is using accessing each key form the iterable returned by metadatakeys
. Depending on how metadata is stored this may not be the most efficient way to accomplish this. If the method that is propagating metadata internally is managing a single collection of metadata then a unique implementation can be written for whatever method of iteration is best. However, when combining multiple sets of metadata it's not practical to have a unique method for every combination of metadata storage patterns.
A couple things to consider in supporting generic iteration over metadata:
metapairs
and styledmetapairs
.metadatakeys
?Do we expect anything from #36 to be added to 1.7 release? If not - I would make a release to avoid having keeping tracok of it in DataFrames.jl.
Following JuliaData/DataFrames.jl#1866 (comment) we could consider adding hcat!
to DataAPI.jl (alternatively we can have it in DataFrames.jl as we currently have).
CC @nalimilan
When a table does not have nrow
or ncol
defined I we can add a description in their docstring what they should return. nothing
seems a reasonable value.
It is often the case that one wants to attach metadata of some sort to an array/graph/etc. How do people feel about adding something basic like metadata(x) = nothing
that can then be extended by other packages?
Both SplitApplyCombine.jl and DataFrames.jl export flatten
. I would add it to DataAPI.jl. The question is what docstring it should have? Maybe something like:
Flatten collection of collections into a single collection
Is enough?
@andyferris - after this is established maybe you could add DataAPI.jl to SplitApplyCombine.jl as a dependency and make innerjoin
and flatten
implement this interface? Then SplitApplyCombine.jl and DataFrames.jl could be used together more easily.
This issue is used to trigger TagBot; feel free to unsubscribe.
If you haven't already, you should update your TagBot.yml
to include issue comment triggers.
Please see this post on Discourse for instructions and more details.
Currently describe
contract is that it does pretty print the passed object.
The contract does not say what the function returns. I propose that describe
should keep to print what it promises, but also return the computed statistics.
The crucial problem is that in REPL this would "double print" the contents. Also there is a duplication between describe
and summarystats
.
I am not sure what is best, but I leave this issue open to keep track of it.
CC @nalimilan
Maybe this was discussed already, but I would think a style
named :default
would be defaulted to if style
is not provided.
I would like to see style
as an optional keyword argument so that metadata!(df, "key", "value")
is allowed. (Then TableMetadataTools.jl would define a way to change what style metadata!(df, "key", "value")
uses.)
Currently, the fit
, fit!
, predict
, and predict!
functions live in StatsBase
.
What do you think about putting the function stubs for those four functions in DataAPI
?
The generic fallback definitions would continue to live in StatsBase
.
I was thinking that it may be convenient and intuitive to use IntervalSets.jl's ellipsis notation as an alternative to Between(:a, :b):
df[:, :a..:b]
df[:, :a..end]
Sorry if this has already been discussed before.
Best,
Carlo
DataFrames defines rownumber
and row
, and I want to define a similar function for the rows (TimeStepRow) of my table type (TimeStepTable) to be able to retreive any other row of the Table from one row.
I was wondering if it would be interesting to add the functions here ?
Both DataFrames and JuliaDB have flatten
now. Since both functionalities for flatten
are pretty far from Iterators.flatten
so it makes sense to both extend a flatten
from DataAPI.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.