invenia / axissets.jl Goto Github PK

View Code? Open in Web Editor NEW

5.0 5.0 0.0 607 KB

Consistent operations over a collection of KeyedArrays

License: MIT License

Julia 100.00%

axissets.jl's People

Contributors

Stargazers

Watchers

axissets.jl's Issues

Add auto hash equals

Reduce the number of datastructures

Currently, our container types includes the following:

OrderedSets
LittleDicts
Tuples
Patterns
KeyedArrays / NamedDimsArray

This can make it hard to follow what operations will return what data types, so maybe we can simplify this in someway.
At the very least, we should probably try to minimize how much of this is exposed through the documentation and avoid too many examples of how they interact with other packages like Tables.jl, DataFrames.jl, etc.

Add an FAQ section

Some design decisions seem to be a bit unintuitive for folks coming from DataFrames.jl. This issue should serve as a list of things to include:

Design:

Why an associative of KeyedArrays?
Why are component paths Tuples?
Why are the wildcards _ and __?
Why is component mutation so hard?

Gotchas:

How do I mutate my component arrays without getting KeyAlignmentErrors?
How add more components?

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Always run `validate` on construction

We almost always want to run validate when constructing a new dataset, so maybe we should have an inner constructor for that. We'll probably still want a validate=true keyword for specific cases where we don't need to check.

Support FeatureTransforms.jl

Should be able to apply arbitrary feature transforms to some subselection of a dataset.

Does `flatten` even need to live in this package?

Unless we're performing flatten with a KeyedDataset then maybe it should live elsewhere?

Loosen key restriction to just `Tuple`

Currently, all "paths" are just tuples of symbols. This is pretty efficient because we can do === in the pattern matching code, but we may have a use case for custom types that folks may want to dispatch on. In that case, loosening the constraint may be worth it.

Replace most instances of Tuple{Vararg{Symbol}} with Tuple
Update in code in patterns.jl to use ==

The bright side is that this make me happy we chose Tuple over Vector :)

setindex! introducing a constraint isn't always desirable

This came up in an internal implementation.

When we introduce a new element to a KeyedDataset the setindex! method assigns a constraint if one isn't already associated with it's dimpath.

AxisSets.jl/src/indexing.jl

Lines 122 to 130 in eb33c8c

    
           for d in dimnames(val) 
        
               dimpath = (key..., d) 
        
               # Similar to construction, if our dimpath isn't present in any existing constraints 
        
               # then we introduce a new one that's `Pattern(:__, dim)` 
        
               if all(c -> !in(dimpath, c), ds.constraints) 
        
                   push!(ds.constraints, Pattern(:__, d)) 
        
               end 
        
           end

However, this isn't necessarily desirable as the new element might share a fieldname with an existing element and introduce and so introduce a conflicting constraint.

MWE

julia> using AxisSets, AxisKeys
julia> using AxisSets: Pattern

# assume the following
julia> ds = KeyedDataset(
           (:train, :input) => KeyedArray(ones(5, 5); time=1:5, id='a':'e'),
           (:predict, :input) => KeyedArray(ones(5); id='a':'e'),
           (:train, :output) => KeyedArray(ones(5); time=1:5);
           constraints=Pattern[
               (:train, :_, :time),
               (:predict, :_, :time),
               (:__, :input, :id),  # offending constraint
           ]
       )

# assign a new element with `:id` dimname - introduces `(:__, :id)` constraint
julia> ds[(:train, :weights)] = KeyedArray(ones(5), id='a':'e')

julia> ds.constraints
OrderedCollections.OrderedSet{Pattern} with 4 elements:
  Pattern((:train, :_, :time))
  Pattern((:predict, :_, :time))
  Pattern((:train, :input, :id))
  Pattern((:__, :id))

julia> ds
Error showing value of type KeyedDataset:
ERROR: ArgumentError: Collection has multiple elements, must contain exactly 1 element
Stacktrace:
  [1] only
    @ ./iterators.jl:1327 [inlined]
  [2] _only(x::Vector{Int64})
    @ AxisSets ~/.julia/packages/AxisSets/27klG/src/AxisSets.jl:27
  [3] (::AxisSets.var"#20#22"{Vector{Pattern}, Tuple{Symbol, Symbol}})(dimname::Symbol)
    @ AxisSets ~/.julia/packages/AxisSets/27klG/src/dataset.jl:97
  [4] map(f::AxisSets.var"#20#22"{Vector{Pattern}, Tuple{Symbol, Symbol}}, t::Tuple{Symbol, Symbol})
    @ Base ./tuple.jl:214
  [5] show(io::IOContext{Base.TTY}, ds::KeyedDataset)
    @ AxisSets ~/.julia/packages/AxisSets/27klG/src/dataset.jl:95
   ...

In this instance one would be better off redefining the offending constraint as (:__, :id) but this might not always be possible/desirable.

`map` errors if function returns `nothing`

Bit of an edge case, but seems valid.

If using map on a KeyedDataset, and the block returns nothing, it errors, because of this line trying to assign nothing as a KeyedArray.

MWE:

julia> ds = KeyedDataset(:input => KeyedArray([1, 2]; a=[1, 2]));

julia> map(ds) do A
           println(typeof(A))
       end
KeyedArray{Int64, 1, NamedDimsArray{(:a,), Int64, 1, Vector{Int64}}, Base.RefValue{Vector{Int64}}}
ERROR: MethodError: Cannot `convert` an object of type 
  Nothing to an object of type 
  KeyedArray
Closest candidates are:
  convert(::Type{T}, ::LinearAlgebra.Factorization) where T<:AbstractArray at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/factorization.jl:58
  convert(::Type{T}, ::T) where T<:AbstractArray at abstractarray.jl:14
  convert(::Type{T}, ::T) where T at essentials.jl:205
Stacktrace:
 [1] setindex!(dd::OrderedCollections.LittleDict{Tuple, KeyedArray, Vector{Tuple}, Vector{KeyedArray}}, value::Nothing, key::Tuple{Symbol})
   @ OrderedCollections ~/.julia/packages/OrderedCollections/cP9uu/src/little_dict.jl:219
 [2] map!(f::var"#9#10", dest::KeyedDataset, src::KeyedDataset)
   @ AxisSets ~/.julia/packages/AxisSets/IddYk/src/functions.jl:49
 [3] map(::Function, ::KeyedDataset)
   @ AxisSets ~/.julia/packages/AxisSets/IddYk/src/functions.jl:41
 [4] top-level scope
   @ REPL[16]:1

If the point of map(::KeyedDataset) is only to modify the dataset, that may be fair, but I think it should either have a fallback, or a clearer error.

Breakup `example.md` into smaller clearer examples

We currently have a big jump from the docstring examples in the API docs and the one very large example that assumes you're already familiar with the API. This documentation structure results in a lot of cognitive load when learning the package, particularly since it pulls in syntax from half a dozen other packages (e.g., Tables, DataFrames, AxisKeys)

Support Tables interface

Currently, KeyedArrays support the tables interface and wrapdims can be used to create a KeyedArray from an existing table. I'm not entirely sure if we have a use-case for it yet, but it might be nice if we could ingest and produce tables. Some tricky parts of this include:

How do we generalize the extraction of keys and values from a table?
Do we need to introduce a constraint where all components must have at least 1 shared dimension order to construct a single table from the dataset?
How do we generalize the table construction? Are we basically implementing our own join logic to avoid depending on a specific table implementation?

Fuzzy constraints

In some datasets, we may want to define a custom alignment function other than simply ==. One use case, is when we have both hourly and daily data which aren't the same size (obviously), but we want to perform the same operations over both (i.e. filtering out a day should also remove the corresponding hours). If we restrict ourselves to Interval queries, then we should be able to define a looser alignment function which simply checks that the days in the hours axis corresponds to the other daily data.

NOTE: I'm not sure how much of a priority this is, as in most cases we could probably just convert the daily data to hourly... either through interpolation or a sparse matrix.

`KeyedDataset` subtype of `Dictionary`

There's a few benefits to having a KeyedDataset subtype a Dictionary:

Extracting singular KeyedArrays should be easier with only(values(ds(pattern...).data)) becoming only(ds(pattern...)).
Iterating over data should become relatively intuitive.
Extracting paths with keys, pairs would be better than directly accessing the data field.
We can use custom indices to improve lookup performance.

Removing a dim does not violate the constraint on that dim

I thought this would throw an error.

MWE:

julia> ds = KeyedDataset(
           :train => KeyedArray(zeros(5, 2); target=1:5, id=[:a, :b]),
           :predict => KeyedArray(zeros(3, 2); target=6:8, id=[:a, :b]),
           constraints=Pattern[(:_, :id)]
       )
KeyedDataset with:
  2 components
    (:train,) => 5x2 KeyedArray{Float64} with dimension target, id[1]
    (:predict,) => 3x2 KeyedArray{Float64} with dimension target, id[1]
  1 constraints
    [1] (:_, :id) ∈ 2-element Vector{Symbol}

julia> map(A -> A(id=:a), ds, (:predict, :_))
KeyedDataset with:
  2 components
    (:train,) => 5x2 KeyedArray{Float64} with dimension target, id[1]
    (:predict,) => 3 KeyedArray{Float64} with dimension target
  1 constraints
    [1] (:_, :id) ∈ 2-element Vector{Symbol}

Whereas it throws an error if the dim preserved:

julia> map(A -> A(id=AxisKeys.Interval(:a, :a)), ds, (:predict, :_))
ERROR: KeyAlignmentError: Misaligned dimension keys on constraint Pattern((:_, :id))
  Tuple[(:predict, :id)] ∈ 1-element view(::Vector{Symbol}, [1]) with eltype Symbol
  Tuple[(:train, :id)] ∈ 2-element Vector{Symbol}

Stacktrace:
 [1] validate(ds::KeyedDataset, constraint::Pattern{Tuple{Symbol, Symbol}}, paths::Set{Tuple})
   @ AxisSets ~/.julia/packages/AxisSets/YTn0q/src/dataset.jl:293
 [2] validate(ds::KeyedDataset)
   @ AxisSets ~/.julia/packages/AxisSets/YTn0q/src/dataset.jl:267
 [3] map!(f::var"#5#6", dest::KeyedDataset, src::KeyedDataset)
   @ AxisSets ~/.julia/packages/AxisSets/YTn0q/src/functions.jl:52
 [4] map(f::Function, ds::KeyedDataset, keys::Tuple{Symbol, Symbol})
   @ AxisSets ~/.julia/packages/AxisSets/YTn0q/src/functions.jl:41
 [5] top-level scope
   @ REPL[9]:1

Doctests failing due to printing changes in 1.7

https://github.com/invenia/AxisSets.jl/runs/4876807034?check_suite_focus=true

Support NamedDimsArray wrapping KeyedArray

In AxisKeys.jl, NamedDimsArray{KeyedArray} and KeyedArray{NamedDimsArray} are equivalent. From the README:

A nested pair of wrappers can be constructed with keywords for names, and everything should work the same way in either order

This is not the case for AxisSets. MWE:

julia> A = NamedDimsArray(rand(2, 3), row=[:a, :b], col=1:3);

julia> KeyedDataset(:x => A)
ERROR: MethodError: Cannot `convert` an object of type 
  NamedDimsArray{(:row, :col), Float64, 2, KeyedArray{Float64, 2, Matrix{Float64}, Tuple{Vector{Symbol}, UnitRange{Int64}}}} to an object of type 
  KeyedArray
Closest candidates are:
  convert(::Type{T}, ::Intervals.AnchoredInterval{P, T, L, R} where {L<:Intervals.Bounded, R<:Intervals.Bounded}) where {P, T} at /Users/bencottier/.julia/packages/Intervals/ua9cq/src/anchoredinterval.jl:181
  convert(::Type{T}, ::Intervals.Interval{T, L, R} where {L<:Intervals.Bound, R<:Intervals.Bound}) where T at /Users/bencottier/.julia/packages/Intervals/ua9cq/src/interval.jl:253
  convert(::Type{T}, ::LinearAlgebra.Factorization) where T<:AbstractArray at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/factorization.jl:58
  ...
Stacktrace:
 [1] push!(a::Vector{KeyedArray}, item::NamedDimsArray{(:row, :col), Float64, 2, KeyedArray{Float64, 2, Matrix{Float64}, Tuple{Vector{Symbol}, UnitRange{Int64}}}})
   @ Base ./array.jl:928
 [2] (OrderedCollections.LittleDict{Tuple, KeyedArray, KS, VS} where {KS<:(Union{var"#s4", var"#s3"} where {var"#s4"<:Tuple, var"#s3"<:(Vector{T} where T)}), VS<:(Union{var"#s4", var"#s3"} where {var"#s4"<:Tuple, var"#s3"<:(Vector{T} where T)})})(itr::OrderedCollections.LittleDict{Tuple{Symbol}, NamedDimsArray{(:row, :col), Float64, 2, KeyedArray{Float64, 2, Matrix{Float64}, Tuple{Vector{Symbol}, UnitRange{Int64}}}}, Vector{Tuple{Symbol}}, Vector{NamedDimsArray{(:row, :col), Float64, 2, KeyedArray{Float64, 2, Matrix{Float64}, Tuple{Vector{Symbol}, UnitRange{Int64}}}}}})
   @ OrderedCollections ~/.julia/packages/OrderedCollections/PRayh/src/little_dict.jl:73
 [3] convert
   @ ./abstractdict.jl:523 [inlined]
 [4] KeyedDataset (repeats 2 times)
   @ ~/.julia/packages/AxisSets/ullT8/src/dataset.jl:49 [inlined]
 [5] KeyedDataset(pairs::Pair{Symbol, NamedDimsArray{(:row, :col), Float64, 2, KeyedArray{Float64, 2, Matrix{Float64}, Tuple{Vector{Symbol}, UnitRange{Int64}}}}}; constraints::Vector{AxisSets.Pattern})
   @ AxisSets ~/.julia/packages/AxisSets/ullT8/src/dataset.jl:72
 [6] KeyedDataset(pairs::Pair{Symbol, NamedDimsArray{(:row, :col), Float64, 2, KeyedArray{Float64, 2, Matrix{Float64}, Tuple{Vector{Symbol}, UnitRange{Int64}}}}})
   @ AxisSets ~/.julia/packages/AxisSets/ullT8/src/dataset.jl:57
 [7] top-level scope
   @ REPL[63]:1

Add examples of realistic Transforms to the docs

We should work a couple of the more complicated transforms into the example docs. It might be helpful for folks to see a few of the different combinations in practice.

Originally posted by @rofinn in #50 (comment)

Pattern reductions

Currently, you can create a KeyedDataset in which a component dimension can match multiple constraint patterns. However, if dimension x must align with all axes in pattern a and b then that must mean that all axes in a and b must align with each other and can be described by a single, more general, Pattern.

Readme

Per recent discussions we want to have readme's as well as docs

Support `occursin` matching on segments

For example, we should be able to define patterns like:

Pattern(:train, :input, (:foo, :bar))

which would match (:train, :input, :foo) or (:train, :input, :bar), but not (:train, :input, :baz)

Similarly, I could see an argument for something like:

Pattern(:train, :input, r"foo.*")

which would match (:train, :input, "foo.1") and (:train, :input, "foo.2), but not (:train, :input, "bar.1")

NOTE: These should both be fallbacks, so that (:train, :input, (:foo, :bar)) or (:train, :input, r"foo.*") would take priority if they matched first.

Support Impute.jl

Should be able to apply filters, validation and imputation to subsets of a dataset.

Better dimension mismatch error

Since it's an error we should probably include more details about what the alignment error actually is.

Custom exception type
List:
- Matched constraint
- Expected dimension
- Input dimensions

Multi-value wildcard doesn't work when dimpaths contain repeated names

A filtered dataset had inconsistent constrained values across tables. The problem turned out to be that the constraint ignored one of the tables because it was using a multi-value wildcard. Switching to a single-value wildcard fixed it.

julia> using AxisSets: Pattern

julia> (:x, :a, :a) in Pattern(:x, :a, :a)
true

julia> (:x, :a, :a) in Pattern(:x, :_, :a)
true

julia> (:x, :a, :a) in Pattern(:x, :__, :a)
false

julia> (:x, :a, :a) in Pattern(:__, :a)
false

Example

Include an example in the docs which demonstrates how to use the Dataset type to perform batched / constrained operations over a collection of KeyedArrays. This example should include:

Construction from multiple tables
Filtering via multiple dimensions
Imputing values
Flattening dimensions
Concatenating multiple components together to produce X, y, X̂ and ŷ values

Implement `cat`, `hcat` and `vcat`

I don't think this is necessary for an initial release, but it'd be nice if we could support batched concatenation of components.

Merge fails on >2 datasets

merge works fine with 2 datasets, but breaks with 3. The docstring claims merge should work on multiple datasets.

julia> ds1 = KeyedDataset(
                    :a => KeyedArray(zeros(3); time=1:3),
                    :b => KeyedArray(ones(3, 2); time=1:3, loc=[:x, :y]),
                );

julia> ds2 = KeyedDataset(
                    :c => KeyedArray(ones(3); time=1:3),
                    :d => KeyedArray(zeros(3, 2); time=1:3, loc=[:x, :y]),
                );

julia> ds3 = KeyedDataset(
                    :e => KeyedArray(ones(3); time=1:3),
                    :f => KeyedArray(zeros(3, 2); time=1:3, loc=[:x, :y]),
                );

julia> merge(ds1, ds2)
KeyedDataset with:
  4 components
    (:a,) => 3 KeyedArray{Float64} with dimension time[1]
    (:b,) => 3x2 KeyedArray{Float64} with dimension time[1], loc[2]
    (:c,) => 3 KeyedArray{Float64} with dimension time[1]
    (:d,) => 3x2 KeyedArray{Float64} with dimension time[1], loc[2]
  2 constraints
    [1] (:__, :time) ∈ 3-element UnitRange{Int64}
    [2] (:__, :loc) ∈ 2-element Vector{Symbol}

julia> merge(ds1, ds2, ds3)
ERROR: MethodError: no method matching KeyedDataset(::OrderedCollections.OrderedSet{AxisSets.Pattern}, ::Dict{Tuple, KeyedArray})
Closest candidates are:
  KeyedDataset(::OrderedCollections.OrderedSet{AxisSets.Pattern}, ::OrderedCollections.LittleDict) at /Users/sam/.julia/packages/AxisSets/ullT8/src/dataset.jl:44
  KeyedDataset(::OrderedCollections.OrderedSet{AxisSets.Pattern}, ::OrderedCollections.LittleDict, ::Any) at /Users/sam/.julia/packages/AxisSets/ullT8/src/dataset.jl:44
Stacktrace:
 [1] merge(::KeyedDataset, ::KeyedDataset, ::KeyedDataset)
   @ AxisSets ~/.julia/packages/AxisSets/ullT8/src/functions.jl:116
 [2] top-level scope
   @ REPL[22]:1

More constructors

We should be able to make an empty KeyedDataset with a set of constraints.

julia> expected = KeyedDataset(
           constraints=Pattern[
                   (:_, :input, :id),
                   (:_, :output, :id),
                   (:train, :_, :target),
                   (:predict, :_, :target),
               ],
           )
ERROR: StackOverflowError:
Stacktrace:
 [1] KeyedDataset(; constraints::Array{Pattern,1}, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /Users/rory/.julia/packages/AxisSets/3RRAg/src/dataset.jl:58 (repeats 13779 times)
 [2] top-level scope at REPL[6]:1

We should be able to construct a KeyedDataset with variable key lengths.

julia> expected = KeyedDataset(
           (:a,) => KeyedArray(reshape(1:8, (4, 2)); target=1:4, id=[:a, :b]),
           (:a, :b) => KeyedArray(reshape(1:8, (4, 2)); target=1:4, id=[:a, :b]),
       )
ERROR: MethodError: no method matching KeyedDataset(::Pair{Tuple{Symbol},KeyedArray{Int64,2,NamedDimsArray{(:target, :id),Int64,2,Base.ReshapedArray{Int64,2,UnitRange{Int64},Tuple{}}},Tuple{UnitRange{Int64},Array{Symbol,1}}}}, ::Pair{Tuple{Symbol,Symbol},KeyedArray{Int64,2,NamedDimsArray{(:target, :id),Int64,2,Base.ReshapedArray{Int64,2,UnitRange{Int64},Tuple{}}},Tuple{UnitRange{Int64},Array{Symbol,1}}}})
Closest candidates are:
  KeyedDataset(::Pair{T,B} where B...; constraints) where T<:Tuple at /Users/rory/.julia/packages/AxisSets/3RRAg/src/dataset.jl:29
Stacktrace:
 [1] top-level scope at REPL[9]:1

List of functions/utilities we don't seem to use

This issue is to track any utilities / functions we don't seem to use that could potentially be deprecated at some point.

Docs delim section

Since we're aiming for explicit delimiters we should probably have a section explaining some of the decisions.

Why do we need delimiters if keys should be tuples?

NamedTuples need names to be symbols
Dimension names are also represented by symbols

Why do we need multiple delimiters?

To identify slightly different types of operations that may be important for code that automatically parses / tokenizes the results.
- ⁻: Common identifier for flattening arbitrary nested structures
- ˣ: Flattening dimensions of an array (new dimensions and keys are the product of the flattened dimensions)
- ⁺: Arrays have been concatenated along that dimensions

Add apply! and apply_append methods

Following on from #50 which just implemented the minimum required interface. It might be useful to also support apply! and apply_append.

Wildcard for 0-or-more in Pattern

Somewhat related to #48 .
I have data containing training sets for several features. Some have a single KeyedArray and others have multiple sub-components. It would be nice to be able to get all :train data using a wildcard instead of merging (:_, :train, :__) and (:_, :train).

julia> data=KeyedDataset(
           (:f1, :train)=>KeyedArray([1], a=[1]), 
           (:f2, :train, :x)=>KeyedArray([1], b=[1]), 
           (:f2, :train, :y)=>KeyedArray([1], c=[1]),
       )
KeyedDataset with:
  3 components
    (:f1, :train) => 1 KeyedArray{Int64} with dimension a[1]
    (:f2, :train, :x) => 1 KeyedArray{Int64} with dimension b[2]
    (:f2, :train, :y) => 1 KeyedArray{Int64} with dimension c[3]
  3 constraints
    [1] (:__, :a) ∈ 1-element Vector{Int64}
    [2] (:__, :b) ∈ 1-element Vector{Int64}
    [3] (:__, :c) ∈ 1-element Vector{Int64}

julia> data(:_, :train, :__)
KeyedDataset with:
  2 components
    (:f2, :train, :x) => 1 KeyedArray{Int64} with dimension b[1]
    (:f2, :train, :y) => 1 KeyedArray{Int64} with dimension c[2]
  2 constraints
    [1] (:__, :b) ∈ 1-element Vector{Int64}
    [2] (:__, :c) ∈ 1-element Vector{Int64}

julia> data(:_, :train)
KeyedDataset with:
  1 components
    (:f1, :train) => 1 KeyedArray{Int64} with dimension a[1]
  1 constraints
    [1] (:__, :a) ∈ 1-element Vector{Int64}

Solidify what `KeyedDataset` interface

Related to #25 it's unclear what common Julia operations are applicable to a KeyedDataset (e.g., first(ds), values(ds), etc). Maybe it's worth making this type an AbstractDict with some extra functions?

Smaller `Pattern`s examples

The current comments in example.md is a bit confusing and should probably be made clearer. For example, better clarification of what "matching" means in terms of dimension paths.

Consider revising separators?

We currently use unicode separators in a few places where tuples are possible. This may contribute to the cognitive load of the API and documentation.

Consider a different glob (`:_`/`:__`) syntax

While aligning with glob (*/**) and NamedDims.jl (:_ wildcard dim) is nice, :_ and :__ are a bit hard to distinguish depending on the font setting in an editor or terminal. We might want to consider using a different symbol or maybe our keys should just be path strings reusing the glob syntax?

Looks into Lazy loading functionality for large datasets

I'm not yet sure if this should be handled by AxisSets.jl directly, but it's a workflow we should look into more.

https://www.tensorflow.org/api_docs/python/tf/data/Dataset
http://shashi.biz/FileTrees.jl/

Easy way to insert new key and value

For many operations we'll want to both mutate the data components and rename the keys.

For example, maybe we want to one-hot-encode a time feature for both train/predict inputs.

for (k, v) in pairs(ds(:_, :input, :temp))
    # Rename our component from temp to hod
    _k = replace(collect(k), :temp => :hod)
    # Insert our one-hot-encode hour-of-day feature from the temperature times.
    ds[k] = ohe(hod(v.time))
end

Unfortunately, this has two issues:

It's a bit verbose and ideally we could write this with something like a map.
We're calling validate each time we insert into the dataset.

Indexing tests throw `CanonicalIndexError` instead of `ErrorException` when run in Julia 1.8

A new error type was introduced in abstractarray.jl in Julia1.8. CanonicalIndexError is a new error type thrown when getting the index of an array in these tests. This error type didn't exist in Julia1.6.

This error type is also stand-alone, and doesn't have any association with any error structs. I'm wondering if Base.abstractarray.jl should be changed such that CanonicalIndexError is subtype of ErrorException so that these tests pass, or if the tests themselves should change to reflect the error "type".

Support simple syntax for extracting sub-datasets

It would be nice to have a convenient syntax for creating a new dataset with a subset of the variables. Currently this can be done with the following:

julia> using AxisKeys, AxisSets

julia> ds = KeyedDataset(
           :val1 => KeyedArray(zeros(3, 2); time=1:3, foo=[:a, :b]),
           :val2 => KeyedArray(ones(3, 2); time=1:3, bar=[:x, :y]),
           :val3 => KeyedArray(ones(3, 2); time=1:3, baz=[:z, :w]),
       );

julia> ds(in([(:val1,), (:val2,)]))
KeyedDataset with:
  2 components
    (:val1,) => 3x2 KeyedArray{Float64} with dimension time[1], foo[2]
    (:val2,) => 3x2 KeyedArray{Float64} with dimension time[1], bar[3]
  3 constraints
    [1] (:__, :time) ∈ 3-element UnitRange{Int64}
    [2] (:__, :foo) ∈ 2-element Vector{Symbol}
    [3] (:__, :bar) ∈ 2-element Vector{Symbol}

I propose a simple overload like the following:

julia> (ds::KeyedDataset)(i::AbstractVector{Symbol}) = ds(in([(s,) for s in i]))

julia> ds([:val1, :val2])
KeyedDataset with:
  2 components
    (:val1,) => 3x2 KeyedArray{Float64} with dimension time[1], foo[2]
    (:val2,) => 3x2 KeyedArray{Float64} with dimension time[1], bar[3]
  3 constraints
    [1] (:__, :time) ∈ 3-element UnitRange{Int64}
    [2] (:__, :foo) ∈ 2-element Vector{Symbol}
    [3] (:__, :bar) ∈ 2-element Vector{Symbol}

Use Requires.jl for external packages

We're slowly adding support for more external packages like Impute.jl and FeatureTransforms.jl. It might be a good idea to preemptively start using Requires.jl to avoid exposing ourselves to too many dependencies. Currently, both package are pretty minimal, but that might now always be the case.

Less verbose `show` method

We currently reuse the show method for each component KeyedArray. This looks nice for small datasets, but if you have more than a few components this can be very verbose to output by default. Also, if you want to inspect a specific component you can always access it with getindex.

Things we usually want to know:

What are the constraints?
How many components are there?
What's the dimensionality of each component?
Which dimensions are aligned to each constraint?
What are the eltypes for each component and corresponding axiskeys?

Sample:

KeyedDataset with:
    7 constraints:
        (:train, :input, :_, :time) ∈ 145-element Vector{Dates.DateTime}
        (:train, :output, :_, :time) ∈ 145-element Vector{Dates.DateTime}
        (:predict, :input, :_, :time) ∈ 25-element Vector{Dates.DateTime}
        (:predict, :output, :_, :time) ∈ 25-element Vector{Dates.DateTime}
        (:__, :prices, :id) ∈ 4-element Vector{Symbol}
        (:__, :temp, :id) ∈ 4-element Vector{Symbol}
        (:__, :load, :id)  ∈ 2-element Vector{Symbol}
    8 components:
        (:train, :input, :prices) => 145x4x4 KeyedArray{Union{Missing, Float64}, 3} with dimension names: (:time, :id, :lag)
        ...
        (:predict, :output, :prices) => 25x4 KeyedArray(Union{Missing, Float64}, 2) with dimension names: (:time, :id)

In theory, I guess we could also generate random colors and use those to visually align the component dimension names to the corresponding constraint?

Utility function for retrieving component names

Currently the only way I can find to retrieve the names of the components of ds::KeyedDataset is with keys(ds.data). As noted on slack, it's not safe to access ds.data directly, as its implementation might change. Is there another way to get the components list with an API function? Alternatively, an API function like components(ds::KeyedDataset) = collect(keys(ds.data)) would be useful.

Similarly, am I correct that given a component name k::Symbol, getproperty(ds::KeyedDataset, k) is the intended way to access the corresponding KeyedArray?

CI doctest failures

https://github.com/invenia/AxisSets.jl/runs/2198579777?check_suite_focus=true#step:6:137

	for d in dimnames(val)
	dimpath = (key..., d)

	# Similar to construction, if our dimpath isn't present in any existing constraints
	# then we introduce a new one that's `Pattern(:__, dim)`
	if all(c -> !in(dimpath, c), ds.constraints)
	push!(ds.constraints, Pattern(:__, d))
	end
	end

invenia / axissets.jl Goto Github PK

axissets.jl's People

Contributors

Stargazers

Watchers

axissets.jl's Issues

Recommend Projects

Recommend Topics

Recommend Org