Giter Site home page Giter Site logo

Comments (15)

bkamins avatar bkamins commented on June 14, 2024 4

If we go for Cols then I think Cols() should not select any columns.

Now as I think of it in general - actually if JuliaDB.jl wants to keep All it is OK. We can just allow Cols as a new selector here and use it in DataFrames.jl. Then there is another question if we deprecate All or keep it as a legacy.

from dataapi.jl.

bkamins avatar bkamins commented on June 14, 2024 4

So in summary - do we introduce Cols? (it does not require removing All as this can be a separate step, though in DataFrames.jl I will then deprecate All and allow Cols if we go this way).

In DataFrames.jl the functionality of Cols will be exactly like All except that Cols() will select no columns.

If we go for it I will make a PR here.

from dataapi.jl.

pdeffebach avatar pdeffebach commented on June 14, 2024 2

Thoughts on Cols? It can be thought of as a container for Columns.

Cols(Not(:id:), r"^t", [:income, :consumption])

Makes it pretty clear that we are building a container of columns.

from dataapi.jl.

bkamins avatar bkamins commented on June 14, 2024 2

Yes - I assume that Cols would do unnesting.

Now we need Cols because, as you note [Not(:x), :y] .=> fun should have a different behavior than Cols(Not(:x), :y) .=> fun.

from dataapi.jl.

piever avatar piever commented on June 14, 2024

One issue is that All() is used to select all columns, I think Union() would be very confusing in that sense (one would expect the empty set with Union()).

from dataapi.jl.

pdeffebach avatar pdeffebach commented on June 14, 2024

We could deprecate that in favor of Union(:). This is just an idea, I'm not wedded to it but think it might be helpful.

from dataapi.jl.

bkamins avatar bkamins commented on June 14, 2024

I agree that Union is a better name. Actually we had a bug in DataFrames.jl because All() is actually All(:) that was recently fixed. However, Union is not the best name as it is used by Base, so probably All should stay.

from dataapi.jl.

CameronBieganek avatar CameronBieganek commented on June 14, 2024

I think that Cols is the best choice. The docstring for All is actually incorrect:

All(cols...)
Select the union of the selections in cols.

All creates a tuple of sets of columns, not the union of the sets. For example,

All(:x, :y) != All(:y, :x)
All(:y, :) != All(:)

In DataFrames.jl the ordering produced by All is used to choose the order of the columns when indexing.

The actual behavior of All is this:

  • Let A, B, and C be sets of columns.
  • All(A, B, C) creates a tuple t = (A, B\A, C\(A ∪ B)).
  • Then df[:, All(A, B, C)] becomes df[:, vcat(t...)].

I think this behavior is much better captured by the Cols name. Then Cols() would select zero columns and Cols(:) would select every column.

Tagging @bkamins, since this would be a breaking change.

EDIT:
Of course maybe the docstring could just be updated. The lack of context of the word All shouldn't generally be a problem, since it pretty much always appears inside an indexing or select expression. And if All is changed to Cols, then it might make sense to change Not to NotCols, because Not can occur without an All/Cols and is subject to the same criticism of "not what?".

Here's another idea. Use Cols instead of All and write a Base.:!(::Cols) method to replace Not. Example:

df = DataFrame(r=1, x1=2, x2=3, y=4)
df[:, Cols(!Cols(r"x"), :)]

I've also thought that it might make sense to just use a literal tuple instead of All. However, I think if I saw df[:, (:x, :x)], then I would expect column :x to be repeated.

from dataapi.jl.

bkamins avatar bkamins commented on June 14, 2024

@piever - what do you think about it? I think that in DataFrames.jl it is not a problem to make this change now if it were preferred (but after 1.0 release this will be a problem), but this is a DataAPI.jl change that would have a larger impact.

CC @nalimilan

from dataapi.jl.

nalimilan avatar nalimilan commented on June 14, 2024

Cols isn't bad, and it would make more sense with broadcasting, e.g. Cols(Not(:x)) .=> sum (since Not(:x) .=> sum cannot work). But should Cols() still be equivalent to Cols(:)? That's less explicit than All(), though probably not the end of the world.

@CameronBieganek Not exists outside of DataAPI and can be used with any array, so I think we should keep that name.

from dataapi.jl.

piever avatar piever commented on June 14, 2024

I confess my main gripe with the status quo is that All() meaning all columns and All(args...) meaning more or less the union is a bit inconsistent (empty union should be no columns), but also, entirely my fault :) JuliaData/IndexedTables.jl#123. Before, it used to be called Join.

These references may be relevant: JuliaData/IndexedTables.jl#112 and JuliaData/IndexedTables.jl#119. There is a bit of a discussion comparing with special selectors from dplyr.

For me to understand, the behavior would remain unchanged other than for Cols(), so say Cols([:x, :y], :z) would be the same as Cols(:x, :y, :z)? From the point of view of JuliaDB that's an important use: the main use of All selector is "unnesting", because (Not(:x), :y) would get a nested thing otherwise (with two columns, the first one in turn being a StructArray that corresponds to Not(:x)).

As an aside, as there is no nesting issue, I thought maybe DataFrames could just use a list, say select(df, [Not(:x), :y]), but maybe one wants to wrap things in a special type to overload broadcasting or things like that.

from dataapi.jl.

CameronBieganek avatar CameronBieganek commented on June 14, 2024

Not exists outside of DataAPI and can be used with any array, so I think we should keep that name.

Yeah, I agree that the name Not is fine. NotCols is kind of unwieldy and df[:, Cols(!Cols(r"x"), :)] looks goofy.

Cols isn't bad, and it would make more sense with broadcasting, e.g. Cols(Not(:x)) .=> sum (since Not(:x) .=> sum cannot work).

Is the reason that Not(:x) .=> sum cannot work because you would have to pirate ::Not in order to add broadcasting? It seems unfortunate to have to rely on InvertedIndices.jl for the definition of Not, since then you can't make it broadcastable. Would it make sense to add Not to DataAPI.jl and just let InvertedIndices.Not be its own thing?

On the other hand, given that Cols is needed to make r"x" and : broadcastable (along with the future curried versions of startswith and endswith), maybe it makes sense for Cols to be broadcastable and for all the other column selectors (including Not and Between) to not be broadcastable.

from dataapi.jl.

bkamins avatar bkamins commented on June 14, 2024

Regarding adding broadcasting to there is open JuliaData/InvertedIndices.jl#15, so maybe comment also there? (I guess if it gets more support it might be added)

from dataapi.jl.

CameronBieganek avatar CameronBieganek commented on June 14, 2024

Related comment here:

JuliaData/DataFrames.jl#2171 (comment)

from dataapi.jl.

bkamins avatar bkamins commented on June 14, 2024

Just as an additional comment, although : .=> fun does not work, this names(df) .=> fun works (similarly for all other selectors). It is a bit more cumbersome but works without any magic.

from dataapi.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.