Comments (15)
If we go for Cols
then I think Cols()
should not select any columns.
Now as I think of it in general - actually if JuliaDB.jl wants to keep All
it is OK. We can just allow Cols
as a new selector here and use it in DataFrames.jl. Then there is another question if we deprecate All
or keep it as a legacy.
from dataapi.jl.
So in summary - do we introduce Cols
? (it does not require removing All
as this can be a separate step, though in DataFrames.jl I will then deprecate All
and allow Cols
if we go this way).
In DataFrames.jl the functionality of Cols
will be exactly like All
except that Cols()
will select no columns.
If we go for it I will make a PR here.
from dataapi.jl.
Thoughts on Cols
? It can be thought of as a container for Columns
.
Cols(Not(:id:), r"^t", [:income, :consumption])
Makes it pretty clear that we are building a container of columns.
from dataapi.jl.
Yes - I assume that Cols
would do unnesting.
Now we need Cols
because, as you note [Not(:x), :y] .=> fun
should have a different behavior than Cols(Not(:x), :y) .=> fun
.
from dataapi.jl.
One issue is that All()
is used to select all columns, I think Union()
would be very confusing in that sense (one would expect the empty set with Union()
).
from dataapi.jl.
We could deprecate that in favor of Union(:)
. This is just an idea, I'm not wedded to it but think it might be helpful.
from dataapi.jl.
I agree that Union
is a better name. Actually we had a bug in DataFrames.jl because All()
is actually All(:)
that was recently fixed. However, Union
is not the best name as it is used by Base, so probably All
should stay.
from dataapi.jl.
I think that Cols
is the best choice. The docstring for All
is actually incorrect:
All(cols...)
Select the union of the selections in cols.
All
creates a tuple of sets of columns, not the union of the sets. For example,
All(:x, :y) != All(:y, :x)
All(:y, :) != All(:)
In DataFrames.jl the ordering produced by All
is used to choose the order of the columns when indexing.
The actual behavior of All
is this:
- Let
A
,B
, andC
be sets of columns. All(A, B, C)
creates a tuplet = (A, B\A, C\(A ∪ B))
.- Then
df[:, All(A, B, C)]
becomesdf[:, vcat(t...)]
.
I think this behavior is much better captured by the Cols
name. Then Cols()
would select zero columns and Cols(:)
would select every column.
Tagging @bkamins, since this would be a breaking change.
EDIT:
Of course maybe the docstring could just be updated. The lack of context of the word All
shouldn't generally be a problem, since it pretty much always appears inside an indexing or select
expression. And if All
is changed to Cols
, then it might make sense to change Not
to NotCols
, because Not
can occur without an All
/Cols
and is subject to the same criticism of "not what?".
Here's another idea. Use Cols
instead of All
and write a Base.:!(::Cols)
method to replace Not
. Example:
df = DataFrame(r=1, x1=2, x2=3, y=4)
df[:, Cols(!Cols(r"x"), :)]
I've also thought that it might make sense to just use a literal tuple instead of All
. However, I think if I saw df[:, (:x, :x)]
, then I would expect column :x
to be repeated.
from dataapi.jl.
@piever - what do you think about it? I think that in DataFrames.jl it is not a problem to make this change now if it were preferred (but after 1.0 release this will be a problem), but this is a DataAPI.jl change that would have a larger impact.
CC @nalimilan
from dataapi.jl.
Cols
isn't bad, and it would make more sense with broadcasting, e.g. Cols(Not(:x)) .=> sum
(since Not(:x) .=> sum
cannot work). But should Cols()
still be equivalent to Cols(:)
? That's less explicit than All()
, though probably not the end of the world.
@CameronBieganek Not
exists outside of DataAPI and can be used with any array, so I think we should keep that name.
from dataapi.jl.
I confess my main gripe with the status quo is that All()
meaning all columns and All(args...)
meaning more or less the union is a bit inconsistent (empty union should be no columns), but also, entirely my fault :) JuliaData/IndexedTables.jl#123. Before, it used to be called Join
.
These references may be relevant: JuliaData/IndexedTables.jl#112 and JuliaData/IndexedTables.jl#119. There is a bit of a discussion comparing with special selectors from dplyr.
For me to understand, the behavior would remain unchanged other than for Cols()
, so say Cols([:x, :y], :z)
would be the same as Cols(:x, :y, :z)
? From the point of view of JuliaDB that's an important use: the main use of All
selector is "unnesting", because (Not(:x), :y)
would get a nested thing otherwise (with two columns, the first one in turn being a StructArray
that corresponds to Not(:x)
).
As an aside, as there is no nesting issue, I thought maybe DataFrames could just use a list, say select(df, [Not(:x), :y])
, but maybe one wants to wrap things in a special type to overload broadcasting or things like that.
from dataapi.jl.
Not
exists outside of DataAPI and can be used with any array, so I think we should keep that name.
Yeah, I agree that the name Not
is fine. NotCols
is kind of unwieldy and df[:, Cols(!Cols(r"x"), :)]
looks goofy.
Cols
isn't bad, and it would make more sense with broadcasting, e.g.Cols(Not(:x)) .=> sum
(sinceNot(:x) .=> sum
cannot work).
Is the reason that Not(:x) .=> sum
cannot work because you would have to pirate ::Not
in order to add broadcasting? It seems unfortunate to have to rely on InvertedIndices.jl
for the definition of Not
, since then you can't make it broadcastable. Would it make sense to add Not
to DataAPI.jl and just let InvertedIndices.Not
be its own thing?
On the other hand, given that Cols
is needed to make r"x"
and :
broadcastable (along with the future curried versions of startswith
and endswith
), maybe it makes sense for Cols
to be broadcastable and for all the other column selectors (including Not
and Between
) to not be broadcastable.
from dataapi.jl.
Regarding adding broadcasting to there is open JuliaData/InvertedIndices.jl#15, so maybe comment also there? (I guess if it gets more support it might be added)
from dataapi.jl.
Related comment here:
JuliaData/DataFrames.jl#2171 (comment)
from dataapi.jl.
Just as an additional comment, although : .=> fun
does not work, this names(df) .=> fun
works (similarly for all other selectors). It is a bit more cumbersome but works without any magic.
from dataapi.jl.
Related Issues (20)
- `Between` should accept more than `Int` and `Symbol` HOT 2
- `metadata` method HOT 70
- isordered HOT 4
- Deprecate `All` HOT 9
- TagBot trigger issue HOT 20
- ellipsis notation for Beetwen HOT 4
- Plan for 1.7 release
- Add flatten to DataAPI.jl HOT 6
- Add `Selector` abstract type for ecosytem compat, and rethink `Between` HOT 8
- Change describe contract HOT 2
- add kwarg to levels to keep missing
- nrow and ncol for undefined values HOT 1
- clarify Between HOT 1
- Add method for iterating metadata HOT 8
- a few concerns about metadata methods HOT 8
- Confusing `levels` fallback HOT 9
- default to metadata! style=:default HOT 6
- Define `rename` and `rename!` for modifying column names? HOT 3
- Don't define unwrap(x::Any) HOT 7
- rownumber HOT 10
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dataapi.jl.