Comments (5)
Duplicate of #2097. It would make sense to define rand
on both GroupedDataFrame
and DataFrame
, as we implement shuffle
for it (#3010).
For DataFrame
, we could also allow specifying a number of rows to draw. That wouldn't work for GroupedDataFrame
but we could print an error message with a hint about what to do. Or we could automatically create a new integer grouping column that would allow repeating a group multiple times if it's been drawn more than once. FWIW, this feature has been requested in dplyr but hasn't been implemented: tidyverse/dplyr#361, tidyverse/dplyr#6518.
from dataframes.jl.
We could add it. @nalimilan, what do you think about adding:
Random.rand(rng::Random.AbstractRNG, ::Random.SamplerTrivial{<:GroupedDataFrame}) = gdf[rand(rng, 1:length(gdf))]
?
from dataframes.jl.
Ah - good catch.
So - now I responded positively as rand
's API specifies:
Pick a random element or array of random elements
And the key word is array
. Which means that with rand(gdf, 10, 10)
we would return a 10x10 array of SubDataFrame
.
If we also added rand
for data frame then writing rand(df, 10, 10)
would return a 10x10 array of DataFrameRow
.
I am not sure this is useful, but this could work. This is different from shuffle
as shuffle
does not promise to return an array, but a permuted copy. While rand
promises to return a single element or an array.
The question is if users would find it intuitive and useful?
from dataframes.jl.
Thanks for all the answers! Sorry about the missed duplicate issue.
The question is if users would find it intuitive and useful?
AFAIK the only interface that DataFrames.jl
provides for Random
is shuffle
and shuffle!
, which both return a permuted DataFrame
. Since DataFrame
does not support rand either, I was probably in the wrong to expect GroupedDataFrame
to behave like an Array
.
As for usefulness, in my case, I was looking to sample groups of data (hence the groupby), and it did feel jarring that I couldn't just sample the GroupedDataFrame
. I am not sure it is strictly useful, but it is certainly more straightforward than the following
N = 100
tdf = transform(df, [:x1, :x2] => ByRow(string))
keys = unique(tdf[!, :x1_x2_string])
subset(tdf, :x1_x2_string => ByRow(in(rand(keys, N)))) # DataFrame, have to drop :x1_x2_string
VS
N = 100
gdf = groupby(df, [:x1, :x2])
rand(gdf, N) # Array of GroupedDataFrame? GroupedDataFrame?
I don't think it's intuitive for rand(gdf, 10, 10)
to return an array. If shuffle
returns a permuted copy, I would expect rand to always return a (Grouped)DataFrame
(although that sounds like a lot of work for not much).
P.S.: I did not go into the implementation of GroupedDataFrame
in details, but is there a reason why getindex(gd, idxs)
does not support duplicates idxs?
from dataframes.jl.
but is there a reason why
getindex(gd, idxs)
does not support duplicates idxs?
This is the same reason why Dict
does not allow for duplicate keys. Group ids must be unique.
Adding shuffle
and shuffle!
to GroupedDataFrame
is easy to do - we could add it if you would find it useful.
from dataframes.jl.
Related Issues (20)
- Segmentation Fault when reading compressed file HOT 1
- Revisit spreading for `AsTable` output` HOT 6
- Better error message when forming a DataFrame from a vector of dictionaries with missing data. HOT 2
- `describe` is slow HOT 3
- CartesianIndex error in Julia 1.11 HOT 4
- `DataFrame(x=Int[], y=Int)` HOT 3
- Add comparison function for dataframes which can handle both isapprox and isequal column types HOT 2
- unique fails with column-type FixedDecimal HOT 5
- mapcols! should modify the parent of a SubDataFrame HOT 11
- Feature request: Pairs in stack HOT 2
- Grouped DataFrame with array elements fails to combine HOT 4
- error when combining a grouped empty dataframe using `first` HOT 6
- Short circuit && on subset? HOT 1
- Integer strings as colnames/selectors are error prone HOT 2
- Suggestion - Matrix Syntax for hcat (as well as vcat) HOT 4
- Document custom generation of column names in manual HOT 9
- `join` should not introduce `Missing` types to schema HOT 1
- Consider removing Tables.allocatecolumn in vcat
- DataFrame(t::Table) converts PooledVector columns HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dataframes.jl.