something like index!(df, "colname") creates an index

Some speed tests on the column-based indexing: <div class="snippet-clipboard-conte

I just updated indexing.jl , renamed <code class="notr

Thanks, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-

allow columns to be indexed for fast access/split by value about dataframes.jl HOT 16 CLOSED

juliadata commented on August 20, 2024

allow columns to be indexed for fast access/split by value

from dataframes.jl.

Comments (16)

tshort commented on August 20, 2024

In addition to the commit just made for column-based indexes, another option for this is pandas/data.table style indexing where the whole DataFrame is sorted based on keys. In pandas, the indexing is separate from the DataFrame. In data.table, the key columns are columns in the DataFrame. Either way, we would need to add extra fields to the definition of a DataFrame, and both would complicate indexing.

Keeping indexing with columns has advantages:

Column ordering is less likely to be messed up inadvertantly.
Multiple columns can be indexed; there's no main key grouping.
More flexible logic combinations are possible.
You can mix and match keyed and non-keyed comparisons.
The DataFrame structure and indexing are not disturbed.
It's probably less coding and maintenance than the approach below.

The advantages of an overall DataFrame index are:

Lookups are somewhat faster than the approach above.
The ordering and priority of keys is kept. This makes things like pivoting easier.
You could do data.table/pandas shortcuts like df["A"] for df[:( keycol .== "A" )]. But, df["A"] is less descriptive if you don't know what df's keys are, and it's hard to separate keyed lookups along rows with column indexing.

from dataframes.jl.

HarlanH commented on August 20, 2024

Tom, thanks for writing that experimental code! That's definitely the direction I'm currently thinking. And I think you nail the pros and cons of the various options.

I'd like to defer this decision until we have a design for distributed DataFrames, which I think will be a uniquely Julian and very valuable feature. What we end up wanting to do for indexes will, I think, be driven in part by what's a good option there. Does that make sense?

from dataframes.jl.

tshort commented on August 20, 2024

Definitely. No need to hurry on this. I especially don't want to clutter the DataFrame definition unless we really know what we want.

For now, it was an easy option to try and may give us some pieces to use later.

from dataframes.jl.

tshort commented on August 20, 2024

Some speed tests on the column-based indexing:

N = 100000000
srand(1)
# big grouping:
x = randi(5, N)
xi = IndexedVec(x)

@time x[x .== 3]
# elapsed time: 1.309391975402832 seconds
@time x[xi .== 3]
# elapsed time: 0.37038278579711914 seconds

# smaller grouping:
x = randi(50000, N)
xi = IndexedVec(x)

@time x[x .== 3]
# elapsed time: 0.9579911231994629 seconds
@time x[xi .== 3]
# elapsed time: 0.00017404556274414062 seconds
@time x[xi .<= 3]
# elapsed time: 0.0005171298980712891 seconds

from dataframes.jl.

tshort commented on August 20, 2024

Bitmap indexing is an interesting option. It's fast and flexible. The main downside is that it's hard/slow to update for changes to the data.

http://en.wikipedia.org/wiki/Bitmap_index
https://sdm.lbl.gov/fastbit/
http://www-vis.lbl.gov/Events/SC05/HDF5FastQuery/

from dataframes.jl.

tshort commented on August 20, 2024

I just updated indexing.jl, renamed IndexedVec to IndexedVector, got rid of warnings, and added a few tests. It is also now live, meaning it is loaded with everything else. In updating, I came across a couple of things:

DataFrame indexing isn't as general as it used to be, probably due to changes needed to avoid warnings. Not a big deal, but specific ref entries were needed to make this work.
Assignment of an IndexedVector to a DataFrame column wasn't straightforward, probably because this type is an AbstractVector, not an AbstractDataVector. I added methods to make it happen, but I don't know if I did it "the right way".

The timings above as well as the advantages and disadvantages still apply.

from dataframes.jl.

johnmyleswhite commented on August 20, 2024

Thanks, @tshort! To wrap my head around the indexing strategy, should I just read through indexing.jl? I'm really excited to have this available.

from dataframes.jl.

tshort commented on August 20, 2024

The short answer is yes. I'll add some more comments. The idea is that an
IndexedVector is a pointer to the original column along with an indexing
vector equal to order(orig). A comparison operation like idv .> 3
returns an Indexer type. The Indexer type includes a pointer to the
IndexedVector along with a vector of Range1's. DataVecs and DataFrames can
be indexed with Indexers. It's fast because you're using a slice of the
already indexed vector.

Indexer's can be combined with | and &. In the case where the
IndexedVector is the same, the Indexer is just reduced or expanded as
appropriate. This includes: 0 .< idv .< 3 or idv .== 1 | idv .== 4. If
Indexers have different IndexedVectors (like idv1 .== 1 | idv2 .== 1),
then the result is converted to a BitVector.

On Wed, Jan 2, 2013 at 11:35 AM, John Myles White
[email protected]:

Thanks, @tshort https://github.com/tshort! To wrap my head around the
indexing strategy, should I just read through indexing.jl? I'm really
excited to have this available.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/24#issuecomment-11814185.

from dataframes.jl.

tshort commented on August 20, 2024

Also, NA support in DataVecs is spotty. Here is an example:

julia> a = DataVector[1,2,3,NA]
4-element Int64 DataArray
 1
 2
 3
 NA

julia> ia = IndexedVector(a) 
4-element Int64 DataArray
 1
 2
 3
 NA

julia> a[ia .< 3]
3-element Int64 DataArray
 NA                                           # PROBABLY NOT WANTED
 1
 2

julia> a[ia .> 1]
2-element Int64 DataArray
 2
 3

julia> a[ia .== 1]
1-element Int64 DataArray
 1

julia> a = DataVector[NA,1,2,3,NA]
5-element Int64 DataArray
 NA
 1
 2
 3
 NA

julia> ia = IndexedVector(a) 
no method isless(NAtype,NAtype)
 in insertionsort_perm! at sort.jl:163
 in mergesort_perm! at sort.jl:270
 in mergesort_perm at sort.jl:315
 in IndexedVector at /home/tshort/.julia/DataFrames/src/indexing.jl:45

julia> order(a)
no method isless(NAtype,NAtype)
 in insertionsort_perm! at sort.jl:163
 in mergesort_perm! at sort.jl:270
 in mergesort_perm at sort.jl:315
 in order at sort.jl:557

from dataframes.jl.

johnmyleswhite commented on August 20, 2024

Have you seen whether just adding isless(NAtype, NAtype) = true to operators.jl makes this better?

from dataframes.jl.

johnmyleswhite commented on August 20, 2024

It's possibly worth noting that these types of operations do work correctly for a pure DataArray:

julia> require("DataFrames"); using DataFrames

julia> a = DataVector[1, 2, 3, NA]
4-element Int64 DataArray
1
2
3
NA

julia> a[a .< 3]
2-element Int64 DataArray
1
2

from dataframes.jl.

tshort commented on August 20, 2024

Correct. The only thing I would call a bug with the existing DataArray is
the result of order(DataVector[NA, 1,2,3, NA]).

On Wed, Jan 2, 2013 at 12:50 PM, John Myles White
[email protected]:

It's possibly worth noting that these types of operations do work
correctly for a pure DataArray:

julia> require("DataFrames"); using DataFrames

julia> a = DataVector[1, 2, 3, NA]
4-element Int64 DataArray
1
2
3
NA

julia> a[a .< 3]
2-element Int64 DataArray
1
2

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/24#issuecomment-11817077.

from dataframes.jl.

johnmyleswhite commented on August 20, 2024

I just fixed a bug in operators. Now we have:

julia> require("DataFrames"); using DataFrames

julia> sort(DataVector[NA, 1, 2, 3, NA])
5-element Int64 DataArray
NA
NA
1
2
3

julia> order(DataVector[NA, 1, 2, 3, NA])
5-element Int64 Array:
1
5
2
3
4

julia> sortperm(DataVector[NA, 1, 2, 3, NA])
([NA,NA,1,2,3],[1, 5, 2, 3, 4])

I think this isn't a bug, but maybe I'm missing something.

from dataframes.jl.

tshort commented on August 20, 2024

Works for me now after your fix.

On Wed, Jan 2, 2013 at 1:03 PM, John Myles White
[email protected] wrote:

order(DataVector[NA, 1, 2, 3, NA])

from dataframes.jl.

tshort commented on August 20, 2024

I just pushed some changes, that make DataVectors with NA's work better with IndexedVectors. Bugs may still lurk, of course.

from dataframes.jl.

tshort commented on August 20, 2024

I'm going to close this now and remove the experimental tag on IndexedVectors. IndexedVectors support fast look-up and sorting, and they don't really conflict with anything else. Multiple approaches to indexing can be supported.

Please open a new issue if you'd like a different type of index or more advanced features..

from dataframes.jl.

allow columns to be indexed for fast access/split by value about dataframes.jl HOT 16 CLOSED

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent