Giter Site home page Giter Site logo

Comments (16)

tshort avatar tshort commented on August 20, 2024

In addition to the commit just made for column-based indexes, another option for this is pandas/data.table style indexing where the whole DataFrame is sorted based on keys. In pandas, the indexing is separate from the DataFrame. In data.table, the key columns are columns in the DataFrame. Either way, we would need to add extra fields to the definition of a DataFrame, and both would complicate indexing.

Keeping indexing with columns has advantages:

  • Column ordering is less likely to be messed up inadvertantly.
  • Multiple columns can be indexed; there's no main key grouping.
  • More flexible logic combinations are possible.
  • You can mix and match keyed and non-keyed comparisons.
  • The DataFrame structure and indexing are not disturbed.
  • It's probably less coding and maintenance than the approach below.

The advantages of an overall DataFrame index are:

  • Lookups are somewhat faster than the approach above.
  • The ordering and priority of keys is kept. This makes things like pivoting easier.
  • You could do data.table/pandas shortcuts like df["A"] for df[:( keycol .== "A" )]. But, df["A"] is less descriptive if you don't know what df's keys are, and it's hard to separate keyed lookups along rows with column indexing.

from dataframes.jl.

HarlanH avatar HarlanH commented on August 20, 2024

Tom, thanks for writing that experimental code! That's definitely the direction I'm currently thinking. And I think you nail the pros and cons of the various options.

I'd like to defer this decision until we have a design for distributed DataFrames, which I think will be a uniquely Julian and very valuable feature. What we end up wanting to do for indexes will, I think, be driven in part by what's a good option there. Does that make sense?

from dataframes.jl.

tshort avatar tshort commented on August 20, 2024

Definitely. No need to hurry on this. I especially don't want to clutter the DataFrame definition unless we really know what we want.

For now, it was an easy option to try and may give us some pieces to use later.

from dataframes.jl.

tshort avatar tshort commented on August 20, 2024

Some speed tests on the column-based indexing:

N = 100000000
srand(1)
# big grouping:
x = randi(5, N)
xi = IndexedVec(x)

@time x[x .== 3]
# elapsed time: 1.309391975402832 seconds
@time x[xi .== 3]
# elapsed time: 0.37038278579711914 seconds

# smaller grouping:
x = randi(50000, N)
xi = IndexedVec(x)

@time x[x .== 3]
# elapsed time: 0.9579911231994629 seconds
@time x[xi .== 3]
# elapsed time: 0.00017404556274414062 seconds
@time x[xi .<= 3]
# elapsed time: 0.0005171298980712891 seconds

from dataframes.jl.

tshort avatar tshort commented on August 20, 2024

Bitmap indexing is an interesting option. It's fast and flexible. The main downside is that it's hard/slow to update for changes to the data.

http://en.wikipedia.org/wiki/Bitmap_index
https://sdm.lbl.gov/fastbit/
http://www-vis.lbl.gov/Events/SC05/HDF5FastQuery/

from dataframes.jl.

tshort avatar tshort commented on August 20, 2024

I just updated indexing.jl, renamed IndexedVec to IndexedVector, got rid of warnings, and added a few tests. It is also now live, meaning it is loaded with everything else. In updating, I came across a couple of things:

  • DataFrame indexing isn't as general as it used to be, probably due to changes needed to avoid warnings. Not a big deal, but specific ref entries were needed to make this work.
  • Assignment of an IndexedVector to a DataFrame column wasn't straightforward, probably because this type is an AbstractVector, not an AbstractDataVector. I added methods to make it happen, but I don't know if I did it "the right way".

The timings above as well as the advantages and disadvantages still apply.

from dataframes.jl.

johnmyleswhite avatar johnmyleswhite commented on August 20, 2024

Thanks, @tshort! To wrap my head around the indexing strategy, should I just read through indexing.jl? I'm really excited to have this available.

from dataframes.jl.

tshort avatar tshort commented on August 20, 2024

The short answer is yes. I'll add some more comments. The idea is that an
IndexedVector is a pointer to the original column along with an indexing
vector equal to order(orig). A comparison operation like idv .> 3
returns an Indexer type. The Indexer type includes a pointer to the
IndexedVector along with a vector of Range1's. DataVecs and DataFrames can
be indexed with Indexers. It's fast because you're using a slice of the
already indexed vector.

Indexer's can be combined with | and &. In the case where the
IndexedVector is the same, the Indexer is just reduced or expanded as
appropriate. This includes: 0 .< idv .< 3 or idv .== 1 | idv .== 4. If
Indexers have different IndexedVectors (like idv1 .== 1 | idv2 .== 1),
then the result is converted to a BitVector.

On Wed, Jan 2, 2013 at 11:35 AM, John Myles White
[email protected]:

Thanks, @tshort https://github.com/tshort! To wrap my head around the
indexing strategy, should I just read through indexing.jl? I'm really
excited to have this available.


Reply to this email directly or view it on GitHubhttps://github.com//issues/24#issuecomment-11814185.

from dataframes.jl.

tshort avatar tshort commented on August 20, 2024

Also, NA support in DataVecs is spotty. Here is an example:

julia> a = DataVector[1,2,3,NA]
4-element Int64 DataArray
 1
 2
 3
 NA

julia> ia = IndexedVector(a) 
4-element Int64 DataArray
 1
 2
 3
 NA

julia> a[ia .< 3]
3-element Int64 DataArray
 NA                                           # PROBABLY NOT WANTED
 1
 2

julia> a[ia .> 1]
2-element Int64 DataArray
 2
 3

julia> a[ia .== 1]
1-element Int64 DataArray
 1

julia> a = DataVector[NA,1,2,3,NA]
5-element Int64 DataArray
 NA
 1
 2
 3
 NA

julia> ia = IndexedVector(a) 
no method isless(NAtype,NAtype)
 in insertionsort_perm! at sort.jl:163
 in mergesort_perm! at sort.jl:270
 in mergesort_perm at sort.jl:315
 in IndexedVector at /home/tshort/.julia/DataFrames/src/indexing.jl:45

julia> order(a)
no method isless(NAtype,NAtype)
 in insertionsort_perm! at sort.jl:163
 in mergesort_perm! at sort.jl:270
 in mergesort_perm at sort.jl:315
 in order at sort.jl:557

from dataframes.jl.

johnmyleswhite avatar johnmyleswhite commented on August 20, 2024

Have you seen whether just adding isless(NAtype, NAtype) = true to operators.jl makes this better?

from dataframes.jl.

johnmyleswhite avatar johnmyleswhite commented on August 20, 2024

It's possibly worth noting that these types of operations do work correctly for a pure DataArray:

julia> require("DataFrames"); using DataFrames

julia> a = DataVector[1, 2, 3, NA]
4-element Int64 DataArray
1
2
3
NA

julia> a[a .< 3]
2-element Int64 DataArray
1
2

from dataframes.jl.

tshort avatar tshort commented on August 20, 2024

Correct. The only thing I would call a bug with the existing DataArray is
the result of order(DataVector[NA, 1,2,3, NA]).

On Wed, Jan 2, 2013 at 12:50 PM, John Myles White
[email protected]:

It's possibly worth noting that these types of operations do work
correctly for a pure DataArray:

julia> require("DataFrames"); using DataFrames

julia> a = DataVector[1, 2, 3, NA]
4-element Int64 DataArray
1
2
3
NA

julia> a[a .< 3]
2-element Int64 DataArray
1
2


Reply to this email directly or view it on GitHubhttps://github.com//issues/24#issuecomment-11817077.

from dataframes.jl.

johnmyleswhite avatar johnmyleswhite commented on August 20, 2024

I just fixed a bug in operators. Now we have:

julia> require("DataFrames"); using DataFrames

julia> sort(DataVector[NA, 1, 2, 3, NA])
5-element Int64 DataArray
NA
NA
1
2
3

julia> order(DataVector[NA, 1, 2, 3, NA])
5-element Int64 Array:
1
5
2
3
4

julia> sortperm(DataVector[NA, 1, 2, 3, NA])
([NA,NA,1,2,3],[1, 5, 2, 3, 4])

I think this isn't a bug, but maybe I'm missing something.

from dataframes.jl.

tshort avatar tshort commented on August 20, 2024

Works for me now after your fix.

On Wed, Jan 2, 2013 at 1:03 PM, John Myles White
[email protected] wrote:

order(DataVector[NA, 1, 2, 3, NA])

from dataframes.jl.

tshort avatar tshort commented on August 20, 2024

I just pushed some changes, that make DataVectors with NA's work better with IndexedVectors. Bugs may still lurk, of course.

from dataframes.jl.

tshort avatar tshort commented on August 20, 2024

I'm going to close this now and remove the experimental tag on IndexedVectors. IndexedVectors support fast look-up and sorting, and they don't really conflict with anything else. Multiple approaches to indexing can be supported.

Please open a new issue if you'd like a different type of index or more advanced features..

from dataframes.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.