juliadata / dataframes.jl Goto Github PK

View Code? Open in Web Editor NEW

1.7K 43.0 360.0 28.94 MB

In-memory tabular data in Julia

Home Page: https://dataframes.juliadata.org/stable/

License: Other

Julia 99.97% Shell 0.03%

julia data-frame datasets data tabular-data hacktoberfest dataframes

dataframes.jl's Introduction

DataFrames.jl

Tools for working with tabular data in Julia.

Installation: at the Julia REPL, using Pkg; Pkg.add("DataFrames")

Documentation:

Reporting Issues and Contributing: See CONTRIBUTING.md

Maintenance: DataFrames is maintained collectively by the JuliaData collaborators. Responsiveness to pull requests and issues can vary, depending on the availability of key collaborators.

Learning: New to DataFrames.jl? Check out our free Julia Academy course which will walk you through how to use DataFrames.jl. You can also check out Bogumił Kamiński's DataFrames.jl tutorial that is available on GitHub.

Citing: We encourage you to cite our work if you have used DataFrames.jl package. Starring the DataFrames.jl repository on GitHub is also appreciated.

The citation information may be found in the CITATION.bib file within the repository:

Bouchet-Valat, M., & Kamiński, B. (2023). DataFrames.jl: Flexible and Fast Tabular Data in Julia. Journal of Statistical Software, 107(4), 1–32. https://doi.org/10.18637/jss.v107.i04

dataframes.jl's People

Contributors

Stargazers

Watchers

Forkers

tshort doobwa quantisan fpepin nfoti ianfiske vnivargi strategist922 kmsquire milktrader tomaskrehlik keno lgautier simonbyrne xiangdiuxiu dcjones djreiss daviddelaat simonster tanmaykm mweiden larsyencken jvns sglyon nalimilan gitfoxi westleyargentum jfsantos malmaud davidavdav adrianlzt ttriche jaredly gijs gilesc jsolack cosinequanon lbybee joidegn bjarkehs superxroot fiatflux garborg tlamadon tkelman aerlinger opoh ssirai hayd vunguyene sckingsley astrieanna yomichi drewgendreau kleinschmidt wildart johansigfrids skariel hunterowens ccsv armgong maximerischard vchuravy tonyhffong alyst guillaumeguy torrange jakebolewski jlepird jarnoharno etosch prabhuvishnumurthy rened binarybana yesimon zhmz90 adamlabadorf linyoumin alainlich statuser tmlbl azakrzew gideonite mathsoul islandsofsleep niloynibhochaudhury emilbayes pwoolcoc josephwagner fspaolo jupiterethan gustafsson timema iamed2 androdri1 gdkrmr daehyok mmagnuski ccphillippi lettis

dataframes.jl's Issues

replace/filter interfaces for DataFrames

The existing nafilter/Filter and similar methods, and their flags in the DataVecs, seem limited when it comes to working with DataFrames. I like the way that different columns have different behaviors (replace/filter modes), but it's not clear how to combine them. For example, if building a model matrix for an OLS model, do you do a complete_cases operation? The naFilter iterator generator doesn't really work usefully in that context.

One option would be to have a filter_nas() method that generates a SubDataFrame without any rows that contained an NA in a column with filtering mode set. The result could then be iterated over row-wise, with NAs being replaced in any columns in replace mode. Other options and variations are certainly possible.

PooledDataVecs should have a user-specifiable type parameter allowing 1, 2, 4, or 8-byte levels

The motivation: in some instances PooledDataVecs might have many levels (i.e. .more than 65000). The plan is to automatically convert the refs to Uint32 if we are about to overflow. One option is to make PooledDataVec a parametric type depending both on the type in the vector but also on the type of the reference, e.g. PooledDataVec{U, T}.

See #55 for a discussion.

filter/replace type for DataVecs

The current 3 fields are duplicated in DV and PDV and should be probably replaced with a better-constrained type.

DataFrame constructor that repeats unequal-lengthed vectors

cbind/cbind! and copies

Julia-dev discussion

cbind(df1, df2) currently makes copies of both df1 and df2 (expensive)
cbind!(df1, df2) currently makes no copies and modifies df1

We need something that doesn't copy columns but doesn't modify df1 or df2. I'd rather have:

cbind(df1, df2) be equivalent to DataFrame([df1.columns, df2.columns], newcolnames)

That doesn't change df1 or df2, but they share columns. If we have that, I don't see too much need for cbind!.

Creating a DataFrame from existing columns isn't expensive.

DataFrame pivot/melt-cast/etc.

Functionality to take a DF in wide form and make it long, and vice-versa

copy(df) should be shallow; deepcopy(df) should be deep

to match the current Julia standard library behavior

Allow merge to take advantage of pre-indexed columns

Currently merge(a, b, bycol, jointype) always calls join_idx to creates a left and right index needed for joining two DataFrames. It would be nice if this could take advantage of columns that are Index types.

Indexing DataFrame columns can return empty DataFrame

Using : to index DataFrame columns returns an empty DataFrame.

d = DataFrame()
d["y"] = [1:4]
d["x1"] = [5:8]
d["x2"] = [9:12]

julia> d[1:4,:]   # wrong answer
DataFrame  (0,0)

julia> d[:,1:3]
DataFrame  (4,3)
        y x1 x2
[1,]    1  5  9
[2,]    2  6 10
[3,]    3  7 11
[4,]    4  8 12

julia> d[1:4,1:3]
DataFrame  (4,3)
        y x1 x2
[1,]    1  5  9
[2,]    2  6 10
[3,]    3  7 11
[4,]    4  8 12

Such a call occurs in the first line of model_matrix:

 df = mfarg.df[complete_cases(mfarg.df),:]

DataStreams

RFC: Dense arrays with missing data

One aspect of missing data that JuliaData does not support is dense arrays with missing data. Extending DataVec and PooledDataVec to a DataArray type that can handle missing data for an array of arbitrary dimension seems like a useful addition to the package. The semantics of a DataArray should be the same as normal Arrays with the addition that functions that operate on them should have the option of excluding missing data. Additionally, slicing a DataArray would return a DataArray with the proper number of dimensions. In principle a 1d DataArray would be a DataVec, however, there may be compelling reasons to keep the DataVec implementation separate. The proposed implementation of DataArray will be such that a 1d DataArray will behave exactly as a current DataVec. Having a special type DataMatrix for 2d data also would be useful. The nafilter/naFilter and nareplace/naReplace would return flattened versions of the objects. I'm sure there are behaviors I have not specified or have not been clear about, so any thoughts on the design and implementation of DataArray are appreciated.

One other special case that deserves attention is float arrays with missing data. In this case I think it is worth implementing something similar to the approach in issue #22 for arbitrary arrays of floats. That is using NaN to indicate missing data in arrays of floats. In this special case NaN has the correct semantics for missing data and does not require a separate mask. It is also straightforward to implement this behavior. Again, any thoughts are appreciated. This type could then be sub-typed to allow named rows and columns via the Index type already in JuliaData. However, just the added functionality for Float arrays would be very useful for machine learning and statistics algorithms that operate on float arrays.

I am planning on implementing these ideas as time permits, but help is welcome if anyone wants to run with the ideas.

Formula should include : and * interactions

Tests have a problem with sshow

Using the latest I get the following. Am I not importing something?

tests("test/data.jl");tests("test/dataframe.jl"); tests("test/formula.jl")
................................................
In Data types and NAs / DataVec to something else
sshow(dvint)=="[1,2,NA,4]" FAILED
Nothing with args:
1: Nothing
2: Nothing
3: Nothing
Exception: sshow not defined
NaN seconds

benchmark suite

It'd be nice to know things like how much overhead does the NA masking have over a raw vector, and how much memory does a PooledDataVec save, etc. The core Julia team is thinking about this too, and also about storing benchmarks over time as the language matures. JuliaLang/julia#1073

DataVec constructor should detect types without zeros

Note that we've defined oftype(ASCIIString) to allow zero(ASCIIString) to work. But any type without a zero will fail with a weird error during DataVec construction.

DistributedDataFrame

Possible construction modes include pushing data out from a central process, or having individual nodes load chunks from a CSV file or another source. Rows could be split by row-number or by value, depending on the application. The DDF would be resident in RAM, and read/write operations could be performed, as well as distributed statistical operations.

Formula should allow intercept only; removed intercept

y ~ 1

y ~ x + 0 (probably not y ~ x - 1 -- seems superfluous)

memory-mapped DataFrame

There are several options here, from indexing CSV rows to allow random access to giant CSV files, to creating a new binary file format for storing DataVecs that supports mmap-style direct memory mapping.

More element-wise comparison support

It would be nice to add support for the following forms.

dvfloat = DataVec[1., NA, 2, 3]
dvfloat2 = DataVec[1., 3, 5, NA]
dvint = DataVec[NA, 4, 5, 6]
dvint2 = DataVec[1, 4, 5, 6]
vfloat = [4., 2, 1, 2]
vint = [4, 2, 1, 2]

1.0 .< dvfloat
0.0 .< dvfloat .< 1.0
dvfloat .< 1
dvint .< 1.0
dvfloat .< dvfloat2
dvfloat .< dvint
dvfloat .< vfloat
dvfloat .< vint
dvint .< vfloat
dvint .< vint
dvfloat .< NA

deal with PooledDataVec overflow

Currently we have refs::Vector{UInt16} which is intentionally short to save space, but there's no overflow check if you add more than 2^16 different items. Presumably there should be a mechanism to detect this, and at a minimum throw an error, and ideally convert itself to a DataVec.

replace col names by index or current value

something like:

rename!(df, quote
  x1 = "cat"
  x2 = "dog"
end)

rename!(df, {"x1"="cat", "x2"="dog"})

rename!(df, {1="cat", 2="dog"})

rename!(df, ["cat", "dog"]) # must be same length as ncol(df)

rename!(df, function) # allow functional generation of new colname from old one

Note that column name groups should be updated appropriately.

should start/next/done iterate over rows or cols?

Currently, start/next/done iterates over columns of an AbstractDataFrame. It seems to me that they should instead iterate over rows, as do these functions for DataStreams. The next() return value should presumably be a 1-row SubDataFrame.

Are there any current functions that depend on the current behavior?

@tshort , github says I should blame you for this. :)

simple conversion of DF to numerical matrix

Function that converts a DF to a single Float64 matrix, with other numerical types promoted and dummy variables created for strings and other types.

make annoying method-definition startup warnings go away

Warning: New definition ==(NAtype,Any) is ambiguous with ==(Any,AbstractArray{T,N}).
         Make sure ==(NAtype,AbstractArray{T,N}) is defined first.
Warning: New definition ==(Any,NAtype) is ambiguous with ==(AbstractArray{T,N},Any).
         Make sure ==(AbstractArray{T,N},NAtype) is defined first.
Warning: New definition .==(AbstractDataVec{T},T) is ambiguous with .==(Any,AbstractArray{T,N}).
         Make sure .==(AbstractDataVec{AbstractArray{T,N}},AbstractArray{T,N}) is defined first.
...

I think these can mostly be made to disappear by clever re-ordering, but it might need a few new methods too.

subDataFrame checks in constructor

Indexing a DataFrame removes column groupings

We should decide if this is a bug or a feature. The main question is how much we require column groupings to propagate. To me this is low priority, but since it touches basics, it may need some thought now.

csvDataFrame rewrite

Currently csvDataFrame uses core csvread into a matrix of Anys then parses each column. Preferred features incude:

don't use csvread
possibly: two-pass read, the first to figure out vector types and sizes, and the second to load the data into the columns
allow UTF-8 column names
detect various dialects of CSV, ala Python's CSV-parsing library
(optionally, by default) use PooledDataVecs for non-numeric data if number of unique values is much less than the total rows
support dates

allow columns to be indexed for fast access/split by value

something like index!(df, "colname") creates an index for the clumn and stores it inside the DataFrame object.

use a bitarray for DataVec NA mask

serialize DataFrame

Built-in serialization may possibly work?

make rbind() work with DFs with non-identical column sets

any columns present in one but not the other should have NAs in the missing values
columns should be in the order of the first argument's DF, followed by any unique columns in the second argument's DF
attempt to promote column types; throw an error if they can't be reconciled
use the first DF's columns' meta-data (NA behavior)

did I miss any obvious requirements?

optionally make AbstractDataVec be like an R factor

There should be a way to enforce a fixed set of pool items in a DV, and to optionally flag the ordering as important. It may also be useful to have meta-data for constrast construction -- or maybe this isn't the appropriate place for it (cf. R).

column groups

Per #18, allow the user to define groups of columns by name (perhaps hierarchically). E.g., assume columns are y1, y2, x1, x2, x3, x4. Then allow the user to define responses = ["y1", "y2"] and odd_predictors = ["x1", "x3"] and predictors = ["odd_predictors", "x2", "x4"]. Then you could do reference ala df["odd_predictors"] and formulae ala :(responses ~ predictors) with automatic expansion in intuitive ways.

This would be part of a DataFrame, presumably, and the methods would need discussion.

rbind with PooledDataVec should unify pools

ModelFrame should support Abstract/SubDFs

Tom says: Make ModelFrame hold an AbstractDataFrame instead of a DataFrame, and modify expand and others to support this. I looked through the code, and I didn't see anything that would prevent this. It'd be nice to be able to form a DesignMatrix from a SubDataFrame.

AbstractDataVec should inherit from AbstractVector

This will make a lot of functions work, including most functions in statistics.jl (mean, median, etc.).

It will take some work to go through and cut down on warnings and tweak things. Also, the default functions won't pay attention to the nafilter or nareplace indicators, but methods supporting those can be added as we go along.

show(DataFrame) should use alignment() from show.jl

for prettier output and alignment

NA support for Vector{Float}

This request has been shot down before when we were working on Harlan's branch. I'm repeating it here "for the record".

I think we should add NA support to floating-point vectors using NaN's. It only takes ten lines of code as follows. Those ten lines of code will allow us to use regular vectors, and immediately most floating-point columns will "just work" (arithmetic, mean, quantile, ...). That will save us a lot of work trying to support every floating point function folks create for DataVecs. I don't think it's a problem to have multiple NA types as long as it appears pretty consistent from the user's point of view.

NA_Float64 = NaN 
NA_Float32 = NaN32
na(::Type{Float64}) = NA_Float64
na(::Type{Float32}) = NA_Float32
convert{T <: Float}(::Type{T}, x::NAtype) = na(T)
promote_rule{T <: Float}(::Type{T}, ::Type{NAtype} ) = T
isna{T <: Float}(x::T) = isnan(x)
isna{T <: Float}(x::AbstractVector{T}) = x .!= x
nafilter{T <: Float}(v::AbstractVector{T}) = v[!isna(v)]
nareplace{T <: Float}(v::AbstractVector{T}, r::T) = [isna(v)[i] ? r : v[i] for i = 1:length(v)]

What I'm proposing is to add NA support for Vectors{Float} by using NaNs as NAs for use as DataFrame columns. This is along the lines of what R and pandas do, and it's one of the options that Numpy is moving to (the other option is a masking approach like DataVec).

Arrays are Julia's fundamental type, so if we can support the use of Arrays{Float, 1}, our DataFrame columns will be more robust.

Advantages

Support

The #1 reason (by far) for using Vectors with NaNs as NAs is reduced support and development by the DataFrame team. With just these 10 lines of code, Vectors have pretty good NA support while at the same time supporting all Julia functions that operate on Arrays. Here are some examples:

julia> srand(1)

julia> v = randn(10)
10-element Float64 Array:
  0.00422471
  0.0636925 
  1.41376   
 -1.09879   
  0.503439  
  1.75336   
 -0.202676  
 -0.458741  
  0.526426  
  1.60172   

julia> dv = DataVec(v)
[0.004224711539662927,0.0636925153577793,1.413764097398493,-1.0987858271983026,0.5034390675674981,1.753360709001194,-0.20267565554863,-0.4587414680524865,0.5264259022023546,1.601723433618217]

julia> quartile(v)
3-element Float64 Array:
 -0.150951
  0.283566
  1.19193 

julia> quartile(dv)                # This is what I worry will be too common.
no method sort(DataVec{Float64},)
 in method_missing at base.jl:60
 in quantile at statistics.jl:356
 in quartile at statistics.jl:369

julia> mean(v)
0.41064274858857797

julia> mean(dv)
no method mean(DataVec{Float64},)
 in method_missing at base.jl:60

julia> 

julia> v[[1,5]] = NA
NA

julia> dv[[1,5]] = NA
NA

julia> mean(v)
NaN

julia> mean(dv)
no method mean(DataVec{Float64},)
 in method_missing at base.jl:60

julia> mean(nafilter(v))
0.44984546334732733

julia> mean(nafilter(dv))
0.44984546334732733

julia> v + dv
no method +(Array{Float64,1},DataVec{Float64})
 in method_missing at base.jl:60

julia> v * 2
10-element Float64 Array:
 NaN       
   0.127385
   2.82753 
  -2.19757 
 NaN       
   3.50672 
  -0.405351
  -0.917483
   1.05285 
   3.20345 

julia> dv * 2
no method *(DataVec{Float64},Int64)
 in method_missing at base.jl:60

With DataVecs, the most common FAQ will likely be: why doesn't function xyz() work on DataFrame columns? The answer will be: wrap the column in nafilter() or write a method for xyz() to support DataVecs.

Especially since DataVecs cannot inherit from AbstractArrays, there is a lot of work to be done to support even the functions in base. Once packages proliferate, it'll be even worse. Most of these will be fairly simple wrappers, but it'll still be work to do initially and to maintain going forward.

I also worry that forcing DataVecs will create more of a division between users coming from R and Matlab backgrounds. If DataFrames support Vectors, then it'll be easier for the Matlab folks to use and for R folks to use functions from the Matlab folks that work on Arrays.

Performance

In the groupby testing I did, the bare array columns were faster.

Automatic support of new Julia data types

By supporting AbstractVectors, new Julia data types that are AbstractVectors can automatically be used in DataFrames.

Here is an example of using mmap_arrays as columns of a DataFrame:

julia> s = open("bigdataframe.bin","w+")
IOStream(<file bigdataframe.bin>)

julia> N = 1000000000
1000000000

julia> v1 = mmap_array(Float64,(N,),s)

julia> v2 = mmap_array(Float64,(N,),s, numel(v1)*sizeof(eltype(v1)))

julia> d = DataFrame({v1, v2})
DataFrame  (1000000000,2)
                  x1  x2
[1,]             0.0 0.0
[2,]             0.0 0.0
[3,]             0.0 0.0
[4,]             0.0 0.0
[5,]             0.0 0.0
[6,]             0.0 0.0
[7,]             0.0 0.0
[8,]             0.0 0.0
[9,]             0.0 0.0
[10,]            0.0 0.0
[11,]            0.0 0.0
[12,]            0.0 0.0
[13,]            0.0 0.0
[14,]            0.0 0.0
[15,]            0.0 0.0
[16,]            0.0 0.0
[17,]            0.0 0.0
[18,]            0.0 0.0
[19,]            0.0 0.0
[20,]            0.0 0.0
  :
[999999981,]     0.0 0.0
[999999982,]     0.0 0.0
[999999983,]     0.0 0.0
[999999984,]     0.0 0.0
[999999985,]     0.0 0.0
[999999986,]     0.0 0.0
[999999987,]     0.0 0.0
[999999988,]     0.0 0.0
[999999989,]     0.0 0.0
[999999990,]     0.0 0.0
[999999991,]     0.0 0.0
[999999992,]     0.0 0.0
[999999993,]     0.0 0.0
[999999994,]     0.0 0.0
[999999995,]     0.0 0.0
[999999996,]     0.0 0.0
[999999997,]     0.0 0.0
[999999998,]     0.0 0.0
[999999999,]     0.0 0.0
[1000000000,]    0.0 0.0

julia> d["x1"][[2,5,N]] = NA
NA

julia> d["x2"][[1,3,6,9]] = pi
3.141592653589793

julia> head(d)
DataFrame  (6,2)
         x1      x2
[1,]    0.0 3.14159
[2,]    NaN     0.0
[3,]    0.0 3.14159
[4,]    0.0     0.0
[5,]    NaN     0.0
[6,]    0.0 3.14159


julia> sum(d[11:15, "x1"])
0.0

julia> sum(d[1:5, "x1"])
NaN

Disadvantages

No built-in filter/replace

The biggest downside to using vectors is that they won't have built in filtering/replacing. The use of filter and replace Bools in DataVecs is a cool idea. It's much better than R's options()$na.action -- I never use that because it makes code less portable. By attaching the na.action flag to the data, code is more portable while allowing the user to avoid wrapping things with nafilter.

I think the advantages of using Vectors outweigh losing built-in filter/replace for most applications of floating-point columns.

Confusing output

As it is now, NAs become NaNs, and they are displayed that way. It is probably possible to modify the show() functions to overcome this. It is probably possible to distinguish between NaNs and NAs by using a different bit pattern for each. With support of Julia core, output could be modified for all arrays.

No universal type for columns

New features like indexing cannot be added automatically to all column types. For many features, this could be handled by API's rather than by type inheritance. For features where this is not possible, DataVecs can be used.

No Integer/Bool/String support

This approach cannot be used for these types because they don't have an NaN. R's approach of picking a bit pattern for each as an NA would not work unless it was integrated into Julia's core (not likely and probably not a good idea). Anyway, DataVecs or PooledDataVecs are more appropriate here anyway. Also, the universe of functions that need to work on these types is smaller than the set of functions operating on Vector{Float}.

Next steps

I'm not proposing that we ditch DataVecs. I'm proposing that we have an alternative. One may work better for the user in some situations than the other.

More work is needed to sort all of this out. This includes deciding what's the default for column assignment or reading from CSV's. Here's what I would choose:

Bool -> PooledDataVec
String -> PooledDataVec
Integer -> DataVec (but Pooled may be used a lot here)
Float -> Vector

We probably need an AsIs type for overriding defaults.

Another area is promotion. What type should x::DataVec{Float64} + v::Vector{Float64} produce?

As far as DataVec{Float} support, I think we should continue to support it, but the list of functions we support at the start could be a lot smaller. If it gains wide use, folks will add methods to support it.

blocked arrays for DataVec?

There's an argument that DataVecs should not be unitary vectors, but instead should be blocked arrays. This would allow relatively common data-cleaning operations such as "delete 17 random rows from this billion-row DataFrame" to be very fast instead of very slow. Basically, columns would be stored in chunks of, say, 4KB (often the size of disk blocks), and single row deletions would require only a shuffling of a single block. Blocks would merge when two adjacent blocks were less than half full. A block structure might also be convenient for indexing, and likely for memory-mapping too. On the other hand, other operations would be more complex.

PooledDataVec constructors are slow

currently just converts from a DataVec, which is silly.

copy_to(dest::DV, srv::DV)

Metadata for columns and/or DataFrames

Should we leave room for metadata on structures? Frank Harrell's Hmisc package allows units and labels to be attached to data.frame columns.

People may want to attach other metadata like experimenter name or a DataFrame comment.

We could add a meta Dict to the DataFrame, the colindex, and/or at the DataVec level.

summary(DataFrame) should have columns

currently it's just a single column of summary tables

RFC: Bitstypes for handling missing values

I've taken a first shot at implementing bitstypes with missing value checking for integers, booleans, and floating point numbers. Implementing missing values as bitstypes has a number of advantages:

Missing values can be included in any Julia composite type, including scalars, vectors, matrices, dicts, sparse arrays, distributed arrays, and mmapped arrays.
Using a bit pattern to represent NA's generally offers speed and memory advantages. For floating point operations, NA's are represented as NaN's, so there is often no slowdown for NA checking.
Because common Vectors can now include NA's, support for DataFrame columns is easier. Any functions that work with Arrays should work with DataFrame columns since they are arrays.

This is an expansion on issue #22 (NA support for Vector{Float}).

Bitstype NA arrays and DataVecs and PooledDataVecs seem to inter-operate well. Having both a masking approach (DataVecs) and a bit-pattern approach will give users more options to best handle missing data based on their problem. Here is a specification document showing how I think everything fits together:

https://github.com/tshort/JuliaData/blob/bitstypeNA/spec/MissingValues.md

Here is the code:

https://github.com/tshort/JuliaData/blob/bitstypeNA/src/bitstypeNA.jl
https://github.com/tshort/JuliaData/blob/bitstypeNA/src/boolNA.jl
https://github.com/tshort/JuliaData/blob/bitstypeNA/src/intNA.jl
https://github.com/tshort/JuliaData/blob/bitstypeNA/src/floatNA.jl

Some tests are also available:

https://github.com/tshort/JuliaData/blob/bitstypeNA/test/bitstypeNA.jl
https://github.com/tshort/JuliaData/blob/bitstypeNA/test/bitstypeNA2.jl

Using bitstypes is a good showcase for Julia's capability. It was surprisingly easy to create these new types. I started with integers and Jeff's suggestion [1], and from there, it was mostly a matter of adapting int.jl. I think I had something working in 1.5 hours one evening earlier this week. Bools and floats each took about the same amount of time.

[1] https://groups.google.com/d/msg/julia-dev/n3ntT4M0gwo/xpPuTgwSpb0J

I didn't submit this as a pull request. It's in a branch under my fork of JuliaData. We should probably have discussions on what and how this should be incorporated. Some options are:

Include it with the JuliaData package.
Include it as a module with the JuliaData package. I'm not sure how different modules and packages will be.
Make it a separate package that can work with JuliaData structures.
Do one of the above on a temporary basis until modules/packages are better sorted out.
Toss it out, and do something better.

Currently, things seem to work pretty well. Vectors work well in DataFrames. Indexing with IntNA's and BoolNA's with and without NA's seems to work. Conversion to DataVecs and inter-operability with DataVecs seems good. It's still not well tested, so I'm sure there are many bugs and missing features, especially in conversion. I have not run into any big gotchas that would require language changes or other action by Julia core developers.

Formula/ModelMatrix should support functions of variables

This requires two things. The function should work on a DataVec (currently log() does, see datavec.jl), and the ModelFrame (or maybe ModelMatrix) constructor should evaluate the function.

DataFrame assignment doesn't work for multiple columns

This fails:

df2 = DataFrame(quote
    a = shuffle(reverse([1:5]))
    b2 = ["A","B","C"][randi(3,5)]
    v2 = randn(3)    # test unequal lengths in the constructor
end)

julia> df2[1:2,:] = df2[4:5,:]
no method convert(Type{DataFrame},Int64)
 in method_missing at base.jl:70
 in DataVec at /home/tshort/julia/JuliaData/src/datavec.jl:161
 in assign at /home/tshort/julia/JuliaData/src/dataframe.jl:455
 in assign at /home/tshort/julia/JuliaData/src/dataframe.jl:466