This request has been shot down before when we were working on Harlan's branch. I'm repeating it here "for the record".
I think we should add NA support to floating-point vectors using NaN's. It only takes ten lines of code as follows. Those ten lines of code will allow us to use regular vectors, and immediately most floating-point columns will "just work" (arithmetic, mean, quantile, ...). That will save us a lot of work trying to support every floating point function folks create for DataVecs. I don't think it's a problem to have multiple NA types as long as it appears pretty consistent from the user's point of view.
NA_Float64 = NaN
NA_Float32 = NaN32
na(::Type{Float64}) = NA_Float64
na(::Type{Float32}) = NA_Float32
convert{T <: Float}(::Type{T}, x::NAtype) = na(T)
promote_rule{T <: Float}(::Type{T}, ::Type{NAtype} ) = T
isna{T <: Float}(x::T) = isnan(x)
isna{T <: Float}(x::AbstractVector{T}) = x .!= x
nafilter{T <: Float}(v::AbstractVector{T}) = v[!isna(v)]
nareplace{T <: Float}(v::AbstractVector{T}, r::T) = [isna(v)[i] ? r : v[i] for i = 1:length(v)]
What I'm proposing is to add NA support for Vectors{Float} by using NaNs as NAs for use as DataFrame columns. This is along the lines of what R and pandas do, and it's one of the options that Numpy is moving to (the other option is a masking approach like DataVec).
Arrays are Julia's fundamental type, so if we can support the use of Arrays{Float, 1}, our DataFrame columns will be more robust.
Advantages
Support
The #1 reason (by far) for using Vectors with NaNs as NAs is reduced support and development by the DataFrame team. With just these 10 lines of code, Vectors have pretty good NA support while at the same time supporting all Julia functions that operate on Arrays. Here are some examples:
julia> srand(1)
julia> v = randn(10)
10-element Float64 Array:
0.00422471
0.0636925
1.41376
-1.09879
0.503439
1.75336
-0.202676
-0.458741
0.526426
1.60172
julia> dv = DataVec(v)
[0.004224711539662927,0.0636925153577793,1.413764097398493,-1.0987858271983026,0.5034390675674981,1.753360709001194,-0.20267565554863,-0.4587414680524865,0.5264259022023546,1.601723433618217]
julia> quartile(v)
3-element Float64 Array:
-0.150951
0.283566
1.19193
julia> quartile(dv) # This is what I worry will be too common.
no method sort(DataVec{Float64},)
in method_missing at base.jl:60
in quantile at statistics.jl:356
in quartile at statistics.jl:369
julia> mean(v)
0.41064274858857797
julia> mean(dv)
no method mean(DataVec{Float64},)
in method_missing at base.jl:60
julia>
julia> v[[1,5]] = NA
NA
julia> dv[[1,5]] = NA
NA
julia> mean(v)
NaN
julia> mean(dv)
no method mean(DataVec{Float64},)
in method_missing at base.jl:60
julia> mean(nafilter(v))
0.44984546334732733
julia> mean(nafilter(dv))
0.44984546334732733
julia> v + dv
no method +(Array{Float64,1},DataVec{Float64})
in method_missing at base.jl:60
julia> v * 2
10-element Float64 Array:
NaN
0.127385
2.82753
-2.19757
NaN
3.50672
-0.405351
-0.917483
1.05285
3.20345
julia> dv * 2
no method *(DataVec{Float64},Int64)
in method_missing at base.jl:60
With DataVecs, the most common FAQ will likely be: why doesn't function xyz() work on DataFrame columns? The answer will be: wrap the column in nafilter() or write a method for xyz() to support DataVecs.
Especially since DataVecs cannot inherit from AbstractArrays, there is a lot of work to be done to support even the functions in base. Once packages proliferate, it'll be even worse. Most of these will be fairly simple wrappers, but it'll still be work to do initially and to maintain going forward.
I also worry that forcing DataVecs will create more of a division between users coming from R and Matlab backgrounds. If DataFrames support Vectors, then it'll be easier for the Matlab folks to use and for R folks to use functions from the Matlab folks that work on Arrays.
Performance
In the groupby testing I did, the bare array columns were faster.
Automatic support of new Julia data types
By supporting AbstractVectors, new Julia data types that are AbstractVectors can automatically be used in DataFrames.
Here is an example of using mmap_arrays as columns of a DataFrame:
julia> s = open("bigdataframe.bin","w+")
IOStream(<file bigdataframe.bin>)
julia> N = 1000000000
1000000000
julia> v1 = mmap_array(Float64,(N,),s)
julia> v2 = mmap_array(Float64,(N,),s, numel(v1)*sizeof(eltype(v1)))
julia> d = DataFrame({v1, v2})
DataFrame (1000000000,2)
x1 x2
[1,] 0.0 0.0
[2,] 0.0 0.0
[3,] 0.0 0.0
[4,] 0.0 0.0
[5,] 0.0 0.0
[6,] 0.0 0.0
[7,] 0.0 0.0
[8,] 0.0 0.0
[9,] 0.0 0.0
[10,] 0.0 0.0
[11,] 0.0 0.0
[12,] 0.0 0.0
[13,] 0.0 0.0
[14,] 0.0 0.0
[15,] 0.0 0.0
[16,] 0.0 0.0
[17,] 0.0 0.0
[18,] 0.0 0.0
[19,] 0.0 0.0
[20,] 0.0 0.0
:
[999999981,] 0.0 0.0
[999999982,] 0.0 0.0
[999999983,] 0.0 0.0
[999999984,] 0.0 0.0
[999999985,] 0.0 0.0
[999999986,] 0.0 0.0
[999999987,] 0.0 0.0
[999999988,] 0.0 0.0
[999999989,] 0.0 0.0
[999999990,] 0.0 0.0
[999999991,] 0.0 0.0
[999999992,] 0.0 0.0
[999999993,] 0.0 0.0
[999999994,] 0.0 0.0
[999999995,] 0.0 0.0
[999999996,] 0.0 0.0
[999999997,] 0.0 0.0
[999999998,] 0.0 0.0
[999999999,] 0.0 0.0
[1000000000,] 0.0 0.0
julia> d["x1"][[2,5,N]] = NA
NA
julia> d["x2"][[1,3,6,9]] = pi
3.141592653589793
julia> head(d)
DataFrame (6,2)
x1 x2
[1,] 0.0 3.14159
[2,] NaN 0.0
[3,] 0.0 3.14159
[4,] 0.0 0.0
[5,] NaN 0.0
[6,] 0.0 3.14159
julia> sum(d[11:15, "x1"])
0.0
julia> sum(d[1:5, "x1"])
NaN
Disadvantages
No built-in filter/replace
The biggest downside to using vectors is that they won't have built in filtering/replacing. The use of filter and replace Bools in DataVecs is a cool idea. It's much better than R's options()$na.action -- I never use that because it makes code less portable. By attaching the na.action flag to the data, code is more portable while allowing the user to avoid wrapping things with nafilter.
I think the advantages of using Vectors outweigh losing built-in filter/replace for most applications of floating-point columns.
Confusing output
As it is now, NAs become NaNs, and they are displayed that way. It is probably possible to modify the show() functions to overcome this. It is probably possible to distinguish between NaNs and NAs by using a different bit pattern for each. With support of Julia core, output could be modified for all arrays.
No universal type for columns
New features like indexing cannot be added automatically to all column types. For many features, this could be handled by API's rather than by type inheritance. For features where this is not possible, DataVecs can be used.
No Integer/Bool/String support
This approach cannot be used for these types because they don't have an NaN. R's approach of picking a bit pattern for each as an NA would not work unless it was integrated into Julia's core (not likely and probably not a good idea). Anyway, DataVecs or PooledDataVecs are more appropriate here anyway. Also, the universe of functions that need to work on these types is smaller than the set of functions operating on Vector{Float}.
Next steps
I'm not proposing that we ditch DataVecs. I'm proposing that we have an alternative. One may work better for the user in some situations than the other.
More work is needed to sort all of this out. This includes deciding what's the default for column assignment or reading from CSV's. Here's what I would choose:
- Bool -> PooledDataVec
- String -> PooledDataVec
- Integer -> DataVec (but Pooled may be used a lot here)
- Float -> Vector
We probably need an AsIs type for overriding defaults.
Another area is promotion. What type should x::DataVec{Float64} + v::Vector{Float64} produce?
As far as DataVec{Float} support, I think we should continue to support it, but the list of functions we support at the start could be a lot smaller. If it gains wide use, folks will add methods to support it.