There should be a way to enforce a fixed set of pool items in a DV, and to optionally

Allowing a predetermined set of pool items is implemented <a class="commit-link" data-

Some valuable discussion in <a class="issue-link js-issue-link" data-error-text="Faile

optionally make AbstractDataVec be like an R factor,about juliadata/dataframes.jl

Comments (20)

nalimilan commented on August 20, 2024 1

Yes, it's so old that I'm not even sure what this issue was about.

from dataframes.jl.

doobwa commented on August 20, 2024

Allowing a predetermined set of pool items is implemented 2711fd8

I agree that ordering flags and contrast options are still needed.

from dataframes.jl.

HarlanH commented on August 20, 2024

Some valuable discussion in #58. This story should be expanded -- define methods for AbstractDataVec to allow categorical/factor-like behavior, with varying performance trade-off depending on Pooled or non-pooled implementations.

from dataframes.jl.

HarlanH commented on August 20, 2024

OK, here's my interface-level proposal. Implementation details would differ between DataVecs and PooledDataVecs.

Each ADV would have a field called datatype::DataType, probably implemented as:

bitstype 8 DataType
@enum DataType NOMINAL ORDINAL INTERVAL RATIO

By default, an ADV{T<:Number} would default to a RATIO type (except maybe Bool, which might default to NOMINAL). Any other type would default to NOMINAL. Non-numeric types can only be NOMINAL or ORDINAL. Numeric types can be set to be any type, which would give proper categorical behavior for, e.g. UIDs.

Each ADV would have an optional Domain, which can be a Set or Range. If present, elements would be checked for membership against the Domain, and an error thrown if an element is not in the Domain. A common use case would be an ASCIIString DV with NOMINAL type and a Set of possible values, which would be equivalent to an R factor.

Each ADV of ORDINAL type may have an Ordering specified, which is a function that provides an ordering of the elements, ala isless(a,b). By default, isless is used, which gives alphanumeric ordering for strings, numeric ordering for numbers, chronological ordering for dates (if we had a date type), etc. A common use case would be an ASCIIString DV with ORDINAL type, a Domain with Ordering "Completely Agree", "Agree", "Neither Agree nor Disagree", etc.

Methods might look like:

# maximally verbose way -- there would be shortcuts
x = DataVec(["Low", "Medium", "High"])
setType(x, ORDINAL)
setDomain(x, ["High", "Medium", "Low"])
orderingDict = {"High" => 3, "Medium" => 2, "Low" => 1}
setOrdering(x, (a,b) -> isless(orderingDict[a], orderingDict[b]))

# or probably something like this could be made to do the same:
x = DataVec(["Low", "Medium", "High"], @options datatype=ORDINAL)

# use the obvious ways
if getType(x) == ORDINAL
  ...

push(x, "Medium") #OK
push(x, "Tiny") #error!

Statistical routines would read this meta-data and act appropriately when building model matrices and similar.

Notes:

x[1] < x[3] will give true, because this evaluates as "High" < "Low". Not sure if there's any way around this. But x .< "Medium" should give the expected answer.

Because implementation details would differ for a DataVec vs PooledDataVec, probably want to use getters and setters instead of fields.

The Domain for PDVs would presumably double as the pool.

Thoughts, @doobwa , @johnmyleswhite , @tshort ?

from dataframes.jl.

doobwa commented on August 20, 2024

Thanks for digging into this.

One concern is that some methods will behave differently depending on the type. Would it be cleaner to have NominalDataVec, OrdinalDataVec, etc? Seems like it might get a bit crowded if we go that direction. (One might view this as an implementation detail, but since we're talking about having a DataType field I figured it was fair game.) For example, how would mean(dv) work?

from dataframes.jl.

tshort commented on August 20, 2024

I'm not very proficient here, but it looks well thought out. As far as function names, I'd prefer settype, setdomain, and so on. Or, use underscores (set_type, set_domain). camelCase doesn't seem to be used much in Julia code.

Concatenation or other combining may get tricky for some combinations.

from dataframes.jl.

HarlanH commented on August 20, 2024

Chris, that's an interesting idea. It might well be more Julian to use the
type system and multiple dispatch here. I wonder then how pooled data might
work? Without multiple inheritance, we'd either need NominalDataVec and
NominalPooledDataVec <: AbstractNominalDataVec, or NominalPooledDataVec and
OrdinalPooledDataVec <: AbstractPooledDataVec. Or, we could just have
Nominal and Ordinal be Pooled by implementation, and Interval and Ratio be
non-pooled? That would be closer to what we have now.

Tom, yes, combinations are an interesting point that I hadn't thought about
yet.

More to ponder...!

On Tue, Sep 25, 2012 at 2:50 PM, Chris DuBois [email protected]:

Thanks for digging into this.

One concern is that some methods will behave differently depending on the
type. Would it be cleaner to have NominalDataVec, OrdinalDataVec, etc?
Seems like it might get a bit crowded if we go that direction. (One might
view this as an implementation detail, but since we're talking about having
a DataType field I figured it was fair game.) For example, how would
mean(dv) work?

—
Reply to this email directly or view it on GitHubhttps://github.com/HarlanH/JuliaData/issues/6#issuecomment-8865631.

from dataframes.jl.

tshort commented on August 20, 2024

I had to do some googling just to figure out what each of these meant. I'm inclined to think that your "by implementation" idea is the best:

"Or, we could just have Nominal and Ordinal be Pooled by implementation, and Interval and Ratio be non-pooled?"

It might make sense to have Nominal and Ordinal share an abstract type because they will share some functions.

Is it really worth it to separate out Ratio and Interval types? I haven't run across that in R before.

from dataframes.jl.

HarlanH commented on August 20, 2024

It may or may not be useful to have Interval. It's not supported by R --
you're right. See this Wikipedia page:
http://en.wikipedia.org/wiki/Level_of_measurement I don't know whether
making this distinction would be useful, or annoying in practice.

The only question about the R-like solution, with Nominal and Ordinal being
Pooled, is what to do with things like the "Categorical User ID" case. We
certainly can allow that to be part of a NominalDataVec{UInt64, UInt64}
(both the pool and the data are UInt64s), but it's going to be quite
space-inefficient, with a fairly major performance hit relative to a
non-pooled implementation. The question is whether that's enough motivation
to have both the NominalDataVec and NominalPooledDataVec cases, and making
everything that much more complex.

On Tue, Sep 25, 2012 at 7:44 PM, Tom Short [email protected] wrote:

I had to do some googling just to figure out what each of these meant. I'm
inclined to think that your "by implementation" idea is the best:

"Or, we could just have Nominal and Ordinal be Pooled by implementation,
and Interval and Ratio be non-pooled?"

It might make sense to have Nominal and Ordinal share an abstract type
because they will share some functions.

Is it really worth it to separate out Ratio and Interval types? I haven't
run across that in R before.

—
Reply to this email directly or view it on GitHubhttps://github.com/HarlanH/JuliaData/issues/6#issuecomment-8874281.

from dataframes.jl.

tshort commented on August 20, 2024

For the "Categorical User ID" case, my first thought is to just use NominalDataVecs. If there's demand, a non-pooled type could be added as another type that shares an abstract type with Nominal and Ordinal.

On naming, what do you think of Factor as the name for NominalDataVec or as the abstract type that covers Nominal and Ordinal? That might help R users.

from dataframes.jl.

doobwa commented on August 20, 2024

Or how about CategoricalVec? Didn't pandas end up using Categorical as a name for this?

I agree that Factor would be good for R converts like me, but I never really liked that name in the first place.

from dataframes.jl.

HarlanH commented on August 20, 2024

I'm OK with us eventually ending up with R's solution (although I agree
with Chris -- Categorical or Nominal is better than Factor), I just want us
to make sure it's the most reasonable option...

Here's another random thought. What if we make a distinction between is-a
and has-a relationships in the type hierarchy. That is, what if the is-a
relationships (and user-visible types) are Nominal/Categorical, Ordinal,
Interval (maybe) and Ratio. But objects of these types have (rather than
are) VectorDataVec (what we now call DataVec) or PooledDataVec objects in
a (Union?) slot. Then, as long as VDV and PDV objects have a consistent
interface (via an abstract type above them), the N/O/I/R objects can use
either one. Then, we'd set things up so that Interval and Ratio always use
VDVs, while Nominal and Ordinal start with PDVs but can convert to VDVs if
they overflow a 16-bit pool. The Pooled/Vector distinction would be
entirely invisible to the user.

In the long run, this might make additional sense when we start thinking
about optimizations for memory-mapped and indexed data, where instead of a
single underlying vector in the VDV case, you probably want a blocked data
structure instead.

(I started writing this proposal as an unlikely brainstorm, but now I sorta
like it...!)

On Tue, Sep 25, 2012 at 8:37 PM, Tom Short [email protected] wrote:

For the "Categorical User ID" case, my first thought is to just use
NominalDataVecs. If there's demand, a non-pooled type could be added as
another type that shares an abstract type with Nominal and Ordinal.

On naming, what do you think of Factor as the name for NominalDataVec or
as the abstract type that covers Nominal and Ordinal? That might help R
users.

—
Reply to this email directly or view it on GitHubhttps://github.com/HarlanH/JuliaData/issues/6#issuecomment-8875221.

from dataframes.jl.

commented on August 20, 2024

It sounds like its worth implementing or trying out some test code to see how it feels. On the conversion when overflowing a 16-bit pool, I think we still need provisions for a larger pool. This is especially important for strings; it doesn't take many repeats to justify having a pool.

from dataframes.jl.

HarlanH commented on August 20, 2024

OK, I'll plan on starting a "newdatavec" branch soon and playing with some of
these ideas...

On Tue, Sep 25, 2012 at 9:07 PM, Tom Short [email protected] wrote:

It sounds like its worth implementing or trying out some test code to see
how it feels. On the conversion when overflowing a 16-bit pool, I think we
still need provisions for a larger pool. This is especially important for
strings; it doesn't take many repeats to justify having a pool.

—
Reply to this email directly or view it on GitHubhttps://github.com/HarlanH/JuliaData/issues/6#issuecomment-8875702.

from dataframes.jl.

johnmyleswhite commented on August 20, 2024

I've been thinking about this lately as model matrices are close to being the only major hole for me left in DataFrames. I'm starting to think that R's factor type is an error: it conflates the storage properties of our PooledDataVec with the modeling properties of a categorical variable.

Put another way: there's no reason why the categoricalness of a variable needs to depend upon the way in which it's stored. If I want to store a categorical variable as a Float64, that shouldn't be a problem.

This line of argument leads to thinking of Factor as a property of Formula and not a new data type. The only trouble with that is the absence of pre-specified levels for that factor.

But that's actually a serious problem for DataStream's as well, because you don't want to assume that you will read the entire data set just to learn about the levels of a factor. You want to have a natural way of specifying the levels manually.

from dataframes.jl.

HarlanH commented on August 20, 2024

Yes. I agree that conflating storage and types of data is a problem. Although R has its global string pool that minimizes some of the issues, at least for strings.

Do you have any thoughts about the Nominal/Ordinal/Interval/Ratio property idea? Would that address your concerns?

Is Factor a property of Formula, or an operation you can apply to a DataVec (of whatever type) to form contrasts?

I don't see why DataStreams can't use a Nominal type and just grow the set of levels as they're seen.

from dataframes.jl.

johnmyleswhite commented on August 20, 2024

I like the idea of distinguishing all of the classical levels of measurement. I think that Factor might need to be split into Ordinal, etc. if we do that.

I would think that Factor could be both a keyword for the Formula DSL and an operation you can do inside of Julia to produce dummy variables like Panda's get_dummies() function.

The trouble with DataStream's is that growing the set of levels could be a nightmare for things like fitting a logistic online using SGD. Suddenly you need to insert a new value/column/matrix section into all of your parameter estimates. It's doable, but a hassle. It gets much worse when you have things like online estimation of a Hessian that's derived from the parameters, which are derived from the dummy columns. In that case a new dummy column has to send signals to all of the other data structures that they need to be enlarged.

from dataframes.jl.

HarlanH commented on August 20, 2024

Yes, I like having both implicit and explicit control over dummy variables.

That DataStream problem seems like an inherent problem that we're not going
to able to fix with better data structures. It needs an algorithmic
solution...

On Thu, Dec 13, 2012 at 5:56 PM, John Myles White
[email protected]:

I like the idea of distinguishing all of the classical levels of
measurement. I think that Factor might need to be split into Ordinal,
etc. if we do that.

I would think that Factor could be both a keyword for the Formula DSL and
an operation you can do inside of Julia to produce dummy variables like
Panda's get_dummies() function.

The trouble with DataStream's is that growing the set of levels could be a
nightmare for things like fitting a logistic online using SGD. Suddenly you
need to insert a new value/column/matrix section into all of your parameter
estimates. It's doable, but a hassle. It gets much worse when you have
things like online estimation of a Hessian that's derived from the
parameters, which are derived from the dummy columns. In that case a new
dummy column has to send signals to all of the other data structures that
they need to be enlarged.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/6#issuecomment-11357516.

from dataframes.jl.

johnmyleswhite commented on August 20, 2024

I agree: we need an algorithmic solution. My sense is that you need to specify in advance all of the levels for a DataStream's factors, possibly using a PooledDataVec that has unseen levels pre-allocated. My thinking on this is still pretty hazy, but I'm probably only a week or two away from releasing general purpose SGD code for simple linear models fit to arbitrary DataStream's as long as there are no categorical variables involved.

from dataframes.jl.

quinnj commented on August 20, 2024

@nalimilan, is this covered by your work in CategoricalArrays.jl now?

from dataframes.jl.

optionally make AbstractDataVec be like an R factor about dataframes.jl HOT 20 CLOSED

Comments (20)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent