Currently, start/next/done iterates over columns of an AbstractDataFrame. It seems to

should start/next/done iterate over rows or cols? about dataframes.jl HOT 16 CLOSED

juliadata commented on August 20, 2024

should start/next/done iterate over rows or cols?

from dataframes.jl.

Comments (16)

johnmyleswhite commented on August 20, 2024

Having iteration for AbstractDataFrames work over rows does seem like the right approach, but there's definitely a use for easy-to-use tools to perform column iteration. So much behavior on DataFrames has a design pattern where you perform operations column-wise and then combine results. The mean of a DataFrame, for example, is a mapping where you aggregate the means of each column. I imagine all of that can achieved with map, but maybe there's something customized to be added here eventually.

from dataframes.jl.

HarlanH commented on August 20, 2024

Yes, I agree. colwise is a good start, of course. Maybe we need another
wrapper type that implicitly flips the axes, so you can do [f(col) for col in flip(df)] or something. Or maybe it should be called itcols(). Dunno.

It'd be good to be able to easily iterate over a subset of columns too, of
course. Especially as we're currently sans row names.

On Wed, Aug 8, 2012 at 8:48 AM, John Myles White
[email protected]:

Having iteration for AbstractDataFrames work over rows does seem like the
right approach, but there's definitely a use for easy-to-use tools to
perform column iteration. So much behavior on DataFrames has a design
pattern where you perform operations column-wise and then combine results.
The mean of a DataFrame, for example, is a mapping where you aggregate the
means of each column. I imagine all of that can achieved with map, but
maybe there's something customized to be added here eventually.

—
Reply to this email directly or view it on GitHubhttps://github.com/HarlanH/JuliaData/issues/48#issuecomment-7582594.

from dataframes.jl.

tshort commented on August 20, 2024

This one is debatable. I don't think there are any functions that depend on this. It's easy enough to fix if I'm forgetting something.

The reason I defaulted to iterating over columns is to keep a DataFrame in line with Julia's Associative methods and also sort-of to match the equivalent of R's lapply. Which way you normally iterate over really depends on the type of analysis you have. For me, I'd say iterating over columns is more common (especially for zoo-like operations). Array comprehension syntax is one I'd rather have work over columns.

Doing anything with a subset of columns is easy: just index it. That operation is cheap.

from dataframes.jl.

HarlanH commented on August 20, 2024

Hm. I just don't think of a DataFrame as an Associative structure as
strongly as you do. The types of data I work with on a day-to-day basis
simply don't make sense to iterate over columnwise, or at least not all
of the columns. I tend to more frequently deal with relational data, not
sequential data. I also don't like the behavior of lappy on data.frames,
at all. Hadley's adply(df, 1, ...) is the more useful (to me) operation,
although I don't like having to give axes by numerical index.

I think something like the solution outlined above, where the default
iterator over a DF is row-wise (for my use cases), but there's some
relatively easy way to change it to col-wise (for your use cases), is the
way we should go.

On Wed, Aug 8, 2012 at 9:28 AM, Tom Short [email protected] wrote:

This one is debatable. I don't think there are any functions that depend
on this. It's easy enough to fix if I'm forgetting something.

The reason I defaulted to iterating over columns is to keep a DataFrame in
line with Julia's Associative methods and also sort-of to match the
equivalent of R's lapply. Which way you normally iterate over really
depends on the type of analysis you have. For me, I'd say iterating over
columns is more common (especially for zoo-like operations). Array
comprehension syntax is one I'd rather have work over columns.

Doing anything with a subset of columns is easy: just index it. That
operation is cheap.

—
Reply to this email directly or view it on GitHubhttps://github.com/HarlanH/JuliaData/issues/48#issuecomment-7583440.

from dataframes.jl.

tshort commented on August 20, 2024

I'm fine with that.

from dataframes.jl.

StefanKarpinski commented on August 20, 2024

One thing to consider: iteration and indexing should probably be consistent. That is doing for x in df should probably give you the same thing as successively indexing into df. You can provide other iteration schemes with a wrapper. E.g. an EachRow type that you can use as for r in EachRow(df) or possibly for r in each_row(df) (see EachLine/each_line for comparison).

from dataframes.jl.

HarlanH commented on August 20, 2024

Hm, perhaps. We do have reference dispatch set up so that df[1:2] gets the first two columns, but df[1:2,1] gets the first two rows of the first column. So a single index into a df does return one or more columns.

The each_line/EachLine technique does seem applicable here.

So, we keep iteration of dfs as columns by default for DataFrames (and by rows necessarily for DataStreams), and define each_row/EachRow for DataFrames? It feels like a bit more typing for me, and not entirely consistent, but reasonable.

from dataframes.jl.

StefanKarpinski commented on August 20, 2024

I dunno. Just giving points of reference. This is your call :-)

from dataframes.jl.

HarlanH commented on August 20, 2024

A reasonable compromise allowing for future changes would be to define both each_row and each_col, make one of them a no-op, and declare in the documentation that the default (unwrapped) iterator behavior might change in the future, and the safe thing to do would be to use the appropriate function.

from dataframes.jl.

johnmyleswhite commented on August 20, 2024

Ok, I find consistency very compelling in general.

But I think there's a very strong argument here for iteration over rows rather than entries or columns: rows are the smallest unit of a DataFrame whose values are of a consistent type. Column A may have a different type from Column B, so you can't blindly loop over the columns and apply consistent processing. But each row consists of exactly the same number of columns and the types of entry at location I in each row is always homogeneous. So you can blindly treats rows as equivalent.

Also, in statistical theory the row is almost always the fundamental object. That's the thing you usually assume is IID: the columns are almost never independent from each other and the entries are definitely not IID in any interesting model. (If they were, you'd have a vector, not a matrix or DataFrame.)

That leaves me with the idea that we should do iteration by rows or that there should be no iteration at all using start, next or done. I'm totally happy with that later approach: create EachRow and EachCol methods or insist that people use a DataStream, not a DataFrame.

from dataframes.jl.

StefanKarpinski commented on August 20, 2024

That's a pretty solid argument, John.

from dataframes.jl.

johnmyleswhite commented on August 20, 2024

We should make some decisions here.

from dataframes.jl.

HarlanH commented on August 20, 2024

I remain OK with the final consensus suggestion, of for row in EachRow(df) or for col in EachCol(df), but for x in df throwing an error.

DataVector and DataArray iteration should definitely follow standard Julia vector and array behavior...

from dataframes.jl.

johnmyleswhite commented on August 20, 2024

Ok. I'll write some tests for that. I believe those already work.

from dataframes.jl.

johnmyleswhite commented on August 20, 2024

Turns out that EachRow and EachCol don't exist. I'm a little hesitant to create them, because they'd occupy a space that should arguably be left to Base to fill. I'll make drafts for now, but we should discuss this issue on the main Julia mailing list.

from dataframes.jl.

johnmyleswhite commented on August 20, 2024

About to e-mail the main Julia list about iterating over rows and columns. For now, this closed by a137666

from dataframes.jl.

should start/next/done iterate over rows or cols? about dataframes.jl HOT 16 CLOSED

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent