Giter Site home page Giter Site logo

Comments (16)

johnmyleswhite avatar johnmyleswhite commented on August 20, 2024

Having iteration for AbstractDataFrames work over rows does seem like the right approach, but there's definitely a use for easy-to-use tools to perform column iteration. So much behavior on DataFrames has a design pattern where you perform operations column-wise and then combine results. The mean of a DataFrame, for example, is a mapping where you aggregate the means of each column. I imagine all of that can achieved with map, but maybe there's something customized to be added here eventually.

from dataframes.jl.

HarlanH avatar HarlanH commented on August 20, 2024

Yes, I agree. colwise is a good start, of course. Maybe we need another
wrapper type that implicitly flips the axes, so you can do [f(col) for col in flip(df)] or something. Or maybe it should be called itcols(). Dunno.

It'd be good to be able to easily iterate over a subset of columns too, of
course. Especially as we're currently sans row names.

On Wed, Aug 8, 2012 at 8:48 AM, John Myles White
[email protected]:

Having iteration for AbstractDataFrames work over rows does seem like the
right approach, but there's definitely a use for easy-to-use tools to
perform column iteration. So much behavior on DataFrames has a design
pattern where you perform operations column-wise and then combine results.
The mean of a DataFrame, for example, is a mapping where you aggregate the
means of each column. I imagine all of that can achieved with map, but
maybe there's something customized to be added here eventually.


Reply to this email directly or view it on GitHubhttps://github.com/HarlanH/JuliaData/issues/48#issuecomment-7582594.

from dataframes.jl.

tshort avatar tshort commented on August 20, 2024

This one is debatable. I don't think there are any functions that depend on this. It's easy enough to fix if I'm forgetting something.

The reason I defaulted to iterating over columns is to keep a DataFrame in line with Julia's Associative methods and also sort-of to match the equivalent of R's lapply. Which way you normally iterate over really depends on the type of analysis you have. For me, I'd say iterating over columns is more common (especially for zoo-like operations). Array comprehension syntax is one I'd rather have work over columns.

Doing anything with a subset of columns is easy: just index it. That operation is cheap.

from dataframes.jl.

HarlanH avatar HarlanH commented on August 20, 2024

Hm. I just don't think of a DataFrame as an Associative structure as
strongly as you do. The types of data I work with on a day-to-day basis
simply don't make sense to iterate over columnwise, or at least not all
of the columns. I tend to more frequently deal with relational data, not
sequential data. I also don't like the behavior of lappy on data.frames,
at all. Hadley's adply(df, 1, ...) is the more useful (to me) operation,
although I don't like having to give axes by numerical index.

I think something like the solution outlined above, where the default
iterator over a DF is row-wise (for my use cases), but there's some
relatively easy way to change it to col-wise (for your use cases), is the
way we should go.

On Wed, Aug 8, 2012 at 9:28 AM, Tom Short [email protected] wrote:

This one is debatable. I don't think there are any functions that depend
on this. It's easy enough to fix if I'm forgetting something.

The reason I defaulted to iterating over columns is to keep a DataFrame in
line with Julia's Associative methods and also sort-of to match the
equivalent of R's lapply. Which way you normally iterate over really
depends on the type of analysis you have. For me, I'd say iterating over
columns is more common (especially for zoo-like operations). Array
comprehension syntax is one I'd rather have work over columns.

Doing anything with a subset of columns is easy: just index it. That
operation is cheap.


Reply to this email directly or view it on GitHubhttps://github.com/HarlanH/JuliaData/issues/48#issuecomment-7583440.

from dataframes.jl.

tshort avatar tshort commented on August 20, 2024

I'm fine with that.

from dataframes.jl.

StefanKarpinski avatar StefanKarpinski commented on August 20, 2024

One thing to consider: iteration and indexing should probably be consistent. That is doing for x in df should probably give you the same thing as successively indexing into df. You can provide other iteration schemes with a wrapper. E.g. an EachRow type that you can use as for r in EachRow(df) or possibly for r in each_row(df) (see EachLine/each_line for comparison).

from dataframes.jl.

HarlanH avatar HarlanH commented on August 20, 2024

Hm, perhaps. We do have reference dispatch set up so that df[1:2] gets the first two columns, but df[1:2,1] gets the first two rows of the first column. So a single index into a df does return one or more columns.

The each_line/EachLine technique does seem applicable here.

So, we keep iteration of dfs as columns by default for DataFrames (and by rows necessarily for DataStreams), and define each_row/EachRow for DataFrames? It feels like a bit more typing for me, and not entirely consistent, but reasonable.

from dataframes.jl.

StefanKarpinski avatar StefanKarpinski commented on August 20, 2024

I dunno. Just giving points of reference. This is your call :-)

from dataframes.jl.

HarlanH avatar HarlanH commented on August 20, 2024

A reasonable compromise allowing for future changes would be to define both each_row and each_col, make one of them a no-op, and declare in the documentation that the default (unwrapped) iterator behavior might change in the future, and the safe thing to do would be to use the appropriate function.

from dataframes.jl.

johnmyleswhite avatar johnmyleswhite commented on August 20, 2024

Ok, I find consistency very compelling in general.

But I think there's a very strong argument here for iteration over rows rather than entries or columns: rows are the smallest unit of a DataFrame whose values are of a consistent type. Column A may have a different type from Column B, so you can't blindly loop over the columns and apply consistent processing. But each row consists of exactly the same number of columns and the types of entry at location I in each row is always homogeneous. So you can blindly treats rows as equivalent.

Also, in statistical theory the row is almost always the fundamental object. That's the thing you usually assume is IID: the columns are almost never independent from each other and the entries are definitely not IID in any interesting model. (If they were, you'd have a vector, not a matrix or DataFrame.)

That leaves me with the idea that we should do iteration by rows or that there should be no iteration at all using start, next or done. I'm totally happy with that later approach: create EachRow and EachCol methods or insist that people use a DataStream, not a DataFrame.

from dataframes.jl.

StefanKarpinski avatar StefanKarpinski commented on August 20, 2024

That's a pretty solid argument, John.

from dataframes.jl.

johnmyleswhite avatar johnmyleswhite commented on August 20, 2024

We should make some decisions here.

from dataframes.jl.

HarlanH avatar HarlanH commented on August 20, 2024

I remain OK with the final consensus suggestion, of for row in EachRow(df) or for col in EachCol(df), but for x in df throwing an error.

DataVector and DataArray iteration should definitely follow standard Julia vector and array behavior...

from dataframes.jl.

johnmyleswhite avatar johnmyleswhite commented on August 20, 2024

Ok. I'll write some tests for that. I believe those already work.

from dataframes.jl.

johnmyleswhite avatar johnmyleswhite commented on August 20, 2024

Turns out that EachRow and EachCol don't exist. I'm a little hesitant to create them, because they'd occupy a space that should arguably be left to Base to fill. I'll make drafts for now, but we should discuss this issue on the main Julia mailing list.

from dataframes.jl.

johnmyleswhite avatar johnmyleswhite commented on August 20, 2024

About to e-mail the main Julia list about iterating over rows and columns. For now, this closed by a137666

from dataframes.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.