`metadata` method,about juliadata/dataapi.jl

Comments (70)

bkamins commented on June 13, 2024 2

One will probably need to reserve key names anyway. In particular I do not think that metadata(df)[col] to return a metadata for column col is a good API (if we allowed this then there would be no way to specify global metadata for a table as a whole).

I think this is such a major thing that we should wait for other JuliaData members to comment before moving forward.

from dataapi.jl.

quinnj commented on June 13, 2024 2

Sorry, I'm still not following the concern. Why/where would type information be important? The discussion has revolved around metadata(x) returning any kind of object that implements the AbstractDict interface, so in practice, you would use metadata like:

meta = metadata(x)

# see metadata keys
keys(meta)

# iterate over metadata key-value pairs
for (k, v) in meta

end

# check if a specific metadata key is present
haskey(meta, :specific_key)

So depending on whether metadata(x) returned a Dict, or NamedTuple, you would have different implementations of these methods, but the interface is still the same.

We should probably require that the object returned be AbstractDictLike{Symbol, Any}, i.e. require that metadata keys by Symbol; does that sound reasonable or too restrictive?

from dataapi.jl.

nalimilan commented on June 13, 2024 2

Metadata.jl is interesting! Though for DataAPI/Tables.jl we don't necessarily have to choose a particular implementation: all we need to do is define an API that particular table types can implement, using Metadata.jl or other solutions.

What about starting with the following minimal API in Tables.jl:

Table-level metadata: metadata(tbl) has to return Union{Nothing, AbstractDict{String, String}}. We can make this more general (now or later) by allowing any dict-like object, but that doesn't change things radically. The default implementation in Tables.jl returns nothing.
Column-level metadata: metadata(tbl, col::Union{Integer, Symbol}) also has to return Union{Nothing, AbstractDict{String, String}}. Tables can implement this by storing column-level metadata either in the table or in vectors themselves (e.g. using Metadata.jl). The default implementation in Tables.jl returns nothing.

An alternative proposal would be to completely skip the second point, and instead of metadata(tbl, col) attach column-level metadata to vectors themselves and require using metadata(Tables.getcolumn(tbl, col)). The constraint with this is that it would require that we agree on a common system for metadata that works for vectors (like Metadata.jl) and that tables cannot store metadata at the table level if they want. The advantage would be that people are guaranteed to be able to retrieve metadata when they only have vectors (without the table).

Once we agree on a minimal common API, we can discuss what should happen e.g. when concatenating, joining or transforming data frames at JuliaData/DataFrames.jl#2276.

from dataapi.jl.

bkamins commented on June 13, 2024 1

Initially I wanted to write that metadata(::DataFrame, ::ColumnIndex) could also return a Dict{Symbol, Any} - for me it would be OK. In this case there should also be some namespace of resrved key names for internal use.

So personally I would prefer the "single function that returns a metadata dict" approach and later the user can just work on the Dict.

Ah - and now I see we could support metadata(::DataFrame, ::ColumnIndex) that would return a NamedTuple of dictionaries associated with columns.

I agree with @Tokazama that different people will want different things from metadata therefore I believe the API we provide should be maximally simple and flexible.
Therefore I would prefer to think that metadata is just a Dict and there is one global dict for a data frame as a whole and then each column can have column specific dicts. Then the rest - how to work with it - would be delegated to a decision of the user.

from dataapi.jl.

quinnj commented on June 13, 2024 1

I'm a little slow/late to the discussion here, but have thought a bit about this. I agree with the idea that this is a way that Julia/DataFrames can really stand apart/improve on the situation from R/pandas; having useful metadata integrated w/ a DataFrame could be really powerful when used in the right contexts.

That said, I worry about some of the suggestions around metadata use because they start to become so fundamental or logic-driven. IMO, if some kind of data starts to become so critical we're changing how things are computed/etc. then it probably deserves a more structured solution that just a metadata entry in a DataFrame.

IMO, metadata should be primarily "descriptive" about the object; give context, explain values and cardinality thereof; tweaking printing/showing seems fine to me. I just worry about packages starting to abuse metadata when they should really be creating a new AbstractArray type or something (I mean, you could imagine someone trying to implement CategoricalArrays by just using metadata).

My other thought is that while I agree that DataFrames can do a tight integration w/ metadata, I do thing we should allow/encourage metadata to be attached/used generically on objects, including columns. There are going to be a lot of cases across the ecosystem where you're not dealing w/ a DataFrame, and it will be useful to support metadata in a variety of ways on columns, rows, etc. But yes, DataFrames can choose how it approaches its use/integration w/ metadata, either at the table-level or column level.

from dataapi.jl.

quinnj commented on June 13, 2024 1

So I'm not sure what exactly the proposed API is? Is it just that metadata(x) returns Union{Nothing, AbstractDict}? Here are a couple thoughts/ideas:

I'm not sure we should require AbstractDict specifically vs. "an object that supports AbstractDict methods" (or as I like to call it, AbstractDictLike); namely it'd probably be nice to allow NamedTuple to be returned from metadata, which isn't an AbstractDict, but does support the interface; it'd be good to be very clear about what exactly is required of the object returned
If we're thinking of requiring Union{Nothing, AbstractDict-LIke}, I wonder if we should just require returning an "AbstractDictLike" and we can return an empty one by default; we could then provide convenience get/put methods. Alternatively, we could not require a specific object type to be returned and just have hte interface be metadata(x) and metadata!(x, meta). I kind of like the idea of requiring AbstractDictLike and returning a NamedTuple() by default
In terms of implementation, I've been looking a lot at how @doc is implemented in Base and I think it could make a lot of sense to do something similar for metadata; that is, instead of modifying DataFrame to have a metadata field, there'd be a global (or per module) metadata IdDict that could store metadata per object. That would allow attaching metadata to all kinds of objects w/o needing wrappers. I think it also helps reinforce the idea that it's metadata, or somewhat detached from the object and not to be too relied upon for program logic. Along this vein, it could make sense to make an entire Metadata.jl package that copied the Base.Docs implementation and could be used by packages everywhere. It would be pretty lightweight, but could provide a lot of flexibility and a clean, standard API that other packages can integrate with. If we want to go that route, we probably don't need a definition in DataAPI.jl

from dataapi.jl.

bkamins commented on June 13, 2024 1

I think having Tables.metadata(tbl, col) is not a problem to have.

The default implementations could be:

Tables.metadata(tbl, col) = Tables.metadata(Tables.getcolumn(tbl, col))`
Tables.metadata(tbl) = nothing

which would also cover the case of a default table-level metadata.

Now Tables.metadata(tbl, col) can be left as is, or if a table type has some way of keeping metadata for columns on a table level then simply Tables.metadata(tbl, col) can have a special method added.

In particular:

Tables.metadata(tbl::AbstractDataFrame, col) = # some custom implementation
Tables.metadata(tbl::AbstractDataFrame) = # some custom implementation

can use a completely different code path.

The only problem to solve is if both vector and table define metadata for column which should take the precedence, but this should be solved at AbstractDataFrame implementation level.

from dataapi.jl.

Tokazama commented on June 13, 2024 1

I can add it back in. Im still ironing out some details before releasing the next version. The new ability to set variables in modules makes it easier to do this sort of thing without macros.

from dataapi.jl.

Tokazama commented on June 13, 2024 1

@Tokazama I just saw you recently removed support for global metadata in Metadata.jl (Tokazama/Metadata.jl@e88941c). AFAICT this means there's no way to attach metadata to arbitrary objects without wrapping them in a new type. Is that right? This would be unfortunate as it was one of the main features we discussed above.

I pulled out the globally stored metadata stuff into a new package and I'm registering it now JuliaRegistries/General#63519.

from dataapi.jl.

bkamins commented on June 13, 2024

@pdeffebach had some good design ideas about it in DataFrames.jl in the past.
Now, finally after 0.21.0 release, we are planning to add this functionality to DataFrames.jl.

As this is raised on a higher level let me give the API I envision for DataFrames.jl for now:

metadata(::DataFrame) returns a Union{Nothing, Dict{Symbol,Any}} that - if filled - gives a DataFrame-level metadata (this can be arbitrary metadata). The restriction would be that symbols starting with DF_ in their name would be reserved for internal use of DataFrames.jl (as a convention)
metadata(obj::OTHER_TYPES_DEFINED_BY_DATAFRAMES) = metadata(parent(obj))
metadata(::DataFrame, ::ColumnIndex) returns a String (by default nothing) - which would indicate just a verbose name of the column, with default being just column name
metadata(obj::OTHER_TYPES_DEFINED_BY_DATAFRAMES, ::ColumnIndex) similar to the above if the column is present in the other type.

If we agree to this design then I can implement it. The key challenge is rules of propagation of metadata, but this is not DataAPI.jl related thing so I leave this discussion for later.

CC @pdeffebach, @nalimilan

from dataapi.jl.

nalimilan commented on June 13, 2024

See JuliaData/DataFrames.jl#1458 for the last attempt at implementing this in DataFrames. Two points:

I think we need something more general than just having a custom label/verbose name for columns. For example it could be useful to store units, informations about measurement, etc. Label can just be a standard field among others.
We also need an API to set metadata and to retrieve the list of fields that have been set.

In general a choice has to be made between having 1) a single function in the API which would return a metadata dict which would have to implement specific methods (getindex, setindex! and keys notably); or 2) several functions in the API that would allow doing these operations directly. See the table in my first comment at JuliaData/DataFrames.jl#1458. I think returning an object is simpler since it allows reusing the standard dict API.

from dataapi.jl.

Tokazama commented on June 13, 2024

Metadata could technically be stored at any level of something like a table. For example, each column could be a MetadataArray (i.e. from MetadataArrays.jl) and the table itself could have metadata. I worry that if we started trying to design this around column based indexing it would needlessly complicate and potentially limit its wider usability. Even the definition of what "metadata" is to different people is likely to vary so I'm not sure we should even guarantee it returns a certain type.

from dataapi.jl.

Tokazama commented on June 13, 2024

Allowing the user to decide what to do with whatever metadata return also provides the freedom to further specialize on this later. For example you could always do something like colmeta(df, col) = metadata(df)[col] and then you wouldn't have to worry about reserving key names.

Would a simple PR to DataAPI.jl on this be a good next step right now?

from dataapi.jl.

nalimilan commented on June 13, 2024

Maybe we can say that metadata(tbl) and metadata(tbl, col) have to return objects implementing the AbstractDictAPI, giving respectively the table-wise and column-wise metadata? That should be flexible enough for all implementations.

(In practice for DataFrames we would probably store column metadata internally as vectors with one entry per column as JuliaData/DataFrames.jl#1458 does but that can be exposed to users via a lazy AbstractDict object per column. What could be useful to provide in addition is a way to access these vectors for convenience/efficiency.)

from dataapi.jl.

bkamins commented on June 13, 2024

have to return objects implementing the AbstractDictAPI

Agreed

In practice for DataFrames we would probably store column metadata internally as vectors with one entry per column

We can discuss what is best in the PR for DataFrames.jl when it is done (essentially we have two options: dict of vectors or vector of dicts).

from dataapi.jl.

bkamins commented on June 13, 2024

As have been thinking about this issue and #1458 I came to the conclusion that we should go back to the fundamentals. And the core issue is:

It is often the case that one wants to attach metadata of some sort to an array/graph/etc.

What I mean that while we seem all agree that adding metadata to tables is needed, actually I would discuss first what kind of metadata we really think people would store in practice. This is a relevant questions as I think we should not create a functionality that later would be very rarely used. Conversely - if we know exactly what we actually want to use we can design API that supports the required use-cases cleanly.

My two concerns are:

persistence; most storage formats will not allow to save and load this metadata; which means that, at least in my understanding, the use cases, where people will use metadata will be situations of non-persistent metadata (i.e. something you attach to your table temporarily for programming convenience)
performance; we do not want to kill the performance of basic operations on tables, because the "table processing engine" would constantly check if metadata needs updating, or if the cos of updating the metadata would be large, or if the memory footprint of allowing to store metadata would be non-negligible (ideally if there is no metadata then the performance should not be affected)

So now let me go down to the starting question - what metadata we see that would be actually used (this is not a comprehensive list - please comment what you think would be really used - not just potentially used):

metadata for handling how data frame is shown (things like overriding show defaults, maybe custom decimal delimiter, maybe - if in the future we add integration with PrettyTables.jl, some settings of all the options that package provides)
custom column labels (possibly shown when printing a data frame) - this is tempting, but has a problem with "persistence" - i.e. most formats will not store this information, so my question is - how in practice we see that people will use this functionality?
custom row lables labels - the same situation
setting flags that some columns should be treated in a special way, e.g. for geospatial or time series analysis of data frames - this is tempting, but myself I am not 100% convinced it is a good idea, as it will be hard to ensure the metadata is consistent with a parent table (e.g. GeoPandas uses this strategy, and it has a problem with ensuring this integrity)

from dataapi.jl.

Tokazama commented on June 13, 2024

In addition to what you've mentioned here are some types of metadata that I think would be useful for me personally to be able to store:

Source and collection information: I often have many tables that have some acquisition metadata pertinent to them that is not row or column specific but describes important aspects of all the data in one table.
Column tracking: When performing semi-automated feature creation I like to keep track of certain operations/parameters/weights that resulted in the formation of a column of measures.
Attaching metadata to a column that changes how it dispatches later on

it will be hard to ensure the metadata is consistent with a parent table (e.g. GeoPandas uses this strategy, and it has a problem with ensuring this integrity)

I think it depends on how much you care to take ownership of handling all metadata. I would prefer handling metadata be given a minimal interface. It could potentially have a methods for things like joins so that something like join_metadata would also join dictionaries but could be taken advantage of for custom metadata types.

I also think that I/O on metadata should be entirely dependent on the package supporting I/O. There aren't many file types equipped to flexibly handle metadata and it seems like the best thing for DataAPI.jl is to just make it simple to extract metadata.

from dataapi.jl.

pdeffebach commented on June 13, 2024

I agree with all that is said here. As the author of one of the previous attempts I think that meta-data is important and people coming from R and Python often don't fully appreciate how useful metadata is for Stata users and how it has hurt the adoption of R in applied economics, especially household surveys.

custom column labels (possibly shown when printing a data frame) - this is tempting, but has a problem with "persistence" - i.e. most formats will not store this information, so my question is - how in practice we see that people will use this functionality?

My use of metadata in Stata was twofold

Pretty printing of columns and keeping track of data. For example, the table below would not have been possible to make programattically without extensive use of column labels. I couldn't imagine trying to write this in R because column metadata in R doesn't persist after joins.

Keeping track of the data-cleaning process. In the above table, the variable "Standardized income index" is composed of the 3 variables below it. The note for that variable will tell us as much, and was automatically generated. If you were to type

note list standardized_index

you would see a note that said something along the lines of

A standardized index of 3 variables: net_earnings, consumption, durable_assets

Stata also has metadata about a table, which is often used to denote a source or author. I never used that feature.

With regards to IO, I don't see a huge problem with saving a data frame to two CSVs and providing a convenience method for adding metadata to a DataFrame when the metadata is stored as a Table. Maybe it's a bit heavy handed but it's robust.

from dataapi.jl.

Tokazama commented on June 13, 2024

I think these are all great use cases that I've wanted at some point. As someone who deals with lots of different types of metadata I'd really like to emphasize that less is more as this is implemented. It's easy to get stuck in the weeds on every little implementation detail because you have the combination of situations that arise from row specific, column specific, and general table metadata and all the different types of metadata.

This is loosely the kind structure I'm considering using...

struct Table{T<:AbstractVector,M<:Union{Nothing,AbstractDict{Symbol,Any}}}
    data::Vector{T}
    index::Dict{Symbol,Int}
    meta::M
end

metadata(x::Table) = getfield(x, :meta)

Users don't have to ever worry about metadata unless they decide they want it and developers can create whatever type of fancy metadata that changes dispatch as long as it is a subtype of AbstractDict{Symbol,Any}.

I would think column specific metadata would be easiest if implemented as a column vector with metadata so I could just do metadata(table.column_name). Otherwise concatenating/merging/joining columns becomes the responsibility of DataAPI.jl. Seems more simple for this to be implemented at the level of something like vcat(::MetadataVector, ::MetadataVector).

from dataapi.jl.

nalimilan commented on June 13, 2024

I agree with most of what has been said. Just one point:

I would think column specific metadata would be easiest if implemented as a column vector with metadata so I could just do metadata(table.column_name). Otherwise concatenating/merging/joining columns becomes the responsibility of DataAPI.jl. Seems more simple for this to be implemented at the level of something like vcat(::MetadataVector, ::MetadataVector).

I'm afraid this wouldn't be workable, as it would require users to deal with another new kind of vector just to store metadata. That would force recompiling all functions for that type, and it wouldn't be easy to deal with e.g. CategoricalArray. It would also make things appear more complex for users who load files with metadata (e.g. Stata files with column labels), while one of the strengths of DataFrames is that they just wrap standard arrays.

We can say the table type is responsible for preserving metadata across concatenations/joins. DataAPI itself doesn't have to know anything about that.

from dataapi.jl.

pdeffebach commented on June 13, 2024

With regards to spatial data, which is a natural use case of this, is there anyone in the Julia Data community who has a really detailed knowledge of R's sf package?

It's the best thing ever, being able to use all of dplyr while also maintaining spatial metadata and using spatial joins etc. is incredible.

Perhaps someone who has worked on that project could provide some insights.

from dataapi.jl.

bkamins commented on June 13, 2024

The project is going to be done during JSoC this year. And one of the reasons I am pressing to decide on metadata now is to have a clear guidance how this extra package should integrate with DataFrames.jl.

from dataapi.jl.

bkamins commented on June 13, 2024

I have discussed:

I just worry about packages starting to abuse metadata when they should really be creating a new AbstractArray type or something

with @visr with the context of geospatial data (temporial data is the same I think) and we came to the same conclusion. The logic in packages using tables should primarly be based either on type or a trait of a column (trait is probably preferable as currently Julia does not allow for multiple inheritance), but not metadata attached to it.

So given this - are there any more comments how the reference API should look like?

from dataapi.jl.

bkamins commented on June 13, 2024

Along this vein, it could make sense to make an entire Metadata.jl package that copied the Base.

Actually I would prefer this idea as it would be much more composable. The consequence for DataFrames.jl users would be:

metadata will not propagate when the object is copied.
still it will propagate when it is just passed, not copied.

So: df.col will keep the metadata of col and similarly df.col = x will make col to have the metadata of x.

from dataapi.jl.

nalimilan commented on June 13, 2024

Ah yes that's interesting. Indeed it's quite convenient in R to be able to attach metadata to any object, and yet in Julia we don't want to have to wrap any object in a special type just to add metadata.

Though losing the metadata on copy would be annoying. That could easily be fixed in DataFrames by ensuring we copy/readd the metadata when copying the columns (this would be needed for important cases like getindex but also select without transformations). But it's not easy to fix when the user calls copy on an arbitrary object: doing this may be too costly for small objects, and for large ones it would require support from all packages (including Base...). Maybe it's not the end of the world though if one has to do copywithmetadata(x) when needed?

(Otherwise, returning an empty NamedTuple by default (instead of nothing) sounds fine. We really need traits in Base!)

from dataapi.jl.

bkamins commented on June 13, 2024

Though losing the metadata on copy would be annoying.

Personally I would feel safer if we worked this way. I would prefer to have a function that copies medatada explicitly that can be called if someone needs it.

from dataapi.jl.

Tokazama commented on June 13, 2024

There are a lot of packages that use the term "metadata" (e.g, ImageMetadata.jl, MetadataArrays.jl, MetaGraph.jl, FieldMetadata.jl, FieldProperties.jl, etc.). I don't think an interface like Base.Docs is flexible enough to fit many of the potential uses of metadata.

from dataapi.jl.

quinnj commented on June 13, 2024

@Tokazama can you explain a little more why you think the Base.Docs approach wouldn't be flexible enough? In terms of approach, it's more of an implementation detail: the user interface would still be metadata(x), it would just retrieve the metadata from a per-module store instead of retrieving it from the object itself.

from dataapi.jl.

Tokazama commented on June 13, 2024

It wouldn't carry any type information so if someone did use something like a NamedTuple it wouldn't really help any.

from dataapi.jl.

bkamins commented on June 13, 2024

Actually I prefer metadata to be flexible and type unstable.

Apart from convenience it is a clear signal for the developers not to use metadata to encode program logic - Julia provides other means to to this efficiently.

Metadata, as I think about it now (but my opinions evolve based on the comments we get here as the design here is not an easy decision) should be for lightweight things like descriptive strings or maybe some hints how output should be formatted (as working with IOContext is hard for most users and in some cases it it not flexible enough as IOContext is not always usable - e.g. you cannot replace stdout with a custom IOContext AFAIK).

from dataapi.jl.

Tokazama commented on June 13, 2024

Actually I prefer metadata to be flexible and type unstable.

I'm not against this being the case for specific implementations like what might be done in DataFrames but I don't think it should be the only option.

from dataapi.jl.

pdeffebach commented on June 13, 2024

Apart from convenience it is a clear signal for the developers not to use metadata to encode program logic - Julia provides other means to to this efficiently.

I agree. You don't want too many interfaces relying on specifically named metadata fields to create unnecessarily complicated features.

So: df.col will keep the metadata of col and similarly df.col = x will make col to have the metadata of x.

I don't fully understand this. IMO metadata should be attached to a data frame and df.col should always return a vector without anything else attached to them.

I agree with @nalimilan, it would be annoying to have metadata disappear with copying, consider something as simple as

df.income = clean_vec(df.income)

clean_vec takes in a Vectof{Float64} and for whatever reason has a concrete type signature. So no meta-data is added.
If I have a label for df.income, , say "Personal Income", I don't want this to disappear. Keeping track of these operations would get tiresome.

This sort of global Dict that contains metadata is basically how Rs metadata system works, and I've always found it very useless, partly because the metadata disappears.

from dataapi.jl.

quinnj commented on June 13, 2024

@pdeffebach, I think there's a lot more flexibility in the Base.Docs system than just thinking of it as a "global Dict". With Julia's rich type system, macros, etc. I think we could easily accomodate scenarios where you want to attach metadata to a DataFrame column, and not the Vector itself, but to a named column of the DataFrame, which would "stick" beyond transformations. And as has been mentioned, there are a number of scenarios where you don't want metadata to stick around too much, if you're creating new objects and such.

As I've played around with ideas/implementations, I just don't see a realistic way to make a system that is general enough to be widely used that relies on either wrapper objects or requiring metadata fields. It just doesn't scale. The doc system, however, is extremely rich and accomplishes its goal/job very well, IMO; attaching extra information to types, variables, fields, etc.
Part of my experience/opinion here is coming from thinking through the entire data ecosystem, not just DataFrames. While I think DF is one of the primary targets for a metadata system, I also want to ensure that other table types, formats, and objects can also take advantage of a metadata system to enhance objects.

from dataapi.jl.

nalimilan commented on June 13, 2024

I agree with @nalimilan, it would be annoying to have metadata disappear with copying, consider something as simple as

df.income = clean_vec(df.income)

1. `clean_vec` takes in a `Vectof{Float64}` and for whatever reason has a concrete type signature. So no meta-data is added.

2. If I have a label for `df.income`, , say `"Personal Income"`, I don't want this to disappear. Keeping track of these operations would get tiresome.

@pdeffebach What kind of operations would be performed within clean_vec? Apart from copy, which you could replace with (say) copywithmetadata, I'm not sure many operations should/could preserve it. In general I don't see what solution we could find a system in which both 1) metadata is preserved on copy and 2) metadata can be added to any object (e.g. Vector). What we can do, though, is to have DataFrames operations copy metadata automatically where it makes sense -- but that doesn't include custom functions since we have no way of knowing if you are just cleaning the income value or creating a completely new thing.

Or maybe in your example you meant that assigning a new vector to an existing column via df.income = v should preserve the metadata of the column? That makes some sense but could be problematic if you really want to replace the column (and you may even not know some previous column existed with that name).

from dataapi.jl.

pdeffebach commented on June 13, 2024

@pdeffebach What kind of operations would be performed within clean_vec? Apart from copy, which you could replace with (say) copywithmetadata, I'm not sure many operations should/could preserve it.

Yes, even replace performs a copy. From a users perspective, this kind of behavior would require a lot of defensive programming and reasoning just to get persistence in metadata that imo is the most intuitive.

we have no way of knowing if you are just cleaning the income value or creating a completely new thing.

In my view, metadata isn't a property of Vectors, its a property of named columns in a data frame. Exactly what object df.income points to in memory is an implementation detail, the point is that because I assigned it to the column :income, I want metadata(df, :income) to return "Personal Income".

Ultimately, if the user wanted different metadata for df.income = clean_vec(df.income), they would have assigned it to a new column.

I understand the appeal for a solution that the whole data ecosystem can use, but I hope it allows for the intuitive use of metadata that Stata has, which persists across reassignment, copies, joins, etc.

from dataapi.jl.

nalimilan commented on June 13, 2024

OK. Whatever the chosen implementation, DataFrames could certainly preserve the metadata of already existing columns in setindex!/setproperty! if we wanted. It's more a matter of deciding whether it's a good idea, but better discuss this elsewhere.

from dataapi.jl.

quinnj commented on June 13, 2024

And perhaps I wasn't super clear in my previous comment, but I was thinking along the lines of being able to do:

@meta "Personal Income" df.income

which would do a nesting metadata attachment of "Personal Income" to the income column of the df object. i.e. the metadata would be attached to the df.income array itself, but the df object would also have a link to its "children" metadata attachments.

Obviously there are some details there to work out, but I think it'd be cool to allow this kind of "metadata" rollup from children to parents.

from dataapi.jl.

bkamins commented on June 13, 2024

I think we are getting close to the design on two levels (general here and DataFrames.jl specific).

@pdeffebach - it would be great, as @nalimilan suggested, if you could comment in the DataFrames.jl (in a new issue or in the old PR related to metadata) what functionality you would like to have. I get a feeling that you have a good set of rules in mind, but we need to make them very precise (like what exactly should happen after what operations). What I mean here is that I want to avoid going into implementation of anything before there is a consensus how things should work.

For example - if I understood you correctly (but please comment on this in DataFrames.jl not here to keep this issue general) - if you have a dataframe df with a column :col then you want:

the result of df.col not to have any metadata attached
but on the other hand by doing df.col = some_new_value then the metadata should be kept
given the two rules above I was not clear for example what you wanted to happen in the following cases:
- df.col2 = df.col (I guess you do not want col2 to have any metadata)
- if you then do select!(df, :col => :col2, :col2 => :col) - then still :col should have metadata and :col2 should not have metadata

from dataapi.jl.

pdeffebach commented on June 13, 2024

Created a thread for DataFrames specific dicsussion here.

With f(df.col), the function f doesn't know that the vector it's being passed has the name :col. The same rule should apply with metadata I think.

from dataapi.jl.

nalimilan commented on June 13, 2024

I've also filed JuliaData/Tables.jl#176 to discuss how Tables.jl could use the general mechanism defined in this issue to allow exchanging per-column metadata between Table types. That way we can concentrated on the most general interface here.

from dataapi.jl.

bkamins commented on June 13, 2024

Arrow.jl supports Dict{String, String} on a table level and on column level. Given this I would re-surface the discussion about how metadata system for tabular data in Julia should be defined.

There are different opinions on this, so let me give my take (but I am open to other opinions).

I personally would be OK with only supporting Dict{String, String} on a table level and delegate to AbstractVector subtypes to define metadata for column vectors if needed. What are the benefits of this:

we can store the metadata in Arrow.jl without losing the information.
we do not enforce any semantic meaning to the metadata, so you can do whatever you like with it

What are the cons of this approach:

only String=>String mappings would be supported, but the question is do we really need other kind of metadata in practice (especially given that when written by Arrow.jl it would be lost)
you have to manually manage the metadata (e.g. if you stored on a table level the mapping column_name => column_description then when e.g. renaming columns it will not get automatically updated); however, I am not sure it is that useful to do it automatically, but probably Stata users can chime in here (CC @pdeffebach)

So in summary - my proposal is to be very minimal, potentially mentioning in the future that this metadata system might be extended in the future (i.e. that something more than Dict{String, String} on table level might be supported, so relying on the exact type of metadata is discouraged). However, I believe that this way we could quickly have "some metadata", and in the future extend it.

I am not very attached to this idea, but what I would love to have is something we can agree on and is easy enough, so that we can provide the functionality in a reasonable time frame. However, if someone has a superior proposal that is consistent (and clear how metadata should be handled under transformations) I would love to hear what it would be (I know that we already had many discussions about it - I think that the way forward is just to put some "end to end" proposals on the table and discuss their pros and cons).

from dataapi.jl.

Tokazama commented on June 13, 2024

I made the Metadata.jl package for this. It provides syntax for binding metadata directly through a struct or in a global variable.

from dataapi.jl.

bkamins commented on June 13, 2024

So e.g. for DataFrame (it is immutable) you would use the attach_metadata function - right?

from dataapi.jl.

Tokazama commented on June 13, 2024

You can use either because each DataFrame has a unique objectid. I do have some support for dimension specific metadata, I haven't done anything that is specific for tables b/c I wanted it to be as generic and flexible as possible.

Here's quick example demonstrating globally stored metadata for a DataFrame.

julia> df = DataFrame(x = 1:2, y = 3:4)
2×2 DataFrame
│ Row │ x     │ y     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 3     │
│ 2   │ 2     │ 4     │

julia> @attach_metadata(df, Dict(:types => [Int, Int]))
Dict{Symbol,Any} with 1 entry:
  :types => DataType[Int64, Int64]

julia> df
2×2 DataFrame
│ Row │ x     │ y     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 3     │
│ 2   │ 2     │ 4     │

julia> @metadata(df)
Dict{Symbol,Any} with 1 entry:
  :types => DataType[Int64, Int64]

julia> df2 = DataFrame(y = 1.0:2.0, z = 3.0:4.0)
2×2 DataFrame
│ Row │ y       │ z       │
│     │ Float64 │ Float64 │
├─────┼─────────┼─────────┤
│ 1   │ 1.0     │ 3.0     │
│ 2   │ 2.0     │ 4.0     │

julia> @attach_metadata(df2, Dict(:types => [Float64, Float64]))
Dict{Symbol,Any} with 1 entry:
  :types => DataType[Float64, Float64]

julia> @metadata(df)
Dict{Symbol,Any} with 1 entry:
  :types => DataType[Int64, Int64]

julia> @metadata(df2)
Dict{Symbol,Any} with 1 entry:
  :types => DataType[Float64, Float64]

from dataapi.jl.

bkamins commented on June 13, 2024

Union{Nothing, AbstractDict{String, String}} makes sense and we could say that in the future Union{Nothing, AbstractDict} might be supported, so people should not rely on the parameters of AbstractDict in dispatch (or maybe we could already assume AbstactDict only and just Arrow.jl would convert what it gets to String when serializing).

metadata(tbl) - I think this is uncontroversial. And in general it does not have to be Tables.table but any object (just define metadata(::Any) = nothing. This would provide a nice fallback for the second method in case a vector defined metadata (so the table could fetch it).

metadata(tbl, col::Union{Integer, Symbol}) - here the question is if AbstractString should be also allowed? I think it is non-problematic to have this method in general, as till we support it we could just return nothing.

from dataapi.jl.

Tokazama commented on June 13, 2024

An alternative proposal would be to completely skip the second point, and instead of metadata(tbl, col) attach column-level metadata to vectors themselves and require using metadata(Tables.getcolumn(tbl, col)).

Unless metadata(::T, ::Symbol) is specifically defined for T the default is to just grab all the metadata and if it is an AbstractDict us getindex and otherwise use getproperty (as defined here).

This was the discourse announcement, in case that's any help.

Once we agree on a minimal common API, we can discuss what should happen e.g. when concatenating, joining or transforming data frames at JuliaData/DataFrames.jl#2276.

I have some internal traits for copying/sharing/dropping metadata that might be useful for this sort of thing when the time comes.

from dataapi.jl.

quinnj commented on June 13, 2024

Metadata.jl is indeed interesting and in the direction I had in mind w/ the Arrow.jl methods (which I really just threw together in order to support the arrow specification, leaving the "generalizing" to a later project). I don't love the use of macros because they're not really doing anything? I feel like just having Metadata.get(obj) and Metadata.set(x, meta) would be simpler/clearer? I can open an issue at Metadata.jl repo to discuss the details there more.

@nalimilan, what's the advantage of having a Tables.jl-level API for metadata as opposed to just pointing people to something like Metadata.jl?

from dataapi.jl.

bkamins commented on June 13, 2024

Having Tables.jl-level API generates less dependencies, e.g. for Arrow.jl I think.

from dataapi.jl.

nalimilan commented on June 13, 2024

metadata(tbl, col::Union{Integer, Symbol}) - here the question is if AbstractString should be also allowed? I think it is non-problematic to have this method in general, as till we support it we could just return nothing.

@bkamins That's a minor point I'd say. We should be consistent across Tables.jl, so better discuss getcolumn, etc. at the same time, and separately from this issue which is already complex enough.

An alternative proposal would be to completely skip the second point, and instead of metadata(tbl, col) attach column-level metadata to vectors themselves and require using metadata(Tables.getcolumn(tbl, col)).

Unless metadata(::T, ::Symbol) is specifically defined for T the default is to just grab all the metadata and if it is an AbstractDict us getindex and otherwise use getproperty (as defined here).

@Tokazama Note that in my proposal Tables.metadata would be a different (and unexported) function from Metadata.metadata. Tables.metadata(tbl, col) would retrieve metadata for column col, while metadata(tbl)[key] would access table-level attribute key. Otherwise there could be conflicts e.g. if a column is called name and you want to store a table-level attribute called name.

@nalimilan, what's the advantage of having a Tables.jl-level API for metadata as opposed to just pointing people to something like Metadata.jl?

@quinnj The reason is that we need to define an API to access column-level metadata. I agree something like Metadata.jl is enough if we decide that column-level metadata should be attached to vector objects themselves rather than stored in the table (my second proporsal). But I think @pdeffebach had arguments against it.

from dataapi.jl.

pdeffebach commented on June 13, 2024

Having metadata be persistent across joins and reassignment is a crucial feature.

If there was a Tables.jl level API for could make assurances about the persistence of metadata. @quinnj if you aren't too familiar with Stata, this is basically the model for the behavior I would like metadata to have. In Stata, it's all about persistence.

My argument against having metadata attached to vector objects is that

We don't want df.a to give you some special vector type that has metadata attached to it
We copy a lot of places, i.e. after every select and transform. So the notion of meta-data being attached to a particular object gets complicated.

I'm going to cc @matthieugomez here, since he is someone familiar with Stata who has probably also thought about this in Julia.

from dataapi.jl.

Tokazama commented on June 13, 2024

I don't love the use of macros because they're not really doing anything?

Similar to @doc, they point to the module where @attach_metadata was called. If you have a type that will always store metadata in the same module you could hard code that in and use metadata.

from dataapi.jl.

nalimilan commented on June 13, 2024

1. We don't want `df.a` to give you some special vector type that has metadata attached to it

@pdeffebach With @attach_metadata it wouldn't be a special type, just a plain Vector with metadata stored in a global dict.

2. We copy a lot of places, i.e. after every `select` and `transform`. So the notion of meta-data being attached to a particular object gets complicated.

We could copy metadata in select and transform. Overall I think the question of persistence should be addressed by particular implementations (e.g. DataFrames). Tables.jl doesn't care about that, it just has to allow you to pass metadata along with tables.

from dataapi.jl.

pdeffebach commented on June 13, 2024

Lets say I have

df = DataFrame(a = [1, 2], b = [3, 4])
metadata!(df, :a => "A", :b => "B")
@pipe df |>
    transform(_, [:a, :b] => ByRow(+) => :c) |>
    select(_, :b, :c)

with the meadata in a global Dict, how would this work? I'm confused by what the keys and what the values are.

What happens when the columns are copied inside the transform? You could imagine this global dict getting very ver large if we have thousands of columns and a lot of transform calls in a pipe.

from dataapi.jl.

bkamins commented on June 13, 2024

Yes, GC of this dict is an issue I think.

from dataapi.jl.

nalimilan commented on June 13, 2024

Yes, using a global dict will certainly be slower and less memory-efficient than storing column-level metadata in the table (especially since in that case we can store metadata using a vector with one entry for each column, and use the data frame index to map names to positions, like at JuliaData/DataFrames.jl#1458). But I wonder whether it really matters in practice: if you need to copy the column vector anyway, copying the metadata should be cheap in comparison.

Maybe we could also add a finalizer to column vectors when adding metadata, so that we can delete the entry from the global dict when the object it destroyed? @Tokazama Have you considered that?

from dataapi.jl.

Tokazama commented on June 13, 2024

Maybe we could also add a finalizer to column vectors when adding metadata, so that we can delete the entry from the global dict when the object it destroyed?

I don't think that's possible without type piracy. There is no unique type provided by "Metadata" that wraps an instance that is attached to global metadata when using @attach_metadata(x, meta). The most useful part of using global metadata is that it has absolutely no effect on dispatch so it can't possibly slow down your code, increase latency through codegen, etc.. This also means you cant propagate metadata by dispatching on a type that binds metadata to your table. If that's what you want, use attach_metadata, so that metadata is directly bound to your table with a wrapper that stores metadata in its structure.

If you want to do what @quinnj is suggesting, something like this should work...

function Metadata.global_metadata(tbl::MyTableType, column_name, module_name)
    return Metadata.global_metadata(getproperty(tbl, column_name), m)
end

...redirecting @metadata(tbl, column_name) to the relevant column.

Similary, you can do this if you expect your columns to wrap metadata.

Metadata.metadata(tbl::MyTableType, column_name) = metadata(getproperty(tbl, column_name), m)

Handling persistence of metadata without a wrapper type (i.e. global metadata) would just require actively using @share_metadata/@copy_metadata. You might want this to be optional (e.g., some_method(args...; share_metadata) so that you don't hurt performance by searching for metadata in every instance in every method.

from dataapi.jl.

nalimilan commented on June 13, 2024

I don't think that's possible without type piracy. There is no unique type provided by "Metadata" that wraps an instance that is attached to global metadata when using @attach_metadata(x, meta). The most useful part of using global metadata is that it has absolutely no effect on dispatch so it can't possibly slow down your code, increase latency through codegen, etc..

Adding a finalizer doesn't require having a special type AFAICT. You can just call finalizer(f, obj).

The concern about performance is that in DataFrames transform copies all columns, so if we want to preserve metadata we would have to also add attach it to the new vectors, adding them to the global dict. If we don't remove metadata which has been attached to objects that have been destroyed by GC from the dict, it will grow indefinitely, which can be a problem.

from dataapi.jl.

Tokazama commented on June 13, 2024

Adding a finalizer doesn't require having a special type AFAICT. You can just call finalizer(f, obj).

Well, that's extremely good to know. I'm trying to add that now but it has the caveat that it only works with mutable structs. Any suggestions on getting this to work with something like DataFrame?

from dataapi.jl.

pdeffebach commented on June 13, 2024

I think this discussion has surpassed my technical knowledge but as the co-author of JuliaData/DataFrames.jl#1458 (Milan made the design), I like it's implementation. It's super transparent, I could make PRs to it, and users can understand it with a conceptual model. It's just a bit scary for me to have metadata be implemented by a global dict that is invisible to the user, but that could just be my lack of technical knowledge.

from dataapi.jl.

bkamins commented on June 13, 2024

I do not think in DataFrames.jl we have to use a default metadata mechanism - we can do whatever we like. That is why we are discussing it in DataAPI.jl as I would like first to agree on the API of getting metadata and, if possible for setting metadata (but this is less crucial as I believe different table types might provide custom mechanisms for setting the metadata).

I think it is important to keep the API and the implementation separate, as otherwise we might run into problems in the future that might be hard to envision currently. Metadata.jl is very nice but it should be an opt-in I think, i.e. if some table type likes Metadata.jl it can start depending on it; but it should not be enforced.

from dataapi.jl.

nalimilan commented on June 13, 2024

Yes, keeping API and implementation separate is usually a good thing. But a difficulty here is that if we don't add Tables.metadata(tbl, col) to the API, then the only way to support column-level metadata is to attach them to column vectors themselves. And that's only possible if we either 1) use a global dict like Metadata.jl, or 2) wrap vectors in a custom type (which is a no-go IMO). Only Tables.metadata(tbl, col) allows storing per-column metadata in the DataFrame itself (with the drawback that it cannot be retrieved if you only have the vector; not sure whether it's a problem).

from dataapi.jl.

pdeffebach commented on June 13, 2024

In my opinion, metadata only makes sense at the level of table. Arrays should not have metadata themselves. i.e.

x = df.x

x should have no metadata attached to it, since any metadata can only be understood in the context of the table which it came from. Since x now lives on its own, separated from the DataFrame, it's not worth having any metadata attached to it.

from dataapi.jl.

bkamins commented on June 13, 2024

In my opinion, metadata only makes sense at the level of table.

I agree that it is also my use case. However, we should design a flexible system that would fit different use cases. I can imagine that people might want to attach metadata to anything in general (this is what Metadata.jl provides now).

Note that in order to have column-level metadata you would have to opt-in for this (normal Vectors do not have metadata). So why disallowing this if someone wants to do it?

Recently we had a similar discussion related to AbstractMatrix being or not being a table. We decided to go for a flexible design allowing custom matrix types to have a different table representation than the default one (and I am OK with this, although I have used DataFrames.jl for years and always converted matrices to tables in a way that preserved shape).

from dataapi.jl.

Tokazama commented on June 13, 2024

We decided to go for a flexible design allowing custom matrix types to have a different table representation than the default one

This is definitely the way to go. I'm currently using this for graphs, tables, and arrays. Performance and storage needs are different for each of these, but it's nice to be able to use a predictable interface for accomplishing this.

For example, you could do this in DataFrames.jl

struct DataFrameColumnMetadata{T<:AbstractDataFrame} <: AbstractDict{Symbol,Any}
    tbl::T
end

function metadata(x::DataFrameColumnMetadata, k)
    c = getcolumn(x.tbl, k)
    if has_metadata(c)
        # indicates the metadata was not found without throwing an error or interfering
        # with metadata that my use `nothing` or `missing` as a meaningful value.
        return Metadata.no_metadata
    else
        return metadata(c)
    end
end


function metadata(tbl::AbstractDataFrame)
    if has_metadata(tbl)
        return metadata(tbl)
    else
        return DataFrameColumnMetadata(tbl)
    end
end

from dataapi.jl.

pdeffebach commented on June 13, 2024

But now every call to copy, join, select, etc. needs to look up a global dictionary about metadata, right? It's hard to imagine this scaling well.

from dataapi.jl.

Tokazama commented on June 13, 2024

needs to look up a global dictionary about metadata

There's a lot of flexibility here so that this doesn't need to be decided here.

struct DataFrame <: AbstractDataFrame
    columns::Vector{AbstractVector}
    colindex::Index
end

Metadata.metadata(df::DataFrame) = Metadata.global_metadata(df, Main)


struct MetaDataFrame <: AbstractDataFrame
    columns::Vector{AbstractVector}
    colindex::Index
    metadata::Dict{Symbol,Any}
end

Metadata.metadata(df::DataFrame) = getfield(df, :metadata)

from dataapi.jl.

nalimilan commented on June 13, 2024

@Tokazama I don't understand how your last proposal stored column-level metadata. That's the main decision to make when designing a general API I think. Attaching metadata to the data frame itself is quite easy (either using Metadata.jl or a custom field in the struct).

from dataapi.jl.

Tokazama commented on June 13, 2024

It wasn't intended to illustrate anymore than that you could store metadata in an instance or global metadata. In reality you would want to ensure that the keys in the metadata correspond to columns (e.g., k in metadata(tbl, k) corresponds to a column).

from dataapi.jl.

nalimilan commented on June 13, 2024

@Tokazama I just saw you recently removed support for global metadata in Metadata.jl (Tokazama/Metadata.jl@e88941c). AFAICT this means there's no way to attach metadata to arbitrary objects without wrapping them in a new type. Is that right? This would be unfortunate as it was one of the main features we discussed above.

from dataapi.jl.

Tokazama commented on June 13, 2024

We can usually correspond metadata to the the data's self, values, indices/axes, or dimensions.
In terms of propagating metadata, I think we've mainly discussed copy, share, and drop as options.
The first two options only work if the metadata refers to the data's self or when following some method that copies the entirety of the data as is.
Indexing, dropping dimensions, permutation, and reduction all change how indices/axes metadata should be propagated (often also dimensional metadata).
This isn't even addressing how two datasets' metadata interact (e.g., cat, merge).
However, I think these provide enough context for most situations that some form of the following would be useful

is_selfmeta(m): is m metadata that corresponds to the entirety of it's corresponding data?
is_axesmeta(m): is m metadata that corresponds to the axes of it's corresponding data?
is_valmeta(m): is m metadata that corresponds to the values of it's corresponding data?
is_dimmeta(m): is m metadata that corresponds to dimensions of it's corresponding data?
should_copy_meta(m): should m be copied on propagation?
should_share_meta(m): should m be shared on propagation?
should_drop_meta(m): should m be dropped?

This can make the ever branching set of possibilities with metadata far more manageable.

function index_metadata(m, inds...)
    if should_drop_meta(m)
        return nothing
    else
        f = should_copy_meta(m) ? copy : identity
        if is_axesmeta(m)
            f(map(getindex, m, inds))
        elseif is_dimmeta(m)
            f(dropints(m, inds))
        elseif is_valmeta(m)
            f(m[inds...])
        else
            f(m)
        end
    end
end

There are certainly plenty of details that remain to make this into a robust generic interface, but I thought it might at least provide some helpful thoughts on how to proceed.

from dataapi.jl.

`metadata` method about dataapi.jl HOT 70 CLOSED

Comments (70)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent