Comments (70)
One will probably need to reserve key names anyway. In particular I do not think that metadata(df)[col]
to return a metadata for column col
is a good API (if we allowed this then there would be no way to specify global metadata for a table as a whole).
I think this is such a major thing that we should wait for other JuliaData members to comment before moving forward.
from dataapi.jl.
Sorry, I'm still not following the concern. Why/where would type information be important? The discussion has revolved around metadata(x)
returning any kind of object that implements the AbstractDict
interface, so in practice, you would use metadata like:
meta = metadata(x)
# see metadata keys
keys(meta)
# iterate over metadata key-value pairs
for (k, v) in meta
end
# check if a specific metadata key is present
haskey(meta, :specific_key)
So depending on whether metadata(x)
returned a Dict
, or NamedTuple
, you would have different implementations of these methods, but the interface is still the same.
We should probably require that the object returned be AbstractDictLike{Symbol, Any}
, i.e. require that metadata keys by Symbol
; does that sound reasonable or too restrictive?
from dataapi.jl.
Metadata.jl is interesting! Though for DataAPI/Tables.jl we don't necessarily have to choose a particular implementation: all we need to do is define an API that particular table types can implement, using Metadata.jl or other solutions.
What about starting with the following minimal API in Tables.jl:
- Table-level metadata:
metadata(tbl)
has to returnUnion{Nothing, AbstractDict{String, String}}
. We can make this more general (now or later) by allowing any dict-like object, but that doesn't change things radically. The default implementation in Tables.jl returnsnothing
. - Column-level metadata:
metadata(tbl, col::Union{Integer, Symbol})
also has to returnUnion{Nothing, AbstractDict{String, String}}
. Tables can implement this by storing column-level metadata either in the table or in vectors themselves (e.g. using Metadata.jl). The default implementation in Tables.jl returnsnothing
.
An alternative proposal would be to completely skip the second point, and instead of metadata(tbl, col)
attach column-level metadata to vectors themselves and require using metadata(Tables.getcolumn(tbl, col))
. The constraint with this is that it would require that we agree on a common system for metadata that works for vectors (like Metadata.jl) and that tables cannot store metadata at the table level if they want. The advantage would be that people are guaranteed to be able to retrieve metadata when they only have vectors (without the table).
Once we agree on a minimal common API, we can discuss what should happen e.g. when concatenating, joining or transforming data frames at JuliaData/DataFrames.jl#2276.
from dataapi.jl.
Initially I wanted to write that metadata(::DataFrame, ::ColumnIndex)
could also return a Dict{Symbol, Any}
- for me it would be OK. In this case there should also be some namespace of resrved key names for internal use.
So personally I would prefer the "single function that returns a metadata dict" approach and later the user can just work on the Dict
.
Ah - and now I see we could support metadata(::DataFrame, ::ColumnIndex)
that would return a NamedTuple
of dictionaries associated with columns.
I agree with @Tokazama that different people will want different things from metadata therefore I believe the API we provide should be maximally simple and flexible.
Therefore I would prefer to think that metadata is just a Dict
and there is one global dict for a data frame as a whole and then each column can have column specific dicts. Then the rest - how to work with it - would be delegated to a decision of the user.
from dataapi.jl.
I'm a little slow/late to the discussion here, but have thought a bit about this. I agree with the idea that this is a way that Julia/DataFrames can really stand apart/improve on the situation from R/pandas; having useful metadata integrated w/ a DataFrame could be really powerful when used in the right contexts.
That said, I worry about some of the suggestions around metadata use because they start to become so fundamental or logic-driven. IMO, if some kind of data starts to become so critical we're changing how things are computed/etc. then it probably deserves a more structured solution that just a metadata entry in a DataFrame.
IMO, metadata should be primarily "descriptive" about the object; give context, explain values and cardinality thereof; tweaking printing/showing seems fine to me. I just worry about packages starting to abuse metadata
when they should really be creating a new AbstractArray
type or something (I mean, you could imagine someone trying to implement CategoricalArrays by just using metadata).
My other thought is that while I agree that DataFrames can do a tight integration w/ metadata, I do thing we should allow/encourage metadata
to be attached/used generically on objects, including columns. There are going to be a lot of cases across the ecosystem where you're not dealing w/ a DataFrame, and it will be useful to support metadata in a variety of ways on columns, rows, etc. But yes, DataFrames can choose how it approaches its use/integration w/ metadata, either at the table-level or column level.
from dataapi.jl.
So I'm not sure what exactly the proposed API is? Is it just that metadata(x)
returns Union{Nothing, AbstractDict}
? Here are a couple thoughts/ideas:
- I'm not sure we should require
AbstractDict
specifically vs. "an object that supports AbstractDict methods" (or as I like to call it,AbstractDictLike
); namely it'd probably be nice to allowNamedTuple
to be returned frommetadata
, which isn't anAbstractDict
, but does support the interface; it'd be good to be very clear about what exactly is required of the object returned - If we're thinking of requiring
Union{Nothing, AbstractDict-LIke}
, I wonder if we should just require returning an "AbstractDictLike" and we can return an empty one by default; we could then provide convenience get/put methods. Alternatively, we could not require a specific object type to be returned and just have hte interface bemetadata(x)
andmetadata!(x, meta)
. I kind of like the idea of requiring AbstractDictLike and returning aNamedTuple()
by default - In terms of implementation, I've been looking a lot at how
@doc
is implemented in Base and I think it could make a lot of sense to do something similar for metadata; that is, instead of modifying DataFrame to have a metadata field, there'd be a global (or per module) metadataIdDict
that could store metadata per object. That would allow attaching metadata to all kinds of objects w/o needing wrappers. I think it also helps reinforce the idea that it's metadata, or somewhat detached from the object and not to be too relied upon for program logic. Along this vein, it could make sense to make an entire Metadata.jl package that copied the Base.Docs implementation and could be used by packages everywhere. It would be pretty lightweight, but could provide a lot of flexibility and a clean, standard API that other packages can integrate with. If we want to go that route, we probably don't need a definition in DataAPI.jl
from dataapi.jl.
I think having Tables.metadata(tbl, col)
is not a problem to have.
The default implementations could be:
Tables.metadata(tbl, col) = Tables.metadata(Tables.getcolumn(tbl, col))`
Tables.metadata(tbl) = nothing
which would also cover the case of a default table-level metadata.
Now Tables.metadata(tbl, col)
can be left as is, or if a table type has some way of keeping metadata for columns on a table level then simply Tables.metadata(tbl, col)
can have a special method added.
In particular:
Tables.metadata(tbl::AbstractDataFrame, col) = # some custom implementation
Tables.metadata(tbl::AbstractDataFrame) = # some custom implementation
can use a completely different code path.
The only problem to solve is if both vector and table define metadata for column which should take the precedence, but this should be solved at AbstractDataFrame
implementation level.
from dataapi.jl.
I can add it back in. Im still ironing out some details before releasing the next version. The new ability to set variables in modules makes it easier to do this sort of thing without macros.
from dataapi.jl.
@Tokazama I just saw you recently removed support for global metadata in Metadata.jl (Tokazama/Metadata.jl@e88941c). AFAICT this means there's no way to attach metadata to arbitrary objects without wrapping them in a new type. Is that right? This would be unfortunate as it was one of the main features we discussed above.
I pulled out the globally stored metadata stuff into a new package and I'm registering it now JuliaRegistries/General#63519.
from dataapi.jl.
@pdeffebach had some good design ideas about it in DataFrames.jl in the past.
Now, finally after 0.21.0 release, we are planning to add this functionality to DataFrames.jl.
As this is raised on a higher level let me give the API I envision for DataFrames.jl for now:
metadata(::DataFrame)
returns aUnion{Nothing, Dict{Symbol,Any}}
that - if filled - gives aDataFrame
-level metadata (this can be arbitrary metadata). The restriction would be that symbols starting withDF_
in their name would be reserved for internal use of DataFrames.jl (as a convention)metadata(obj::OTHER_TYPES_DEFINED_BY_DATAFRAMES) = metadata(parent(obj))
metadata(::DataFrame, ::ColumnIndex)
returns aString
(by default nothing) - which would indicate just a verbose name of the column, with default being just column namemetadata(obj::OTHER_TYPES_DEFINED_BY_DATAFRAMES, ::ColumnIndex)
similar to the above if the column is present in the other type.
If we agree to this design then I can implement it. The key challenge is rules of propagation of metadata, but this is not DataAPI.jl related thing so I leave this discussion for later.
from dataapi.jl.
See JuliaData/DataFrames.jl#1458 for the last attempt at implementing this in DataFrames. Two points:
- I think we need something more general than just having a custom label/verbose name for columns. For example it could be useful to store units, informations about measurement, etc. Label can just be a standard field among others.
- We also need an API to set metadata and to retrieve the list of fields that have been set.
In general a choice has to be made between having 1) a single function in the API which would return a metadata dict which would have to implement specific methods (getindex
, setindex!
and keys
notably); or 2) several functions in the API that would allow doing these operations directly. See the table in my first comment at JuliaData/DataFrames.jl#1458. I think returning an object is simpler since it allows reusing the standard dict API.
from dataapi.jl.
Metadata could technically be stored at any level of something like a table. For example, each column could be a MetadataArray (i.e. from MetadataArrays.jl) and the table itself could have metadata. I worry that if we started trying to design this around column based indexing it would needlessly complicate and potentially limit its wider usability. Even the definition of what "metadata" is to different people is likely to vary so I'm not sure we should even guarantee it returns a certain type.
from dataapi.jl.
Allowing the user to decide what to do with whatever metadata
return also provides the freedom to further specialize on this later. For example you could always do something like colmeta(df, col) = metadata(df)[col]
and then you wouldn't have to worry about reserving key names.
Would a simple PR to DataAPI.jl on this be a good next step right now?
from dataapi.jl.
Maybe we can say that metadata(tbl)
and metadata(tbl, col)
have to return objects implementing the AbstractDict
API, giving respectively the table-wise and column-wise metadata? That should be flexible enough for all implementations.
(In practice for DataFrames we would probably store column metadata internally as vectors with one entry per column as JuliaData/DataFrames.jl#1458 does but that can be exposed to users via a lazy AbstractDict
object per column. What could be useful to provide in addition is a way to access these vectors for convenience/efficiency.)
from dataapi.jl.
have to return objects implementing the AbstractDictAPI
Agreed
In practice for DataFrames we would probably store column metadata internally as vectors with one entry per column
We can discuss what is best in the PR for DataFrames.jl when it is done (essentially we have two options: dict of vectors or vector of dicts).
from dataapi.jl.
As have been thinking about this issue and #1458 I came to the conclusion that we should go back to the fundamentals. And the core issue is:
It is often the case that one wants to attach metadata of some sort to an array/graph/etc.
What I mean that while we seem all agree that adding metadata to tables is needed, actually I would discuss first what kind of metadata we really think people would store in practice. This is a relevant questions as I think we should not create a functionality that later would be very rarely used. Conversely - if we know exactly what we actually want to use we can design API that supports the required use-cases cleanly.
My two concerns are:
- persistence; most storage formats will not allow to save and load this metadata; which means that, at least in my understanding, the use cases, where people will use metadata will be situations of non-persistent metadata (i.e. something you attach to your table temporarily for programming convenience)
- performance; we do not want to kill the performance of basic operations on tables, because the "table processing engine" would constantly check if metadata needs updating, or if the cos of updating the metadata would be large, or if the memory footprint of allowing to store metadata would be non-negligible (ideally if there is no metadata then the performance should not be affected)
So now let me go down to the starting question - what metadata we see that would be actually used (this is not a comprehensive list - please comment what you think would be really used - not just potentially used):
- metadata for handling how data frame is shown (things like overriding
show
defaults, maybe custom decimal delimiter, maybe - if in the future we add integration with PrettyTables.jl, some settings of all the options that package provides) - custom column labels (possibly shown when printing a data frame) - this is tempting, but has a problem with "persistence" - i.e. most formats will not store this information, so my question is - how in practice we see that people will use this functionality?
- custom row lables labels - the same situation
- setting flags that some columns should be treated in a special way, e.g. for geospatial or time series analysis of data frames - this is tempting, but myself I am not 100% convinced it is a good idea, as it will be hard to ensure the metadata is consistent with a parent table (e.g. GeoPandas uses this strategy, and it has a problem with ensuring this integrity)
from dataapi.jl.
In addition to what you've mentioned here are some types of metadata that I think would be useful for me personally to be able to store:
- Source and collection information: I often have many tables that have some acquisition metadata pertinent to them that is not row or column specific but describes important aspects of all the data in one table.
- Column tracking: When performing semi-automated feature creation I like to keep track of certain operations/parameters/weights that resulted in the formation of a column of measures.
- Attaching metadata to a column that changes how it dispatches later on
it will be hard to ensure the metadata is consistent with a parent table (e.g. GeoPandas uses this strategy, and it has a problem with ensuring this integrity)
I think it depends on how much you care to take ownership of handling all metadata. I would prefer handling metadata be given a minimal interface. It could potentially have a methods for things like joins so that something like join_metadata
would also join dictionaries but could be taken advantage of for custom metadata types.
I also think that I/O on metadata should be entirely dependent on the package supporting I/O. There aren't many file types equipped to flexibly handle metadata and it seems like the best thing for DataAPI.jl is to just make it simple to extract metadata.
from dataapi.jl.
I agree with all that is said here. As the author of one of the previous attempts I think that meta-data is important and people coming from R and Python often don't fully appreciate how useful metadata is for Stata users and how it has hurt the adoption of R in applied economics, especially household surveys.
- custom column labels (possibly shown when printing a data frame) - this is tempting, but has a problem with "persistence" - i.e. most formats will not store this information, so my question is - how in practice we see that people will use this functionality?
My use of metadata in Stata was twofold
- Pretty printing of columns and keeping track of data. For example, the table below would not have been possible to make programattically without extensive use of column labels. I couldn't imagine trying to write this in R because column metadata in R doesn't persist after
join
s.
- Keeping track of the data-cleaning process. In the above table, the variable "Standardized income index" is composed of the 3 variables below it. The
note
for that variable will tell us as much, and was automatically generated. If you were to type
note list standardized_index
you would see a note that said something along the lines of
A standardized index of 3 variables: net_earnings, consumption, durable_assets
Stata also has metadata about a table, which is often used to denote a source or author. I never used that feature.
With regards to IO, I don't see a huge problem with saving a data frame to two CSVs and providing a convenience method for adding metadata to a DataFrame when the metadata is stored as a Table. Maybe it's a bit heavy handed but it's robust.
from dataapi.jl.
I think these are all great use cases that I've wanted at some point. As someone who deals with lots of different types of metadata I'd really like to emphasize that less is more as this is implemented. It's easy to get stuck in the weeds on every little implementation detail because you have the combination of situations that arise from row specific, column specific, and general table metadata and all the different types of metadata.
This is loosely the kind structure I'm considering using...
struct Table{T<:AbstractVector,M<:Union{Nothing,AbstractDict{Symbol,Any}}}
data::Vector{T}
index::Dict{Symbol,Int}
meta::M
end
metadata(x::Table) = getfield(x, :meta)
Users don't have to ever worry about metadata unless they decide they want it and developers can create whatever type of fancy metadata that changes dispatch as long as it is a subtype of AbstractDict{Symbol,Any}
.
I would think column specific metadata would be easiest if implemented as a column vector with metadata so I could just do metadata(table.column_name)
. Otherwise concatenating/merging/joining columns becomes the responsibility of DataAPI.jl. Seems more simple for this to be implemented at the level of something like vcat(::MetadataVector, ::MetadataVector)
.
from dataapi.jl.
I agree with most of what has been said. Just one point:
I would think column specific metadata would be easiest if implemented as a column vector with metadata so I could just do
metadata(table.column_name)
. Otherwise concatenating/merging/joining columns becomes the responsibility of DataAPI.jl. Seems more simple for this to be implemented at the level of something likevcat(::MetadataVector, ::MetadataVector)
.
I'm afraid this wouldn't be workable, as it would require users to deal with another new kind of vector just to store metadata. That would force recompiling all functions for that type, and it wouldn't be easy to deal with e.g. CategoricalArray
. It would also make things appear more complex for users who load files with metadata (e.g. Stata files with column labels), while one of the strengths of DataFrames is that they just wrap standard arrays.
We can say the table type is responsible for preserving metadata across concatenations/joins. DataAPI itself doesn't have to know anything about that.
from dataapi.jl.
With regards to spatial data, which is a natural use case of this, is there anyone in the Julia Data community who has a really detailed knowledge of R's sf
package?
It's the best thing ever, being able to use all of dplyr
while also maintaining spatial metadata and using spatial joins etc. is incredible.
Perhaps someone who has worked on that project could provide some insights.
from dataapi.jl.
The project is going to be done during JSoC this year. And one of the reasons I am pressing to decide on metadata now is to have a clear guidance how this extra package should integrate with DataFrames.jl.
from dataapi.jl.
I have discussed:
I just worry about packages starting to abuse
metadata
when they should really be creating a newAbstractArray
type or something
with @visr with the context of geospatial data (temporial data is the same I think) and we came to the same conclusion. The logic in packages using tables should primarly be based either on type or a trait of a column (trait is probably preferable as currently Julia does not allow for multiple inheritance), but not metadata attached to it.
So given this - are there any more comments how the reference API should look like?
from dataapi.jl.
Along this vein, it could make sense to make an entire Metadata.jl package that copied the Base.
Actually I would prefer this idea as it would be much more composable. The consequence for DataFrames.jl users would be:
- metadata will not propagate when the object is copied.
- still it will propagate when it is just passed, not copied.
So: df.col
will keep the metadata of col
and similarly df.col = x
will make col
to have the metadata of x
.
from dataapi.jl.
Ah yes that's interesting. Indeed it's quite convenient in R to be able to attach metadata to any object, and yet in Julia we don't want to have to wrap any object in a special type just to add metadata.
Though losing the metadata on copy would be annoying. That could easily be fixed in DataFrames by ensuring we copy/readd the metadata when copying the columns (this would be needed for important cases like getindex
but also select
without transformations). But it's not easy to fix when the user calls copy
on an arbitrary object: doing this may be too costly for small objects, and for large ones it would require support from all packages (including Base...). Maybe it's not the end of the world though if one has to do copywithmetadata(x)
when needed?
(Otherwise, returning an empty NamedTuple
by default (instead of nothing
) sounds fine. We really need traits in Base!)
from dataapi.jl.
Though losing the metadata on copy would be annoying.
Personally I would feel safer if we worked this way. I would prefer to have a function that copies medatada explicitly that can be called if someone needs it.
from dataapi.jl.
There are a lot of packages that use the term "metadata" (e.g, ImageMetadata.jl, MetadataArrays.jl, MetaGraph.jl, FieldMetadata.jl, FieldProperties.jl, etc.). I don't think an interface like Base.Docs
is flexible enough to fit many of the potential uses of metadata.
from dataapi.jl.
@Tokazama can you explain a little more why you think the Base.Docs
approach wouldn't be flexible enough? In terms of approach, it's more of an implementation detail: the user interface would still be metadata(x)
, it would just retrieve the metadata from a per-module store instead of retrieving it from the object itself.
from dataapi.jl.
It wouldn't carry any type information so if someone did use something like a NamedTuple
it wouldn't really help any.
from dataapi.jl.
Actually I prefer metadata
to be flexible and type unstable.
Apart from convenience it is a clear signal for the developers not to use metadata to encode program logic - Julia provides other means to to this efficiently.
Metadata, as I think about it now (but my opinions evolve based on the comments we get here as the design here is not an easy decision) should be for lightweight things like descriptive strings or maybe some hints how output should be formatted (as working with IOContext
is hard for most users and in some cases it it not flexible enough as IOContext
is not always usable - e.g. you cannot replace stdout
with a custom IOContext
AFAIK).
from dataapi.jl.
Actually I prefer metadata to be flexible and type unstable.
I'm not against this being the case for specific implementations like what might be done in DataFrames but I don't think it should be the only option.
from dataapi.jl.
Apart from convenience it is a clear signal for the developers not to use metadata to encode program logic - Julia provides other means to to this efficiently.
I agree. You don't want too many interfaces relying on specifically named metadata
fields to create unnecessarily complicated features.
So:
df.col
will keep the metadata ofcol
and similarlydf.col = x
will makecol
to have the metadata ofx
.
I don't fully understand this. IMO metadata should be attached to a data frame and df.col
should always return a vector without anything else attached to them.
I agree with @nalimilan, it would be annoying to have metadata disappear with copying, consider something as simple as
df.income = clean_vec(df.income)
clean_vec
takes in aVectof{Float64}
and for whatever reason has a concrete type signature. So no meta-data is added.- If I have a label for
df.income
, , say"Personal Income"
, I don't want this to disappear. Keeping track of these operations would get tiresome.
This sort of global Dict
that contains metadata is basically how R
s metadata system works, and I've always found it very useless, partly because the metadata disappears.
from dataapi.jl.
@pdeffebach, I think there's a lot more flexibility in the Base.Docs
system than just thinking of it as a "global Dict". With Julia's rich type system, macros, etc. I think we could easily accomodate scenarios where you want to attach metadata to a DataFrame column, and not the Vector
itself, but to a named column of the DataFrame, which would "stick" beyond transformations. And as has been mentioned, there are a number of scenarios where you don't want metadata to stick around too much, if you're creating new objects and such.
As I've played around with ideas/implementations, I just don't see a realistic way to make a system that is general enough to be widely used that relies on either wrapper objects or requiring metadata fields. It just doesn't scale. The doc system, however, is extremely rich and accomplishes its goal/job very well, IMO; attaching extra information to types, variables, fields, etc.
Part of my experience/opinion here is coming from thinking through the entire data ecosystem, not just DataFrames. While I think DF is one of the primary targets for a metadata system, I also want to ensure that other table types, formats, and objects can also take advantage of a metadata system to enhance objects.
from dataapi.jl.
I agree with @nalimilan, it would be annoying to have metadata disappear with copying, consider something as simple as
df.income = clean_vec(df.income)
1. `clean_vec` takes in a `Vectof{Float64}` and for whatever reason has a concrete type signature. So no meta-data is added. 2. If I have a label for `df.income`, , say `"Personal Income"`, I don't want this to disappear. Keeping track of these operations would get tiresome.
@pdeffebach What kind of operations would be performed within clean_vec
? Apart from copy
, which you could replace with (say) copywithmetadata
, I'm not sure many operations should/could preserve it. In general I don't see what solution we could find a system in which both 1) metadata is preserved on copy
and 2) metadata can be added to any object (e.g. Vector
). What we can do, though, is to have DataFrames operations copy metadata automatically where it makes sense -- but that doesn't include custom functions since we have no way of knowing if you are just cleaning the income value or creating a completely new thing.
Or maybe in your example you meant that assigning a new vector to an existing column via df.income = v
should preserve the metadata of the column? That makes some sense but could be problematic if you really want to replace the column (and you may even not know some previous column existed with that name).
from dataapi.jl.
@pdeffebach What kind of operations would be performed within
clean_vec
? Apart fromcopy
, which you could replace with (say)copywithmetadata
, I'm not sure many operations should/could preserve it.
Yes, even replace
performs a copy. From a users perspective, this kind of behavior would require a lot of defensive programming and reasoning just to get persistence in metadata that imo is the most intuitive.
we have no way of knowing if you are just cleaning the income value or creating a completely new thing.
In my view, metadata isn't a property of Vectors
, its a property of named columns in a data frame. Exactly what object df.income
points to in memory is an implementation detail, the point is that because I assigned it to the column :income
, I want metadata(df, :income)
to return "Personal Income"
.
Ultimately, if the user wanted different metadata for df.income = clean_vec(df.income)
, they would have assigned it to a new column.
I understand the appeal for a solution that the whole data ecosystem can use, but I hope it allows for the intuitive use of metadata that Stata has, which persists across reassignment, copies, joins, etc.
from dataapi.jl.
OK. Whatever the chosen implementation, DataFrames could certainly preserve the metadata of already existing columns in setindex!
/setproperty!
if we wanted. It's more a matter of deciding whether it's a good idea, but better discuss this elsewhere.
from dataapi.jl.
And perhaps I wasn't super clear in my previous comment, but I was thinking along the lines of being able to do:
@meta "Personal Income" df.income
which would do a nesting metadata attachment of "Personal Income" to the income
column of the df
object. i.e. the metadata would be attached to the df.income
array itself, but the df
object would also have a link to its "children" metadata attachments.
Obviously there are some details there to work out, but I think it'd be cool to allow this kind of "metadata" rollup from children to parents.
from dataapi.jl.
I think we are getting close to the design on two levels (general here and DataFrames.jl specific).
@pdeffebach - it would be great, as @nalimilan suggested, if you could comment in the DataFrames.jl (in a new issue or in the old PR related to metadata) what functionality you would like to have. I get a feeling that you have a good set of rules in mind, but we need to make them very precise (like what exactly should happen after what operations). What I mean here is that I want to avoid going into implementation of anything before there is a consensus how things should work.
For example - if I understood you correctly (but please comment on this in DataFrames.jl not here to keep this issue general) - if you have a dataframe df
with a column :col
then you want:
- the result of
df.col
not to have any metadata attached - but on the other hand by doing
df.col = some_new_value
then the metadata should be kept - given the two rules above I was not clear for example what you wanted to happen in the following cases:
df.col2 = df.col
(I guess you do not wantcol2
to have any metadata)- if you then do
select!(df, :col => :col2, :col2 => :col)
- then still:col
should have metadata and:col2
should not have metadata
from dataapi.jl.
Created a thread for DataFrames specific dicsussion here.
With f(df.col)
, the function f
doesn't know that the vector it's being passed has the name :col
. The same rule should apply with metadata I think.
from dataapi.jl.
I've also filed JuliaData/Tables.jl#176 to discuss how Tables.jl could use the general mechanism defined in this issue to allow exchanging per-column metadata between Table types. That way we can concentrated on the most general interface here.
from dataapi.jl.
Arrow.jl supports Dict{String, String}
on a table level and on column level. Given this I would re-surface the discussion about how metadata system for tabular data in Julia should be defined.
There are different opinions on this, so let me give my take (but I am open to other opinions).
I personally would be OK with only supporting Dict{String, String}
on a table level and delegate to AbstractVector
subtypes to define metadata for column vectors if needed. What are the benefits of this:
- we can store the metadata in Arrow.jl without losing the information.
- we do not enforce any semantic meaning to the metadata, so you can do whatever you like with it
What are the cons of this approach:
- only
String=>String
mappings would be supported, but the question is do we really need other kind of metadata in practice (especially given that when written by Arrow.jl it would be lost) - you have to manually manage the metadata (e.g. if you stored on a table level the mapping
column_name => column_description
then when e.g. renaming columns it will not get automatically updated); however, I am not sure it is that useful to do it automatically, but probably Stata users can chime in here (CC @pdeffebach)
So in summary - my proposal is to be very minimal, potentially mentioning in the future that this metadata system might be extended in the future (i.e. that something more than Dict{String, String}
on table level might be supported, so relying on the exact type of metadata is discouraged). However, I believe that this way we could quickly have "some metadata", and in the future extend it.
I am not very attached to this idea, but what I would love to have is something we can agree on and is easy enough, so that we can provide the functionality in a reasonable time frame. However, if someone has a superior proposal that is consistent (and clear how metadata should be handled under transformations) I would love to hear what it would be (I know that we already had many discussions about it - I think that the way forward is just to put some "end to end" proposals on the table and discuss their pros and cons).
from dataapi.jl.
I made the Metadata.jl package for this. It provides syntax for binding metadata directly through a struct or in a global variable.
from dataapi.jl.
So e.g. for DataFrame
(it is immutable) you would use the attach_metadata
function - right?
from dataapi.jl.
You can use either because each DataFrame
has a unique objectid
. I do have some support for dimension specific metadata, I haven't done anything that is specific for tables b/c I wanted it to be as generic and flexible as possible.
Here's quick example demonstrating globally stored metadata for a DataFrame.
julia> df = DataFrame(x = 1:2, y = 3:4)
2×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 3 │
│ 2 │ 2 │ 4 │
julia> @attach_metadata(df, Dict(:types => [Int, Int]))
Dict{Symbol,Any} with 1 entry:
:types => DataType[Int64, Int64]
julia> df
2×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 3 │
│ 2 │ 2 │ 4 │
julia> @metadata(df)
Dict{Symbol,Any} with 1 entry:
:types => DataType[Int64, Int64]
julia> df2 = DataFrame(y = 1.0:2.0, z = 3.0:4.0)
2×2 DataFrame
│ Row │ y │ z │
│ │ Float64 │ Float64 │
├─────┼─────────┼─────────┤
│ 1 │ 1.0 │ 3.0 │
│ 2 │ 2.0 │ 4.0 │
julia> @attach_metadata(df2, Dict(:types => [Float64, Float64]))
Dict{Symbol,Any} with 1 entry:
:types => DataType[Float64, Float64]
julia> @metadata(df)
Dict{Symbol,Any} with 1 entry:
:types => DataType[Int64, Int64]
julia> @metadata(df2)
Dict{Symbol,Any} with 1 entry:
:types => DataType[Float64, Float64]
from dataapi.jl.
Union{Nothing, AbstractDict{String, String}}
makes sense and we could say that in the future Union{Nothing, AbstractDict}
might be supported, so people should not rely on the parameters of AbstractDict
in dispatch (or maybe we could already assume AbstactDict
only and just Arrow.jl would convert what it gets to String
when serializing).
metadata(tbl)
- I think this is uncontroversial. And in general it does not have to be Tables.table but any object (just define metadata(::Any) = nothing
. This would provide a nice fallback for the second method in case a vector defined metadata (so the table could fetch it).
metadata(tbl, col::Union{Integer, Symbol})
- here the question is if AbstractString
should be also allowed? I think it is non-problematic to have this method in general, as till we support it we could just return nothing
.
from dataapi.jl.
An alternative proposal would be to completely skip the second point, and instead of metadata(tbl, col) attach column-level metadata to vectors themselves and require using metadata(Tables.getcolumn(tbl, col)).
Unless metadata(::T, ::Symbol)
is specifically defined for T
the default is to just grab all the metadata and if it is an AbstractDict
us getindex
and otherwise use getproperty
(as defined here).
This was the discourse announcement, in case that's any help.
Once we agree on a minimal common API, we can discuss what should happen e.g. when concatenating, joining or transforming data frames at JuliaData/DataFrames.jl#2276.
I have some internal traits for copying/sharing/dropping metadata that might be useful for this sort of thing when the time comes.
from dataapi.jl.
Metadata.jl is indeed interesting and in the direction I had in mind w/ the Arrow.jl methods (which I really just threw together in order to support the arrow specification, leaving the "generalizing" to a later project). I don't love the use of macros because they're not really doing anything? I feel like just having Metadata.get(obj)
and Metadata.set(x, meta)
would be simpler/clearer? I can open an issue at Metadata.jl repo to discuss the details there more.
@nalimilan, what's the advantage of having a Tables.jl-level API for metadata as opposed to just pointing people to something like Metadata.jl?
from dataapi.jl.
Having Tables.jl-level API generates less dependencies, e.g. for Arrow.jl I think.
from dataapi.jl.
metadata(tbl, col::Union{Integer, Symbol})
- here the question is ifAbstractString
should be also allowed? I think it is non-problematic to have this method in general, as till we support it we could just returnnothing
.
@bkamins That's a minor point I'd say. We should be consistent across Tables.jl, so better discuss getcolumn
, etc. at the same time, and separately from this issue which is already complex enough.
An alternative proposal would be to completely skip the second point, and instead of metadata(tbl, col) attach column-level metadata to vectors themselves and require using metadata(Tables.getcolumn(tbl, col)).
Unless
metadata(::T, ::Symbol)
is specifically defined forT
the default is to just grab all the metadata and if it is anAbstractDict
usgetindex
and otherwise usegetproperty
(as defined here).
@Tokazama Note that in my proposal Tables.metadata
would be a different (and unexported) function from Metadata.metadata
. Tables.metadata(tbl, col)
would retrieve metadata for column col
, while metadata(tbl)[key]
would access table-level attribute key
. Otherwise there could be conflicts e.g. if a column is called name
and you want to store a table-level attribute called name
.
@nalimilan, what's the advantage of having a Tables.jl-level API for metadata as opposed to just pointing people to something like Metadata.jl?
@quinnj The reason is that we need to define an API to access column-level metadata. I agree something like Metadata.jl is enough if we decide that column-level metadata should be attached to vector objects themselves rather than stored in the table (my second proporsal). But I think @pdeffebach had arguments against it.
from dataapi.jl.
Having metadata be persistent across joins and reassignment is a crucial feature.
If there was a Tables.jl level API for could make assurances about the persistence of metadata. @quinnj if you aren't too familiar with Stata, this is basically the model for the behavior I would like metadata to have. In Stata, it's all about persistence.
My argument against having metadata attached to vector objects is that
- We don't want
df.a
to give you some special vector type that has metadata attached to it - We copy a lot of places, i.e. after every
select
andtransform
. So the notion of meta-data being attached to a particular object gets complicated.
I'm going to cc @matthieugomez here, since he is someone familiar with Stata who has probably also thought about this in Julia.
from dataapi.jl.
I don't love the use of macros because they're not really doing anything?
Similar to @doc
, they point to the module where @attach_metadata
was called. If you have a type that will always store metadata in the same module you could hard code that in and use metadata
.
from dataapi.jl.
1. We don't want `df.a` to give you some special vector type that has metadata attached to it
@pdeffebach With @attach_metadata
it wouldn't be a special type, just a plain Vector
with metadata stored in a global dict.
2. We copy a lot of places, i.e. after every `select` and `transform`. So the notion of meta-data being attached to a particular object gets complicated.
We could copy metadata in select
and transform
. Overall I think the question of persistence should be addressed by particular implementations (e.g. DataFrames). Tables.jl doesn't care about that, it just has to allow you to pass metadata along with tables.
from dataapi.jl.
Lets say I have
df = DataFrame(a = [1, 2], b = [3, 4])
metadata!(df, :a => "A", :b => "B")
@pipe df |>
transform(_, [:a, :b] => ByRow(+) => :c) |>
select(_, :b, :c)
with the meadata in a global Dict
, how would this work? I'm confused by what the keys and what the values are.
What happens when the columns are copied inside the transform
? You could imagine this global dict getting very ver large if we have thousands of columns and a lot of transform
calls in a pipe.
from dataapi.jl.
Yes, GC of this dict is an issue I think.
from dataapi.jl.
Yes, using a global dict will certainly be slower and less memory-efficient than storing column-level metadata in the table (especially since in that case we can store metadata using a vector with one entry for each column, and use the data frame index to map names to positions, like at JuliaData/DataFrames.jl#1458). But I wonder whether it really matters in practice: if you need to copy the column vector anyway, copying the metadata should be cheap in comparison.
Maybe we could also add a finalizer to column vectors when adding metadata, so that we can delete the entry from the global dict when the object it destroyed? @Tokazama Have you considered that?
from dataapi.jl.
Maybe we could also add a finalizer to column vectors when adding metadata, so that we can delete the entry from the global dict when the object it destroyed?
I don't think that's possible without type piracy. There is no unique type provided by "Metadata" that wraps an instance that is attached to global metadata when using @attach_metadata(x, meta)
. The most useful part of using global metadata is that it has absolutely no effect on dispatch so it can't possibly slow down your code, increase latency through codegen, etc.. This also means you cant propagate metadata by dispatching on a type that binds metadata to your table. If that's what you want, use attach_metadata
, so that metadata is directly bound to your table with a wrapper that stores metadata in its structure.
If you want to do what @quinnj is suggesting, something like this should work...
function Metadata.global_metadata(tbl::MyTableType, column_name, module_name)
return Metadata.global_metadata(getproperty(tbl, column_name), m)
end
...redirecting @metadata(tbl, column_name)
to the relevant column.
Similary, you can do this if you expect your columns to wrap metadata.
Metadata.metadata(tbl::MyTableType, column_name) = metadata(getproperty(tbl, column_name), m)
Handling persistence of metadata without a wrapper type (i.e. global metadata) would just require actively using @share_metadata/@copy_metadata
. You might want this to be optional (e.g., some_method(args...; share_metadata)
so that you don't hurt performance by searching for metadata in every instance in every method.
from dataapi.jl.
I don't think that's possible without type piracy. There is no unique type provided by "Metadata" that wraps an instance that is attached to global metadata when using
@attach_metadata(x, meta)
. The most useful part of using global metadata is that it has absolutely no effect on dispatch so it can't possibly slow down your code, increase latency through codegen, etc..
Adding a finalizer doesn't require having a special type AFAICT. You can just call finalizer(f, obj)
.
The concern about performance is that in DataFrames transform
copies all columns, so if we want to preserve metadata we would have to also add attach it to the new vectors, adding them to the global dict. If we don't remove metadata which has been attached to objects that have been destroyed by GC from the dict, it will grow indefinitely, which can be a problem.
from dataapi.jl.
Adding a finalizer doesn't require having a special type AFAICT. You can just call finalizer(f, obj).
Well, that's extremely good to know. I'm trying to add that now but it has the caveat that it only works with mutable structs. Any suggestions on getting this to work with something like DataFrame
?
from dataapi.jl.
I think this discussion has surpassed my technical knowledge but as the co-author of JuliaData/DataFrames.jl#1458 (Milan made the design), I like it's implementation. It's super transparent, I could make PRs to it, and users can understand it with a conceptual model. It's just a bit scary for me to have metadata be implemented by a global dict that is invisible to the user, but that could just be my lack of technical knowledge.
from dataapi.jl.
I do not think in DataFrames.jl we have to use a default metadata mechanism - we can do whatever we like. That is why we are discussing it in DataAPI.jl as I would like first to agree on the API of getting metadata and, if possible for setting metadata (but this is less crucial as I believe different table types might provide custom mechanisms for setting the metadata).
I think it is important to keep the API and the implementation separate, as otherwise we might run into problems in the future that might be hard to envision currently. Metadata.jl is very nice but it should be an opt-in I think, i.e. if some table type likes Metadata.jl it can start depending on it; but it should not be enforced.
from dataapi.jl.
Yes, keeping API and implementation separate is usually a good thing. But a difficulty here is that if we don't add Tables.metadata(tbl, col)
to the API, then the only way to support column-level metadata is to attach them to column vectors themselves. And that's only possible if we either 1) use a global dict like Metadata.jl, or 2) wrap vectors in a custom type (which is a no-go IMO). Only Tables.metadata(tbl, col)
allows storing per-column metadata in the DataFrame itself (with the drawback that it cannot be retrieved if you only have the vector; not sure whether it's a problem).
from dataapi.jl.
In my opinion, metadata only makes sense at the level of table. Arrays should not have metadata themselves. i.e.
x = df.x
x
should have no metadata attached to it, since any metadata can only be understood in the context of the table which it came from. Since x
now lives on its own, separated from the DataFrame, it's not worth having any metadata attached to it.
from dataapi.jl.
In my opinion, metadata only makes sense at the level of table.
I agree that it is also my use case. However, we should design a flexible system that would fit different use cases. I can imagine that people might want to attach metadata to anything in general (this is what Metadata.jl provides now).
Note that in order to have column-level metadata you would have to opt-in for this (normal Vector
s do not have metadata). So why disallowing this if someone wants to do it?
Recently we had a similar discussion related to AbstractMatrix
being or not being a table. We decided to go for a flexible design allowing custom matrix types to have a different table representation than the default one (and I am OK with this, although I have used DataFrames.jl for years and always converted matrices to tables in a way that preserved shape).
from dataapi.jl.
We decided to go for a flexible design allowing custom matrix types to have a different table representation than the default one
This is definitely the way to go. I'm currently using this for graphs, tables, and arrays. Performance and storage needs are different for each of these, but it's nice to be able to use a predictable interface for accomplishing this.
For example, you could do this in DataFrames.jl
struct DataFrameColumnMetadata{T<:AbstractDataFrame} <: AbstractDict{Symbol,Any}
tbl::T
end
function metadata(x::DataFrameColumnMetadata, k)
c = getcolumn(x.tbl, k)
if has_metadata(c)
# indicates the metadata was not found without throwing an error or interfering
# with metadata that my use `nothing` or `missing` as a meaningful value.
return Metadata.no_metadata
else
return metadata(c)
end
end
function metadata(tbl::AbstractDataFrame)
if has_metadata(tbl)
return metadata(tbl)
else
return DataFrameColumnMetadata(tbl)
end
end
from dataapi.jl.
But now every call to copy
, join
, select
, etc. needs to look up a global dictionary about metadata, right? It's hard to imagine this scaling well.
from dataapi.jl.
needs to look up a global dictionary about metadata
There's a lot of flexibility here so that this doesn't need to be decided here.
struct DataFrame <: AbstractDataFrame
columns::Vector{AbstractVector}
colindex::Index
end
Metadata.metadata(df::DataFrame) = Metadata.global_metadata(df, Main)
struct MetaDataFrame <: AbstractDataFrame
columns::Vector{AbstractVector}
colindex::Index
metadata::Dict{Symbol,Any}
end
Metadata.metadata(df::DataFrame) = getfield(df, :metadata)
from dataapi.jl.
@Tokazama I don't understand how your last proposal stored column-level metadata. That's the main decision to make when designing a general API I think. Attaching metadata to the data frame itself is quite easy (either using Metadata.jl or a custom field in the struct).
from dataapi.jl.
It wasn't intended to illustrate anymore than that you could store metadata in an instance or global metadata. In reality you would want to ensure that the keys in the metadata correspond to columns (e.g., k
in metadata(tbl, k)
corresponds to a column).
from dataapi.jl.
@Tokazama I just saw you recently removed support for global metadata in Metadata.jl (Tokazama/Metadata.jl@e88941c). AFAICT this means there's no way to attach metadata to arbitrary objects without wrapping them in a new type. Is that right? This would be unfortunate as it was one of the main features we discussed above.
from dataapi.jl.
We can usually correspond metadata to the the data's self, values, indices/axes, or dimensions.
In terms of propagating metadata, I think we've mainly discussed copy, share, and drop as options.
The first two options only work if the metadata refers to the data's self or when following some method that copies the entirety of the data as is.
Indexing, dropping dimensions, permutation, and reduction all change how indices/axes metadata should be propagated (often also dimensional metadata).
This isn't even addressing how two datasets' metadata interact (e.g., cat
, merge
).
However, I think these provide enough context for most situations that some form of the following would be useful
is_selfmeta(m)
: ism
metadata that corresponds to the entirety of it's corresponding data?is_axesmeta(m)
: ism
metadata that corresponds to the axes of it's corresponding data?is_valmeta(m)
: ism
metadata that corresponds to the values of it's corresponding data?is_dimmeta(m)
: ism
metadata that corresponds to dimensions of it's corresponding data?should_copy_meta(m)
: shouldm
be copied on propagation?should_share_meta(m)
: shouldm
be shared on propagation?should_drop_meta(m)
: shouldm
be dropped?
This can make the ever branching set of possibilities with metadata far more manageable.
function index_metadata(m, inds...)
if should_drop_meta(m)
return nothing
else
f = should_copy_meta(m) ? copy : identity
if is_axesmeta(m)
f(map(getindex, m, inds))
elseif is_dimmeta(m)
f(dropints(m, inds))
elseif is_valmeta(m)
f(m[inds...])
else
f(m)
end
end
end
There are certainly plenty of details that remain to make this into a robust generic interface, but I thought it might at least provide some helpful thoughts on how to proceed.
from dataapi.jl.
Related Issues (20)
- `Between` should accept more than `Int` and `Symbol` HOT 2
- isordered HOT 4
- Deprecate `All` HOT 9
- TagBot trigger issue HOT 20
- ellipsis notation for Beetwen HOT 4
- Plan for 1.7 release
- Add flatten to DataAPI.jl HOT 6
- Add `Selector` abstract type for ecosytem compat, and rethink `Between` HOT 8
- Change describe contract HOT 2
- add kwarg to levels to keep missing
- nrow and ncol for undefined values HOT 1
- clarify Between HOT 1
- Add method for iterating metadata HOT 8
- a few concerns about metadata methods HOT 8
- Confusing `levels` fallback HOT 9
- default to metadata! style=:default HOT 6
- Define `rename` and `rename!` for modifying column names? HOT 3
- Don't define unwrap(x::Any) HOT 7
- rownumber HOT 10
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dataapi.jl.