w3c / microdata Goto Github PK
View Code? Open in Web Editor NEWMoved to https://html.spec.whatwg.org/multipage/microdata.html
Moved to https://html.spec.whatwg.org/multipage/microdata.html
From https://w3c.github.io/microdata/#the-microdata-model
5.1 The microdata model
The microdata model consists of groups of name-value pairs known as items.
Each group is known as an item. Each item can have item types, a global identifier (if the vocabulary specified by the item types support global identifiers for items), and a list of name-value pairs. Each name in the name-value pair is known as a property, and each property has one or more values. Each value is either a string or itself a group of name-value pairs (an item). The names are unordered relative to each other, but if a particular name has multiple values, they do have a relative order.
Q: What does "they do have a relative order" vs "are unordered" actually mean? Did anyone implement against this distinction?
A test case to explore could be based on something like:
<div itemscope itemtype="http://schema.org/Book">
<meta itemprop="bookFormat" content="EBook/DAISY3"/>
<meta itemprop="accessibilityFeature" content="largePrint/CSSEnabled"/>
<meta itemprop="accessibilityFeature" content="highContrast/CSSEnabled"/>
<span itemprop="author">
<div itemscope itemtype="http://schema.org/Person">
<span itemprop="name">Alice Aardvark</span>
</div>
</span>
<span itemprop="author">
</span>
</div>
The spec seems to say that the relative order of accessibilityFeature vs author on this Book is unimportant, whereas considering the values for accessibilityFeature, they are relative to each other; and considering the two authors listed, that ordering is also considered in some sense significant. For example, perhaps a later accessibilityFeature declaration overrides an earlier one; or perhaps a first-listed author is implicitly said to be a more significant contributor. Microdata delegates such details to vocabularies such as Schema.org. Schema.org says that it does not attach meaning at this level. Does anyone else?
So - I would like to explore clarifications in this area. Neither Schema.org nor the earlier datavocabulary.org vocabulary, assign semantics to this kind of property ordering. At Google we extract schema.org and datavocabulary Microdata into re-order-able triples / graphs; our parser currently assumes other uses of Microdata follow this pattern. I suspect @gkellogg and other parser writers may have implemented structures that represent the property ordering, but I do not know of anyone making use of such facilities.
I suggest that "but if a particular name has multiple values, they do have a relative order" may lack implementations beyond parsers i.e. vocabularies + publisher/consumer ecosystem. Is "parsers can handle this distinction" enough of an argument to preserve this aspect of Microdata, or can the spec be simplified in the light of experience here?
We might consider clarifying that the entire Microdata structure can be viewed as fully ordered as HTML, considered in the context of its life within a larger HTML document. This can be very important for use cases such as editors. However we might choose to say that order is not significant / meaningful when considering Microdata as a carrier of factual claims.
One way to state this idea would be to try to agree that any circumstances that are captured by the above test case ought to also be equally accurately described by the following test case (in which I have reordered everything):
<div itemscope itemtype="http://schema.org/Book">
<span itemprop="author">
<div itemscope itemtype="http://schema.org/Person">
<span itemprop="name">Zac Zebedee</span>
</div>
</span>
<span itemprop="author">
<div itemscope itemtype="http://schema.org/Person">
<span itemprop="name">Alice Aardvark</span>
</div>
</div>
</span>
<meta itemprop="accessibilityFeature" content="highContrast/CSSEnabled"/>
<meta itemprop="accessibilityFeature" content="largePrint/CSSEnabled"/>
<meta itemprop="bookFormat" content="EBook/DAISY3"/>
</div>
These distinctions are a bit easier to state for languages that explicitly extract into atomic triples, but I think we can find a way.
Does anyone know of a use of Microdata which depends upon "but if a particular name has multiple values, they do have a relative order."?
/cc @tmarshbing @nicolastorzec @chaals, @betehess (and @pmika for old time's sake) for Bing, Yahoo, Yandex, Apple perspective on this.
<div itemscope itemtype="http://schema.org/CreativeWork">
<data itemprop="name" value="data value attribute" content="data content attribute" >data element text content</data>
<data itemprop="name" value="lone data value attribute" >missed data element' value attribute [BUG]</data>
<data itemprop="name">data used its element's textContent</data>
<meter itemprop="name" content="meter content attribute" value="meter value attribute">meter element text content</meter>
<meter itemprop="name" value="lone meter value attribute" >missed meter element's value attribute [BUG]</meter>
<meter itemprop="name">meter used its element's textContent</meter>
</div>
in the SDL generates:
data content attribute
meter used its element's textContent
lone meter value attribute
data used its element's textContent
lone data value attribute
meter content attribute
and in Google's SDTT gives
@type: CreativeWork
name: data content attribute
name: missed data element' value attribute [BUG]
name: data used its element's textContent
name: meter content attribute
name: missed meter element's value attribute [BUG]
name: meter used its element's textContent
I'll do more testing, but I think modulo the apparent bug in Google of not reading the value
attribute at all, I think we should align the value algorithm to match this behaviour. See also #20, #38
Thanks to Nick Doty and Christine Runnegar for comments leading to this issue
The document uses "microdata" instead of "Microdata", except in these cases:
Title:
HTML Microdata
ToC:
Converting Microdata to other formats
Body:
Vocabulary specifications must not define property names for Microdata that contain […]
The original specification for Microdata was developed by Ian Hickson.
Whichever variant gets used (I would prefer "Microdata"), I think it should be consistent.
The Values
section discusses getting a value from elements including data
and meter
, but this is returned as textContent
. However, the JSON Serialization section specifically says to serialize JSON using "no unnecessary zero digits in numbers", implying that values may be numbers which could only come from these elements. Certainly the intention of the data
and meter
elements is that the content is machine readable and descriptive text in HTML 5.2 does suggest that this is numeric (at least for the meter
element.
The Microdata to RDF spec treats this content as numeric if it is either valid xsd:integer
or xsd:double
, and as text otherwise.
This doesn't seem to be used much, nor very useful - it's claimed purpose is to define the data carried in a drag and drop operation, but that isn't implemented anywhere I could find.
lang
and XML xml:lang
language attributes where appropriate, rather than creating a new attribute or mechanism. moreauto
. This means that the base direction will be determined by examining the content itself.auto
for plain text, the direction of content paragraphs should be determined on a paragraph by paragraph basis.span
-like element or construct. morebefore
and after
line locations.vertical-
values in CSS (only) should use UTR50 to apply default text orientation of characters. (This does not apply to writing modes that are equivalent to sideways-
in CSS.)sideways-lr
and sideways-rl
in CSS to allow for vertical rotation of lines of horizontal script text. UTR50 is not applicable for these cases.rb rb rt rt
).rb
tag for ruby bases.b
for bold, and i
for italic. moreNote that Microdata to RDF makes no such restriction, and includes a process for crafting URIs for @itemprop
values based on the document location. Either this restriction should be removed (requiring only @itemscope
), or we’ll need to remove that mechanism from a future Microdata to RDF update. In the absence of either @itemid
or @itemtype
, a Microdata to RDF processor will generate triples using blank node identifiers. Either such a processor should not ever generate triples without seeing an @itemtype
, or @itemid
should be allowed without @itemtype
.
At the moment, microdata-rdf is listed as a dependency. As we are adding a JSON-LD and an RDF conversion into the document (normatively), I wonder what the fate of that note, and therefore the dependency, should be.
In my view, the cleanest option will be to rescind (or something similar) the microdata-rdf document. Indeed, JSON-LD and RDF are RDF serializations, therefore it seems to be unnecessary to keep that document. A note in the new microdata spec may want to make that clear.
(In view of, albeit limited, but nevertheless existing deployment of microdata-rdf it may worth comparing and making sure that we define the same mappings in terms of RDF…)
section 7.1 says the data
attribute must be present when there is an itemprop
attribute on the iframe
and embed
attributes. I think it should be src
attribute...
As noted in #7 the ToC is broken.
There are no properties in the DOM for microdata - a parser in JS needs to use document.querySelectorAll('[itemscope]') to find items, and
getAttribute()` and friends to process them.
It's apparently unimplemented.
A quick messy manual test that might not show much more than the original demo I adapted it from
From the W3C security and privacy questionnaire
An @content attribute is used elsewhere in the HTML universe (at least RDFa).
It appears that at schema.org we have mistakenly assumed it was part of Microdata or HTML proper. If you grep for @content appearing alongside @itemprop in the schema.org examples, there are lots of examples which use it. This idiom is intended to allow a more machine-friendly property value be parsed out, while something more appropriate to human audiences is also accessible for non-machines. It may also help with l18n where schema designs contain e.g. English-language strings but the markup is otherwise in another natural language.
If you specify an itemtype
, then there is an idea that its specification describes "relevant types", and so you don't need to use a full URL to parse them as part of the same specification.
What defines this? Does
<div itemscope itemtype="http://schema.org/Thing">
<p itemprop="name">My thing</name>
</div>
mean that the Thing has a schema.org name
? That is my understanding of what happens in reality, but as far as I can understand the specification, if that code is at http://example.org/some/page" the property should be `http://example.org/some/pagename".
Which means that the interpretation of the property as a schema name
is happening by some undocumented magic - parsing according to the "Microdata to RDF" note, or just by deciding that this is how to parse schema.org typed items because that makes sense.
RDFa and JSON-LD are both serializations of RDF. What it means that, when converted to RDF, both conversion results should produce equivalent graphs.
However... this does not seem to be the case. At least the way I read it
items
property, which yields, in RDF one subject (a blank node, actually) which has a number of <items> _:XYZ
pairs, where _:XYZ
are blank nodes with the content coming from a specific itemscope
_:XYZ
triplets without any common subjects binding them together.This can be easily solved. Either
@graph
construct which can be used to specify a number of more or less independent group of triples with common subjectsitems
I am more in favour of the first approach to solve this, but the second one is also a solution.
(As an aside, the JSON-LD example is incomplete, there is no @context
.)
Cc: @gkellogg
Section 4 uses the term "global identifier", but does not reference it. Additionally, this section would seem to be about @itemid
, however it is not discussed in this context. It looks like there is some missing text.
Since we extend HTML, we should make sure that it is listed for addition. http://w3c.github.io/html-extensions
Would you like time at TPAC to update the WG on progress and/or bring issues up for discussion?
Please let me know by Friday 16th June. If yes, please also let me know how much time you think you'll need.
Google and SDL both do this already:
<div itemscope itemtype="http://schema.org/CreativeWork">
<time itemprop="name" content="time content attribute"
datetime="2017-05-19T02:59">time element text content</time>
<time itemprop="name" datetime="2017-05-19T02:59">time element text content</time>
<time itemprop="name">time element only has text content</time>
</div>
gives 3 names:
time content attribute
2017-05-19T02:59
time element only has text content
I'm proposing to match this behaviour in the algorithm for determining values. @iherman ?
The specification mentions vocabularies, and vocabulary specifications, dozens of times. It makes assertion about vocabulary design, and about constraints that are imposed by vocabularies. But it never actually says what a vocabulary is.
I think that a lot of the fixing needed is editorial, but given that there is no formal way of processing a vocabulary, we might end up making some substantive changes like removing constraints, or insteadof saying "only if it is allowed by a vocabulary" provide the more actionable "unless invalid according to a machine-readable specification of the item type: or some such.
This Call For Consensus (CFC) is to move the ccurrent Microdata Editors Draft (ED) to First Public Working Draft (FPWD).
Changes between the 2013 W3C Note and the current ED:
Please respond to this CFC by the end of day on Monday 17 April 2017. Positive responses are encouraged, in the form of a +1 or -1 on this thread, or by posting a message to [email protected]. Silence will be taken as consent with the proposal to move to a FPWD.
Hopefully this will be closed by adding it to specref
The spec talks about whether or not a vocabulary supports itemid
- but provides no explanation of how to make it clear whether this happens, nor what it means if it is not supported.
I suggest that we remove the question of whether itemid
is supported by a vocabulary, and just state that if present it represents an identifier for the element it is on.
The primary use of Microdata on the web is to contain schema.org-based metadata. From the recent common crawl 2.5/3 million pages include Microdata. All of this data has an RDF interpretation, which is in fact critical to extracting information from the pages.
The microdata spec should integrate the work from the Microdata to RDF Note.
See also #2, #5, #15 and https://lists.w3.org/Archives/Public/public-whatwg-archive/2012Aug/0101.html (which gives a sense of what is required as changes).
microdata makes it hard to have inverses, unlike RDFa and JSON-LD. This means that any vocabulary which wants to work with all three has to add a whole set of inverse properties to make microdata useful.
A reverse
property, or similar, as per w3c/microdata-rdf#24 would be useful.
See also https://www.w3.org/wiki/WebSchemas/InverseProperties (notes from schema.org discussions a few years ago).
If not, should we remove that bit?
jar, Jason, zip
If an HTML document hosted somewhere other than http://example.org, and it has <base href="http:example.org">
, do parsers resolve the URL relative to the base element, or not?
And likewise for XML…
I believe it would be better to use the same example for JSON-LD, RDFa (and JSON). It is better for the readability of the document...
(There may be several examples; the current RDFa example contains the itemref
trick, which is great because RDFa can indeed reproduce that...)
For e.g. serious internationalisation microdata has some pretty fundamental issues - see e.g. #21, #22. It is possible to work around most of these by converting to another format. Since most tools seem to work with both, instead of trying to rewrite microdata which would make it more complex, I think we should just suggest that people use RDFa or JSON-LD if they need their capabilities.
There seems to be something wrong with the syntax highlighting of a few examples in section 4.2.
Colors are missing in these parts (each list item represents one example):
<div itemscope itemtype
<div itemscope itemtype
<div itemscope>
<figure>
<span itemscope>
and <figure>
(btw., it would be helpful to number the examples and/or give them id
s)
We say 'property' a lot, without saying what it is.
A sentence fragment says "User agents are " (including the trailing space) and nothing more. Either it needs more or it should be deleted. It's in the W3C Working Draft of June 26, 2017, section 5.2, Note (green box), at the end.
itemId is a great addition, but it would be great if there was an example of it being used. The current description is minimal.
"This is an absolute URL that provides a global identifier for an item. The itemid attribute must not be specified on elements that do not have both an itemscope attribute and an itemtype attribute specified." - https://www.w3.org/TR/2017/WD-microdata-20170626/#dfn-itemid
Simply including the attribute on one of the page's examples would be very helpful to new microdata users.
The specification claims it is only processed in HTML. Is that true?
See the Internationalisation and Localisation section ...
The title is "Values: the content
attribute", however the section generally describes getting a value for a property, which includes, but is not limited to the content
attribute.
For various reasons - internationalisation, accessibility, … - it is helpful to have rich / marked up text for content, but microdata currently strips everything back to raw text. Is it possible to change that.
See also #4
Clarify that as it stands, microdata loses accessibility information like alt attributes, aria annotations, etc., and that if people need to preserve this they should consider RDFa with its greater expressivity, or use the markup really explicitly.
At present, the RDFa generation does not generate RDFa Lite. I also believe that by changing, in step 5, the reference to the about
attribute to resource
would make the trick. The way microdata uses itemid
will lead, I believe, identical results.
(I admit my RDFa becomes a bit rusty, so I rely on @gkellogg to watch over my shoulders…)
Describe the similarities and differences. Related to #3
Section 7.1 currently extends content models by making various attributes required in circumstances where the microdata processing won't otherwise work.
I suggest that we state that it is a microdata error
if something is missing leading to broken parsing. It seems at first glance reasonable to add the content model constraints, which basically amount to defining authoring errors, but we should think about this.
Incomplete sentence in a Note in section 5.2:
User agents are
From the W3C security and privacy questionnaire
In the list of editors, the link to "Dan Brickley" returns a 404.
The Values section does not make use of the language of the element (as established using @lang
or @xml:lang
on an ancestor or self).
This could certainly pertain to the textContent
of an element and potentially the value of the @content
attribute. RDFa uses the current language when creating a literal from @content
, but it could be argued either way.
Of course, the JSON expression cannot make use of the language, but it is useful to have in an abstract model for the purposes of generating RDF or JSON-LD.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.