Giter Site home page Giter Site logo

Comments (12)

ArtemSokolov avatar ArtemSokolov commented on August 14, 2024 2

Thanks, @jimmymathews. I think having a Frictionless Data representation would be extremely valuable, but my vote would be for "in addition" rather than "instead".

In general, I completely agree that having a human-readable representation simplifies schema maintenance. This is actually what motivated us to use YAML instead of the more popular JSON format. (As you know, YAML is often used for configuration specifications, e.g., in Kubernetes deployment, because YAML files are intended to be both machine- and human-readable.) The challenge for a flat-table representation is that the specifications are ragged. For example, it is not trivial how to represent

MITI/03-file.yaml

Lines 87 to 95 in 5311594

Immersion Medium:
description: the imaging medium affects the working NA of the objective
type: string
valid-values:
- Air
- Water
- Glycerin
- Oil
significance: recommended

and

MITI/03-file.yaml

Lines 100 to 105 in 5311594

Frame_ Averaging:
description: Number of frames averaged together (if no averaging, set to 1)
type: integer
valid-values:
min: 1.0
significance: recommended

in the same table, in part because the format and length of valid-values: depends on type:. The standard approach (and the approach you're taking it sounds like) would be to utilize multiple tables that cross-reference each other. To me, it's not immediately obvious that a collection of tables offers much of an advantage over a single self-contained YAML file, when it comes to by-hand editing. But this could just come down to personal preference.

My personal vote would be to continue maintaining the "source" specification in YAML format, but begin accumulating scripts that automatically translate the "source" spec into other representations, including flat tables and Frictionless Data. In that sense, I fully welcome the proposed PR, which we can wrap into another GitHub Action to keep all representations automatically synchronized.

I am curious to hear what @DenisSch, @jmuhlich and @adamjtaylor think.

from miti.

ArtemSokolov avatar ArtemSokolov commented on August 14, 2024 2

There is always going to be some discrepancy in the the level of expert knowledge when it comes to specific fields and values. I don't necessarily think that introducing additional complexity to the schema is the best way to close that knowledge gap. What I propose is that we make better use of the MITI website (https://www.miti-consortium.org/) and create additional pages with links to existing microscopy resources and/or precise definitions of Water, etc. This would avoid potentially unnecessary expansion of the schema, while providing a reference resource for non-experts.

So, maybe to summarize the action items:

  1. Implement scripts that can convert between the various representations, including but not limited to YAML, flat tables and Frictionless Data.
  2. Expand the MITI website with references to standard microscopy resources and/or field/value definitions.
  3. Decide on a representation to be the canonical reference, and wrap all scripts that convert it to other representations into GitHub Actions that will trigger on new PRs and git merges.

I think 3. will likely come down to the preference of MITI maintainers. To me, it makes sense to make their jobs as easy as possible and automate everything else.

We will probably hear some more thoughts from maintainers and the governing board next week; I believe a number of folks are still on vacation.

from miti.

DenisSch avatar DenisSch commented on August 14, 2024 1

I am happy to organise a call with @jimmymathews after our next governance meeting (next week) to discuss this topic in more detail efficiently.

from miti.

jimmymathews avatar jimmymathews commented on August 14, 2024

This makes sense... A synchronized multi-format schema would be very useful across multiple domains. Although, of course, one must be chosen as the canonical reference, and synchronization is non-trivial work.

You're right that my approach in your example would be to separate out the "valid values" into a separate file. But the reason for this is not to deal with the raggedness in 03-file.yaml. It is rather because I would regard Air, Water, Glycerin, Oil as first-class data of its own, not schema information. Different datasets may involve different "Immersion medium" values, and some may even have additional information about specific media. I would think that hard coding data values in the schema specification would lead to habitual non-conformity to the spec.

from miti.

jimmymathews avatar jimmymathews commented on August 14, 2024

You write that: "To me, it's not immediately obvious that a collection of tables offers much of an advantage over a single self-contained YAML file, when it comes to by-hand editing."

This sounds right, but this comment probably means I miscommunicated a bit.

To clarify: I'm suggesting that the schema be located in (essentially) one file -- the fields table. This one file would be hand-edited by schema designers. The "collection of tables", multiple files, refers to the data bundles, not the schema/spec. As things stand, there are 8 YAML files comprising the spec, not "a single self-contained YAML file".

from miti.

ArtemSokolov avatar ArtemSokolov commented on August 14, 2024

As things stand, there are 8 YAML files comprising the spec, not "a single self-contained YAML file".

@jimmymathews Sorry, I meant that replacing any one YAML file with multiple tables does not confer an advantage for by-hand editing (in my opinion). My interpretation of your original post was that the 8 YAML files (each of which is self-contained) would be replaced by a collection of tables, with each table adhering to a particular convention, such as Tidy Data for example. However, it sounds like I am misinterpreting your proposal?

To assist with future discussion, can you maybe share a small example of what you envision as a "one file, multiple table" format for by-hand maintenance?

It is rather because I would regard Air, Water, Glycerin, Oil as first-class data of its own, not schema information.

I'm probably not the best person to comment on valid-values, which were defined by domain experts, but my understanding is that it is preferable to have these enforced by the schema instead of allowing data providers define their own. Centralizing valid values in the schema enables standardization across datasets and removes a whole class of wrangling issues associated with different sources using "Water" vs. "water", etc.

Your point about certain scenarios not being covered by existing definitions is very valid. However, given that MITI is still in its infancy, I would advocate for iterative refinement of the schema (even with simple additions of Other to valid-value fields where appropriate) to help with conformance.

Maybe @santas01, @arenasg, @acraquel, @clarenceyapp and others who helped define existing valid-values can comment further.

from miti.

jimmymathews avatar jimmymathews commented on August 14, 2024

Sorry, I think I wrongly suggested general desireability of user choice of alternate "valid values". Of course, the standard should take a hard line on valid values. My point is that standardizing names only isn't enough.

Air, Water, etc. are indeed very much needed, for the reasons you point out, as standardized names for real things, kinds of immersion media. But these real things have still not been explained or described.

As things stand a prospective data provider is still left wondering what state of affairs they are claiming holds if they list Water on a file record. (Should they do so if distilled water was used, for example?) At the very least, a 2-column "immersion media" table is needed, with "Name" (to link up with values in other tables) and "Description" (to explain what is meant when that value is chosen).

This seems to be a systematic issue with the dozens of fields with mere controlled string values in the spec. On the basis of the specification, data providers do not know what they are claiming scientifically by putting out a compliant dataset, and they do not know what is claimed in datasets they encounter.

Here is the autogenerated fields table I have been referring to (not perfect yet!). And the tables table.

from miti.

clarenceyapp avatar clarenceyapp commented on August 14, 2024

Hi @jimmymathews . I was one of the members who selected the metadata fields for the microscopy/imaging tables such as immersion media. I need to clarify what is the issue you're finding with some of the fields.
For common objective lenses, there are only a limited number of possible immersion media one can physically use without damaging the lens, which we've listed as valid values. Lenses usually can only take one type of immersion media which is labelled on its side. If it's a water immersion lens, then the user should be using water. If it's a multi-immersion lens, this should still match with one of the valid values we've included. By 'Water', we do mean distilled water or any of the commercially sold aqueous solutions, which is essentially distilled water. Are you suggesting we need to include further granularity between water and distilled water? Any other types of water are not suitable for this sort of image capture.

You mentioned that data providers do not know what they are claiming scientifically by putting out a compliant dataset. How does one design an experiment without knowing what immersion media (and other settings) they've been instructed to use or have used? This would have been done prospectively before data acquisition.

I understand that it would be nice to have a detailed description of every single field, but as you know, there are alot of fields and it would take a long time to curate. In some cases, a concise description of what a field means is just simply not possible without turning it into a lecture. The original point of the miti standards was to hold metadata, not act as an instructional media. There are several microscopy websites that give thorough explanations which we can link out to if that is useful.

from miti.

jimmymathews avatar jimmymathews commented on August 14, 2024

Thanks @clarenceyapp , yes I think you have really honed in on the precise issue! Also thank you all for patience with my somewhat unclear posts.

Definitions. You are correct that the main thing I am after is a "description of every single field" (and every hard-coded value in the specification). You are also correct that this is an onerous task to do comprehensively. However:

  • The schema should not be designed to prevent the inclusion of descriptions where they are concise and known.
  • By making a public/community standard, as opposed to an internal/proprietary one, the standard gets to benefit from the input of many people. In the community context, the task of annotating 275 fields should not be regarded as too difficult to attempt. Even 50% coverage would be great. Frankly if they are as self-evident as you seem to hope they are, this should take an expert only a couple of minutes per field.

I would also distinguish between "definitions" and "descriptions", and suggest that definitions are needed rather than "detailed descriptions". You are of course correct that a metadata specification ought not to be instructional media. But the minimum of semantic content -- one sentence definitions -- should be considered the minimally sound basis for sharing data.

Best case scenario. The absolute best case scenario, that handles all the problems raised in this thread, would be cases where a pre-existing formal ontology already covers the field. For example, the clinical file has a 'Gender' field. The definition need not be any more complicated than:

"phenotypic sex" as defined by the Phenotype and Trait Ontology: http://purl.obolibrary.org/obo/PATO_0001894

(Link phenotypic sex.)

This is the best case scenario because it wholly reuses others' work on the problem, answering your complaint that this onerous annotation task isn't the point of the metadata standard.

Of course, in any scenario, one still has to decide what one means, and a term/name is not enough. It might seem that what is meant by "gender" is self-evident to experienced researchers in the microscopy domain, but a moment's reflection shows this not to be the case. Some species have a karyotypic sex concept. The karyotypic sex system is different for birds and for mammals. And so on. Choosing a definition, even an incomplete one, helps to immunize against getting mired in such tangential issues, raised by pedants like myself.

Clarification. You write:

"You mentioned that data providers do not know what they are claiming scientifically by putting out a compliant dataset. How does one design an experiment without knowing what immersion media (and other settings) they've been instructed to use or have used? This would have been done prospectively before data acquisition."

I think you misunderstood my point. The investigator of course knows what they would like to claim scientifically. What is problematic is that the MITI specification currently prevents them from communicating the content of this claim via MITI-compliant datasets. My complaint is answered in this case if Water comes with this definition text:

Distilled water or any of the commercially sold aqueous solutions, which are essentially distilled water.

from miti.

jimmymathews avatar jimmymathews commented on August 14, 2024

This plan makes sense. It seems this Issue is more or less settled. Vote to close?

I plan to submit a PR with scripts to support part of item (1). Items (2) and (3) should perhaps get their own Issues.

from miti.

ArtemSokolov avatar ArtemSokolov commented on August 14, 2024

Thanks, @jimmymathews
Maybe close via PR? But it's up to you if you want to close now.

from miti.

jimmymathews avatar jimmymathews commented on August 14, 2024

That would be great, count me in. Thank you @DenisSch

from miti.

Related Issues (6)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.