Giter Site home page Giter Site logo

marda-alliance / metadata_extractors_registry Goto Github PK

View Code? Open in Web Editor NEW
6.0 6.0 6.0 210 KB

Archive. See Datatractor Yard, below:

Home Page: https://github.com/datatractor/yard

License: MIT License

Procfile 0.29% Python 42.27% xBase 0.58% Dockerfile 3.02% CSS 14.72% HTML 39.13%
chemistry extract-transform-load materials-science metadata registry

metadata_extractors_registry's People

Contributors

edan-bainglass avatar jdbocarsly avatar ml-evs avatar peterkraus avatar pre-commit-ci[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

metadata_extractors_registry's Issues

Allow example files to be provided as remote resources (with bibliographic data)

Committing an example file for everything under the sun into this repo directly is probably not the best way forward. We should have a mechanism for providing persistent links to example files (e.g., archived files with DOIs) that the registry can download and use as test data. Probably this ends up being a registry of example files too, in that case...

Automated ingestion pipelines

Potential idea as discussed: save .marda.yaml as a file in a GitHub repo that outlines the entry in this registry, then simply submit the link to the repo here.

Repo can then be watched by the CI of the registry and entries updated.

Can also make a form/UI for creating the initial yaml file.

Immediate file types and extractors we are interested in registering

This issue can be used to vaguely track specific filetypes and extractors we want in the registry:

image

Fly deployment failures

Intermittent build problems e.g., "remote builder app unavailable", seem to be related to docker layer caching. Destroying the "builder" (not the app) with fly destroy <builder name> seems to work on the next build.

Add lookup endpoint for extractors that support a given file type

This could either be additional data added to the single entry file type endpoint, e.g.,

/registry/filetypes/biologic-mpr also returns

"relationships": {
    "extractors": [
        "yadg"
    ]
}

or it could be a search endpoint for registry/extractors?supported_filetypes=biologic-mpr.

I think I prefer the former to start with.

`FileType`: Include example files for the filetype.

The most important thing in our "schema" was perhaps the link to an example file (as simply the name of the instrument and extension oftentimes doesn't describe much). Perhaps one could also consider adding it here.

To make this possible, I once started a "chemical files registry" here, where I use also a yml schema similar to yours: https://github.com/kjappelbaum/chemical-files-registry/blob/master/fileDescriptions/analyticalMethods/thermogravimetricAnalysis/ta-txt/description.yml.

I didn't have any time to work on this, but the idea was to collect example files and link to them (and the filetype schema) from the parser registry.

Originally posted by @kjapplebaum in marda-alliance/metadata_extractors_schema#2 (comment)

Make use of example files for each type in registry

  • Make sure the URLs are resolvable through Fly (probably pointing to GitHub raw links)
  • Add the files as examples in the file types models and expose them for validation
  • Validate entries in the data folder and make sure they correspond to registered file types

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.