Giter Site home page Giter Site logo

bionomia / bionomia Goto Github PK

View Code? Open in Web Editor NEW
13.0 3.0 7.0 50.33 MB

Sinatra app to parse people names from biodiversity occurrence data, apply basic regular expressions and heuristics to disambiguate them, and to make these occurrence records as entities that can be claimed by people via ORCID.

Home Page: https://bionomia.net

License: MIT License

Ruby 33.55% CSS 1.10% JavaScript 4.13% Haml 61.07% HTML 0.15%

bionomia's Introduction

Bionomia

Sinatra app to parse people names from structured biodiversity occurrence data, apply basic regular expressions and heuristics to disambiguate them, and then allow them to be claimed by authenticated users via ORCID. Authenticated users may also help other users that have either ORCID or Wikidata identifiers. The web application lives at https://bionomia.net.

Build Status

Translations

Strings of text in the user interface are translatable via config/locales. Large pages of text are fully translatable in the views/static_i18n/ directory.

Crowdin

Requirements

  1. ruby 3.2.1+
  2. Elasticsearch 8.10.2+
  3. MySQL 8.0.34+
  4. Redis 7.0.12+
  5. Apache Spark 3+
  6. Unix-based operating system to use GNU parallel to process GBIF downloads

Installation

 $ git clone https://github.com/bionomia/bionomia.git
 $ cd bionomia
 $ gem install bundler
 $ bundle install
 $ mysql -u root bionomia < db/bionomia.sql
 $ cp config/settings/development.yml.sample config/settings/development.yml
 # Adjust content of development.yml
 # Copy and edit production.yml and test.yml as above
 $ RUBY_YJIT_ENABLE=true rackup -p 4567 config.ru

License

The MIT License (MIT)

Copyright (c) David P. Shorthouse

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

bionomia's People

Contributors

dependabot[bot] avatar dshorthouse avatar jcgiron avatar mattblissett avatar msimoes123 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

bionomia's Issues

Dedup organizations

  1. Populate grid identifiers
  2. Remove duplicate organizations
  3. Improve code in how organizations are queried/found/stored when new ORCID record is created.

How to deal with expeditions (of named collectors)?

Hi David,

I have been talking to one of our more prolific and engaged collectors here at the RBGE – Martin Gardner, he has just created and Orchid and Bloodhound accounts and claimed a number of his specimens.

But he was worried/thinking about his expedition collections.

These are specimens collected as part of a named and coded expedition. (see examples below)

image
image

There are two ways to solve the issue I think -

  1. Should we be exporting the participant's string as well as the code and name i.e.

DCI Darwin Chilean Initiative (2002 - 2005) Gardner, Martin Fraser Thomas, Philip Ian Hechenleitner Vega, Paulina Martínez, C. Brownless, Peter

Or

  1. add the code of the expedition I.e. DCI or name i.e. DCI Darwin Chilean Initiative (2002 - 2005) to the ORCHID record of each participant.

IMO - First, is easiest as it’s a simple change in the data going to GBIF, the second is more complicated as it’s asking collectors to add to their own Orchid records and add on each collector

Yours

Rob Cubey

sort

Is there any way to sort the list of found records so that I could more easily determine things that are not the right person simply by looking at the collection year? sorting by collector or determiner column would also be amazing.

Create verification routine for users

Bulk selection of records has resulted in potential for error. Create a user-facing utility for them to "double-check" suspiciously tagged records that may not have actually been recorded or identified by the user.

Add a settings option in one's profile to show/hide particular records

Siobhan Leachman would like to prioritize her linking efforts to more openly licensed records https://twitter.com/SiobhanLeachman/status/1189944123257737216?s=20. One way to do this is to add a config setting in one's profile to show/hide specimen records with a particular license(s). The other side of this is the need for a more robust way to show per-institution records, not currently possible. Such a view ought to additionally show the consequences of choosing CC-BY-NC over licences without the NC clause. Have the NC clause on your data? Siobhan might not ever help attribute your institution's specimens.

Cataloged by

Hi David

Love this!!! Awesome idea to provide attribution for tasks undertaken given my interest in attribution and integration of data. I think this would be very useful as metrics for hiring purposes, not only for researchers but for collection managers to gain an idea of how well "rounded" they are.

With this in mind, cataloged by is another metric that would help with this and one that we routinely share as part of Darwin Core (maybe not as ubiquitously across collections as the other two). This would give a metric of curatorial effort along with the collecting and identified by effort already being collected. Any chance of adding this as a third category?

I also agree with the "not me" button proposed by Rich so that you can purge stuff that is not related.

Make lists of recordedByID or identifiedByID values that did not make the cut

A number of data providers to GBIF are making use of identifiedByID and recordedByID and several use identifiers like VIAF, ISNI, HUH or other. When these are discovered, they are queried in wikidata to find a Q number. On occasion however, these items do not have birth or death dates and so the attributions are not drawn-in. It would help if we could see which of these are missing birth/death dates, included in wikidata, so that the attributions can be properly shown.

Feature request: sort country listings by # of specimens

Rather than showing an alpha-order listing of all possible claimants, provide ability to sort (if not providing as a default view) the list by number of specimens.

Filtering or separating entries without claims might also de-clutter the view, imho

a wrinkle on "sort", "filter"

Hi David,
I would find it immensely useful to filter or sort on institution code, as within an institution a common name string will often be unique to a particular collector, and one is often pretty safe in ascribing the specimen (collected) to that person.
Cheers,
Margaret Donald

Enable I18n as subdomain

  1. In application.rb, add the following:
  before do
    locale = request.host.split('.')[0].to_sym
    if I18n.available_locales.include? locale
      I18n.locale = locale
    end
  end
  1. Add wild-card DNS
  2. Add drop-down in nav

Create a map interface to claim candidate collected specimens

Potentially a big new feature. Considerations:

  1. Use leaflet, popups for a paged list of specimen records within which each record may be claimed
  2. Migrate to PostgreSQL?
  3. Churn into basic, flat map on public user profile as generally viewable "places I've collected"

Allow users to create groups and to solicit membership

Half-baked idea, but perhaps it would be useful for users to create groups (+ custom theme banner) and to solicit membership. There are many such taxonomically- and/or geographically-themed associations but few would ever have an opportunity to aggregate all their specimens as a showcase of their group's collective effort. In effect, this would be comparable to iNaturalist but for preserved specimens already shared to GBIF.

Add caching to profile overview to accelerate login

Users with many attributions and a large number of other stats experience slow log-in because there is no caching of profile overview page as there is for public profile pages. Add caching and a button to refresh stats to accelerate login.

Use ORCID data to filter candidate list of specimens

When ORCID publication lists or keywords are made available by the user, an effort should be made to filter the candidate resultset by Kingdom by (a) name-finding titles & resolving hierarchy (use GNRD and EOL APIs for this), then (b) cross-referencing against the Kingdom of candidate specimens

Add suspect records to Frictionless Data downloads

Create dedicated spreadsheets in Frictionless Data downloads for datasets that contain occurrence records flagged because dateIdentified or eventDate are at odds with date of birth/death of determiner or collector, respectively.

Replace xml-sitemap gem

xml-sitemap gem apparently was designed for small sites and is hard-coded to error out with >= 50,000 pages. Replace with gem that can deal with many URLs.

Hide records when birth or death dates make claim illogical

Now have birth and death dates from wikidata. These should be used to filter out records whose dates collected or determined (when present) fall outside that range.

Issues:

  1. Processing. Dates are horrible.
  2. Store or parse on-the-fly?
  3. Make rows visible & merely highlight with yellow warning colour?
  4. Performance. If parsing is done on-the-fly, what will rendering 200 rows do to performance?

Make a UI, search, utility, anything that will help convert "agents" to "users"

There are approx. 1.7M agents - raw names of people - but a mere 30k users - those linked to an ORCID or wikidata identifier. This is scary and shows the magnitude of the problem. Need to find a way to help churn the former into the latter, illustrate progress, and do this as quickly as possible.

Ideas

  • make a hero list of agents with many specimens but none apparently attributed to a user
  • subdivide the hero list by country and/or organization & show these in the country & organization pages
  • subdivide the list by decades for date collected and identified as a naive filter for living/deceased or for other reasons
  • subdivide the list by taxonomic Family

And finally

  • give instructions on how to make a wikidata page and/or programmatically seed one with basics pre-filled
  • "share me", temporary URL for existing users to share with a colleague that does not yet have an ORCID; a "hey, this is you and you should get an ORCID" email message

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.