bionomia / bionomia Goto Github PK

Sinatra app to parse people names from biodiversity occurrence data, apply basic regular expressions and heuristics to disambiguate them, and to make these occurrence records as entities that can be claimed by people via ORCID.

Home Page: https://bionomia.net

License: MIT License

Ruby 33.55% CSS 1.10% JavaScript 4.13% Haml 61.07% HTML 0.15%

bionomia's Introduction

Bionomia

Sinatra app to parse people names from structured biodiversity occurrence data, apply basic regular expressions and heuristics to disambiguate them, and then allow them to be claimed by authenticated users via ORCID. Authenticated users may also help other users that have either ORCID or Wikidata identifiers. The web application lives at https://bionomia.net.

Translations

Strings of text in the user interface are translatable via config/locales. Large pages of text are fully translatable in the views/static_i18n/ directory.

Requirements

ruby 3.2.1+
Elasticsearch 8.10.2+
MySQL 8.0.34+
Redis 7.0.12+
Apache Spark 3+
Unix-based operating system to use GNU parallel to process GBIF downloads

Installation

 $ git clone https://github.com/bionomia/bionomia.git
 $ cd bionomia
 $ gem install bundler
 $ bundle install
 $ mysql -u root bionomia < db/bionomia.sql
 $ cp config/settings/development.yml.sample config/settings/development.yml
 # Adjust content of development.yml
 # Copy and edit production.yml and test.yml as above
 $ RUBY_YJIT_ENABLE=true rackup -p 4567 config.ru

License

The MIT License (MIT)

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

bionomia's People

Contributors

Stargazers

Watchers

Forkers

waldenn ksorellana rukayaj msimoes123 mattblissett stefanwallin gbif-norway

bionomia's Issues

Dedup organizations

Populate grid identifiers
Remove duplicate organizations
Improve code in how organizations are queried/found/stored when new ORCID record is created.

Create an repeatable ingestion routine for GBIF records

Gulp.

How to deal with expeditions (of named collectors)?

Hi David,

I have been talking to one of our more prolific and engaged collectors here at the RBGE – Martin Gardner, he has just created and Orchid and Bloodhound accounts and claimed a number of his specimens.

But he was worried/thinking about his expedition collections.

These are specimens collected as part of a named and coded expedition. (see examples below)

There are two ways to solve the issue I think -

Should we be exporting the participant's string as well as the code and name i.e.

DCI Darwin Chilean Initiative (2002 - 2005) Gardner, Martin Fraser Thomas, Philip Ian Hechenleitner Vega, Paulina Martínez, C. Brownless, Peter

add the code of the expedition I.e. DCI or name i.e. DCI Darwin Chilean Initiative (2002 - 2005) to the ORCHID record of each participant.

IMO - First, is easiest as it’s a simple change in the data going to GBIF, the second is more complicated as it’s asking collectors to add to their own Orchid records and add on each collector

Yours

Rob Cubey

Create better spreadsheet download, upload

Investigate roo gem https://github.com/roo-rb/roo for native Excel I/O for offline claims & attributions.

sort

Is there any way to sort the list of found records so that I could more easily determine things that are not the right person simply by looking at the collection year? sorting by collector or determiner column would also be amazing.

Create verification routine for users

Bulk selection of records has resulted in potential for error. Create a user-facing utility for them to "double-check" suspiciously tagged records that may not have actually been recorded or identified by the user.

Add a settings option in one's profile to show/hide particular records

Siobhan Leachman would like to prioritize her linking efforts to more openly licensed records https://twitter.com/SiobhanLeachman/status/1189944123257737216?s=20. One way to do this is to add a config setting in one's profile to show/hide specimen records with a particular license(s). The other side of this is the need for a more robust way to show per-institution records, not currently possible. Such a view ought to additionally show the consequences of choosing CC-BY-NC over licences without the NC clause. Have the NC clause on your data? Siobhan might not ever help attribute your institution's specimens.

Internationalize all text strings in preparation for translations

Placeholder ticket to continue making locales for text strings and blocks of text throughout the site. First release will be support for FR. Plan for ways to seek help for other languages.

Cataloged by

Hi David

Love this!!! Awesome idea to provide attribution for tasks undertaken given my interest in attribution and integration of data. I think this would be very useful as metrics for hiring purposes, not only for researchers but for collection managers to gain an idea of how well "rounded" they are.

With this in mind, cataloged by is another metric that would help with this and one that we routinely share as part of Darwin Core (maybe not as ubiquitously across collections as the other two). This would give a metric of curatorial effort along with the collecting and identified by effort already being collected. Any chance of adding this as a third category?

I also agree with the "not me" button proposed by Rich so that you can purge stuff that is not related.

Make lists of recordedByID or identifiedByID values that did not make the cut

A number of data providers to GBIF are making use of identifiedByID and recordedByID and several use identifiers like VIAF, ISNI, HUH or other. When these are discovered, they are queried in wikidata to find a Q number. On occasion however, these items do not have birth or death dates and so the attributions are not drawn-in. It would help if we could see which of these are missing birth/death dates, included in wikidata, so that the attributions can be properly shown.

Churn institution codes back into organizations

Metrics like this https://bloodhound-tracker.net/organization/7864/metrics are incomprehensible, known only to insiders who recognize these institution codes. Please churn these institution codes back into organization names & URLs so a human can follow what's going on.

Feature request: sort country listings by # of specimens

Rather than showing an alpha-order listing of all possible claimants, provide ability to sort (if not providing as a default view) the list by number of specimens.

Filtering or separating entries without claims might also de-clutter the view, imho

Use GBIF's normalized Family names from provider

As in the subject. Raw Family names from publisher tends to be all over the place, caps, etc.

Add attribution, licensing to images when present

Gather the attribution and license from wikidata for images that are not public domain & then show.

Improve performance of rendering dataset or organization & dataset pages

When there are an exceptional number of specimen records attributed to people AND there is an exceptional number of specimen records, the rendering organization and/or dataset pages falls on its face. Find more efficient ways to present these pages.

a wrinkle on "sort", "filter"

Hi David,
I would find it immensely useful to filter or sort on institution code, as within an institution a common name string will often be unique to a particular collector, and one is often pretty safe in ascribing the specimen (collected) to that person.
Cheers,
Margaret Donald

Deposited at is incomplete

I have collected quite some specimens https://bloodhound-tracker.net/0000-0002-4329-4892 that are deposited at Naturalis Biodiversity Center, a merger of the former Naturalis, Leiden herbarium (L), Utrecht herbarium (U), Wageningen herbarium (Wag) and the zoological museum of Amsterdam; Like this one https://www.gbif.org/occurrence/2291505282. The page - Deposited at - however does not show any of these institutions. Any reasons why this is the case?

Explore other means of authentication

Need authentication other than ORCID. Time to expand beyond reach of a single niche.

Enable I18n as subdomain

In application.rb, add the following:

  before do
    locale = request.host.split('.')[0].to_sym
    if I18n.available_locales.include? locale
      I18n.locale = locale
    end
  end

Add wild-card DNS
Add drop-down in nav

Add privacy and terms of service links to homepage footer

As in subject.

New (better!) visualizations in profile tabs

bubble charts for Deposited At, https://observablehq.com/@d3/bubble-chart, https://stackoverflow.com/questions/24336898/d3-bubble-chart-pack-layout-how-to-make-bubbles-radiate-out-from-the-largest
bar charts for Specialties

Create a map interface to claim candidate collected specimens

Potentially a big new feature. Considerations:

Use leaflet, popups for a paged list of specimen records within which each record may be claimed
Migrate to PostgreSQL?
Churn into basic, flat map on public user profile as generally viewable "places I've collected"

Make a “might be them, but not sure” button + interface

Should there be a “might be them, but not sure” button + interface to bubble these up for review by others for confirmation?

+1 on this suggestion—it would be very helpful at times.

Originally posted by @kcopas in https://github.com/dshorthouse/bloodhound/issues/65#issuecomment-515748104

Prevent creation of deceased user with ORCID through resolution of recordedByID or identifiedByID

GUIDs in recordedByID and identifiedByID might be ORCIDs of users that are deceased and also have a wikidata item. Find a way to prevent creation of these seeming duplicate people when attributions are ingested from these terms in GBIF downloads.

Allow users to create groups and to solicit membership

Half-baked idea, but perhaps it would be useful for users to create groups (+ custom theme banner) and to solicit membership. There are many such taxonomically- and/or geographically-themed associations but few would ever have an opportunity to aggregate all their specimens as a showcase of their group's collective effort. In effect, this would be comparable to iNaturalist but for preserved specimens already shared to GBIF.

Use "not me" declarations to prevent showing similar records when GBIF data are re-ingested

As in subject.

Add taxonomic Family to advanced search & filter

As in the subject.

Visually differentiate user's private profile from their public profile

As in the subject. Two views are very similar and need to make distinctions.

Add caching to profile overview to accelerate login

Users with many attributions and a large number of other stats experience slow log-in because there is no caching of profile overview page as there is for public profile pages. Add caching and a button to refresh stats to accelerate login.

Investigate use of websockets to accommodate group attribution activity

Though does not happen often, there have been instances where websockets to dynamically flash & then remove rows in others' UIs could be handy. @tmcelrath and Sarah Tassell's Twitter stream suggest that this could be useful https://twitter.com/BloodhoundTrack/status/1257461732802465795?s=20. This is a placeholder ticket to investigate effort and requirements for the server.

Deal with merged items on wikidata that do not have watched properties

Merge events are only tracked for watched properties. However, merges might still happen for items that do not have any of the watched properties. Example: https://www.wikidata.org/wiki/Q55974055. Find a solution to this, perhaps using the Bionomia ID property.

Find instances of recordedBy with duplicate family names

See https://twitter.com/kylecopas/status/1164262567512027137?s=20 for rationale. Make data available...somewhere.

Use ORCID data to filter candidate list of specimens

When ORCID publication lists or keywords are made available by the user, an effort should be made to filter the candidate resultset by Kingdom by (a) name-finding titles & resolving hierarchy (use GNRD and EOL APIs for this), then (b) cross-referencing against the Kingdom of candidate specimens

Add suspect records to Frictionless Data downloads

Create dedicated spreadsheets in Frictionless Data downloads for datasets that contain occurrence records flagged because dateIdentified or eventDate are at odds with date of birth/death of determiner or collector, respectively.

Find existing ORCID in recordedBy or identifiedBy

Some GBIF data providers have already started to put ORCID IDs in occurrence data. See: https://www.gbif.org/occurrence/1571471067. Should actually use this.

Create a notification system when new GBIF data are ingested

There is currently no way to alert users via email or via change in UI that new specimen data have been ingested and processed. Need to figure out how best to do that. Related bionomia/bloodhound#26.

Add Family lists to private profile as another filtering option

As in the subject line.

Replace xml-sitemap gem

xml-sitemap gem apparently was designed for small sites and is hard-coded to error out with >= 50,000 pages. Replace with gem that can deal with many URLs.

Adjust homepage to include more details about benefits

From Sarah Tassell, "expand(ing) the information & the explanation of benefits and uses to the home page"

Re-evaluate how parsing is done for accounts/names drawn in from ORCID

ORCID profiles like https://orcid.org/0000-0003-3659-8019 do have names that parse. Re-evaluate how these are processed.

Parse TL2 and make lists of people not in wikidata

Parse TL2 from https://library.si.edu/data and cross-reference against people already in Bloodhound via wikidata and make list of residuals for the wikidata "community"

Investigate poor search or parsing with hyphenated names

As in the subject. Suspect the issue is more to do with search than with parsing.

Show user's organization when end date is in the future

As in the subject. Current behaviour of user.current_organization is to look for nil on end_year but a year in the future is equally possible from ORCID data.

Add country list to profile overview page

Because we cannot zoom on a region of the flat Google maps, small countries are obscured. Make a collapsible list of countries so these can be used as a filter.

Use user-selected records to refine parsing and reconciling

As more users tag their records, there is the potential to use this new information to help refine name parsing and reconciling.

Downloads produce files more than 10k records

See https://twitter.com/KnitMeAThneed/status/1305635828777902080?s=20

Write a utility to check for deleted items on wikidata

A Wikidata admin might delete an item but that will result in an orphaned user in Bionomia. Create a utility to check for deleted items on wikidata and then...do what?!

"not me" button should be "not them" when helping others

When helping others e.g.: https://bionomia.net/help-others/Q22113468 the button to indicate that specimens were not associated with a person should read "not them" rather than "not me"

Hide records when birth or death dates make claim illogical

Now have birth and death dates from wikidata. These should be used to filter out records whose dates collected or determined (when present) fall outside that range.

Issues:

Processing. Dates are horrible.
Store or parse on-the-fly?
Make rows visible & merely highlight with yellow warning colour?
Performance. If parsing is done on-the-fly, what will rendering 200 rows do to performance?

Make a UI, search, utility, anything that will help convert "agents" to "users"

There are approx. 1.7M agents - raw names of people - but a mere 30k users - those linked to an ORCID or wikidata identifier. This is scary and shows the magnitude of the problem. Need to find a way to help churn the former into the latter, illustrate progress, and do this as quickly as possible.

Ideas

make a hero list of agents with many specimens but none apparently attributed to a user
subdivide the hero list by country and/or organization & show these in the country & organization pages
subdivide the list by decades for date collected and identified as a naive filter for living/deceased or for other reasons
subdivide the list by taxonomic Family

And finally

give instructions on how to make a wikidata page and/or programmatically seed one with basics pre-filled
"share me", temporary URL for existing users to share with a colleague that does not yet have an ORCID; a "hey, this is you and you should get an ORCID" email message